How it works¶

The reconcile loop¶

Every controller follows the same shape: watch a resource, compare desired to observed, take one safe step, record status, requeue. The operator decides; the agent performs host mutations; the agent reports; the operator rolls forward on the next pass.

sequenceDiagram
    autonumber
    participant U as nwctl / UI
    participant A as K8s API
    participant O as Operator (controller)
    participant N as Node agent
    participant H as Host / storage

    U->>A: apply CRD (intent)
    A-->>O: watch event
    O->>O: compare desired vs observed
    O->>A: patch .status (phase, next step)
    O->>N: delegate one host op
    N->>H: mutate (ip link / drbdadm / kairos)
    N-->>A: report observed state
    A-->>O: requeue
    O->>O: honor gates, then advance one step

These invariants are non-negotiable — each is a hard-won lesson from the DaemonSet's incident history:

Idempotence. Every pass is safe to retry; never leave partial state.
Honor explicit gates. Refuse to advance past a safety gate rather than racing it.
Backpressure. If the storage layer is unhealthy, wait — don't queue more work.
Fail loud, don't partially proceed. The motivating failure was one node flipping while the other didn't, and the next step firing anyway. Controllers stop instead.
Status reflects observed, never desired.

The storage-upgrade machine¶

The most battle-hardened workflow is the single→multi-node storage upgrade: flipping the storage bond from active-backup to balance-rr so DRBD can replicate across nodes — without disrupting the VMs running on top. In the DaemonSet this is a 21-state machine persisted in a ConfigMap; the operator observes it today and models it as a first-class workflow (BondModeFlipPlan / ReplicationUpgrade) next.

stateDiagram-v2
    direction TB
    [*] --> Idle
    Idle --> PreconditionsCheck: ≥2 eligible nodes, peers live
    PreconditionsCheck --> AcquireLock
    AcquireLock --> SuspendMigration: cluster lock held

    state "Protect workloads" as prot {
        SuspendMigration
    }
    SuspendMigration --> DisconnectDRBD: abort VMIMs, pin evictionStrategy=None
    DisconnectDRBD --> DrainSatellite
    DrainSatellite --> FlipNode1
    FlipNode1 --> FlipNode2
    FlipNode2 --> ResumeDRBD: both on balance-rr

    ResumeDRBD --> RestoreMigration: peers reconnected + synced
    RestoreMigration --> ReleaseLock: restore evictionStrategy=LiveMigrate
    ReleaseLock --> Complete
    Complete --> [*]

    FlipNode1 --> BondFlipRetry: transient failure
    FlipNode2 --> BondFlipRetry: transient failure
    BondFlipRetry --> FlipNode1: forward-only retry (max 3)
    BondFlipRetry --> Rollback: retries exhausted
    ResumeDRBD --> Rollback: cannot resume safely

    state "Rollback ladder" as Rollback {
        direction TB
        RbRestoreMigration --> RbResumeSatellite
        RbResumeSatellite --> RbReconnectDrbd
        RbReconnectDrbd --> RbRevertNodes
    }
    Rollback --> Failed
    Failed --> [*]

What the diagram encodes — and why each guard exists:

Stage	The guard	Why
PreconditionsCheck	≥2 eligible nodes; every peer live	Never flip into a cluster that can't hold replicas.
SuspendMigration	abort in-flight VM migrations, pin `evictionStrategy=None`	If KubeVirt live-migrates a VM mid-flip, it can land on a node with asymmetric DRBD — data-availability risk.
DisconnectDRBD → DrainSatellite	disconnect replication, move storage pods off the node	Flip the bond on a quiesced storage path, not a live one.
FlipNode1 → FlipNode2	one node at a time, forward-only retry	A half-flipped cluster is the original incident; retries never revert an already-flipped node.
ResumeDRBD	reconnect only after peer convergence + two-replica sync	Never resume replication over a marginal or asymmetric link.
Rollback ladder	undo in reverse: restore VMs → resume storage → reconnect → revert bonds	Every workflow has a defined stuck-state and a reverse path.

This is why it's gated, not autonomous

Every step here mutates host networking or storage on a live cluster — high blast radius. The productized version is over-specified with these guards as testable acceptance criteria, and its dangerous paths run supervised with human sign-off, never as an unattended loop. The read-only observation bridge comes first precisely because it carries none of that risk.