Skip to content

How it works

The reconcile loop

Every controller follows the same shape: watch a resource, compare desired to observed, take one safe step, record status, requeue. The operator decides; the agent performs host mutations; the agent reports; the operator rolls forward on the next pass.

sequenceDiagram
    autonumber
    participant U as nwctl / UI
    participant A as K8s API
    participant O as Operator (controller)
    participant N as Node agent
    participant H as Host / storage

    U->>A: apply CRD (intent)
    A-->>O: watch event
    O->>O: compare desired vs observed
    O->>A: patch .status (phase, next step)
    O->>N: delegate one host op
    N->>H: mutate (ip link / drbdadm / kairos)
    N-->>A: report observed state
    A-->>O: requeue
    O->>O: honor gates, then advance one step

These invariants are non-negotiable — each is a hard-won lesson from the DaemonSet's incident history:

  • Idempotence. Every pass is safe to retry; never leave partial state.
  • Honor explicit gates. Refuse to advance past a safety gate rather than racing it.
  • Backpressure. If the storage layer is unhealthy, wait — don't queue more work.
  • Fail loud, don't partially proceed. The motivating failure was one node flipping while the other didn't, and the next step firing anyway. Controllers stop instead.
  • Status reflects observed, never desired.

The storage-upgrade machine

The most battle-hardened workflow is the single→multi-node storage upgrade: flipping the storage bond from active-backup to balance-rr so DRBD can replicate across nodes — without disrupting the VMs running on top. In the DaemonSet this is a 21-state machine persisted in a ConfigMap; the operator observes it today and models it as a first-class workflow (BondModeFlipPlan / ReplicationUpgrade) next.

stateDiagram-v2
    direction TB
    [*] --> Idle
    Idle --> PreconditionsCheck: ≥2 eligible nodes, peers live
    PreconditionsCheck --> AcquireLock
    AcquireLock --> SuspendMigration: cluster lock held

    state "Protect workloads" as prot {
        SuspendMigration
    }
    SuspendMigration --> DisconnectDRBD: abort VMIMs, pin evictionStrategy=None
    DisconnectDRBD --> DrainSatellite
    DrainSatellite --> FlipNode1
    FlipNode1 --> FlipNode2
    FlipNode2 --> ResumeDRBD: both on balance-rr

    ResumeDRBD --> RestoreMigration: peers reconnected + synced
    RestoreMigration --> ReleaseLock: restore evictionStrategy=LiveMigrate
    ReleaseLock --> Complete
    Complete --> [*]

    FlipNode1 --> BondFlipRetry: transient failure
    FlipNode2 --> BondFlipRetry: transient failure
    BondFlipRetry --> FlipNode1: forward-only retry (max 3)
    BondFlipRetry --> Rollback: retries exhausted
    ResumeDRBD --> Rollback: cannot resume safely

    state "Rollback ladder" as Rollback {
        direction TB
        RbRestoreMigration --> RbResumeSatellite
        RbResumeSatellite --> RbReconnectDrbd
        RbReconnectDrbd --> RbRevertNodes
    }
    Rollback --> Failed
    Failed --> [*]

What the diagram encodes — and why each guard exists:

Stage The guard Why
PreconditionsCheck ≥2 eligible nodes; every peer live Never flip into a cluster that can't hold replicas.
SuspendMigration abort in-flight VM migrations, pin evictionStrategy=None If KubeVirt live-migrates a VM mid-flip, it can land on a node with asymmetric DRBD — data-availability risk.
DisconnectDRBD → DrainSatellite disconnect replication, move storage pods off the node Flip the bond on a quiesced storage path, not a live one.
FlipNode1 → FlipNode2 one node at a time, forward-only retry A half-flipped cluster is the original incident; retries never revert an already-flipped node.
ResumeDRBD reconnect only after peer convergence + two-replica sync Never resume replication over a marginal or asymmetric link.
Rollback ladder undo in reverse: restore VMs → resume storage → reconnect → revert bonds Every workflow has a defined stuck-state and a reverse path.

This is why it's gated, not autonomous

Every step here mutates host networking or storage on a live cluster — high blast radius. The productized version is over-specified with these guards as testable acceptance criteria, and its dangerous paths run supervised with human sign-off, never as an unattended loop. The read-only observation bridge comes first precisely because it carries none of that risk.