Files
harmony/docs/fleet-operator-recovery-scenarios.md
Jean-Gabriel Gill-Couture 81b0f79f55 docs(fleet): rewrite operator recovery scenarios with diagrams
ASCII diagrams per scenario replace dense paragraphs. Architecture
overview shows the three watchers, FleetState, and convergence latches.
Cold-start sequence and key invariants in scannable tables.
2026-06-09 16:45:35 -04:00

12 KiB

Fleet operator recovery scenarios

The operator is stateless across restarts. It can be killed, upgraded, or rescheduled at any time. On restart it cold-rebuilds from two durable sources — kube CRs and NATS KV — with no customer-visible "unknown state" window.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Operator process                         │
│                                                                 │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐  │
│  │ Deployment   │  │ Device       │  │ Device-state KV      │  │
│  │ CR watcher   │  │ CR watcher   │  │ watcher (history)    │  │
│  └──────┬───────┘  └──────┬───────┘  └──────────┬───────────┘  │
│         │                 │                      │              │
│         ▼                 ▼                      ▼              │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │                    FleetState (mutex)                     │  │
│  │  deployments: {key → CachedDeployment}                   │  │
│  │  devices:     {name → labels}                            │  │
│  │  states:      {(device,deployment) → DeploymentState}    │  │
│  │  owned_targets: {deployment → {device…}}                 │  │
│  │  dirty:       {key…}  ──► patch_tick (1 Hz)              │  │
│  └──────────────────────────────────────────────────────────┘  │
│         │                                                       │
│         ▼                                                       │
│  ┌──────────────────┐     ┌────────────────────────────────┐   │
│  │ desired-state KV │     │ Deployment.status.aggregate    │   │
│  │ put / delete     │     │ (patched to kube at 1 Hz)      │   │
│  └──────────────────┘     └────────────────────────────────┘   │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │ OperatorLiveness (3 atomic latches, shared w/ dashboard) │  │
│  │   deployments_ready ─┐                                   │  │
│  │   devices_ready ─────┼──► all three → Converged          │  │
│  │   states_ready ──────┘    (else: Recovering → banner)    │  │
│  └──────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Cold-start sequence

Operator starts
  │
  ├─ seed_owned_targets()       reads desired-state KV keys
  │                               → populates owned_targets
  │                                 (so first reconcile diffs correctly)
  │
  ├─ spawn 3 watchers (JoinSet) │
  │   ├─ Deployment CR watcher  │  replay → InitDone → mark_deployments_ready
  │   │                         │           + gc_orphaned_desired_state
  │   ├─ Device CR watcher      │  replay → InitDone → mark_devices_ready
  │   └─ Device-state KV watch  │  replay → seen_current → mark_states_ready
  │                             │  (or: empty bucket → mark immediately)
  │
  └─ spawn patch_tick (1 Hz)    flushes dirty CR status patches

Convergence banner

Dashboard reads OperatorLiveness.phase():

  Recovering ──────► Converged
  ┌──────────────┐   ┌─────┐
  │ ⚠ rebuilding │   │     │  (empty div, polling stops)
  │ polls /2s    │   │     │
  └──────────────┘   └─────┘

The banner self-clears when all three latches trip. Typically sub-second.

Key invariants

Invariant Mechanism
Writes are byte-deterministic Same CR → same serialized score JSON. Two operators write identical bytes.
No leader election needed Idempotent multi-writer at current fleet size. (HA = D3, deferred.)
Cancel = clean teardown JoinSet owns all watchers; dropping run's future aborts children.
owned_targets seeded before first reconcile seed_owned_targets() reads KV keys so diff doesn't orphan entries.
Orphan GC on InitDone Deployment CRs deleted while operator was down → purged at convergence.
Out-of-order state dedup apply_state rejects entries with older last_event_at.

Scenarios

1. Cold restart, healthy fleet

Operator killed and restarted. Devices kept reporting to KV throughout.

Time ──────────────────────────────────────────────────►

Operator:   [running]─── KILL ───[restart]──────────────────►
                                │
                                ├─ seed_owned_targets (from KV keys)
                                ├─ replay CRs → InitDone
                                ├─ replay device-state → seen_current
                                └─ Converged

KV:         [desired-state ✓]  [device-state ✓]  (untouched)

Expected:   desired-state re-written identically (agents byte-compare → no-op)
            health counts rebuilt from device-state KV
            liveness → Converged

Test: aggregator_converges_from_kv_after_restart

2. Stale KV — CR deleted while operator down

A Deployment CR is force-deleted (finalizer bypassed: namespace teardown, --force) while no operator is running. The desired-state KV entry is orphaned.

Time ──────────────────────────────────────────────────►

Operator:   [running]─── KILL ──────────────[restart]──────────►
                                │                  │
                                │                  ├─ seed_owned_targets
                                │                  │   (picks up orphan key)
                                │                  │
                                │                  ├─ CR replay: InitDone
                                │                  │   deployment NOT in cache
                                │                  │
                                │                  └─ gc_orphaned_desired_state
                                │                      deletes orphan from KV
                                │
CR:         [exists]─── force-delete ──[gone]─────────────────►
KV:         [desired-state ✓]──────────[orphan]───[purged]────►

Expected:   orphan desired-state deleted at convergence
            agents stop running the dead deployment

Test: aggregator_gcs_desired_state_for_deleted_cr

3. Two operators racing (rolling deploy overlap)

Brief period where two operator replicas run simultaneously.

Time ──────────────────────────────────────────────────►

Operator A: [running]────────────────────────────────────────►
Operator B:        [start]───────────────────────────────────►
                   │
                   ├─ both seed_owned_targets (same KV keys)
                   ├─ both replay same CRs
                   ├─ both compute same matched_devices
                   └─ both put identical score_json to same KV keys

KV:         [desired-state: bytes_A]  ==  [desired-state: bytes_B]

Expected:   KV value stable regardless of write order
            no orphan, no flapping, no leader election needed

Test: two_aggregators_produce_identical_desired_state

4. Partial KV — device offline during restart

A device hasn't reported state (no device-state entry) when the operator restarts.

Time ──────────────────────────────────────────────────►

Operator:   [running]─── KILL ───[restart]──────────────────►

Device A:   [reporting ✓]────────────────────────────────────►
Device B:   [offline]────────────────────────────────────────►
            (no device-state entry in KV)

Expected:   Device A → Running (rebuilt from KV)
            Device B → Pending (no state entry = pending in aggregate)
            Device B recovers to Running when it reports

Test: device_offline_during_restart_counts_as_pending

5. Chaos — kill under write load

Operator repeatedly killed and restarted while deployments are being created.

Time ──────────────────────────────────────────────────►

Deployments:  [create d0] [create d1] [create d2] [create d3] [create d4]
                                           ▲
Operator:     [running]──────── KILL ──────┘──[restart]──── KILL ──[final]──►
                                                                     │
                                                                     └─ converge
                                                                        all 5

Expected:   final operator converges full desired-state set within 30s

Test: chaos_kill_under_write_load_converges


Running the regression tests

HARMONY_FLEET_E2E=1 cargo test -p harmony-fleet-e2e --test operator_recovery

Prerequisites: k3d + podman on PATH. The shared harness brings up NATS in a fresh namespace.

How it works: tests drive fleet_aggregator::run in-process against the real NATS + k3d. spawn_aggregator returns a JoinHandle; kill aborts it and awaits (ensuring the JoinSet children tear down before the next instance starts).