ASCII diagrams per scenario replace dense paragraphs. Architecture overview shows the three watchers, FleetState, and convergence latches. Cold-start sequence and key invariants in scannable tables.
12 KiB
Fleet operator recovery scenarios
The operator is stateless across restarts. It can be killed, upgraded, or rescheduled at any time. On restart it cold-rebuilds from two durable sources — kube CRs and NATS KV — with no customer-visible "unknown state" window.
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Operator process │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ Deployment │ │ Device │ │ Device-state KV │ │
│ │ CR watcher │ │ CR watcher │ │ watcher (history) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────────┬───────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ FleetState (mutex) │ │
│ │ deployments: {key → CachedDeployment} │ │
│ │ devices: {name → labels} │ │
│ │ states: {(device,deployment) → DeploymentState} │ │
│ │ owned_targets: {deployment → {device…}} │ │
│ │ dirty: {key…} ──► patch_tick (1 Hz) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ ┌────────────────────────────────┐ │
│ │ desired-state KV │ │ Deployment.status.aggregate │ │
│ │ put / delete │ │ (patched to kube at 1 Hz) │ │
│ └──────────────────┘ └────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ OperatorLiveness (3 atomic latches, shared w/ dashboard) │ │
│ │ deployments_ready ─┐ │ │
│ │ devices_ready ─────┼──► all three → Converged │ │
│ │ states_ready ──────┘ (else: Recovering → banner) │ │
│ └──────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Cold-start sequence
Operator starts
│
├─ seed_owned_targets() reads desired-state KV keys
│ → populates owned_targets
│ (so first reconcile diffs correctly)
│
├─ spawn 3 watchers (JoinSet) │
│ ├─ Deployment CR watcher │ replay → InitDone → mark_deployments_ready
│ │ │ + gc_orphaned_desired_state
│ ├─ Device CR watcher │ replay → InitDone → mark_devices_ready
│ └─ Device-state KV watch │ replay → seen_current → mark_states_ready
│ │ (or: empty bucket → mark immediately)
│
└─ spawn patch_tick (1 Hz) flushes dirty CR status patches
Convergence banner
Dashboard reads OperatorLiveness.phase():
Recovering ──────► Converged
┌──────────────┐ ┌─────┐
│ ⚠ rebuilding │ │ │ (empty div, polling stops)
│ polls /2s │ │ │
└──────────────┘ └─────┘
The banner self-clears when all three latches trip. Typically sub-second.
Key invariants
| Invariant | Mechanism |
|---|---|
| Writes are byte-deterministic | Same CR → same serialized score JSON. Two operators write identical bytes. |
| No leader election needed | Idempotent multi-writer at current fleet size. (HA = D3, deferred.) |
| Cancel = clean teardown | JoinSet owns all watchers; dropping run's future aborts children. |
owned_targets seeded before first reconcile |
seed_owned_targets() reads KV keys so diff doesn't orphan entries. |
Orphan GC on InitDone |
Deployment CRs deleted while operator was down → purged at convergence. |
| Out-of-order state dedup | apply_state rejects entries with older last_event_at. |
Scenarios
1. Cold restart, healthy fleet
Operator killed and restarted. Devices kept reporting to KV throughout.
Time ──────────────────────────────────────────────────►
Operator: [running]─── KILL ───[restart]──────────────────►
│
├─ seed_owned_targets (from KV keys)
├─ replay CRs → InitDone
├─ replay device-state → seen_current
└─ Converged
KV: [desired-state ✓] [device-state ✓] (untouched)
Expected: desired-state re-written identically (agents byte-compare → no-op)
health counts rebuilt from device-state KV
liveness → Converged
Test: aggregator_converges_from_kv_after_restart
2. Stale KV — CR deleted while operator down
A Deployment CR is force-deleted (finalizer bypassed: namespace teardown,
--force) while no operator is running. The desired-state KV entry is orphaned.
Time ──────────────────────────────────────────────────►
Operator: [running]─── KILL ──────────────[restart]──────────►
│ │
│ ├─ seed_owned_targets
│ │ (picks up orphan key)
│ │
│ ├─ CR replay: InitDone
│ │ deployment NOT in cache
│ │
│ └─ gc_orphaned_desired_state
│ deletes orphan from KV
│
CR: [exists]─── force-delete ──[gone]─────────────────►
KV: [desired-state ✓]──────────[orphan]───[purged]────►
Expected: orphan desired-state deleted at convergence
agents stop running the dead deployment
Test: aggregator_gcs_desired_state_for_deleted_cr
3. Two operators racing (rolling deploy overlap)
Brief period where two operator replicas run simultaneously.
Time ──────────────────────────────────────────────────►
Operator A: [running]────────────────────────────────────────►
Operator B: [start]───────────────────────────────────►
│
├─ both seed_owned_targets (same KV keys)
├─ both replay same CRs
├─ both compute same matched_devices
└─ both put identical score_json to same KV keys
KV: [desired-state: bytes_A] == [desired-state: bytes_B]
Expected: KV value stable regardless of write order
no orphan, no flapping, no leader election needed
Test: two_aggregators_produce_identical_desired_state
4. Partial KV — device offline during restart
A device hasn't reported state (no device-state entry) when the operator
restarts.
Time ──────────────────────────────────────────────────►
Operator: [running]─── KILL ───[restart]──────────────────►
Device A: [reporting ✓]────────────────────────────────────►
Device B: [offline]────────────────────────────────────────►
(no device-state entry in KV)
Expected: Device A → Running (rebuilt from KV)
Device B → Pending (no state entry = pending in aggregate)
Device B recovers to Running when it reports
Test: device_offline_during_restart_counts_as_pending
5. Chaos — kill under write load
Operator repeatedly killed and restarted while deployments are being created.
Time ──────────────────────────────────────────────────►
Deployments: [create d0] [create d1] [create d2] [create d3] [create d4]
▲
Operator: [running]──────── KILL ──────┘──[restart]──── KILL ──[final]──►
│
└─ converge
all 5
Expected: final operator converges full desired-state set within 30s
Test: chaos_kill_under_write_load_converges
Running the regression tests
HARMONY_FLEET_E2E=1 cargo test -p harmony-fleet-e2e --test operator_recovery
Prerequisites: k3d + podman on PATH. The shared harness brings up NATS in a fresh namespace.
How it works: tests drive fleet_aggregator::run in-process against the
real NATS + k3d. spawn_aggregator returns a JoinHandle; kill aborts it
and awaits (ensuring the JoinSet children tear down before the next instance
starts).