Files
harmony/docs/fleet-operator-recovery-scenarios.md
Jean-Gabriel Gill-Couture 81b0f79f55 docs(fleet): rewrite operator recovery scenarios with diagrams
ASCII diagrams per scenario replace dense paragraphs. Architecture
overview shows the three watchers, FleetState, and convergence latches.
Cold-start sequence and key invariants in scannable tables.
2026-06-09 16:45:35 -04:00

220 lines
12 KiB
Markdown

# Fleet operator recovery scenarios
The operator is **stateless across restarts**. It can be killed, upgraded, or
rescheduled at any time. On restart it cold-rebuilds from two durable sources —
kube CRs and NATS KV — with no customer-visible "unknown state" window.
## Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ Operator process │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ Deployment │ │ Device │ │ Device-state KV │ │
│ │ CR watcher │ │ CR watcher │ │ watcher (history) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────────┬───────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ FleetState (mutex) │ │
│ │ deployments: {key → CachedDeployment} │ │
│ │ devices: {name → labels} │ │
│ │ states: {(device,deployment) → DeploymentState} │ │
│ │ owned_targets: {deployment → {device…}} │ │
│ │ dirty: {key…} ──► patch_tick (1 Hz) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ ┌────────────────────────────────┐ │
│ │ desired-state KV │ │ Deployment.status.aggregate │ │
│ │ put / delete │ │ (patched to kube at 1 Hz) │ │
│ └──────────────────┘ └────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ OperatorLiveness (3 atomic latches, shared w/ dashboard) │ │
│ │ deployments_ready ─┐ │ │
│ │ devices_ready ─────┼──► all three → Converged │ │
│ │ states_ready ──────┘ (else: Recovering → banner) │ │
│ └──────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
```
### Cold-start sequence
```
Operator starts
├─ seed_owned_targets() reads desired-state KV keys
│ → populates owned_targets
│ (so first reconcile diffs correctly)
├─ spawn 3 watchers (JoinSet) │
│ ├─ Deployment CR watcher │ replay → InitDone → mark_deployments_ready
│ │ │ + gc_orphaned_desired_state
│ ├─ Device CR watcher │ replay → InitDone → mark_devices_ready
│ └─ Device-state KV watch │ replay → seen_current → mark_states_ready
│ │ (or: empty bucket → mark immediately)
└─ spawn patch_tick (1 Hz) flushes dirty CR status patches
```
### Convergence banner
```
Dashboard reads OperatorLiveness.phase():
Recovering ──────► Converged
┌──────────────┐ ┌─────┐
│ ⚠ rebuilding │ │ │ (empty div, polling stops)
│ polls /2s │ │ │
└──────────────┘ └─────┘
```
The banner self-clears when all three latches trip. Typically sub-second.
### Key invariants
| Invariant | Mechanism |
|---|---|
| Writes are byte-deterministic | Same CR → same serialized score JSON. Two operators write identical bytes. |
| No leader election needed | Idempotent multi-writer at current fleet size. (HA = D3, deferred.) |
| Cancel = clean teardown | `JoinSet` owns all watchers; dropping `run`'s future aborts children. |
| `owned_targets` seeded before first reconcile | `seed_owned_targets()` reads KV keys so diff doesn't orphan entries. |
| Orphan GC on `InitDone` | Deployment CRs deleted while operator was down → purged at convergence. |
| Out-of-order state dedup | `apply_state` rejects entries with older `last_event_at`. |
---
## Scenarios
### 1. Cold restart, healthy fleet
Operator killed and restarted. Devices kept reporting to KV throughout.
```
Time ──────────────────────────────────────────────────►
Operator: [running]─── KILL ───[restart]──────────────────►
├─ seed_owned_targets (from KV keys)
├─ replay CRs → InitDone
├─ replay device-state → seen_current
└─ Converged
KV: [desired-state ✓] [device-state ✓] (untouched)
Expected: desired-state re-written identically (agents byte-compare → no-op)
health counts rebuilt from device-state KV
liveness → Converged
```
**Test:** `aggregator_converges_from_kv_after_restart`
### 2. Stale KV — CR deleted while operator down
A Deployment CR is force-deleted (finalizer bypassed: namespace teardown,
`--force`) while no operator is running. The desired-state KV entry is orphaned.
```
Time ──────────────────────────────────────────────────►
Operator: [running]─── KILL ──────────────[restart]──────────►
│ │
│ ├─ seed_owned_targets
│ │ (picks up orphan key)
│ │
│ ├─ CR replay: InitDone
│ │ deployment NOT in cache
│ │
│ └─ gc_orphaned_desired_state
│ deletes orphan from KV
CR: [exists]─── force-delete ──[gone]─────────────────►
KV: [desired-state ✓]──────────[orphan]───[purged]────►
Expected: orphan desired-state deleted at convergence
agents stop running the dead deployment
```
**Test:** `aggregator_gcs_desired_state_for_deleted_cr`
### 3. Two operators racing (rolling deploy overlap)
Brief period where two operator replicas run simultaneously.
```
Time ──────────────────────────────────────────────────►
Operator A: [running]────────────────────────────────────────►
Operator B: [start]───────────────────────────────────►
├─ both seed_owned_targets (same KV keys)
├─ both replay same CRs
├─ both compute same matched_devices
└─ both put identical score_json to same KV keys
KV: [desired-state: bytes_A] == [desired-state: bytes_B]
Expected: KV value stable regardless of write order
no orphan, no flapping, no leader election needed
```
**Test:** `two_aggregators_produce_identical_desired_state`
### 4. Partial KV — device offline during restart
A device hasn't reported state (no `device-state` entry) when the operator
restarts.
```
Time ──────────────────────────────────────────────────►
Operator: [running]─── KILL ───[restart]──────────────────►
Device A: [reporting ✓]────────────────────────────────────►
Device B: [offline]────────────────────────────────────────►
(no device-state entry in KV)
Expected: Device A → Running (rebuilt from KV)
Device B → Pending (no state entry = pending in aggregate)
Device B recovers to Running when it reports
```
**Test:** `device_offline_during_restart_counts_as_pending`
### 5. Chaos — kill under write load
Operator repeatedly killed and restarted while deployments are being created.
```
Time ──────────────────────────────────────────────────►
Deployments: [create d0] [create d1] [create d2] [create d3] [create d4]
Operator: [running]──────── KILL ──────┘──[restart]──── KILL ──[final]──►
└─ converge
all 5
Expected: final operator converges full desired-state set within 30s
```
**Test:** `chaos_kill_under_write_load_converges`
---
## Running the regression tests
```bash
HARMONY_FLEET_E2E=1 cargo test -p harmony-fleet-e2e --test operator_recovery
```
**Prerequisites:** k3d + podman on PATH. The shared harness brings up NATS in
a fresh namespace.
**How it works:** tests drive `fleet_aggregator::run` in-process against the
real NATS + k3d. `spawn_aggregator` returns a `JoinHandle`; `kill` aborts it
and awaits (ensuring the `JoinSet` children tear down before the next instance
starts).