ASCII diagrams per scenario replace dense paragraphs. Architecture overview shows the three watchers, FleetState, and convergence latches. Cold-start sequence and key invariants in scannable tables.
220 lines
12 KiB
Markdown
220 lines
12 KiB
Markdown
# Fleet operator recovery scenarios
|
|
|
|
The operator is **stateless across restarts**. It can be killed, upgraded, or
|
|
rescheduled at any time. On restart it cold-rebuilds from two durable sources —
|
|
kube CRs and NATS KV — with no customer-visible "unknown state" window.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Operator process │
|
|
│ │
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
|
|
│ │ Deployment │ │ Device │ │ Device-state KV │ │
|
|
│ │ CR watcher │ │ CR watcher │ │ watcher (history) │ │
|
|
│ └──────┬───────┘ └──────┬───────┘ └──────────┬───────────┘ │
|
|
│ │ │ │ │
|
|
│ ▼ ▼ ▼ │
|
|
│ ┌──────────────────────────────────────────────────────────┐ │
|
|
│ │ FleetState (mutex) │ │
|
|
│ │ deployments: {key → CachedDeployment} │ │
|
|
│ │ devices: {name → labels} │ │
|
|
│ │ states: {(device,deployment) → DeploymentState} │ │
|
|
│ │ owned_targets: {deployment → {device…}} │ │
|
|
│ │ dirty: {key…} ──► patch_tick (1 Hz) │ │
|
|
│ └──────────────────────────────────────────────────────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌──────────────────┐ ┌────────────────────────────────┐ │
|
|
│ │ desired-state KV │ │ Deployment.status.aggregate │ │
|
|
│ │ put / delete │ │ (patched to kube at 1 Hz) │ │
|
|
│ └──────────────────┘ └────────────────────────────────┘ │
|
|
│ │
|
|
│ ┌──────────────────────────────────────────────────────────┐ │
|
|
│ │ OperatorLiveness (3 atomic latches, shared w/ dashboard) │ │
|
|
│ │ deployments_ready ─┐ │ │
|
|
│ │ devices_ready ─────┼──► all three → Converged │ │
|
|
│ │ states_ready ──────┘ (else: Recovering → banner) │ │
|
|
│ └──────────────────────────────────────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Cold-start sequence
|
|
|
|
```
|
|
Operator starts
|
|
│
|
|
├─ seed_owned_targets() reads desired-state KV keys
|
|
│ → populates owned_targets
|
|
│ (so first reconcile diffs correctly)
|
|
│
|
|
├─ spawn 3 watchers (JoinSet) │
|
|
│ ├─ Deployment CR watcher │ replay → InitDone → mark_deployments_ready
|
|
│ │ │ + gc_orphaned_desired_state
|
|
│ ├─ Device CR watcher │ replay → InitDone → mark_devices_ready
|
|
│ └─ Device-state KV watch │ replay → seen_current → mark_states_ready
|
|
│ │ (or: empty bucket → mark immediately)
|
|
│
|
|
└─ spawn patch_tick (1 Hz) flushes dirty CR status patches
|
|
```
|
|
|
|
### Convergence banner
|
|
|
|
```
|
|
Dashboard reads OperatorLiveness.phase():
|
|
|
|
Recovering ──────► Converged
|
|
┌──────────────┐ ┌─────┐
|
|
│ ⚠ rebuilding │ │ │ (empty div, polling stops)
|
|
│ polls /2s │ │ │
|
|
└──────────────┘ └─────┘
|
|
```
|
|
|
|
The banner self-clears when all three latches trip. Typically sub-second.
|
|
|
|
### Key invariants
|
|
|
|
| Invariant | Mechanism |
|
|
|---|---|
|
|
| Writes are byte-deterministic | Same CR → same serialized score JSON. Two operators write identical bytes. |
|
|
| No leader election needed | Idempotent multi-writer at current fleet size. (HA = D3, deferred.) |
|
|
| Cancel = clean teardown | `JoinSet` owns all watchers; dropping `run`'s future aborts children. |
|
|
| `owned_targets` seeded before first reconcile | `seed_owned_targets()` reads KV keys so diff doesn't orphan entries. |
|
|
| Orphan GC on `InitDone` | Deployment CRs deleted while operator was down → purged at convergence. |
|
|
| Out-of-order state dedup | `apply_state` rejects entries with older `last_event_at`. |
|
|
|
|
---
|
|
|
|
## Scenarios
|
|
|
|
### 1. Cold restart, healthy fleet
|
|
|
|
Operator killed and restarted. Devices kept reporting to KV throughout.
|
|
|
|
```
|
|
Time ──────────────────────────────────────────────────►
|
|
|
|
Operator: [running]─── KILL ───[restart]──────────────────►
|
|
│
|
|
├─ seed_owned_targets (from KV keys)
|
|
├─ replay CRs → InitDone
|
|
├─ replay device-state → seen_current
|
|
└─ Converged
|
|
|
|
KV: [desired-state ✓] [device-state ✓] (untouched)
|
|
|
|
Expected: desired-state re-written identically (agents byte-compare → no-op)
|
|
health counts rebuilt from device-state KV
|
|
liveness → Converged
|
|
```
|
|
|
|
**Test:** `aggregator_converges_from_kv_after_restart`
|
|
|
|
### 2. Stale KV — CR deleted while operator down
|
|
|
|
A Deployment CR is force-deleted (finalizer bypassed: namespace teardown,
|
|
`--force`) while no operator is running. The desired-state KV entry is orphaned.
|
|
|
|
```
|
|
Time ──────────────────────────────────────────────────►
|
|
|
|
Operator: [running]─── KILL ──────────────[restart]──────────►
|
|
│ │
|
|
│ ├─ seed_owned_targets
|
|
│ │ (picks up orphan key)
|
|
│ │
|
|
│ ├─ CR replay: InitDone
|
|
│ │ deployment NOT in cache
|
|
│ │
|
|
│ └─ gc_orphaned_desired_state
|
|
│ deletes orphan from KV
|
|
│
|
|
CR: [exists]─── force-delete ──[gone]─────────────────►
|
|
KV: [desired-state ✓]──────────[orphan]───[purged]────►
|
|
|
|
Expected: orphan desired-state deleted at convergence
|
|
agents stop running the dead deployment
|
|
```
|
|
|
|
**Test:** `aggregator_gcs_desired_state_for_deleted_cr`
|
|
|
|
### 3. Two operators racing (rolling deploy overlap)
|
|
|
|
Brief period where two operator replicas run simultaneously.
|
|
|
|
```
|
|
Time ──────────────────────────────────────────────────►
|
|
|
|
Operator A: [running]────────────────────────────────────────►
|
|
Operator B: [start]───────────────────────────────────►
|
|
│
|
|
├─ both seed_owned_targets (same KV keys)
|
|
├─ both replay same CRs
|
|
├─ both compute same matched_devices
|
|
└─ both put identical score_json to same KV keys
|
|
|
|
KV: [desired-state: bytes_A] == [desired-state: bytes_B]
|
|
|
|
Expected: KV value stable regardless of write order
|
|
no orphan, no flapping, no leader election needed
|
|
```
|
|
|
|
**Test:** `two_aggregators_produce_identical_desired_state`
|
|
|
|
### 4. Partial KV — device offline during restart
|
|
|
|
A device hasn't reported state (no `device-state` entry) when the operator
|
|
restarts.
|
|
|
|
```
|
|
Time ──────────────────────────────────────────────────►
|
|
|
|
Operator: [running]─── KILL ───[restart]──────────────────►
|
|
|
|
Device A: [reporting ✓]────────────────────────────────────►
|
|
Device B: [offline]────────────────────────────────────────►
|
|
(no device-state entry in KV)
|
|
|
|
Expected: Device A → Running (rebuilt from KV)
|
|
Device B → Pending (no state entry = pending in aggregate)
|
|
Device B recovers to Running when it reports
|
|
```
|
|
|
|
**Test:** `device_offline_during_restart_counts_as_pending`
|
|
|
|
### 5. Chaos — kill under write load
|
|
|
|
Operator repeatedly killed and restarted while deployments are being created.
|
|
|
|
```
|
|
Time ──────────────────────────────────────────────────►
|
|
|
|
Deployments: [create d0] [create d1] [create d2] [create d3] [create d4]
|
|
▲
|
|
Operator: [running]──────── KILL ──────┘──[restart]──── KILL ──[final]──►
|
|
│
|
|
└─ converge
|
|
all 5
|
|
|
|
Expected: final operator converges full desired-state set within 30s
|
|
```
|
|
|
|
**Test:** `chaos_kill_under_write_load_converges`
|
|
|
|
---
|
|
|
|
## Running the regression tests
|
|
|
|
```bash
|
|
HARMONY_FLEET_E2E=1 cargo test -p harmony-fleet-e2e --test operator_recovery
|
|
```
|
|
|
|
**Prerequisites:** k3d + podman on PATH. The shared harness brings up NATS in
|
|
a fresh namespace.
|
|
|
|
**How it works:** tests drive `fleet_aggregator::run` in-process against the
|
|
real NATS + k3d. `spawn_aggregator` returns a `JoinHandle`; `kill` aborts it
|
|
and awaits (ensuring the `JoinSet` children tear down before the next instance
|
|
starts).
|