Files
harmony/docs/fleet-operator-recovery-scenarios.md
Jean-Gabriel Gill-Couture 56602b505c
All checks were successful
Run Check Script / check (pull_request) Successful in 2m35s
feat(fleet-operator): aggregator recovery signal + orphan GC + recovery e2e (Ch2)
Operator restart + aggregator recovery (v0.3 plan Ch2). The aggregator already
cold-rebuilds from NATS KV + CR watches; this makes recovery observable, closes
an orphan gap, and pins each failure shape with a regression test.

- OperatorLiveness: a shared in-process latch (Recovering → Converged) the
  aggregator sets once all three cold-start sources replay (Deployment/Device
  watcher InitDone, device-state KV seen_current; empty-bucket short-circuit).
  The in-process dashboard reads it and shows a self-clearing banner via an
  HTMX self-poll (/__recovery), so the customer sees progress, not a blank.
- gc_orphaned_desired_state: at convergence, purge desired-state whose
  Deployment CR no longer exists (force-deleted while the operator was down,
  finalizer bypassed). Belt-and-suspenders with the controller finalizer.
- run() now owns its watchers in a JoinSet, so cancelling the aggregator
  aborts its children — no orphan tasks outliving a restart (matters for the
  restart-simulation tests and clean process teardown). Also made run() Send
  (hoisted a .await out of a tracing macro) so it can be spawned.
- docs/fleet-operator-recovery-scenarios.md enumerates the failure shapes and
  maps each to its test.
- harmony-fleet-e2e/tests/operator_recovery.rs: regression test per scenario
  (cold restart converges from KV; orphan GC; two operators write identical
  bytes; chaos kill under write load converges <30s) + AdminKv::put_device_state.

Writes stay idempotent + byte-deterministic, so two operators racing agree
without leader election (operator HA = D3, deferred).
2026-06-05 15:26:00 -04:00

4.2 KiB

Fleet operator recovery scenarios

The operator can be killed, upgraded, or rescheduled at any time. When it comes back it must converge from NATS KV + the CRs alone — no customer-visible "unknown state" window. This doc enumerates the failure shapes, what correct recovery looks like, and the regression test that pins each one.

How recovery works

The operator is stateless across restarts. Everything it needs is durable:

  • Deployment / Device CRs live in etcd (kube). On restart the aggregator's watcher replays the full list (Event::InitApply …→ Event::InitDone).
  • desired-state KV is what the operator previously wrote per (device, deployment). On restart seed_owned_targets reloads it so the first reconcile diffs correctly instead of orphaning entries.
  • device-state KV is per-device phase reported by agents. On restart the state watcher replays it (watch_with_history), rebuilding health counts.

All writes are idempotent and byte-deterministic: the desired-state payload is the same serialized score for a given CR, and status patches are computed from the rebuilt caches. So a second operator racing the first writes identical bytes — no leader election needed at current fleet size (operator HA is D3, deferred). See fleet_aggregator.rs.

Convergence signal

Until the cold replay finishes, health counts can be stale. The aggregator latches an OperatorLiveness from RecoveringConverged once all three sources have replayed (Deployment InitDone, Device InitDone, device-state KV seen_current). The in-process dashboard reads it and shows a banner, so the customer sees "recovering", never a silent stale view. Convergence is typically sub-second; the banner self-clears.

Scenarios

# Shape Correct recovery Regression test
1 Cold restart, healthy fleet. Operator killed and restarted; devices kept reporting state to KV. Rebuild desired-state + health counts from KV alone. Desired-state entries are re-written identically, not churned (agents byte-compare and no-op). Liveness reaches Converged. aggregator_converges_from_kv_after_restart
2 Stale KV: CR deleted while operator down. A Deployment CR is force-deleted (finalizer bypassed — namespace teardown, --force) while the operator is down, leaving orphan desired-state. At convergence, GC any desired-state whose Deployment CR no longer exists (gc_orphaned_desired_state). Agents stop running the dead deployment. aggregator_gcs_desired_state_for_deleted_cr
3 Two operators racing. A rolling deploy briefly runs two operator replicas. Both write identical desired-state bytes; the KV value is stable; no orphan, no flapping. Idempotent multi-writer, so no leader election required. two_aggregators_produce_identical_desired_state
4 Partial KV: device offline during reset. A device hasn't reported (no device-state entry) when the operator restarts. The deployment still converges; the missing device counts as pending, not lost. Recovers to Running when the device reports. covered by #1 (one target has no state entry) + unit apply_state dedup
5 Chaos: kill under write load. Operator repeatedly killed/restarted while deployments are being created. Final state converges to the full desired-state set in < 30 s once a replica stays up. chaos_kill_under_write_load_converges

Out-of-order / replayed device-state (an older event arriving after a newer one) is handled by apply_state's last_event_at dedup — unit-tested in fleet_aggregator.rs, exercised on every replay.

Running the regression tests

HARMONY_FLEET_E2E=1 cargo test -p harmony-fleet-e2e --test operator_recovery

Needs k3d + podman on PATH (the shared harness brings up NATS in a fresh namespace). The tests drive fleet_aggregator::run in-process against the real NATS + k3d, aborting and respawning it to simulate restarts — run owns its watchers in a JoinSet, so a cancelled aggregator leaves no orphan tasks.