Operator restart + aggregator recovery (v0.3 plan Ch2). The aggregator already cold-rebuilds from NATS KV + CR watches; this makes recovery observable, closes an orphan gap, and pins each failure shape with a regression test. - OperatorLiveness: a shared in-process latch (Recovering → Converged) the aggregator sets once all three cold-start sources replay (Deployment/Device watcher InitDone, device-state KV seen_current; empty-bucket short-circuit). The in-process dashboard reads it and shows a self-clearing banner via an HTMX self-poll (/__recovery), so the customer sees progress, not a blank. - gc_orphaned_desired_state: at convergence, purge desired-state whose Deployment CR no longer exists (force-deleted while the operator was down, finalizer bypassed). Belt-and-suspenders with the controller finalizer. - run() now owns its watchers in a JoinSet, so cancelling the aggregator aborts its children — no orphan tasks outliving a restart (matters for the restart-simulation tests and clean process teardown). Also made run() Send (hoisted a .await out of a tracing macro) so it can be spawned. - docs/fleet-operator-recovery-scenarios.md enumerates the failure shapes and maps each to its test. - harmony-fleet-e2e/tests/operator_recovery.rs: regression test per scenario (cold restart converges from KV; orphan GC; two operators write identical bytes; chaos kill under write load converges <30s) + AdminKv::put_device_state. Writes stay idempotent + byte-deterministic, so two operators racing agree without leader election (operator HA = D3, deferred).
4.2 KiB
Fleet operator recovery scenarios
The operator can be killed, upgraded, or rescheduled at any time. When it comes back it must converge from NATS KV + the CRs alone — no customer-visible "unknown state" window. This doc enumerates the failure shapes, what correct recovery looks like, and the regression test that pins each one.
How recovery works
The operator is stateless across restarts. Everything it needs is durable:
- Deployment / Device CRs live in etcd (kube). On restart the aggregator's
watcherreplays the full list (Event::InitApply…→Event::InitDone). desired-stateKV is what the operator previously wrote per(device, deployment). On restartseed_owned_targetsreloads it so the first reconcile diffs correctly instead of orphaning entries.device-stateKV is per-device phase reported by agents. On restart the state watcher replays it (watch_with_history), rebuilding health counts.
All writes are idempotent and byte-deterministic: the desired-state payload
is the same serialized score for a given CR, and status patches are computed
from the rebuilt caches. So a second operator racing the first writes identical
bytes — no leader election needed at current fleet size (operator HA is D3,
deferred). See fleet_aggregator.rs.
Convergence signal
Until the cold replay finishes, health counts can be stale. The aggregator
latches an OperatorLiveness from Recovering → Converged once all three
sources have replayed (Deployment InitDone, Device InitDone, device-state KV
seen_current). The in-process dashboard reads it and shows a banner, so the
customer sees "recovering", never a silent stale view. Convergence is typically
sub-second; the banner self-clears.
Scenarios
| # | Shape | Correct recovery | Regression test |
|---|---|---|---|
| 1 | Cold restart, healthy fleet. Operator killed and restarted; devices kept reporting state to KV. | Rebuild desired-state + health counts from KV alone. Desired-state entries are re-written identically, not churned (agents byte-compare and no-op). Liveness reaches Converged. |
aggregator_converges_from_kv_after_restart |
| 2 | Stale KV: CR deleted while operator down. A Deployment CR is force-deleted (finalizer bypassed — namespace teardown, --force) while the operator is down, leaving orphan desired-state. |
At convergence, GC any desired-state whose Deployment CR no longer exists (gc_orphaned_desired_state). Agents stop running the dead deployment. |
aggregator_gcs_desired_state_for_deleted_cr |
| 3 | Two operators racing. A rolling deploy briefly runs two operator replicas. | Both write identical desired-state bytes; the KV value is stable; no orphan, no flapping. Idempotent multi-writer, so no leader election required. | two_aggregators_produce_identical_desired_state |
| 4 | Partial KV: device offline during reset. A device hasn't reported (no device-state entry) when the operator restarts. |
The deployment still converges; the missing device counts as pending, not lost. Recovers to Running when the device reports. |
covered by #1 (one target has no state entry) + unit apply_state dedup |
| 5 | Chaos: kill under write load. Operator repeatedly killed/restarted while deployments are being created. | Final state converges to the full desired-state set in < 30 s once a replica stays up. | chaos_kill_under_write_load_converges |
Out-of-order / replayed device-state (an older event arriving after a newer
one) is handled by apply_state's last_event_at dedup — unit-tested in
fleet_aggregator.rs, exercised on every replay.
Running the regression tests
HARMONY_FLEET_E2E=1 cargo test -p harmony-fleet-e2e --test operator_recovery
Needs k3d + podman on PATH (the shared harness brings up NATS in a fresh
namespace). The tests drive fleet_aggregator::run in-process against the real
NATS + k3d, aborting and respawning it to simulate restarts — run owns its
watchers in a JoinSet, so a cancelled aggregator leaves no orphan tasks.