Files
harmony/docs/fleet-operator-recovery-scenarios.md
Jean-Gabriel Gill-Couture 56602b505c
All checks were successful
Run Check Script / check (pull_request) Successful in 2m35s
feat(fleet-operator): aggregator recovery signal + orphan GC + recovery e2e (Ch2)
Operator restart + aggregator recovery (v0.3 plan Ch2). The aggregator already
cold-rebuilds from NATS KV + CR watches; this makes recovery observable, closes
an orphan gap, and pins each failure shape with a regression test.

- OperatorLiveness: a shared in-process latch (Recovering → Converged) the
  aggregator sets once all three cold-start sources replay (Deployment/Device
  watcher InitDone, device-state KV seen_current; empty-bucket short-circuit).
  The in-process dashboard reads it and shows a self-clearing banner via an
  HTMX self-poll (/__recovery), so the customer sees progress, not a blank.
- gc_orphaned_desired_state: at convergence, purge desired-state whose
  Deployment CR no longer exists (force-deleted while the operator was down,
  finalizer bypassed). Belt-and-suspenders with the controller finalizer.
- run() now owns its watchers in a JoinSet, so cancelling the aggregator
  aborts its children — no orphan tasks outliving a restart (matters for the
  restart-simulation tests and clean process teardown). Also made run() Send
  (hoisted a .await out of a tracing macro) so it can be spawned.
- docs/fleet-operator-recovery-scenarios.md enumerates the failure shapes and
  maps each to its test.
- harmony-fleet-e2e/tests/operator_recovery.rs: regression test per scenario
  (cold restart converges from KV; orphan GC; two operators write identical
  bytes; chaos kill under write load converges <30s) + AdminKv::put_device_state.

Writes stay idempotent + byte-deterministic, so two operators racing agree
without leader election (operator HA = D3, deferred).
2026-06-05 15:26:00 -04:00

62 lines
4.2 KiB
Markdown

# Fleet operator recovery scenarios
The operator can be killed, upgraded, or rescheduled at any time. When it comes
back it must **converge from NATS KV + the CRs alone** — no customer-visible
"unknown state" window. This doc enumerates the failure shapes, what correct
recovery looks like, and the regression test that pins each one.
## How recovery works
The operator is stateless across restarts. Everything it needs is durable:
- **Deployment / Device CRs** live in etcd (kube). On restart the aggregator's
`watcher` replays the full list (`Event::InitApply` …→ `Event::InitDone`).
- **`desired-state` KV** is what the operator previously wrote per
`(device, deployment)`. On restart `seed_owned_targets` reloads it so the
first reconcile diffs correctly instead of orphaning entries.
- **`device-state` KV** is per-device phase reported by agents. On restart the
state watcher replays it (`watch_with_history`), rebuilding health counts.
All writes are **idempotent and byte-deterministic**: the desired-state payload
is the same serialized score for a given CR, and status patches are computed
from the rebuilt caches. So a second operator racing the first writes identical
bytes — no leader election needed at current fleet size (operator HA is D3,
deferred). See [`fleet_aggregator.rs`].
### Convergence signal
Until the cold replay finishes, health counts can be stale. The aggregator
latches an [`OperatorLiveness`] from `Recovering``Converged` once all three
sources have replayed (Deployment `InitDone`, Device `InitDone`, device-state KV
`seen_current`). The in-process dashboard reads it and shows a banner, so the
customer sees "recovering", never a silent stale view. Convergence is typically
sub-second; the banner self-clears.
## Scenarios
| # | Shape | Correct recovery | Regression test |
|---|-------|------------------|-----------------|
| 1 | **Cold restart, healthy fleet.** Operator killed and restarted; devices kept reporting state to KV. | Rebuild desired-state + health counts from KV alone. Desired-state entries are **re-written identically, not churned** (agents byte-compare and no-op). Liveness reaches `Converged`. | `aggregator_converges_from_kv_after_restart` |
| 2 | **Stale KV: CR deleted while operator down.** A Deployment CR is force-deleted (finalizer bypassed — namespace teardown, `--force`) while the operator is down, leaving orphan desired-state. | At convergence, GC any desired-state whose Deployment CR no longer exists (`gc_orphaned_desired_state`). Agents stop running the dead deployment. | `aggregator_gcs_desired_state_for_deleted_cr` |
| 3 | **Two operators racing.** A rolling deploy briefly runs two operator replicas. | Both write **identical** desired-state bytes; the KV value is stable; no orphan, no flapping. Idempotent multi-writer, so no leader election required. | `two_aggregators_produce_identical_desired_state` |
| 4 | **Partial KV: device offline during reset.** A device hasn't reported (no `device-state` entry) when the operator restarts. | The deployment still converges; the missing device counts as `pending`, not lost. Recovers to `Running` when the device reports. | covered by #1 (one target has no state entry) + unit `apply_state` dedup |
| 5 | **Chaos: kill under write load.** Operator repeatedly killed/restarted while deployments are being created. | Final state converges to the full desired-state set in < 30 s once a replica stays up. | `chaos_kill_under_write_load_converges` |
Out-of-order / replayed `device-state` (an older event arriving after a newer
one) is handled by `apply_state`'s `last_event_at` dedup unit-tested in
`fleet_aggregator.rs`, exercised on every replay.
## Running the regression tests
```bash
HARMONY_FLEET_E2E=1 cargo test -p harmony-fleet-e2e --test operator_recovery
```
Needs k3d + podman on PATH (the shared harness brings up NATS in a fresh
namespace). The tests drive `fleet_aggregator::run` in-process against the real
NATS + k3d, aborting and respawning it to simulate restarts `run` owns its
watchers in a `JoinSet`, so a cancelled aggregator leaves no orphan tasks.
[`fleet_aggregator.rs`]: ../fleet/harmony-fleet-operator/src/fleet_aggregator.rs
[`OperatorLiveness`]: ../fleet/harmony-fleet-operator/src/liveness.rs