harmony/docs/fleet-operator-recovery-scenarios.md

# Fleet operator recovery scenarios

The operator can be killed, upgraded, or rescheduled at any time. When it comes
back it must **converge from NATS KV + the CRs alone** — no customer-visible
"unknown state" window. This doc enumerates the failure shapes, what correct
recovery looks like, and the regression test that pins each one.

## How recovery works

The operator is stateless across restarts. Everything it needs is durable:

- **Deployment / Device CRs** live in etcd (kube). On restart the aggregator's
  `watcher` replays the full list (`Event::InitApply` …→ `Event::InitDone`).
- **`desired-state` KV** is what the operator previously wrote per
  `(device, deployment)`. On restart `seed_owned_targets` reloads it so the
  first reconcile diffs correctly instead of orphaning entries.
- **`device-state` KV** is per-device phase reported by agents. On restart the
  state watcher replays it (`watch_with_history`), rebuilding health counts.

All writes are **idempotent and byte-deterministic**: the desired-state payload
is the same serialized score for a given CR, and status patches are computed
from the rebuilt caches. So a second operator racing the first writes identical
bytes — no leader election needed at current fleet size (operator HA is D3,
deferred). See [`fleet_aggregator.rs`].

### Convergence signal

Until the cold replay finishes, health counts can be stale. The aggregator
latches an [`OperatorLiveness`] from `Recovering` → `Converged` once all three
sources have replayed (Deployment `InitDone`, Device `InitDone`, device-state KV
`seen_current`). The in-process dashboard reads it and shows a banner, so the
customer sees "recovering", never a silent stale view. Convergence is typically
sub-second; the banner self-clears.

## Scenarios

| # | Shape | Correct recovery | Regression test |
|---|-------|------------------|-----------------|
| 1 | **Cold restart, healthy fleet.** Operator killed and restarted; devices kept reporting state to KV. | Rebuild desired-state + health counts from KV alone. Desired-state entries are **re-written identically, not churned** (agents byte-compare and no-op). Liveness reaches `Converged`. | `aggregator_converges_from_kv_after_restart` |
| 2 | **Stale KV: CR deleted while operator down.** A Deployment CR is force-deleted (finalizer bypassed — namespace teardown, `--force`) while the operator is down, leaving orphan desired-state. | At convergence, GC any desired-state whose Deployment CR no longer exists (`gc_orphaned_desired_state`). Agents stop running the dead deployment. | `aggregator_gcs_desired_state_for_deleted_cr` |
| 3 | **Two operators racing.** A rolling deploy briefly runs two operator replicas. | Both write **identical** desired-state bytes; the KV value is stable; no orphan, no flapping. Idempotent multi-writer, so no leader election required. | `two_aggregators_produce_identical_desired_state` |
| 4 | **Partial KV: device offline during reset.** A device hasn't reported (no `device-state` entry) when the operator restarts. | The deployment still converges; the missing device counts as `pending`, not lost. Recovers to `Running` when the device reports. | covered by #1 (one target has no state entry) + unit `apply_state` dedup |
| 5 | **Chaos: kill under write load.** Operator repeatedly killed/restarted while deployments are being created. | Final state converges to the full desired-state set in < 30 s once a replica stays up. | `chaos_kill_under_write_load_converges` |

Out-of-order / replayed `device-state` (an older event arriving after a newer
one) is handled by `apply_state`'s `last_event_at` dedup — unit-tested in
`fleet_aggregator.rs`, exercised on every replay.

## Running the regression tests

```bash
HARMONY_FLEET_E2E=1 cargo test -p harmony-fleet-e2e --test operator_recovery
```

Needs k3d + podman on PATH (the shared harness brings up NATS in a fresh
namespace). The tests drive `fleet_aggregator::run` in-process against the real
NATS + k3d, aborting and respawning it to simulate restarts — `run` owns its
watchers in a `JoinSet`, so a cancelled aggregator leaves no orphan tasks.

[`fleet_aggregator.rs`]: ../fleet/harmony-fleet-operator/src/fleet_aggregator.rs
[`OperatorLiveness`]: ../fleet/harmony-fleet-operator/src/liveness.rs