harmony/docs/fleet-operator-recovery-scenarios.md

# Fleet operator recovery scenarios

Inventory of every failure shape the IoT operator pod must survive on restart,
re-schedule, or upgrade. Companion to ROADMAP `v0_3_plan.md` **Chapter 2 —
Operator restart + aggregator recovery**.

The operator's in-memory aggregate (`Phase`, `DeploymentAggregate`, per-device
state) is rebuilt from scratch on every startup by watching the four NATS KV
buckets:

- `desired-state` — operator-written, `<device>.<deployment>` keys
- `device-info`   — agent-written, static-ish facts
- `device-state`  — agent-written, per `(device, deployment)` phase
- `device-heartbeat` — agent liveness pings

The aggregator entry point is `harmony_fleet_operator::fleet_aggregator::run`
(see `fleet/harmony-fleet-operator/src/fleet_aggregator.rs`).

## Convention used in this document

- **Code path** cites the most relevant file and line range in
  `fleet/harmony-fleet-operator/src/fleet_aggregator.rs` (call site `FA:Lxxx`).
  Other crates use `crate:path:Lxxx`.
- **Coverage** points at a smoke script under `fleet/scripts/` or a test under
  `fleet/harmony-fleet-e2e/tests/`. `none` means we currently rely on it
  working by inspection.
- **Risk** is impact if mishandled, not likelihood. `high` means a customer
  could see stale or wrong data on the dashboard; `medium` means the operator
  self-heals but logs a noisy error; `low` means transient correctness with
  no customer-visible effect.

## Scenarios

| # | Name                                            | Risk    | Coverage |
|---|-------------------------------------------------|---------|----------|
| 1 | Cold restart with full KV                       | high    | this PR (`tests/operator_restart.rs`) |
| 2 | Cold restart with desired-state seed only       | medium  | none |
| 3 | Partial KV — device offline during restart      | high    | none |
| 4 | Stale KV — Deployment CR deleted while down     | high    | none — Phase 2 |
| 5 | Stale KV — Device CR deleted while down         | medium  | none — Phase 2 |
| 6 | Selector change while operator is down          | high    | none |
| 7 | Two operator pods racing on rolling deploy      | high    | none — Phase 2 (leader election) |
| 8 | NATS reconnect mid-rebuild                      | medium  | partial (`async_nats` retries; no test) |
| 9 | NATS stream loss after rebuild                  | high    | none |
| 10 | KV revision wraparound                         | low     | none |
| 11 | Malformed `device-state` payload               | medium  | `FA:L266-271` swallow path |
| 12 | Malformed `desired-state` payload              | low     | `FA:L619-625` swallow path |
| 13 | Invalid `DeploymentName` in KV key             | low     | `FA:L623-625` swallow path |
| 14 | High write load during rebuild (slow rebuild)  | medium  | none |
| 15 | Missing KV bucket on startup                   | low     | created on first run (`FA:L154-164`) |
| 16 | Concurrent CR mutation during rebuild          | medium  | none |
| 17 | Heartbeat-only liveness (no state) on restart  | low     | none |
| 18 | Aggregator panics on the kube patch path       | medium  | none — Phase 2 (liveness signal) |
| 19 | Kube apiserver unreachable mid-rebuild         | high    | none |
| 20 | Operator killed mid-`patch_tick`               | medium  | none |

---

### 1. Cold restart with full KV

- **Trigger.** Operator pod is killed (OOM, node reschedule, `kubectl rollout
  restart`) when every device has previously published `device-info` and
  `device-state`, and every desired-state KV entry the previous operator
  wrote is still present.
- **Expected behavior.** New operator replays `device-state` via
  `bucket.watch_with_history(">")` (`FA:L251`) and seeds `owned_targets` from
  `desired-state` via `seed_owned_targets` (`FA:L611-633`). After both kube
  watchers fire `Event::InitDone`, the aggregate converges to byte-identical
  status patches.
- **Code path.** `run` (`FA:L152-235`); `seed_owned_targets` (`FA:L611-633`);
  `run_state_kv_watcher` (`FA:L250-282`); `run_deployment_watcher`
  (`FA:L358-376`); `run_device_watcher` (`FA:L466-484`); `patch_tick`
  (`FA:L640-679`).
- **Coverage.** `fleet/harmony-fleet-e2e/tests/operator_restart.rs` (this PR).
- **Risk.** **High.** The customer-facing happy path. If this regresses,
  every dashboard reads stale or empty status after an upgrade.

### 2. Cold restart with desired-state seed only

- **Trigger.** Operator restart at a moment when `desired-state` has entries
  from a previous run but `device-state` is empty (all agents asleep or
  freshly provisioned).
- **Expected behavior.** `owned_targets` is seeded correctly. Aggregate
  reports `matched_device_count = N`, `pending = N`, no false-positive
  `Running` counts.
- **Code path.** `seed_owned_targets` (`FA:L611-633`); `compute_aggregate`
  (`FA:L684-710`).
- **Coverage.** none.
- **Risk.** **Medium.** Surfaces as transient over-`pending` until agents
  re-publish. Self-heals on the next state watch delivery.

### 3. Partial KV — device offline during restart

- **Trigger.** A device was running and reporting before the operator went
  down; it has now lost power or NATS connectivity. Its `device-state` entry
  is still present (KV is persistent), but it can no longer republish.
- **Expected behavior.** Operator replays the cached state and renders the
  device as `Running` (or whatever the last phase was) until the agent comes
  back. Dashboard does not show the device as "missing" unless the heartbeat
  bucket says so. Phase 2 will surface staleness via the heartbeat watcher.
- **Code path.** `apply_state` (`FA:L286-323`); no heartbeat watcher yet.
- **Coverage.** none.
- **Risk.** **High.** A long-offline device that the operator believes is
  `Running` could mask a real incident. Mitigation deferred to Chapter 2
  liveness signalling.

### 4. Stale KV — Deployment CR deleted while operator was down

- **Trigger.** While the operator pod is down, a customer deletes a
  `Deployment` CR. The CR finalizer never gets to run because no controller
  is alive to process it (the apiserver waits). When the operator restarts,
  the CR is in `Terminating` with a finalizer; the corresponding
  `desired-state.<device>.<name>` keys are still in NATS.
- **Expected behavior.** Operator processes the `Event::Delete` for the CR
  (`FA:L369`), drops `owned_targets`, deletes desired-state entries
  (`FA:L450-455`), removes the finalizer. Agents observe the KV delete and
  reconcile-tear-down.
- **Code path.** `on_deployment_delete` (`FA:L428-456`). Controller-side
  finalizer in `controller.rs` does a belt-and-braces scan of the same
  prefix.
- **Coverage.** none — **Phase 2 work** (stale-KV reconciliation rule, step
  3 of Chapter 2). Not in this PR.
- **Risk.** **High.** Orphaned KV entries make agents reconcile a long-dead
  Deployment forever. Customer-visible as "I deleted it but it's still
  running."

### 5. Stale KV — Device CR deleted while operator was down

- **Trigger.** Device CR is deleted (admin removes a Pi from the fleet) while
  the operator is down. `device-info` entry may also have been deleted
  separately; if so, the operator never rebuilds a Device CR for it.
- **Expected behavior.** Stale `desired-state` entries keyed on the deleted
  device should be cleaned up. Today they're not — the operator only deletes
  them on a live `Event::Delete` for the Device (`FA:L552-576`). Phase 2
  reconciliation must walk `owned_targets` after init and prune entries with
  no matching device.
- **Code path.** `on_device_delete` (`FA:L552-576`). No init-time prune.
- **Coverage.** none — **Phase 2 work**.
- **Risk.** **Medium.** Smaller blast radius than #4 because the device is
  also gone; nothing reconciles against the orphan key.

### 6. Selector change while operator is down

- **Trigger.** Customer edits `spec.targetSelector` on a CR while the
  operator pod is down. On restart, watcher delivers a single
  `Event::Apply` for the updated CR (kube collapses the history).
- **Expected behavior.** Operator computes new matched set, diffs against
  seeded `owned_targets`, writes new desired-state entries, deletes
  newly-orphaned ones. `reconcile_kv` (`FA:L582-604`) is responsible.
- **Code path.** `on_deployment_upsert` (`FA:L378-426`); `reconcile_kv`
  (`FA:L582-604`).
- **Coverage.** none. Critical because the seed step is what makes the diff
  correct on a cold restart — without `seed_owned_targets`, a selector
  reduction would leak orphan entries.
- **Risk.** **High.** Orphan keys reach agents that no longer match the
  selector, causing them to run a deployment they shouldn't.

### 7. Two operator pods racing on rolling deploy

- **Trigger.** A `kubectl rollout restart deploy/harmony-fleet-operator`
  briefly runs the old and new pods in parallel. Both watch the same KV
  and CRs, both write desired-state entries.
- **Expected behavior.** Writes are idempotent and byte-deterministic
  (`reconcile_kv` is a put-or-delete on the same content). Status patches
  via `Patch::Merge` collide harmlessly. **However**, the loser's
  `owned_targets` snapshot can lag and re-delete a key the winner just
  wrote, causing flap.
- **Code path.** `reconcile_kv` (`FA:L582-604`); kube patch (`FA:L655-666`).
- **Coverage.** none — **Phase 2 work** (leader election decision, step 4
  of Chapter 2). Not in this PR.
- **Risk.** **High.** Customer sees the dashboard flicker during a rolling
  upgrade. Self-heals once the old pod terminates.

### 8. NATS reconnect mid-rebuild

- **Trigger.** Operator's NATS connection drops during cold rebuild.
  `async_nats` reconnects transparently. The KV `watch_with_history(">")`
  call has already returned a `Stream`; the underlying connection drop
  surfaces as a delivery error on the stream.
- **Expected behavior.** Stream loop logs the error and continues
  (`FA:L256-258`). The history replay may be incomplete — a follow-up watch
  refresh would be needed to guarantee correctness.
- **Code path.** `run_state_kv_watcher` (`FA:L250-282`).
- **Coverage.** partial — `async_nats` reconnect is covered by its own
  tests; no operator-level test asserts post-reconnect convergence.
- **Risk.** **Medium.** The watch stream may silently miss messages until a
  manual restart.

### 9. NATS stream loss after rebuild

- **Trigger.** NATS server crashes or the JetStream stream is deleted
  out-of-band after the operator has finished its cold rebuild.
- **Expected behavior.** Bucket re-creation on first reconnect
  (`create_key_value` is idempotent, `FA:L154-164`). Operator should detect
  the empty stream, clear in-memory state, and rebuild. Today the watcher
  loop exits silently and `select!` cancels the process.
- **Code path.** `run` (`FA:L229-234`).
- **Coverage.** none.
- **Risk.** **High.** Possible silent data loss on a NATS incident.

### 10. KV revision wraparound

- **Trigger.** NATS JetStream KV uses a `u64` revision counter. At ~10
  Hz it would take ~58 billion years to wrap. Included for completeness;
  practical only with a corrupted stream.
- **Expected behavior.** No special handling needed.
- **Code path.** N/A.
- **Coverage.** none.
- **Risk.** **Low.** Theoretical.

### 11. Malformed `device-state` payload

- **Trigger.** A buggy agent (or a manual `nats kv put`) writes a value to
  `state.<device>.<deployment>` that doesn't deserialize as
  `DeploymentState`.
- **Expected behavior.** Operator logs `aggregator: bad device_state
  payload` and skips the entry.
- **Code path.** `run_state_kv_watcher` deserialize arm (`FA:L266-271`).
- **Coverage.** none. The error path is exercised at compile time only.
- **Risk.** **Medium.** A single bad entry shouldn't poison the whole
  rebuild; today's swallow-and-log handles this. Should be unit-tested.

### 12. Malformed `desired-state` payload

- **Trigger.** A previous operator wrote a value that no longer matches
  the current `ReconcileScore` shape (older version, manual mutation).
- **Expected behavior.** `seed_owned_targets` doesn't deserialize the
  value — it only reads keys. The next CR upsert from kube rewrites it.
- **Code path.** `seed_owned_targets` (`FA:L611-633`).
- **Coverage.** none.
- **Risk.** **Low.** Score format evolution is covered by CR validation;
  the KV is a derived projection.

### 13. Invalid `DeploymentName` in KV key

- **Trigger.** A key like `pi-01.hello.world` snuck into the bucket
  (multiple dots) — manual mutation or an older operator version that
  didn't validate names.
- **Expected behavior.** `seed_owned_targets` logs `Invalid deployment
  name for key …` and skips it (`FA:L623-625`).
- **Code path.** `seed_owned_targets` (`FA:L619-632`).
- **Coverage.** none.
- **Risk.** **Low.** Belt-and-braces, the CR layer already enforces this
  via `DeploymentName::try_new`.

### 14. High write load during rebuild

- **Trigger.** Hundreds of devices publishing `device-state` updates per
  second while the operator is rebuilding. The watch history replay races
  the live stream.
- **Expected behavior.** Deliveries are ordered last-writer-wins; the
  per-pair `last_event_at` dedup in `apply_state` (`FA:L287-291`) prevents
  out-of-order entries from clobbering newer ones.
- **Code path.** `apply_state` (`FA:L286-323`).
- **Coverage.** none. No load test exists.
- **Risk.** **Medium.** Likely fine in practice given the dedup, but
  unverified at scale.

### 15. Missing KV bucket on startup

- **Trigger.** First-ever operator start on a fresh NATS cluster, or after
  someone wiped JetStream state.
- **Expected behavior.** `create_key_value` is idempotent — creates the
  bucket if absent, no-ops if present.
- **Code path.** `run` (`FA:L153-164`).
- **Coverage.** implicit in every smoke run that starts from a clean NATS.
  `smoke-a1.sh:182-195` asserts `KV bucket ready` log.
- **Risk.** **Low.** Idiomatic NATS pattern.

### 16. Concurrent CR mutation during rebuild

- **Trigger.** User applies a new Deployment CR while the operator is still
  replaying KV history.
- **Expected behavior.** Kube watcher delivers the `Event::Apply` after
  `Event::InitDone`; the upsert handler runs against the partially-seeded
  state and correctly diffs against any matching seeded targets.
- **Code path.** `on_deployment_upsert` (`FA:L378-426`).
- **Coverage.** none.
- **Risk.** **Medium.** Possible race between KV seed and CR init; today
  the locking in `state.lock().await` serializes both, but the order in
  which they observe state is not asserted.

### 17. Heartbeat-only liveness (no state) on restart

- **Trigger.** Device has been publishing heartbeats but has no deployments
  assigned. Operator restart finds heartbeats but no `device-state` or
  `desired-state` entries for it.
- **Expected behavior.** Device is recognized via its `Device` CR
  (rebuilt from `device-info` in `device_reconciler.rs`) and shown idle.
  No phase counts. The heartbeat bucket isn't watched by the aggregator.
- **Code path.** `device_reconciler` (separate from this module).
- **Coverage.** none.
- **Risk.** **Low.** Expected dashboard rendering.

### 18. Aggregator panics on the kube patch path

- **Trigger.** A bug or upstream change makes `patch_status` panic. Tokio
  unwinds the spawned task; the process keeps running because of `select!`
  — but the patcher silently stops.
- **Expected behavior.** Process should exit-and-restart on any subsystem
  failure. The dashboard should also surface "operator unhealthy" so a
  customer doesn't trust stale status.
- **Code path.** `patch_tick` (`FA:L640-679`); `run` select (`FA:L229-234`).
- **Coverage.** none — **Phase 2 work** (liveness signalling, step 5 of
  Chapter 2). Not in this PR.
- **Risk.** **Medium.** Status freezes silently; depends on dashboard
  noticing the lack of updates.

### 19. Kube apiserver unreachable mid-rebuild

- **Trigger.** Apiserver hiccup during operator startup. `Api::list` or
  the initial `watcher::watcher` invocation fails.
- **Expected behavior.** Watcher loop logs and exits; `select!` cancels
  the process; k8s restarts the pod with exponential backoff.
- **Code path.** `run_deployment_watcher` (`FA:L362-374`);
  `run_device_watcher` (`FA:L470-482`).
- **Coverage.** none.
- **Risk.** **High.** A flapping apiserver can keep the operator from
  ever reaching steady state.

### 20. Operator killed mid-`patch_tick`

- **Trigger.** Pod terminated between draining the `dirty` set
  (`FA:L643`) and the actual `patch_status` calls (`FA:L656-666`).
- **Expected behavior.** Lost dirty entries are re-marked on the next KV
  watch delivery. Worst case is a one-tick lag in `.status.aggregate` —
  the patch tick runs at 1 Hz.
- **Code path.** `patch_tick` (`FA:L640-679`).
- **Coverage.** none.
- **Risk.** **Medium.** Self-heals on next event, but unverified.

## Phase 2 follow-ups (out of scope for this PR)

The Chapter 2 roadmap lists five steps. This PR ships only steps 1 and 2.

| Step | What it does | Scenarios it closes |
|------|--------------|---------------------|
| 1. Scenario inventory (this doc) | — | covers all 20 above by enumeration |
| 2. Cold-restart regression test  | gates #1 in CI | #1 |
| 3. Stale-KV reconciliation rule  | init-time prune of orphan keys | #4, #5 |
| 4. Leader election decision      | single-writer or idempotent multi-writer | #7 |
| 5. Liveness signalling           | dashboard "operator converging" banner | #18, parts of #19/#20 |

Each Phase 2 step is its own PR. The scenarios above tagged "Phase 2 work"
are the entry points.