Files

Jean-Gabriel Gill-Couture 8648d05ff7 docs(fleet): enumerate operator recovery scenarios

Tabletop inventory of every failure mode the fleet operator must
survive on restart, re-schedule, or upgrade. Companion to v0.3
roadmap Chapter 2; each scenario lists trigger, expected behavior,
code-path citation, current test coverage, and risk classification.

Step 1 of Chapter 2. Steps 3-5 (stale-KV reconciliation, leader
election, liveness signalling) deferred to follow-up PRs and tagged
"Phase 2 work" in the table.

2026-05-24 15:27:38 -04:00

17 KiB

Raw Permalink Blame History

Fleet operator recovery scenarios

Inventory of every failure shape the IoT operator pod must survive on restart, re-schedule, or upgrade. Companion to ROADMAP v0_3_plan.md Chapter 2 — Operator restart + aggregator recovery.

The operator's in-memory aggregate (Phase, DeploymentAggregate, per-device state) is rebuilt from scratch on every startup by watching the four NATS KV buckets:

desired-state — operator-written, <device>.<deployment> keys
device-info — agent-written, static-ish facts
device-state — agent-written, per (device, deployment) phase
device-heartbeat — agent liveness pings

The aggregator entry point is harmony_fleet_operator::fleet_aggregator::run (see fleet/harmony-fleet-operator/src/fleet_aggregator.rs).

Convention used in this document

Code path cites the most relevant file and line range in fleet/harmony-fleet-operator/src/fleet_aggregator.rs (call site FA:Lxxx). Other crates use crate:path:Lxxx.
Coverage points at a smoke script under fleet/scripts/ or a test under fleet/harmony-fleet-e2e/tests/. none means we currently rely on it working by inspection.
Risk is impact if mishandled, not likelihood. high means a customer could see stale or wrong data on the dashboard; medium means the operator self-heals but logs a noisy error; low means transient correctness with no customer-visible effect.

Scenarios

#	Name	Risk	Coverage
1	Cold restart with full KV	high	this PR (`tests/operator_restart.rs`)
2	Cold restart with desired-state seed only	medium	none
3	Partial KV — device offline during restart	high	none
4	Stale KV — Deployment CR deleted while down	high	none — Phase 2
5	Stale KV — Device CR deleted while down	medium	none — Phase 2
6	Selector change while operator is down	high	none
7	Two operator pods racing on rolling deploy	high	none — Phase 2 (leader election)
8	NATS reconnect mid-rebuild	medium	partial (`async_nats` retries; no test)
9	NATS stream loss after rebuild	high	none
10	KV revision wraparound	low	none
11	Malformed `device-state` payload	medium	`FA:L266-271` swallow path
12	Malformed `desired-state` payload	low	`FA:L619-625` swallow path
13	Invalid `DeploymentName` in KV key	low	`FA:L623-625` swallow path
14	High write load during rebuild (slow rebuild)	medium	none
15	Missing KV bucket on startup	low	created on first run (`FA:L154-164`)
16	Concurrent CR mutation during rebuild	medium	none
17	Heartbeat-only liveness (no state) on restart	low	none
18	Aggregator panics on the kube patch path	medium	none — Phase 2 (liveness signal)
19	Kube apiserver unreachable mid-rebuild	high	none
20	Operator killed mid-`patch_tick`	medium	none

1. Cold restart with full KV

Trigger. Operator pod is killed (OOM, node reschedule, kubectl rollout restart) when every device has previously published device-info and device-state, and every desired-state KV entry the previous operator wrote is still present.
Expected behavior. New operator replays device-state via bucket.watch_with_history(">") (FA:L251) and seeds owned_targets from desired-state via seed_owned_targets (FA:L611-633). After both kube watchers fire Event::InitDone, the aggregate converges to byte-identical status patches.
Code path. run (FA:L152-235); seed_owned_targets (FA:L611-633); run_state_kv_watcher (FA:L250-282); run_deployment_watcher (FA:L358-376); run_device_watcher (FA:L466-484); patch_tick (FA:L640-679).
Coverage. fleet/harmony-fleet-e2e/tests/operator_restart.rs (this PR).
Risk. High. The customer-facing happy path. If this regresses, every dashboard reads stale or empty status after an upgrade.

2. Cold restart with desired-state seed only

Trigger. Operator restart at a moment when desired-state has entries from a previous run but device-state is empty (all agents asleep or freshly provisioned).
Expected behavior. owned_targets is seeded correctly. Aggregate reports matched_device_count = N, pending = N, no false-positive Running counts.
Code path. seed_owned_targets (FA:L611-633); compute_aggregate (FA:L684-710).
Coverage. none.
Risk. Medium. Surfaces as transient over-pending until agents re-publish. Self-heals on the next state watch delivery.

3. Partial KV — device offline during restart

Trigger. A device was running and reporting before the operator went down; it has now lost power or NATS connectivity. Its device-state entry is still present (KV is persistent), but it can no longer republish.
Expected behavior. Operator replays the cached state and renders the device as Running (or whatever the last phase was) until the agent comes back. Dashboard does not show the device as "missing" unless the heartbeat bucket says so. Phase 2 will surface staleness via the heartbeat watcher.
Code path. apply_state (FA:L286-323); no heartbeat watcher yet.
Coverage. none.
Risk. High. A long-offline device that the operator believes is Running could mask a real incident. Mitigation deferred to Chapter 2 liveness signalling.

4. Stale KV — Deployment CR deleted while operator was down

Trigger. While the operator pod is down, a customer deletes a Deployment CR. The CR finalizer never gets to run because no controller is alive to process it (the apiserver waits). When the operator restarts, the CR is in Terminating with a finalizer; the corresponding desired-state.<device>.<name> keys are still in NATS.
Expected behavior. Operator processes the Event::Delete for the CR (FA:L369), drops owned_targets, deletes desired-state entries (FA:L450-455), removes the finalizer. Agents observe the KV delete and reconcile-tear-down.
Code path. on_deployment_delete (FA:L428-456). Controller-side finalizer in controller.rs does a belt-and-braces scan of the same prefix.
Coverage. none — Phase 2 work (stale-KV reconciliation rule, step 3 of Chapter 2). Not in this PR.
Risk. High. Orphaned KV entries make agents reconcile a long-dead Deployment forever. Customer-visible as "I deleted it but it's still running."

5. Stale KV — Device CR deleted while operator was down

Trigger. Device CR is deleted (admin removes a Pi from the fleet) while the operator is down. device-info entry may also have been deleted separately; if so, the operator never rebuilds a Device CR for it.
Expected behavior. Stale desired-state entries keyed on the deleted device should be cleaned up. Today they're not — the operator only deletes them on a live Event::Delete for the Device (FA:L552-576). Phase 2 reconciliation must walk owned_targets after init and prune entries with no matching device.
Code path. on_device_delete (FA:L552-576). No init-time prune.
Coverage. none — Phase 2 work.
Risk. Medium. Smaller blast radius than #4 because the device is also gone; nothing reconciles against the orphan key.

6. Selector change while operator is down

Trigger. Customer edits spec.targetSelector on a CR while the operator pod is down. On restart, watcher delivers a single Event::Apply for the updated CR (kube collapses the history).
Expected behavior. Operator computes new matched set, diffs against seeded owned_targets, writes new desired-state entries, deletes newly-orphaned ones. reconcile_kv (FA:L582-604) is responsible.
Code path. on_deployment_upsert (FA:L378-426); reconcile_kv (FA:L582-604).
Coverage. none. Critical because the seed step is what makes the diff correct on a cold restart — without seed_owned_targets, a selector reduction would leak orphan entries.
Risk. High. Orphan keys reach agents that no longer match the selector, causing them to run a deployment they shouldn't.

7. Two operator pods racing on rolling deploy

Trigger. A kubectl rollout restart deploy/harmony-fleet-operator briefly runs the old and new pods in parallel. Both watch the same KV and CRs, both write desired-state entries.
Expected behavior. Writes are idempotent and byte-deterministic (reconcile_kv is a put-or-delete on the same content). Status patches via Patch::Merge collide harmlessly. However, the loser's owned_targets snapshot can lag and re-delete a key the winner just wrote, causing flap.
Code path. reconcile_kv (FA:L582-604); kube patch (FA:L655-666).
Coverage. none — Phase 2 work (leader election decision, step 4 of Chapter 2). Not in this PR.
Risk. High. Customer sees the dashboard flicker during a rolling upgrade. Self-heals once the old pod terminates.

8. NATS reconnect mid-rebuild

Trigger. Operator's NATS connection drops during cold rebuild. async_nats reconnects transparently. The KV watch_with_history(">") call has already returned a Stream; the underlying connection drop surfaces as a delivery error on the stream.
Expected behavior. Stream loop logs the error and continues (FA:L256-258). The history replay may be incomplete — a follow-up watch refresh would be needed to guarantee correctness.
Code path. run_state_kv_watcher (FA:L250-282).
Coverage. partial — async_nats reconnect is covered by its own tests; no operator-level test asserts post-reconnect convergence.
Risk. Medium. The watch stream may silently miss messages until a manual restart.

9. NATS stream loss after rebuild

Trigger. NATS server crashes or the JetStream stream is deleted out-of-band after the operator has finished its cold rebuild.
Expected behavior. Bucket re-creation on first reconnect (create_key_value is idempotent, FA:L154-164). Operator should detect the empty stream, clear in-memory state, and rebuild. Today the watcher loop exits silently and select! cancels the process.
Code path. run (FA:L229-234).
Coverage. none.
Risk. High. Possible silent data loss on a NATS incident.

10. KV revision wraparound

Trigger. NATS JetStream KV uses a u64 revision counter. At ~10 Hz it would take ~58 billion years to wrap. Included for completeness; practical only with a corrupted stream.
Expected behavior. No special handling needed.
Code path. N/A.
Coverage. none.
Risk. Low. Theoretical.

11. Malformed `device-state` payload

Trigger. A buggy agent (or a manual nats kv put) writes a value to state.<device>.<deployment> that doesn't deserialize as DeploymentState.
Expected behavior. Operator logs aggregator: bad device_state payload and skips the entry.
Code path. run_state_kv_watcher deserialize arm (FA:L266-271).
Coverage. none. The error path is exercised at compile time only.
Risk. Medium. A single bad entry shouldn't poison the whole rebuild; today's swallow-and-log handles this. Should be unit-tested.

12. Malformed `desired-state` payload

Trigger. A previous operator wrote a value that no longer matches the current ReconcileScore shape (older version, manual mutation).
Expected behavior. seed_owned_targets doesn't deserialize the value — it only reads keys. The next CR upsert from kube rewrites it.
Code path. seed_owned_targets (FA:L611-633).
Coverage. none.
Risk. Low. Score format evolution is covered by CR validation; the KV is a derived projection.

13. Invalid `DeploymentName` in KV key

Trigger. A key like pi-01.hello.world snuck into the bucket (multiple dots) — manual mutation or an older operator version that didn't validate names.
Expected behavior. seed_owned_targets logs Invalid deployment name for key … and skips it (FA:L623-625).
Code path. seed_owned_targets (FA:L619-632).
Coverage. none.
Risk. Low. Belt-and-braces, the CR layer already enforces this via DeploymentName::try_new.

14. High write load during rebuild

Trigger. Hundreds of devices publishing device-state updates per second while the operator is rebuilding. The watch history replay races the live stream.
Expected behavior. Deliveries are ordered last-writer-wins; the per-pair last_event_at dedup in apply_state (FA:L287-291) prevents out-of-order entries from clobbering newer ones.
Code path. apply_state (FA:L286-323).
Coverage. none. No load test exists.
Risk. Medium. Likely fine in practice given the dedup, but unverified at scale.

15. Missing KV bucket on startup

Trigger. First-ever operator start on a fresh NATS cluster, or after someone wiped JetStream state.
Expected behavior. create_key_value is idempotent — creates the bucket if absent, no-ops if present.
Code path. run (FA:L153-164).
Coverage. implicit in every smoke run that starts from a clean NATS. smoke-a1.sh:182-195 asserts KV bucket ready log.
Risk. Low. Idiomatic NATS pattern.

16. Concurrent CR mutation during rebuild

Trigger. User applies a new Deployment CR while the operator is still replaying KV history.
Expected behavior. Kube watcher delivers the Event::Apply after Event::InitDone; the upsert handler runs against the partially-seeded state and correctly diffs against any matching seeded targets.
Code path. on_deployment_upsert (FA:L378-426).
Coverage. none.
Risk. Medium. Possible race between KV seed and CR init; today the locking in state.lock().await serializes both, but the order in which they observe state is not asserted.

17. Heartbeat-only liveness (no state) on restart

Trigger. Device has been publishing heartbeats but has no deployments assigned. Operator restart finds heartbeats but no device-state or desired-state entries for it.
Expected behavior. Device is recognized via its Device CR (rebuilt from device-info in device_reconciler.rs) and shown idle. No phase counts. The heartbeat bucket isn't watched by the aggregator.
Code path. device_reconciler (separate from this module).
Coverage. none.
Risk. Low. Expected dashboard rendering.

18. Aggregator panics on the kube patch path

Trigger. A bug or upstream change makes patch_status panic. Tokio unwinds the spawned task; the process keeps running because of select! — but the patcher silently stops.
Expected behavior. Process should exit-and-restart on any subsystem failure. The dashboard should also surface "operator unhealthy" so a customer doesn't trust stale status.
Code path. patch_tick (FA:L640-679); run select (FA:L229-234).
Coverage. none — Phase 2 work (liveness signalling, step 5 of Chapter 2). Not in this PR.
Risk. Medium. Status freezes silently; depends on dashboard noticing the lack of updates.

19. Kube apiserver unreachable mid-rebuild

Trigger. Apiserver hiccup during operator startup. Api::list or the initial watcher::watcher invocation fails.
Expected behavior. Watcher loop logs and exits; select! cancels the process; k8s restarts the pod with exponential backoff.
Code path. run_deployment_watcher (FA:L362-374); run_device_watcher (FA:L470-482).
Coverage. none.
Risk. High. A flapping apiserver can keep the operator from ever reaching steady state.

20. Operator killed mid-`patch_tick`

Trigger. Pod terminated between draining the dirty set (FA:L643) and the actual patch_status calls (FA:L656-666).
Expected behavior. Lost dirty entries are re-marked on the next KV watch delivery. Worst case is a one-tick lag in .status.aggregate — the patch tick runs at 1 Hz.
Code path. patch_tick (FA:L640-679).
Coverage. none.
Risk. Medium. Self-heals on next event, but unverified.

Phase 2 follow-ups (out of scope for this PR)

The Chapter 2 roadmap lists five steps. This PR ships only steps 1 and 2.

Step	What it does	Scenarios it closes
1. Scenario inventory (this doc)	—	covers all 20 above by enumeration
2. Cold-restart regression test	gates #1 in CI	#1
3. Stale-KV reconciliation rule	init-time prune of orphan keys	#4, #5
4. Leader election decision	single-writer or idempotent multi-writer	#7
5. Liveness signalling	dashboard "operator converging" banner	#18, parts of #19/#20

Each Phase 2 step is its own PR. The scenarios above tagged "Phase 2 work" are the entry points.

17 KiB Raw Permalink Blame History