Tabletop inventory of every failure mode the fleet operator must survive on restart, re-schedule, or upgrade. Companion to v0.3 roadmap Chapter 2; each scenario lists trigger, expected behavior, code-path citation, current test coverage, and risk classification. Step 1 of Chapter 2. Steps 3-5 (stale-KV reconciliation, leader election, liveness signalling) deferred to follow-up PRs and tagged "Phase 2 work" in the table.
17 KiB
17 KiB
Fleet operator recovery scenarios
Inventory of every failure shape the IoT operator pod must survive on restart,
re-schedule, or upgrade. Companion to ROADMAP v0_3_plan.md Chapter 2 —
Operator restart + aggregator recovery.
The operator's in-memory aggregate (Phase, DeploymentAggregate, per-device
state) is rebuilt from scratch on every startup by watching the four NATS KV
buckets:
desired-state— operator-written,<device>.<deployment>keysdevice-info— agent-written, static-ish factsdevice-state— agent-written, per(device, deployment)phasedevice-heartbeat— agent liveness pings
The aggregator entry point is harmony_fleet_operator::fleet_aggregator::run
(see fleet/harmony-fleet-operator/src/fleet_aggregator.rs).
Convention used in this document
- Code path cites the most relevant file and line range in
fleet/harmony-fleet-operator/src/fleet_aggregator.rs(call siteFA:Lxxx). Other crates usecrate:path:Lxxx. - Coverage points at a smoke script under
fleet/scripts/or a test underfleet/harmony-fleet-e2e/tests/.nonemeans we currently rely on it working by inspection. - Risk is impact if mishandled, not likelihood.
highmeans a customer could see stale or wrong data on the dashboard;mediummeans the operator self-heals but logs a noisy error;lowmeans transient correctness with no customer-visible effect.
Scenarios
| # | Name | Risk | Coverage |
|---|---|---|---|
| 1 | Cold restart with full KV | high | this PR (tests/operator_restart.rs) |
| 2 | Cold restart with desired-state seed only | medium | none |
| 3 | Partial KV — device offline during restart | high | none |
| 4 | Stale KV — Deployment CR deleted while down | high | none — Phase 2 |
| 5 | Stale KV — Device CR deleted while down | medium | none — Phase 2 |
| 6 | Selector change while operator is down | high | none |
| 7 | Two operator pods racing on rolling deploy | high | none — Phase 2 (leader election) |
| 8 | NATS reconnect mid-rebuild | medium | partial (async_nats retries; no test) |
| 9 | NATS stream loss after rebuild | high | none |
| 10 | KV revision wraparound | low | none |
| 11 | Malformed device-state payload |
medium | FA:L266-271 swallow path |
| 12 | Malformed desired-state payload |
low | FA:L619-625 swallow path |
| 13 | Invalid DeploymentName in KV key |
low | FA:L623-625 swallow path |
| 14 | High write load during rebuild (slow rebuild) | medium | none |
| 15 | Missing KV bucket on startup | low | created on first run (FA:L154-164) |
| 16 | Concurrent CR mutation during rebuild | medium | none |
| 17 | Heartbeat-only liveness (no state) on restart | low | none |
| 18 | Aggregator panics on the kube patch path | medium | none — Phase 2 (liveness signal) |
| 19 | Kube apiserver unreachable mid-rebuild | high | none |
| 20 | Operator killed mid-patch_tick |
medium | none |
1. Cold restart with full KV
- Trigger. Operator pod is killed (OOM, node reschedule,
kubectl rollout restart) when every device has previously publisheddevice-infoanddevice-state, and every desired-state KV entry the previous operator wrote is still present. - Expected behavior. New operator replays
device-stateviabucket.watch_with_history(">")(FA:L251) and seedsowned_targetsfromdesired-stateviaseed_owned_targets(FA:L611-633). After both kube watchers fireEvent::InitDone, the aggregate converges to byte-identical status patches. - Code path.
run(FA:L152-235);seed_owned_targets(FA:L611-633);run_state_kv_watcher(FA:L250-282);run_deployment_watcher(FA:L358-376);run_device_watcher(FA:L466-484);patch_tick(FA:L640-679). - Coverage.
fleet/harmony-fleet-e2e/tests/operator_restart.rs(this PR). - Risk. High. The customer-facing happy path. If this regresses, every dashboard reads stale or empty status after an upgrade.
2. Cold restart with desired-state seed only
- Trigger. Operator restart at a moment when
desired-statehas entries from a previous run butdevice-stateis empty (all agents asleep or freshly provisioned). - Expected behavior.
owned_targetsis seeded correctly. Aggregate reportsmatched_device_count = N,pending = N, no false-positiveRunningcounts. - Code path.
seed_owned_targets(FA:L611-633);compute_aggregate(FA:L684-710). - Coverage. none.
- Risk. Medium. Surfaces as transient over-
pendinguntil agents re-publish. Self-heals on the next state watch delivery.
3. Partial KV — device offline during restart
- Trigger. A device was running and reporting before the operator went
down; it has now lost power or NATS connectivity. Its
device-stateentry is still present (KV is persistent), but it can no longer republish. - Expected behavior. Operator replays the cached state and renders the
device as
Running(or whatever the last phase was) until the agent comes back. Dashboard does not show the device as "missing" unless the heartbeat bucket says so. Phase 2 will surface staleness via the heartbeat watcher. - Code path.
apply_state(FA:L286-323); no heartbeat watcher yet. - Coverage. none.
- Risk. High. A long-offline device that the operator believes is
Runningcould mask a real incident. Mitigation deferred to Chapter 2 liveness signalling.
4. Stale KV — Deployment CR deleted while operator was down
- Trigger. While the operator pod is down, a customer deletes a
DeploymentCR. The CR finalizer never gets to run because no controller is alive to process it (the apiserver waits). When the operator restarts, the CR is inTerminatingwith a finalizer; the correspondingdesired-state.<device>.<name>keys are still in NATS. - Expected behavior. Operator processes the
Event::Deletefor the CR (FA:L369), dropsowned_targets, deletes desired-state entries (FA:L450-455), removes the finalizer. Agents observe the KV delete and reconcile-tear-down. - Code path.
on_deployment_delete(FA:L428-456). Controller-side finalizer incontroller.rsdoes a belt-and-braces scan of the same prefix. - Coverage. none — Phase 2 work (stale-KV reconciliation rule, step 3 of Chapter 2). Not in this PR.
- Risk. High. Orphaned KV entries make agents reconcile a long-dead Deployment forever. Customer-visible as "I deleted it but it's still running."
5. Stale KV — Device CR deleted while operator was down
- Trigger. Device CR is deleted (admin removes a Pi from the fleet) while
the operator is down.
device-infoentry may also have been deleted separately; if so, the operator never rebuilds a Device CR for it. - Expected behavior. Stale
desired-stateentries keyed on the deleted device should be cleaned up. Today they're not — the operator only deletes them on a liveEvent::Deletefor the Device (FA:L552-576). Phase 2 reconciliation must walkowned_targetsafter init and prune entries with no matching device. - Code path.
on_device_delete(FA:L552-576). No init-time prune. - Coverage. none — Phase 2 work.
- Risk. Medium. Smaller blast radius than #4 because the device is also gone; nothing reconciles against the orphan key.
6. Selector change while operator is down
- Trigger. Customer edits
spec.targetSelectoron a CR while the operator pod is down. On restart, watcher delivers a singleEvent::Applyfor the updated CR (kube collapses the history). - Expected behavior. Operator computes new matched set, diffs against
seeded
owned_targets, writes new desired-state entries, deletes newly-orphaned ones.reconcile_kv(FA:L582-604) is responsible. - Code path.
on_deployment_upsert(FA:L378-426);reconcile_kv(FA:L582-604). - Coverage. none. Critical because the seed step is what makes the diff
correct on a cold restart — without
seed_owned_targets, a selector reduction would leak orphan entries. - Risk. High. Orphan keys reach agents that no longer match the selector, causing them to run a deployment they shouldn't.
7. Two operator pods racing on rolling deploy
- Trigger. A
kubectl rollout restart deploy/harmony-fleet-operatorbriefly runs the old and new pods in parallel. Both watch the same KV and CRs, both write desired-state entries. - Expected behavior. Writes are idempotent and byte-deterministic
(
reconcile_kvis a put-or-delete on the same content). Status patches viaPatch::Mergecollide harmlessly. However, the loser'sowned_targetssnapshot can lag and re-delete a key the winner just wrote, causing flap. - Code path.
reconcile_kv(FA:L582-604); kube patch (FA:L655-666). - Coverage. none — Phase 2 work (leader election decision, step 4 of Chapter 2). Not in this PR.
- Risk. High. Customer sees the dashboard flicker during a rolling upgrade. Self-heals once the old pod terminates.
8. NATS reconnect mid-rebuild
- Trigger. Operator's NATS connection drops during cold rebuild.
async_natsreconnects transparently. The KVwatch_with_history(">")call has already returned aStream; the underlying connection drop surfaces as a delivery error on the stream. - Expected behavior. Stream loop logs the error and continues
(
FA:L256-258). The history replay may be incomplete — a follow-up watch refresh would be needed to guarantee correctness. - Code path.
run_state_kv_watcher(FA:L250-282). - Coverage. partial —
async_natsreconnect is covered by its own tests; no operator-level test asserts post-reconnect convergence. - Risk. Medium. The watch stream may silently miss messages until a manual restart.
9. NATS stream loss after rebuild
- Trigger. NATS server crashes or the JetStream stream is deleted out-of-band after the operator has finished its cold rebuild.
- Expected behavior. Bucket re-creation on first reconnect
(
create_key_valueis idempotent,FA:L154-164). Operator should detect the empty stream, clear in-memory state, and rebuild. Today the watcher loop exits silently andselect!cancels the process. - Code path.
run(FA:L229-234). - Coverage. none.
- Risk. High. Possible silent data loss on a NATS incident.
10. KV revision wraparound
- Trigger. NATS JetStream KV uses a
u64revision counter. At ~10 Hz it would take ~58 billion years to wrap. Included for completeness; practical only with a corrupted stream. - Expected behavior. No special handling needed.
- Code path. N/A.
- Coverage. none.
- Risk. Low. Theoretical.
11. Malformed device-state payload
- Trigger. A buggy agent (or a manual
nats kv put) writes a value tostate.<device>.<deployment>that doesn't deserialize asDeploymentState. - Expected behavior. Operator logs
aggregator: bad device_state payloadand skips the entry. - Code path.
run_state_kv_watcherdeserialize arm (FA:L266-271). - Coverage. none. The error path is exercised at compile time only.
- Risk. Medium. A single bad entry shouldn't poison the whole rebuild; today's swallow-and-log handles this. Should be unit-tested.
12. Malformed desired-state payload
- Trigger. A previous operator wrote a value that no longer matches
the current
ReconcileScoreshape (older version, manual mutation). - Expected behavior.
seed_owned_targetsdoesn't deserialize the value — it only reads keys. The next CR upsert from kube rewrites it. - Code path.
seed_owned_targets(FA:L611-633). - Coverage. none.
- Risk. Low. Score format evolution is covered by CR validation; the KV is a derived projection.
13. Invalid DeploymentName in KV key
- Trigger. A key like
pi-01.hello.worldsnuck into the bucket (multiple dots) — manual mutation or an older operator version that didn't validate names. - Expected behavior.
seed_owned_targetslogsInvalid deployment name for key …and skips it (FA:L623-625). - Code path.
seed_owned_targets(FA:L619-632). - Coverage. none.
- Risk. Low. Belt-and-braces, the CR layer already enforces this
via
DeploymentName::try_new.
14. High write load during rebuild
- Trigger. Hundreds of devices publishing
device-stateupdates per second while the operator is rebuilding. The watch history replay races the live stream. - Expected behavior. Deliveries are ordered last-writer-wins; the
per-pair
last_event_atdedup inapply_state(FA:L287-291) prevents out-of-order entries from clobbering newer ones. - Code path.
apply_state(FA:L286-323). - Coverage. none. No load test exists.
- Risk. Medium. Likely fine in practice given the dedup, but unverified at scale.
15. Missing KV bucket on startup
- Trigger. First-ever operator start on a fresh NATS cluster, or after someone wiped JetStream state.
- Expected behavior.
create_key_valueis idempotent — creates the bucket if absent, no-ops if present. - Code path.
run(FA:L153-164). - Coverage. implicit in every smoke run that starts from a clean NATS.
smoke-a1.sh:182-195assertsKV bucket readylog. - Risk. Low. Idiomatic NATS pattern.
16. Concurrent CR mutation during rebuild
- Trigger. User applies a new Deployment CR while the operator is still replaying KV history.
- Expected behavior. Kube watcher delivers the
Event::ApplyafterEvent::InitDone; the upsert handler runs against the partially-seeded state and correctly diffs against any matching seeded targets. - Code path.
on_deployment_upsert(FA:L378-426). - Coverage. none.
- Risk. Medium. Possible race between KV seed and CR init; today
the locking in
state.lock().awaitserializes both, but the order in which they observe state is not asserted.
17. Heartbeat-only liveness (no state) on restart
- Trigger. Device has been publishing heartbeats but has no deployments
assigned. Operator restart finds heartbeats but no
device-stateordesired-stateentries for it. - Expected behavior. Device is recognized via its
DeviceCR (rebuilt fromdevice-infoindevice_reconciler.rs) and shown idle. No phase counts. The heartbeat bucket isn't watched by the aggregator. - Code path.
device_reconciler(separate from this module). - Coverage. none.
- Risk. Low. Expected dashboard rendering.
18. Aggregator panics on the kube patch path
- Trigger. A bug or upstream change makes
patch_statuspanic. Tokio unwinds the spawned task; the process keeps running because ofselect!— but the patcher silently stops. - Expected behavior. Process should exit-and-restart on any subsystem failure. The dashboard should also surface "operator unhealthy" so a customer doesn't trust stale status.
- Code path.
patch_tick(FA:L640-679);runselect (FA:L229-234). - Coverage. none — Phase 2 work (liveness signalling, step 5 of Chapter 2). Not in this PR.
- Risk. Medium. Status freezes silently; depends on dashboard noticing the lack of updates.
19. Kube apiserver unreachable mid-rebuild
- Trigger. Apiserver hiccup during operator startup.
Api::listor the initialwatcher::watcherinvocation fails. - Expected behavior. Watcher loop logs and exits;
select!cancels the process; k8s restarts the pod with exponential backoff. - Code path.
run_deployment_watcher(FA:L362-374);run_device_watcher(FA:L470-482). - Coverage. none.
- Risk. High. A flapping apiserver can keep the operator from ever reaching steady state.
20. Operator killed mid-patch_tick
- Trigger. Pod terminated between draining the
dirtyset (FA:L643) and the actualpatch_statuscalls (FA:L656-666). - Expected behavior. Lost dirty entries are re-marked on the next KV
watch delivery. Worst case is a one-tick lag in
.status.aggregate— the patch tick runs at 1 Hz. - Code path.
patch_tick(FA:L640-679). - Coverage. none.
- Risk. Medium. Self-heals on next event, but unverified.
Phase 2 follow-ups (out of scope for this PR)
The Chapter 2 roadmap lists five steps. This PR ships only steps 1 and 2.
| Step | What it does | Scenarios it closes |
|---|---|---|
| 1. Scenario inventory (this doc) | — | covers all 20 above by enumeration |
| 2. Cold-restart regression test | gates #1 in CI | #1 |
| 3. Stale-KV reconciliation rule | init-time prune of orphan keys | #4, #5 |
| 4. Leader election decision | single-writer or idempotent multi-writer | #7 |
| 5. Liveness signalling | dashboard "operator converging" banner | #18, parts of #19/#20 |
Each Phase 2 step is its own PR. The scenarios above tagged "Phase 2 work" are the entry points.