feat/v0-3-operator-restart-baseline #294

Open
johnride wants to merge 2 commits from feat/v0-3-operator-restart-baseline into feat/smoke-test-contract

2 Commits

Author SHA1 Message Date
13e5549d6b test(fleet-e2e): cold-restart regression baseline for operator
All checks were successful
Run Check Script / check (pull_request) Successful in 2m22s
Adds the operator_restart integration test that gates scenario #1
in docs/fleet-operator-recovery-scenarios.md: when the operator Pod
is killed and replaced, the new instance must rebuild Deployment
status + desired-state KV from NATS alone, byte-for-byte matching
the pre-kill snapshot within 30 s.

Pattern: deploy via FleetOperatorScore (no handrolled manifests),
seed a Device + Deployment CR, wait for first patch, snapshot the
aggregate counts + desired-state bytes, delete the operator pod,
wait for the replacement Ready, then poll the snapshot until it
matches or the budget elapses.

Gated behind HARMONY_FLEET_E2E=1 so cargo test --workspace stays
cheap; runs in its own test binary to isolate the pod-kill blast
radius from the existing operator suite.

Step 2 of v0.3 Chapter 2. Steps 3-5 deferred.
2026-05-24 15:27:49 -04:00
8648d05ff7 docs(fleet): enumerate operator recovery scenarios
Tabletop inventory of every failure mode the fleet operator must
survive on restart, re-schedule, or upgrade. Companion to v0.3
roadmap Chapter 2; each scenario lists trigger, expected behavior,
code-path citation, current test coverage, and risk classification.

Step 1 of Chapter 2. Steps 3-5 (stale-KV reconciliation, leader
election, liveness signalling) deferred to follow-up PRs and tagged
"Phase 2 work" in the table.
2026-05-24 15:27:38 -04:00