feat(fleet-operator): aggregator recovery signal + orphan GC + recovery e2e (Ch2) #328

johnride · 2026-06-05T01:34:12Z

johnride commented

2026-06-05 01:34:12 +00:00

Operator restart + aggregator recovery (v0.3 plan Ch2). The aggregator already
cold-rebuilds from NATS KV + CR watches; this makes recovery observable, closes
an orphan gap, and pins each failure shape with a regression test.

OperatorLiveness: a shared in-process latch (Recovering → Converged) the
aggregator sets once all three cold-start sources replay (Deployment/Device
watcher InitDone, device-state KV seen_current; empty-bucket short-circuit).
The in-process dashboard reads it and shows a self-clearing banner via an
HTMX self-poll (/__recovery), so the customer sees progress, not a blank.
gc_orphaned_desired_state: at convergence, purge desired-state whose
Deployment CR no longer exists (force-deleted while the operator was down,
finalizer bypassed). Belt-and-suspenders with the controller finalizer.
run() now owns its watchers in a JoinSet, so cancelling the aggregator
aborts its children — no orphan tasks outliving a restart (matters for the
restart-simulation tests and clean process teardown). Also made run() Send
(hoisted a .await out of a tracing macro) so it can be spawned.
docs/fleet-operator-recovery-scenarios.md enumerates the failure shapes and
maps each to its test.
harmony-fleet-e2e/tests/operator_recovery.rs: regression test per scenario
(cold restart converges from KV; orphan GC; two operators write identical
bytes; chaos kill under write load converges <30s) + AdminKv::put_device_state.

Writes stay idempotent + byte-deterministic, so two operators racing agree
without leader election (operator HA = D3, deferred).

Operator restart + aggregator recovery (v0.3 plan Ch2). The aggregator already cold-rebuilds from NATS KV + CR watches; this makes recovery observable, closes an orphan gap, and pins each failure shape with a regression test. - OperatorLiveness: a shared in-process latch (Recovering → Converged) the aggregator sets once all three cold-start sources replay (Deployment/Device watcher InitDone, device-state KV seen_current; empty-bucket short-circuit). The in-process dashboard reads it and shows a self-clearing banner via an HTMX self-poll (/__recovery), so the customer sees progress, not a blank. - gc_orphaned_desired_state: at convergence, purge desired-state whose Deployment CR no longer exists (force-deleted while the operator was down, finalizer bypassed). Belt-and-suspenders with the controller finalizer. - run() now owns its watchers in a JoinSet, so cancelling the aggregator aborts its children — no orphan tasks outliving a restart (matters for the restart-simulation tests and clean process teardown). Also made run() Send (hoisted a .await out of a tracing macro) so it can be spawned. - docs/fleet-operator-recovery-scenarios.md enumerates the failure shapes and maps each to its test. - harmony-fleet-e2e/tests/operator_recovery.rs: regression test per scenario (cold restart converges from KV; orphan GC; two operators write identical bytes; chaos kill under write load converges <30s) + AdminKv::put_device_state. Writes stay idempotent + byte-deterministic, so two operators racing agree without leader election (operator HA = D3, deferred).

johnride added 1 commit 2026-06-05 01:34:13 +00:00

feat(fleet-operator): aggregator recovery signal + orphan GC + recovery e2e (Ch2)

Run Check Script / check (pull_request) Failing after 1m0s

Details

01118dc87b

Operator restart + aggregator recovery (v0.3 plan Ch2). The aggregator already
cold-rebuilds from NATS KV + CR watches; this makes recovery observable, closes
an orphan gap, and pins each failure shape with a regression test.

- OperatorLiveness: a shared in-process latch (Recovering → Converged) the
  aggregator sets once all three cold-start sources replay (Deployment/Device
  watcher InitDone, device-state KV seen_current; empty-bucket short-circuit).
  The in-process dashboard reads it and shows a self-clearing banner via an
  HTMX self-poll (/__recovery), so the customer sees progress, not a blank.
- gc_orphaned_desired_state: at convergence, purge desired-state whose
  Deployment CR no longer exists (force-deleted while the operator was down,
  finalizer bypassed). Belt-and-suspenders with the controller finalizer.
- run() now owns its watchers in a JoinSet, so cancelling the aggregator
  aborts its children — no orphan tasks outliving a restart (matters for the
  restart-simulation tests and clean process teardown). Also made run() Send
  (hoisted a .await out of a tracing macro) so it can be spawned.
- docs/fleet-operator-recovery-scenarios.md enumerates the failure shapes and
  maps each to its test.
- harmony-fleet-e2e/tests/operator_recovery.rs: regression test per scenario
  (cold restart converges from KV; orphan GC; two operators write identical
  bytes; chaos kill under write load converges <30s) + AdminKv::put_device_state.

Writes stay idempotent + byte-deterministic, so two operators racing agree
without leader election (operator HA = D3, deferred).

johnride force-pushed feat/fleet-ch2-operator-recovery from 01118dc87b to 56602b505c

2026-06-05 19:48:55 +00:00

Compare

johnride changed target branch from feat/fleet-ch1-role-gate-followups to master

2026-06-09 19:44:34 +00:00

johnride changed target branch from master to feat/fleet-device-exec-logs

2026-06-09 19:44:57 +00:00

johnride reviewed 2026-06-09 20:34:24 +00:00

fleet/harmony-fleet-e2e/src/kv_admin.rs

						
				@@ -175,0 +175,4 @@

				    /// Publish a `device-state` entry directly, simulating an agent report.

				    /// Recovery tests use this to assert the operator rebuilds health counts

				    /// from KV alone after a restart.

				    pub async fn put_device_state(&self, state: &DeploymentState) -> Result<(), AdminKvError> {

johnride commented

2026-06-09 20:34:23 +00:00

This should not exist on a live code path. At the very least use cfg test so it does not make it in the release binary.

johnride added 2 commits 2026-06-10 20:06:50 +00:00

docs(fleet): rewrite operator recovery scenarios with diagrams 81b0f79f55

ASCII diagrams per scenario replace dense paragraphs. Architecture
overview shows the three watchers, FleetState, and convergence latches.
Cold-start sequence and key invariants in scannable tables.

fix(fleet): clarify recovery tests + add missing scenario 4 + dedup test

Run Check Script / check (pull_request) Failing after 52s

Details

086d905586

- Rewrite e2e tests with explicit SETUP/ACTION/ASSERT structure per
  scenario. Each test header documents devices, deployments, and
  expected end state.
- Add scenario 4 test: device offline during restart counts as pending
  (2 devices, 1 deployment, only 1 reports → succeeded=1, pending=1).
- Add apply_state_rejects_older_timestamp unit test (previously claimed
  in docs but missing).
- Fix split_once → rsplit_once in parse_state_key and seed_owned_targets
  (device IDs with dots would silently drop entries).

Run Check Script / check (pull_request) Failing after 52s

Details

This pull request can be merged automatically.

This branch is out-of-date with the base branch

You are not authorized to merge this pull request.

View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.

git fetch -u origin feat/fleet-ch2-operator-recovery:feat/fleet-ch2-operator-recovery

git checkout feat/fleet-ch2-operator-recovery

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: NationTech/harmony#328