feat(fleet): agent self-upgrade + auto-rollback protocol, ADR-022 (Ch4) #330

Open
johnride wants to merge 1 commits from feat/fleet-ch4-agent-upgrade into feat/fleet-ch3-log-streaming

1 Commits

Author SHA1 Message Date
76ecf6da42 feat(fleet): agent self-upgrade + auto-rollback protocol, ADR-022 (Ch4)
All checks were successful
Run Check Script / check (pull_request) Successful in 2m35s
Full ADR-022 protocol end to end. The state-machine brain and the operator's
commit decision are exhaustively unit-tested; OS side-effects sit behind a seam
so they're faked in tests and real on-device.

Contracts (harmony-reconciler-contracts):
- agent-upgrade marker + status KV buckets, AgentUpgradePhase, agent_version on
  the heartbeat, Verb::UpgradeStop on the command protocol.

Shared (new crate harmony_downloadable_asset):
- download + SHA-256 verify, lifted from k3d's pub(crate) copy; k3d now depends
  on it (DRY — second consumer is the agent). Tested with httptest.

Agent (harmony-fleet-agent):
- `drive`: Staging -> Verifying -> CutoverReady -> wait-for-operator-stop, with
  heartbeat-timeout revert. 6 unit tests incl. every failure/rollback path.
- UpgradeExecutor seam + real SystemdUpgradeExecutor: download+verify,
  `--self-test`, atomic symlink swap, systemd-run transient unit, revert. The
  executor self-heals the on-disk layout so first-upgrade rollback is safe even
  before M1 (preserves the running binary at its versioned path).
- `--self-test` flag; Verb::UpgradeStop handling gated by an armed
  UpgradeStopSignal so only the cutover-waiting old agent acts (both agents are
  subscribed). The agent never self-stops.

Operator (harmony-fleet-operator):
- upgrade_coordinator: sends the stop ONLY after independently observing the new
  version's heartbeat (single source of truth); reflects currentVersion + the
  upgrade phase onto the Device CR. 2 unit tests on the commit decision.
- FleetCommandsClient::upgrade_stop; Device.status.{currentVersion, upgrade}.

Deviations + flagged follow-ups (M1 clean install, libvirt vX->vX+1 e2e) in
ROADMAP/fleet_platform/ch4-agent-upgrade-status.md. Marker/status ride NATS KV
(survive operator restart, per Ch2).
2026-06-05 15:26:38 -04:00