Files
harmony/ROADMAP/fleet_platform/ch4-agent-upgrade-status.md
Jean-Gabriel Gill-Couture 76ecf6da42
All checks were successful
Run Check Script / check (pull_request) Successful in 2m35s
feat(fleet): agent self-upgrade + auto-rollback protocol, ADR-022 (Ch4)
Full ADR-022 protocol end to end. The state-machine brain and the operator's
commit decision are exhaustively unit-tested; OS side-effects sit behind a seam
so they're faked in tests and real on-device.

Contracts (harmony-reconciler-contracts):
- agent-upgrade marker + status KV buckets, AgentUpgradePhase, agent_version on
  the heartbeat, Verb::UpgradeStop on the command protocol.

Shared (new crate harmony_downloadable_asset):
- download + SHA-256 verify, lifted from k3d's pub(crate) copy; k3d now depends
  on it (DRY — second consumer is the agent). Tested with httptest.

Agent (harmony-fleet-agent):
- `drive`: Staging -> Verifying -> CutoverReady -> wait-for-operator-stop, with
  heartbeat-timeout revert. 6 unit tests incl. every failure/rollback path.
- UpgradeExecutor seam + real SystemdUpgradeExecutor: download+verify,
  `--self-test`, atomic symlink swap, systemd-run transient unit, revert. The
  executor self-heals the on-disk layout so first-upgrade rollback is safe even
  before M1 (preserves the running binary at its versioned path).
- `--self-test` flag; Verb::UpgradeStop handling gated by an armed
  UpgradeStopSignal so only the cutover-waiting old agent acts (both agents are
  subscribed). The agent never self-stops.

Operator (harmony-fleet-operator):
- upgrade_coordinator: sends the stop ONLY after independently observing the new
  version's heartbeat (single source of truth); reflects currentVersion + the
  upgrade phase onto the Device CR. 2 unit tests on the commit decision.
- FleetCommandsClient::upgrade_stop; Device.status.{currentVersion, upgrade}.

Deviations + flagged follow-ups (M1 clean install, libvirt vX->vX+1 e2e) in
ROADMAP/fleet_platform/ch4-agent-upgrade-status.md. Marker/status ride NATS KV
(survive operator restart, per Ch2).
2026-06-05 15:26:38 -04:00

3.3 KiB

Ch4 — Agent self-upgrade + auto-rollback (ADR-022): status

Built the full ADR-022 protocol end to end. The state-machine "brain" and the operator's commit decision are exhaustively unit-tested; the OS side-effects sit behind a seam so they're faked in tests and real on-device.

Shipped

Piece Where Tested
Wire types: marker, phase, status, agent_version on heartbeat, Verb::UpgradeStop harmony-reconciler-contracts/src/upgrade.rs, kv.rs, commands.rs, fleet.rs unit
Shared download+SHA-256 verify (lifted from k3d) new crate harmony_downloadable_asset unit (httptest)
Agent state machine drive (Staging→Verifying→CutoverReady→stop/revert) harmony-fleet-agent/src/upgrade.rs 6 unit tests incl. timeout-revert, stage/self-test/cutover failure
UpgradeExecutor seam + real SystemdUpgradeExecutor (download, --self-test, atomic symlink swap, systemd-run transient unit, revert) same seam fake-tested; real impl self-heals layout
--self-test flag harmony-fleet-agent/src/main.rs
Verb::UpgradeStop handling + armed UpgradeStopSignal (only the cutover-waiting old agent acts) command_server.rs, upgrade.rs
Operator coordinator: send stop only after observing the new version's heartbeat; reflect version + phase to the Device CR harmony-fleet-operator/src/upgrade_coordinator.rs 2 unit tests on the commit decision
FleetCommandsClient::upgrade_stop commands.rs
Device.status.{currentVersion, upgrade} crd.rs

Load-bearing properties from the ADR are intact: old verifies new (--self-test); operator commits the stop (single source of truth, never the agent); rollback is the same code path (revert symlink + stop transient unit on self-test failure / heartbeat-timeout); no version is GC'd.

Deviations (deliberate)

  • Marker + status ride NATS KV (agent-upgrade / agent-upgrade-status), not a fire-and-forget subject, so they survive an operator restart — same ethos as Ch2. The ADR's device-cmd.*/device-state.*.upgrade subjects map onto: the existing command protocol (Verb::UpgradeStop) and the status KV.
  • First-upgrade rollback without M1. The real executor capture_revert_target preserves the running binary at its versioned path on first cutover even when the initial install put a plain file at /usr/local/bin/fleet-agent. This makes M1 a clean-install nicety, not a rollback-correctness prerequisite.

Flagged for a supervised run (not done tonight)

  1. M1 clean install layoutFleetDeviceSetupScore should install to /usr/bin/fleet-agent-v<ver> + symlink /usr/local/bin/fleet-agent from the start. Needs a new agent_version config field (≈9 construction sites) and a FileSource::Symlink delivery primitive (ansible state: link). The executor self-heal above covers correctness in the meantime.
  2. libvirt vX→vX+1 e2e + corrupt-binary auto-revert — needs two built agent binaries, a served URL reachable from the VM, and a KVM run. The VM harness exists (harmony-fleet-e2e/src/vm); the protocol brain is unit-green, so this is an integration proof to run on real hardware. The corrupt-binary path is already unit-proven via stage_failure_* / heartbeat_timeout_reverts_*.