All checks were successful
Run Check Script / check (pull_request) Successful in 2m35s
Full ADR-022 protocol end to end. The state-machine brain and the operator's
commit decision are exhaustively unit-tested; OS side-effects sit behind a seam
so they're faked in tests and real on-device.
Contracts (harmony-reconciler-contracts):
- agent-upgrade marker + status KV buckets, AgentUpgradePhase, agent_version on
the heartbeat, Verb::UpgradeStop on the command protocol.
Shared (new crate harmony_downloadable_asset):
- download + SHA-256 verify, lifted from k3d's pub(crate) copy; k3d now depends
on it (DRY — second consumer is the agent). Tested with httptest.
Agent (harmony-fleet-agent):
- `drive`: Staging -> Verifying -> CutoverReady -> wait-for-operator-stop, with
heartbeat-timeout revert. 6 unit tests incl. every failure/rollback path.
- UpgradeExecutor seam + real SystemdUpgradeExecutor: download+verify,
`--self-test`, atomic symlink swap, systemd-run transient unit, revert. The
executor self-heals the on-disk layout so first-upgrade rollback is safe even
before M1 (preserves the running binary at its versioned path).
- `--self-test` flag; Verb::UpgradeStop handling gated by an armed
UpgradeStopSignal so only the cutover-waiting old agent acts (both agents are
subscribed). The agent never self-stops.
Operator (harmony-fleet-operator):
- upgrade_coordinator: sends the stop ONLY after independently observing the new
version's heartbeat (single source of truth); reflects currentVersion + the
upgrade phase onto the Device CR. 2 unit tests on the commit decision.
- FleetCommandsClient::upgrade_stop; Device.status.{currentVersion, upgrade}.
Deviations + flagged follow-ups (M1 clean install, libvirt vX->vX+1 e2e) in
ROADMAP/fleet_platform/ch4-agent-upgrade-status.md. Marker/status ride NATS KV
(survive operator restart, per Ch2).
3.3 KiB
3.3 KiB
Ch4 — Agent self-upgrade + auto-rollback (ADR-022): status
Built the full ADR-022 protocol end to end. The state-machine "brain" and the operator's commit decision are exhaustively unit-tested; the OS side-effects sit behind a seam so they're faked in tests and real on-device.
Shipped
| Piece | Where | Tested |
|---|---|---|
Wire types: marker, phase, status, agent_version on heartbeat, Verb::UpgradeStop |
harmony-reconciler-contracts/src/upgrade.rs, kv.rs, commands.rs, fleet.rs |
unit |
| Shared download+SHA-256 verify (lifted from k3d) | new crate harmony_downloadable_asset |
unit (httptest) |
Agent state machine drive (Staging→Verifying→CutoverReady→stop/revert) |
harmony-fleet-agent/src/upgrade.rs |
6 unit tests incl. timeout-revert, stage/self-test/cutover failure |
UpgradeExecutor seam + real SystemdUpgradeExecutor (download, --self-test, atomic symlink swap, systemd-run transient unit, revert) |
same | seam fake-tested; real impl self-heals layout |
--self-test flag |
harmony-fleet-agent/src/main.rs |
— |
Verb::UpgradeStop handling + armed UpgradeStopSignal (only the cutover-waiting old agent acts) |
command_server.rs, upgrade.rs |
— |
Operator coordinator: send stop only after observing the new version's heartbeat; reflect version + phase to the Device CR |
harmony-fleet-operator/src/upgrade_coordinator.rs |
2 unit tests on the commit decision |
FleetCommandsClient::upgrade_stop |
commands.rs |
— |
Device.status.{currentVersion, upgrade} |
crd.rs |
— |
Load-bearing properties from the ADR are intact: old verifies new
(--self-test); operator commits the stop (single source of truth, never the
agent); rollback is the same code path (revert symlink + stop transient unit on
self-test failure / heartbeat-timeout); no version is GC'd.
Deviations (deliberate)
- Marker + status ride NATS KV (
agent-upgrade/agent-upgrade-status), not a fire-and-forget subject, so they survive an operator restart — same ethos as Ch2. The ADR'sdevice-cmd.*/device-state.*.upgradesubjects map onto: the existing command protocol (Verb::UpgradeStop) and the status KV. - First-upgrade rollback without M1. The real executor
capture_revert_targetpreserves the running binary at its versioned path on first cutover even when the initial install put a plain file at/usr/local/bin/fleet-agent. This makes M1 a clean-install nicety, not a rollback-correctness prerequisite.
Flagged for a supervised run (not done tonight)
- M1 clean install layout —
FleetDeviceSetupScoreshould install to/usr/bin/fleet-agent-v<ver>+ symlink/usr/local/bin/fleet-agentfrom the start. Needs a newagent_versionconfig field (≈9 construction sites) and aFileSource::Symlinkdelivery primitive (ansiblestate: link). The executor self-heal above covers correctness in the meantime. - libvirt vX→vX+1 e2e + corrupt-binary auto-revert — needs two built agent
binaries, a served URL reachable from the VM, and a KVM run. The VM harness
exists (
harmony-fleet-e2e/src/vm); the protocol brain is unit-green, so this is an integration proof to run on real hardware. The corrupt-binary path is already unit-proven viastage_failure_*/heartbeat_timeout_reverts_*.