feat(fleet): agent self-upgrade + auto-rollback protocol, ADR-022 (Ch4) #330

Open
johnride wants to merge 1 commits from feat/fleet-ch4-agent-upgrade into feat/fleet-ch3-log-streaming
Owner

Full ADR-022 protocol end to end. The state-machine brain and the operator's
commit decision are exhaustively unit-tested; OS side-effects sit behind a seam
so they're faked in tests and real on-device.

Contracts (harmony-reconciler-contracts):

  • agent-upgrade marker + status KV buckets, AgentUpgradePhase, agent_version on
    the heartbeat, Verb::UpgradeStop on the command protocol.

Shared (new crate harmony_downloadable_asset):

  • download + SHA-256 verify, lifted from k3d's pub(crate) copy; k3d now depends
    on it (DRY — second consumer is the agent). Tested with httptest.

Agent (harmony-fleet-agent):

  • drive: Staging -> Verifying -> CutoverReady -> wait-for-operator-stop, with
    heartbeat-timeout revert. 6 unit tests incl. every failure/rollback path.
  • UpgradeExecutor seam + real SystemdUpgradeExecutor: download+verify,
    --self-test, atomic symlink swap, systemd-run transient unit, revert. The
    executor self-heals the on-disk layout so first-upgrade rollback is safe even
    before M1 (preserves the running binary at its versioned path).
  • --self-test flag; Verb::UpgradeStop handling gated by an armed
    UpgradeStopSignal so only the cutover-waiting old agent acts (both agents are
    subscribed). The agent never self-stops.

Operator (harmony-fleet-operator):

  • upgrade_coordinator: sends the stop ONLY after independently observing the new
    version's heartbeat (single source of truth); reflects currentVersion + the
    upgrade phase onto the Device CR. 2 unit tests on the commit decision.
  • FleetCommandsClient::upgrade_stop; Device.status.{currentVersion, upgrade}.

Deviations + flagged follow-ups (M1 clean install, libvirt vX->vX+1 e2e) in
ROADMAP/fleet_platform/ch4-agent-upgrade-status.md. Marker/status ride NATS KV
(survive operator restart, per Ch2).

Full ADR-022 protocol end to end. The state-machine brain and the operator's commit decision are exhaustively unit-tested; OS side-effects sit behind a seam so they're faked in tests and real on-device. Contracts (harmony-reconciler-contracts): - agent-upgrade marker + status KV buckets, AgentUpgradePhase, agent_version on the heartbeat, Verb::UpgradeStop on the command protocol. Shared (new crate harmony_downloadable_asset): - download + SHA-256 verify, lifted from k3d's pub(crate) copy; k3d now depends on it (DRY — second consumer is the agent). Tested with httptest. Agent (harmony-fleet-agent): - `drive`: Staging -> Verifying -> CutoverReady -> wait-for-operator-stop, with heartbeat-timeout revert. 6 unit tests incl. every failure/rollback path. - UpgradeExecutor seam + real SystemdUpgradeExecutor: download+verify, `--self-test`, atomic symlink swap, systemd-run transient unit, revert. The executor self-heals the on-disk layout so first-upgrade rollback is safe even before M1 (preserves the running binary at its versioned path). - `--self-test` flag; Verb::UpgradeStop handling gated by an armed UpgradeStopSignal so only the cutover-waiting old agent acts (both agents are subscribed). The agent never self-stops. Operator (harmony-fleet-operator): - upgrade_coordinator: sends the stop ONLY after independently observing the new version's heartbeat (single source of truth); reflects currentVersion + the upgrade phase onto the Device CR. 2 unit tests on the commit decision. - FleetCommandsClient::upgrade_stop; Device.status.{currentVersion, upgrade}. Deviations + flagged follow-ups (M1 clean install, libvirt vX->vX+1 e2e) in ROADMAP/fleet_platform/ch4-agent-upgrade-status.md. Marker/status ride NATS KV (survive operator restart, per Ch2).
johnride added 1 commit 2026-06-05 01:35:01 +00:00
feat(fleet): agent self-upgrade + auto-rollback protocol, ADR-022 (Ch4)
Some checks failed
Run Check Script / check (pull_request) Failing after 1m11s
654b979da3
Full ADR-022 protocol end to end. The state-machine brain and the operator's
commit decision are exhaustively unit-tested; OS side-effects sit behind a seam
so they're faked in tests and real on-device.

Contracts (harmony-reconciler-contracts):
- agent-upgrade marker + status KV buckets, AgentUpgradePhase, agent_version on
  the heartbeat, Verb::UpgradeStop on the command protocol.

Shared (new crate harmony_downloadable_asset):
- download + SHA-256 verify, lifted from k3d's pub(crate) copy; k3d now depends
  on it (DRY — second consumer is the agent). Tested with httptest.

Agent (harmony-fleet-agent):
- `drive`: Staging -> Verifying -> CutoverReady -> wait-for-operator-stop, with
  heartbeat-timeout revert. 6 unit tests incl. every failure/rollback path.
- UpgradeExecutor seam + real SystemdUpgradeExecutor: download+verify,
  `--self-test`, atomic symlink swap, systemd-run transient unit, revert. The
  executor self-heals the on-disk layout so first-upgrade rollback is safe even
  before M1 (preserves the running binary at its versioned path).
- `--self-test` flag; Verb::UpgradeStop handling gated by an armed
  UpgradeStopSignal so only the cutover-waiting old agent acts (both agents are
  subscribed). The agent never self-stops.

Operator (harmony-fleet-operator):
- upgrade_coordinator: sends the stop ONLY after independently observing the new
  version's heartbeat (single source of truth); reflects currentVersion + the
  upgrade phase onto the Device CR. 2 unit tests on the commit decision.
- FleetCommandsClient::upgrade_stop; Device.status.{currentVersion, upgrade}.

Deviations + flagged follow-ups (M1 clean install, libvirt vX->vX+1 e2e) in
ROADMAP/fleet_platform/ch4-agent-upgrade-status.md. Marker/status ride NATS KV
(survive operator restart, per Ch2).
johnride force-pushed feat/fleet-ch4-agent-upgrade from 654b979da3 to 76ecf6da42 2026-06-05 19:48:56 +00:00 Compare
All checks were successful
Run Check Script / check (pull_request) Successful in 2m35s
This pull request can be merged automatically.
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin feat/fleet-ch4-agent-upgrade:feat/fleet-ch4-agent-upgrade
git checkout feat/fleet-ch4-agent-upgrade
Sign in to join this conversation.
No Reviewers
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: NationTech/harmony#330
No description provided.