All checks were successful
Run Check Script / check (pull_request) Successful in 2m35s
Full ADR-022 protocol end to end. The state-machine brain and the operator's
commit decision are exhaustively unit-tested; OS side-effects sit behind a seam
so they're faked in tests and real on-device.
Contracts (harmony-reconciler-contracts):
- agent-upgrade marker + status KV buckets, AgentUpgradePhase, agent_version on
the heartbeat, Verb::UpgradeStop on the command protocol.
Shared (new crate harmony_downloadable_asset):
- download + SHA-256 verify, lifted from k3d's pub(crate) copy; k3d now depends
on it (DRY — second consumer is the agent). Tested with httptest.
Agent (harmony-fleet-agent):
- `drive`: Staging -> Verifying -> CutoverReady -> wait-for-operator-stop, with
heartbeat-timeout revert. 6 unit tests incl. every failure/rollback path.
- UpgradeExecutor seam + real SystemdUpgradeExecutor: download+verify,
`--self-test`, atomic symlink swap, systemd-run transient unit, revert. The
executor self-heals the on-disk layout so first-upgrade rollback is safe even
before M1 (preserves the running binary at its versioned path).
- `--self-test` flag; Verb::UpgradeStop handling gated by an armed
UpgradeStopSignal so only the cutover-waiting old agent acts (both agents are
subscribed). The agent never self-stops.
Operator (harmony-fleet-operator):
- upgrade_coordinator: sends the stop ONLY after independently observing the new
version's heartbeat (single source of truth); reflects currentVersion + the
upgrade phase onto the Device CR. 2 unit tests on the commit decision.
- FleetCommandsClient::upgrade_stop; Device.status.{currentVersion, upgrade}.
Deviations + flagged follow-ups (M1 clean install, libvirt vX->vX+1 e2e) in
ROADMAP/fleet_platform/ch4-agent-upgrade-status.md. Marker/status ride NATS KV
(survive operator restart, per Ch2).
49 lines
3.3 KiB
Markdown
49 lines
3.3 KiB
Markdown
# Ch4 — Agent self-upgrade + auto-rollback (ADR-022): status
|
|
|
|
Built the full ADR-022 protocol end to end. The state-machine "brain" and the
|
|
operator's commit decision are exhaustively unit-tested; the OS side-effects sit
|
|
behind a seam so they're faked in tests and real on-device.
|
|
|
|
## Shipped
|
|
|
|
| Piece | Where | Tested |
|
|
|---|---|---|
|
|
| Wire types: marker, phase, status, `agent_version` on heartbeat, `Verb::UpgradeStop` | `harmony-reconciler-contracts/src/upgrade.rs`, `kv.rs`, `commands.rs`, `fleet.rs` | unit |
|
|
| Shared download+SHA-256 verify (lifted from k3d) | new crate `harmony_downloadable_asset` | unit (httptest) |
|
|
| Agent state machine `drive` (Staging→Verifying→CutoverReady→stop/revert) | `harmony-fleet-agent/src/upgrade.rs` | **6 unit tests** incl. timeout-revert, stage/self-test/cutover failure |
|
|
| `UpgradeExecutor` seam + real `SystemdUpgradeExecutor` (download, `--self-test`, atomic symlink swap, `systemd-run` transient unit, revert) | same | seam fake-tested; real impl self-heals layout |
|
|
| `--self-test` flag | `harmony-fleet-agent/src/main.rs` | — |
|
|
| `Verb::UpgradeStop` handling + armed `UpgradeStopSignal` (only the cutover-waiting old agent acts) | `command_server.rs`, `upgrade.rs` | — |
|
|
| Operator coordinator: send stop **only after** observing the new version's heartbeat; reflect version + phase to the `Device` CR | `harmony-fleet-operator/src/upgrade_coordinator.rs` | **2 unit tests** on the commit decision |
|
|
| `FleetCommandsClient::upgrade_stop` | `commands.rs` | — |
|
|
| `Device.status.{currentVersion, upgrade}` | `crd.rs` | — |
|
|
|
|
Load-bearing properties from the ADR are intact: old verifies new
|
|
(`--self-test`); operator commits the stop (single source of truth, never the
|
|
agent); rollback is the same code path (revert symlink + stop transient unit on
|
|
self-test failure / heartbeat-timeout); no version is GC'd.
|
|
|
|
## Deviations (deliberate)
|
|
|
|
- **Marker + status ride NATS KV** (`agent-upgrade` / `agent-upgrade-status`),
|
|
not a fire-and-forget subject, so they survive an operator restart — same
|
|
ethos as Ch2. The ADR's `device-cmd.*`/`device-state.*.upgrade` subjects map
|
|
onto: the existing command protocol (`Verb::UpgradeStop`) and the status KV.
|
|
- **First-upgrade rollback without M1.** The real executor `capture_revert_target`
|
|
preserves the running binary at its versioned path on first cutover even when
|
|
the initial install put a plain file at `/usr/local/bin/fleet-agent`. This
|
|
makes M1 a clean-install nicety, not a rollback-correctness prerequisite.
|
|
|
|
## Flagged for a supervised run (not done tonight)
|
|
|
|
1. **M1 clean install layout** — `FleetDeviceSetupScore` should install to
|
|
`/usr/bin/fleet-agent-v<ver>` + symlink `/usr/local/bin/fleet-agent` from the
|
|
start. Needs a new `agent_version` config field (≈9 construction sites) and a
|
|
`FileSource::Symlink` delivery primitive (ansible `state: link`). The executor
|
|
self-heal above covers correctness in the meantime.
|
|
2. **libvirt vX→vX+1 e2e + corrupt-binary auto-revert** — needs two built agent
|
|
binaries, a served URL reachable from the VM, and a KVM run. The VM harness
|
|
exists (`harmony-fleet-e2e/src/vm`); the protocol brain is unit-green, so this
|
|
is an integration proof to run on real hardware. The corrupt-binary path is
|
|
already unit-proven via `stage_failure_*` / `heartbeat_timeout_reverts_*`.
|