Two design documents framing the next push. `ROADMAP/fleet_platform/v0_2_plan.md` — three-day production push. Replaces the open-ended chapter structure of v0_1_plan.md for the period between the walking-skeleton merge and v0.1.0 in production. Focus is locking the fleet module's public API surface so the inevitable physical refactor (out of `harmony/modules/fleet/`, into `fleet/harmony-fleet/`) is mechanical when we get to it. Anchored in the principle from JG's *Pour l'amour des compilateurs* talk: design the brick before moving the brick. `docs/adr/022-fleet-agent-upgrade.md` — agent upgrade procedure. K8s rolling-update shape applied to one host: drain in-flight work, stage versioned binary alongside old, smoke-test, atomic symlink swap, both agents alive briefly, operator verifies new agent's heartbeat then sends explicit stop signal to old, old exits cleanly. No version is ever erased — N-history on disk is the rollback target. Operator-driven cutover (not self-stopping) so the most-trusted side decides the handoff. Implementation deferred to post-v0.1 backlog; spec exists so anyone can build it without reinventing the design. ADR README index updated.
17 KiB
Architecture Decision Record: Fleet Agent Upgrade Procedure
Initial Author: Jean-Gabriel Gill-Couture
Initial Date: 2026-05-06
Last Updated Date: 2026-05-06
Status
Accepted (design); implementation deferred — see roadmap
ROADMAP/fleet_platform/v0_2_plan.md.
Context
The v0.1 fleet agent ships as a single static aarch64-musl binary
sitting at /usr/local/bin/fleet-agent, started by a systemd
unit dropped at install time by FleetDeviceSetupScore. Every
managed device runs one. Today the only "upgrade procedure" is
scp + systemctl restart — fine for the bring-up phase, not
fine once paying customers run real workloads on the fleet.
Without a defined upgrade story we cannot ship a v0.1 agent into the field. The contract a customer needs is:
- New agent versions can be rolled out without operator-side manual intervention per device.
- Workloads currently reconciled on the device do not flap (start/stop/start) during the upgrade.
- A failed new version automatically reverts to the last known-good version, on its own, without page.
- The operator (the central one in the cluster, not the human) sees what version each device is running, can drive a target version per device, and observes upgrade progress.
The agent itself is the only process on-device with full context on what's reconciling and what's healthy. Anything centralized (Ansible-pushed, OS-package-managed) doesn't have that signal. The agent must be the one driving its own swap, with the operator coordinating but not executing.
Decision
We adopt a K8s rolling-update–shape upgrade, single-host, agent-driven, operator-coordinated. Old version stays alive until new is verified healthy from the operator's vantage point; only then does the operator signal old to exit. No version is ever erased from disk. Symlinks select the active binary.
On-disk layout
/usr/bin/fleet-agent-v0.1.1 ← versioned binary, immutable
/usr/bin/fleet-agent-v0.1.2 ← versioned binary, immutable
/usr/bin/fleet-agent-v0.1.3 ← versioned binary, immutable
/usr/local/bin/fleet-agent → symlink to current versioned binary
- Versioned binaries are the source of truth. They live forever (history-preserving, no GC). Disk use is bounded by humans cleaning up explicitly, not by the upgrade procedure.
- The systemd unit installed by
FleetDeviceSetupScorereferences/usr/local/bin/fleet-agent. Symlink swap is the cutover primitive — atomic on POSIX (renameat2). - Naming convention: exact crate version string,
v<MAJOR>.<MINOR>.<PATCH>, no build metadata in the path. Build metadata lives in the agent's reported version string but not in the file path (otherwise you can't predict the path from a version pin).
State machine on the agent side
Running ──[operator publishes desired_version != current]──▶ Draining
▲ │
│ │
│ ▼
│ Staging
│ │
│ ▼
│ Verifying
│ │
│ ▼
│ ┌──────────────────────────────[smoke fails]────────┤
│ │ │
│ [revert: symlink → previous, ▼
│ stay at current] Cutover-Ready
│ │
│ [Cutover-Ready persists ≥ T_OPERATOR_OBSERVE │
│ until operator publishes stop_signal] │
│ ▼
└────────────────────────────────────────────────────── Stopping
│
▼
(exit)
States in detail:
- Running — normal reconcile loop.
- Draining — refuses to start new podman services for new
desired-state writes. In-flight reconciles complete and report
their final state to the operator. Existing services stay
running. Heartbeat continues. State is published as part of the
agent's heartbeat (
agent_state: "draining"). - Staging — fetch new versioned binary URL (signed,
hash-pinned), verify, place at
/usr/bin/fleet-agent-v<new>. Set chmod, ownership. No other state mutation. - Verifying — invoke the staged binary with
--self-test. New binary parses its config, opens NATS connection, validates JWT, prints version + "ok", exits 0. No state mutation. Catches obvious breakage (missing dependency, wrong arch, corrupt download, broken config-schema migration) before swap. - Cutover-Ready — staged binary is healthy. Old agent updates
the symlink atomically:
Old agent then
ln -sfn /usr/bin/fleet-agent-v0.1.2 /usr/local/bin/fleet-agent.new mv -T /usr/local/bin/fleet-agent.new /usr/local/bin/fleet-agentsystemctl start fleet-agent-v0.1.2.service(a parallel transient service, notsystemctl restartof itself). Both old and new are now running. New publishes its first heartbeat withversion=v0.1.2. Operator sees two heartbeats per device for a brief window. - Stopping — operator publishes a stop signal to the old
agent's NATS subject. Old agent receives, gracefully exits.
systemd's
Restart=on-failuredoes not trigger because the exit issuccess(rc=0, code-path-driven). New agent is now the only one running. systemd unit is reconfigured to point at the current symlink target on its next restart, but that's cosmetic — the symlink already does the job.
Operator-side coordination
The operator is the only source of truth for "what version should this device run". One new field per device, two new subjects.
New on Device CR / KV device-info:
current_version— what the agent is running right now. Reported in heartbeat; reflected to the CR.desired_version— what the operator wants the agent to run. Set by operator-side logic (default: latest published; eventually canary / %-based).
New NATS subjects (per-device, scoped by callout permissions):
device-cmd.<device_id>.upgrade-stop— operator → old agent. Payload:{"reason": "...", "deadline_ms": ...}. Sent only after operator has observed a heartbeat from the new version withcurrent_version == desired_versionANDagent_state == "running".device-state.<device_id>.upgrade— agent → operator. Status events:staging,verifying,cutover-ready,failed,done. DrivesDevice.status.upgrade.{phase, last_error, ...}.
The operator only emits upgrade-stop after it has independently
verified the new agent is up. Old agent does not stop itself
based on its own observations. This is the load-bearing
property: the same operator that disagreed with the upgrade
("haven't seen new version's heartbeat") would never have sent
the stop signal. Single-source-of-truth handoff.
Failure modes and rollback
- Staging fails (download / hash mismatch): Agent stays in
Running. Reportsphase: "failed",last_error. Operator sees the failure; can fix the artifact + retry by re-publishing the same desired_version (any change to desired_version re-triggers the state machine). - Verifying fails (smoke test rc != 0): Agent stays in
Running. Reports failure. Staged binary stays on disk for inspection. Operator can collect, debug, ship a fixed version. - Cutover-ready, but new agent never publishes a heartbeat
with the new version within T_HEARTBEAT_TIMEOUT (suggested
60s): Old agent reverts the symlink, stops the parallel
systemd transient service, transitions back to Running with
the old version. Reports
failed. Same recovery path. - Operator never sends stop signal (e.g., operator-side outage): Old agent stays in Cutover-Ready indefinitely. Both agents are running; only the new one is publishing as the active one (the old one's writes are gated on its state). This is expensive (2× resource use) but safe — the operator is the authoritative coordinator and any other behavior would risk losing both agents at once.
- Both agents alive but new agent crashes: systemd's
Restart=on-failureon the new agent's transient unit retries. If it can't come back, the operator never sends the stop signal, the old agent stays Cutover-Ready, and a human investigates. The fleet keeps working on the old version — the rollback is implicit. - Operator publishes an older
desired_version: Reverse rollout. Same mechanism, just with old/new swapped. The "new" binary is older, but the procedure is identical. The fact that no version is ever GC'd is what makes this work.
What this isn't
- Not fleet-wide. Per-device. Fleet-wide canary / %-based
rollouts are operator-side orchestration on top of this
primitive. The operator would publish
desired_versionto a rolling subset of devices and watch heartbeats. Out of scope for v0.2 — single-device upgrade is sufficient for a 100-Pi fleet which is more than the 12-month customer roadmap. - Not blue/green of the entire OS. We swap one userspace binary. The OS, podman, the systemd unit text, the kernel — all unchanged. Out of scope.
- Not a package manager. Versioned binaries land at fixed paths because we control them. apt / dpkg / OSTree are orthogonal and not in the loop.
Rationale
- No version ever erased. Trivializes rollback (the previous
binary is a
ln -sfnaway). Simplifies the failure tree: every "what if" branch resolves to "old still on disk". Disk cost on aarch64-musl is ~5–10 MB per version — at 12 versions / year, that's 100 MB after a decade of upgrades. Negligible compared to Pi storage. - Symlink swap as cutover. POSIX-atomic. No daemon state. Cheap to revert. Compatible with systemd unit references that point at a stable path.
- Old verifies new, then reports up. This is the load-bearing property: it places the verification at the agent (which has the only complete view of its own runtime state) but the commitment at the operator (which is the only thing safe to trust as the cluster-wide source of truth). Either side alone can fail safe; only consensus advances the upgrade.
- Operator-driven stop, not agent self-stop. A self-stopping agent could decide to exit before the operator agrees, leaving the cluster blind. Forcing the stop through the operator means any disagreement keeps the old agent alive — which is the desired bias.
- Drains in-flight work first. Mirrors K8s pod-shutdown semantics. A workload reconciling at the moment of swap finishes its current step, reports state, then queues. New agent picks up the queue once it's the active version. No observable flap on the workload.
- Heartbeat-driven version reporting. The agent already publishes heartbeats; adding the version field is one line. No new transport.
Consequences
Pros:
- Bounded blast radius per upgrade (one device).
- Rollback is the same code path as upgrade — no special-case bug class.
- Operator's view is monotonic: heartbeats with versions are immutable history; there's no "did the upgrade really happen" state.
- Old agent never decides to exit on its own. The most dangerous failure mode in self-upgrading software (premature exit) is designed out.
- Compatible with eventual fleet-wide rollouts (canary, %-based) which become operator-side orchestration on top of this primitive.
Cons:
- Briefly runs two agents per device (Cutover-Ready window). Memory and connection-count both ~2× during that window. Acceptable for the upgrade duration (typically <60s).
- Requires reliable connectivity between agent and operator to complete the handoff. A device whose NATS link fails mid- upgrade stays in Cutover-Ready until link recovers.
- Disk grows monotonically with version count. Bounded by human cleanup. We do not GC.
- New NATS subjects, new heartbeat fields, new
Device.statusfields. Schema bump that operators-in-the-field need to handle (the operator must understand "old agent reporting no version field" asversion: unknown, not crash).
Alternatives considered
-
OS-package upgrade (apt / dpkg / OSTree). Pros: zero custom code, standard toolchain, GPG-signed. Cons: Loses the "agent verifies the new agent before swap" property. apt's restart hook flips the symlink and
systemctl restarts; if the new binary is broken, the device is bricked until human intervention. Doesn't drain in-flight work. Doesn't know about NATS-managed pause states. Couples the upgrade schedule to the distro's repo, not to the cluster operator's intent. Rejected. -
Pull-from-OCI-registry on each agent restart. Pros: same primitive as podman / kube node-image-rotation. Cons: Coupling to a registry the device must reach — many customer fleets are on private subnets without registry access. Would mean shipping a registry mirror per fleet. Adds a dependency for a problem we can solve with a signed binary on a CDN.
-
Two systemd units, blue/green at the unit level.
fleet-agent-v0.1.1.serviceandfleet-agent-v0.1.2.service, ratchet via systemctl enable/disable. Pros: no symlink dance. Cons: duplicates a lot of unit-file content; harder to reason about what the "active" unit is (you have to ask systemd, notreadlink); doesn't compose well with theExecStart=/usr/local/bin/fleet-agentline we already ship. Symlink swap is the lighter primitive. -
Self-stopping agent (no operator stop signal). New agent tells old agent "I'm up, you can go" via NATS. Pros: one fewer subject. Cons: The new agent is also the agent we're least sure of — putting it in charge of the old one's lifecycle inverts the trust model. If the new agent has a bug that causes it to announce ready prematurely, the cluster goes blind. The operator path is the conservative choice.
-
Operator-pushed binary (instead of agent-pulled). The operator sshes / executes a one-off command per device. Pros: operator controls timing precisely. Cons: Reintroduces SSH as a control plane (we just spent a month getting rid of it for the enrollment flow). Doesn't scale to fleets where most devices are NATted away from the operator.
Implementation milestones
(For a future implementer; not committed to a date here. Lives in the v0.2+ backlog.)
- M1 — Versioned binary layout: builds produce
fleet-agent-v<version>artifacts; install Score writes them to/usr/bin/fleet-agent-v<version>+ creates/usr/local/bin/fleet-agentsymlink. Existing tests cover the rest. - M2 — Version field in heartbeat +
Device.status.current_versionreflection on the operator side. No upgrade behavior yet. - M3 —
desired_versionfield on the device-info KV + operator setter. No agent-side action yet. - M4 — Agent state machine, end to end, gated by a feature flag. Operator publishes desired_version → agent does the dance → operator sends stop signal → done. Includes failure- mode tests (download fail, smoke fail, heartbeat-timeout revert).
- M5 — Remove the feature flag. Default-on.
- M6 — Operator-side rollout strategies (canary, %-based) — only after M5 has been in production for 30 days against a real fleet.
Additional Notes
- Binary signing + signature verification is in scope for the
Stagingstep but the which signing scheme (cosign / Rekor / minisign) is deferred until the M1 implementation. Whatever we pick must work on aarch64-musl Pi devices without additional system dependencies. - The N-versions-on-disk policy is "all of them, forever" per
the constraint above. If disk pressure becomes real on some
customer fleet, a manual GC tool can prune
/usr/bin/fleet-agent-v*by date — never automatic, never as part of the upgrade itself. - See JG's Pour l'amour des compilateurs talk (Botpress Meetup, 2026-04-30) for the framing applied here: cardinality-matched types and operator-as-coordinator are the same idea, applied to one function and to one platform.