Files

Jean-Gabriel Gill-Couture 064fa1da0d docs: v0.2 roadmap + ADR-022 fleet agent upgrade procedure

Two design documents framing the next push.

`ROADMAP/fleet_platform/v0_2_plan.md` — three-day production push.
Replaces the open-ended chapter structure of v0_1_plan.md for the
period between the walking-skeleton merge and v0.1.0 in production.
Focus is locking the fleet module's public API surface so the
inevitable physical refactor (out of `harmony/modules/fleet/`,
into `fleet/harmony-fleet/`) is mechanical when we get to it.
Anchored in the principle from JG's *Pour l'amour des compilateurs*
talk: design the brick before moving the brick.

`docs/adr/022-fleet-agent-upgrade.md` — agent upgrade procedure.
K8s rolling-update shape applied to one host: drain in-flight
work, stage versioned binary alongside old, smoke-test, atomic
symlink swap, both agents alive briefly, operator verifies new
agent's heartbeat then sends explicit stop signal to old, old
exits cleanly. No version is ever erased — N-history on disk is
the rollback target. Operator-driven cutover (not self-stopping)
so the most-trusted side decides the handoff. Implementation
deferred to post-v0.1 backlog; spec exists so anyone can build
it without reinventing the design.

ADR README index updated.

2026-05-06 22:51:14 -04:00

17 KiB

Raw Permalink Blame History

Architecture Decision Record: Fleet Agent Upgrade Procedure

Initial Author: Jean-Gabriel Gill-Couture

Initial Date: 2026-05-06

Last Updated Date: 2026-05-06

Status

Accepted (design); implementation deferred — see roadmap ROADMAP/fleet_platform/v0_2_plan.md.

Context

The v0.1 fleet agent ships as a single static aarch64-musl binary sitting at /usr/local/bin/fleet-agent, started by a systemd unit dropped at install time by FleetDeviceSetupScore. Every managed device runs one. Today the only "upgrade procedure" is scp + systemctl restart — fine for the bring-up phase, not fine once paying customers run real workloads on the fleet.

Without a defined upgrade story we cannot ship a v0.1 agent into the field. The contract a customer needs is:

New agent versions can be rolled out without operator-side manual intervention per device.
Workloads currently reconciled on the device do not flap (start/stop/start) during the upgrade.
A failed new version automatically reverts to the last known-good version, on its own, without page.
The operator (the central one in the cluster, not the human) sees what version each device is running, can drive a target version per device, and observes upgrade progress.

The agent itself is the only process on-device with full context on what's reconciling and what's healthy. Anything centralized (Ansible-pushed, OS-package-managed) doesn't have that signal. The agent must be the one driving its own swap, with the operator coordinating but not executing.

Decision

We adopt a K8s rolling-update–shape upgrade, single-host, agent-driven, operator-coordinated. Old version stays alive until new is verified healthy from the operator's vantage point; only then does the operator signal old to exit. No version is ever erased from disk. Symlinks select the active binary.

On-disk layout

/usr/bin/fleet-agent-v0.1.1            ← versioned binary, immutable
/usr/bin/fleet-agent-v0.1.2            ← versioned binary, immutable
/usr/bin/fleet-agent-v0.1.3            ← versioned binary, immutable
/usr/local/bin/fleet-agent             → symlink to current versioned binary

Versioned binaries are the source of truth. They live forever (history-preserving, no GC). Disk use is bounded by humans cleaning up explicitly, not by the upgrade procedure.
The systemd unit installed by FleetDeviceSetupScore references /usr/local/bin/fleet-agent. Symlink swap is the cutover primitive — atomic on POSIX (renameat2).
Naming convention: exact crate version string, v<MAJOR>.<MINOR>.<PATCH>, no build metadata in the path. Build metadata lives in the agent's reported version string but not in the file path (otherwise you can't predict the path from a version pin).

State machine on the agent side

Running ──[operator publishes desired_version != current]──▶ Draining
   ▲                                                            │
   │                                                            │
   │                                                            ▼
   │                                                          Staging
   │                                                            │
   │                                                            ▼
   │                                                         Verifying
   │                                                            │
   │                                                            ▼
   │       ┌──────────────────────────────[smoke fails]────────┤
   │       │                                                   │
   │   [revert: symlink → previous,                            ▼
   │    stay at current]                                  Cutover-Ready
   │                                                            │
   │   [Cutover-Ready persists ≥ T_OPERATOR_OBSERVE              │
   │    until operator publishes stop_signal]                   │
   │                                                            ▼
   └────────────────────────────────────────────────────── Stopping
                                                                │
                                                                ▼
                                                              (exit)

States in detail:

Running — normal reconcile loop.
Draining — refuses to start new podman services for new desired-state writes. In-flight reconciles complete and report their final state to the operator. Existing services stay running. Heartbeat continues. State is published as part of the agent's heartbeat (agent_state: "draining").
Staging — fetch new versioned binary URL (signed, hash-pinned), verify, place at /usr/bin/fleet-agent-v<new>. Set chmod, ownership. No other state mutation.
Verifying — invoke the staged binary with --self-test. New binary parses its config, opens NATS connection, validates JWT, prints version + "ok", exits 0. No state mutation. Catches obvious breakage (missing dependency, wrong arch, corrupt download, broken config-schema migration) before swap.
Cutover-Ready — staged binary is healthy. Old agent updates the symlink atomically:
```
ln -sfn /usr/bin/fleet-agent-v0.1.2 /usr/local/bin/fleet-agent.new
mv -T /usr/local/bin/fleet-agent.new /usr/local/bin/fleet-agent
```
Old agent then systemctl start fleet-agent-v0.1.2.service (a parallel transient service, not systemctl restart of itself). Both old and new are now running. New publishes its first heartbeat with version=v0.1.2. Operator sees two heartbeats per device for a brief window.
Stopping — operator publishes a stop signal to the old agent's NATS subject. Old agent receives, gracefully exits. systemd's Restart=on-failure does not trigger because the exit is success (rc=0, code-path-driven). New agent is now the only one running. systemd unit is reconfigured to point at the current symlink target on its next restart, but that's cosmetic — the symlink already does the job.

Operator-side coordination

The operator is the only source of truth for "what version should this device run". One new field per device, two new subjects.

New on Device CR / KV device-info:

current_version — what the agent is running right now. Reported in heartbeat; reflected to the CR.
desired_version — what the operator wants the agent to run. Set by operator-side logic (default: latest published; eventually canary / %-based).

New NATS subjects (per-device, scoped by callout permissions):

device-cmd.<device_id>.upgrade-stop — operator → old agent. Payload: {"reason": "...", "deadline_ms": ...}. Sent only after operator has observed a heartbeat from the new version with current_version == desired_version AND agent_state == "running".
device-state.<device_id>.upgrade — agent → operator. Status events: staging, verifying, cutover-ready, failed, done. Drives Device.status.upgrade.{phase, last_error, ...}.

The operator only emits upgrade-stop after it has independently verified the new agent is up. Old agent does not stop itself based on its own observations. This is the load-bearing property: the same operator that disagreed with the upgrade ("haven't seen new version's heartbeat") would never have sent the stop signal. Single-source-of-truth handoff.

Failure modes and rollback

Staging fails (download / hash mismatch): Agent stays in Running. Reports phase: "failed", last_error. Operator sees the failure; can fix the artifact + retry by re-publishing the same desired_version (any change to desired_version re-triggers the state machine).
Verifying fails (smoke test rc != 0): Agent stays in Running. Reports failure. Staged binary stays on disk for inspection. Operator can collect, debug, ship a fixed version.
Cutover-ready, but new agent never publishes a heartbeat with the new version within T_HEARTBEAT_TIMEOUT (suggested 60s): Old agent reverts the symlink, stops the parallel systemd transient service, transitions back to Running with the old version. Reports failed. Same recovery path.
Operator never sends stop signal (e.g., operator-side outage): Old agent stays in Cutover-Ready indefinitely. Both agents are running; only the new one is publishing as the active one (the old one's writes are gated on its state). This is expensive (2× resource use) but safe — the operator is the authoritative coordinator and any other behavior would risk losing both agents at once.
Both agents alive but new agent crashes: systemd's Restart=on-failure on the new agent's transient unit retries. If it can't come back, the operator never sends the stop signal, the old agent stays Cutover-Ready, and a human investigates. The fleet keeps working on the old version — the rollback is implicit.
Operator publishes an older desired_version: Reverse rollout. Same mechanism, just with old/new swapped. The "new" binary is older, but the procedure is identical. The fact that no version is ever GC'd is what makes this work.

What this isn't

Not fleet-wide. Per-device. Fleet-wide canary / %-based rollouts are operator-side orchestration on top of this primitive. The operator would publish desired_version to a rolling subset of devices and watch heartbeats. Out of scope for v0.2 — single-device upgrade is sufficient for a 100-Pi fleet which is more than the 12-month customer roadmap.
Not blue/green of the entire OS. We swap one userspace binary. The OS, podman, the systemd unit text, the kernel — all unchanged. Out of scope.
Not a package manager. Versioned binaries land at fixed paths because we control them. apt / dpkg / OSTree are orthogonal and not in the loop.

Rationale

No version ever erased. Trivializes rollback (the previous binary is a ln -sfn away). Simplifies the failure tree: every "what if" branch resolves to "old still on disk". Disk cost on aarch64-musl is ~5–10 MB per version — at 12 versions / year, that's 100 MB after a decade of upgrades. Negligible compared to Pi storage.
Symlink swap as cutover. POSIX-atomic. No daemon state. Cheap to revert. Compatible with systemd unit references that point at a stable path.
Old verifies new, then reports up. This is the load-bearing property: it places the verification at the agent (which has the only complete view of its own runtime state) but the commitment at the operator (which is the only thing safe to trust as the cluster-wide source of truth). Either side alone can fail safe; only consensus advances the upgrade.
Operator-driven stop, not agent self-stop. A self-stopping agent could decide to exit before the operator agrees, leaving the cluster blind. Forcing the stop through the operator means any disagreement keeps the old agent alive — which is the desired bias.
Drains in-flight work first. Mirrors K8s pod-shutdown semantics. A workload reconciling at the moment of swap finishes its current step, reports state, then queues. New agent picks up the queue once it's the active version. No observable flap on the workload.
Heartbeat-driven version reporting. The agent already publishes heartbeats; adding the version field is one line. No new transport.

Consequences

Pros:

Bounded blast radius per upgrade (one device).
Rollback is the same code path as upgrade — no special-case bug class.
Operator's view is monotonic: heartbeats with versions are immutable history; there's no "did the upgrade really happen" state.
Old agent never decides to exit on its own. The most dangerous failure mode in self-upgrading software (premature exit) is designed out.
Compatible with eventual fleet-wide rollouts (canary, %-based) which become operator-side orchestration on top of this primitive.

Cons:

Briefly runs two agents per device (Cutover-Ready window). Memory and connection-count both ~2× during that window. Acceptable for the upgrade duration (typically <60s).
Requires reliable connectivity between agent and operator to complete the handoff. A device whose NATS link fails mid- upgrade stays in Cutover-Ready until link recovers.
Disk grows monotonically with version count. Bounded by human cleanup. We do not GC.
New NATS subjects, new heartbeat fields, new Device.status fields. Schema bump that operators-in-the-field need to handle (the operator must understand "old agent reporting no version field" as version: unknown, not crash).

Alternatives considered

OS-package upgrade (apt / dpkg / OSTree). Pros: zero custom code, standard toolchain, GPG-signed. Cons: Loses the "agent verifies the new agent before swap" property. apt's restart hook flips the symlink and systemctl restarts; if the new binary is broken, the device is bricked until human intervention. Doesn't drain in-flight work. Doesn't know about NATS-managed pause states. Couples the upgrade schedule to the distro's repo, not to the cluster operator's intent. Rejected.
Pull-from-OCI-registry on each agent restart. Pros: same primitive as podman / kube node-image-rotation. Cons: Coupling to a registry the device must reach — many customer fleets are on private subnets without registry access. Would mean shipping a registry mirror per fleet. Adds a dependency for a problem we can solve with a signed binary on a CDN.
Two systemd units, blue/green at the unit level. fleet-agent-v0.1.1.service and fleet-agent-v0.1.2.service, ratchet via systemctl enable/disable. Pros: no symlink dance. Cons: duplicates a lot of unit-file content; harder to reason about what the "active" unit is (you have to ask systemd, not readlink); doesn't compose well with the ExecStart=/usr/local/bin/fleet-agent line we already ship. Symlink swap is the lighter primitive.
Self-stopping agent (no operator stop signal). New agent tells old agent "I'm up, you can go" via NATS. Pros: one fewer subject. Cons: The new agent is also the agent we're least sure of — putting it in charge of the old one's lifecycle inverts the trust model. If the new agent has a bug that causes it to announce ready prematurely, the cluster goes blind. The operator path is the conservative choice.
Operator-pushed binary (instead of agent-pulled). The operator sshes / executes a one-off command per device. Pros: operator controls timing precisely. Cons: Reintroduces SSH as a control plane (we just spent a month getting rid of it for the enrollment flow). Doesn't scale to fleets where most devices are NATted away from the operator.

Implementation milestones

(For a future implementer; not committed to a date here. Lives in the v0.2+ backlog.)

M1 — Versioned binary layout: builds produce fleet-agent-v<version> artifacts; install Score writes them to /usr/bin/fleet-agent-v<version> + creates /usr/local/bin/fleet-agent symlink. Existing tests cover the rest.
M2 — Version field in heartbeat + Device.status.current_version reflection on the operator side. No upgrade behavior yet.
M3 — desired_version field on the device-info KV + operator setter. No agent-side action yet.
M4 — Agent state machine, end to end, gated by a feature flag. Operator publishes desired_version → agent does the dance → operator sends stop signal → done. Includes failure- mode tests (download fail, smoke fail, heartbeat-timeout revert).
M5 — Remove the feature flag. Default-on.
M6 — Operator-side rollout strategies (canary, %-based) — only after M5 has been in production for 30 days against a real fleet.

Additional Notes

Binary signing + signature verification is in scope for the Staging step but the which signing scheme (cosign / Rekor / minisign) is deferred until the M1 implementation. Whatever we pick must work on aarch64-musl Pi devices without additional system dependencies.
The N-versions-on-disk policy is "all of them, forever" per the constraint above. If disk pressure becomes real on some customer fleet, a manual GC tool can prune /usr/bin/fleet-agent-v* by date — never automatic, never as part of the upgrade itself.
See JG's Pour l'amour des compilateurs talk (Botpress Meetup, 2026-04-30) for the framing applied here: cardinality-matched types and operator-as-coordinator are the same idea, applied to one function and to one platform.

17 KiB Raw Permalink Blame History Unescape Escape