Files
harmony/docs/adr/022-fleet-agent-upgrade.md
Jean-Gabriel Gill-Couture 064fa1da0d docs: v0.2 roadmap + ADR-022 fleet agent upgrade procedure
Two design documents framing the next push.

`ROADMAP/fleet_platform/v0_2_plan.md` — three-day production push.
Replaces the open-ended chapter structure of v0_1_plan.md for the
period between the walking-skeleton merge and v0.1.0 in production.
Focus is locking the fleet module's public API surface so the
inevitable physical refactor (out of `harmony/modules/fleet/`,
into `fleet/harmony-fleet/`) is mechanical when we get to it.
Anchored in the principle from JG's *Pour l'amour des compilateurs*
talk: design the brick before moving the brick.

`docs/adr/022-fleet-agent-upgrade.md` — agent upgrade procedure.
K8s rolling-update shape applied to one host: drain in-flight
work, stage versioned binary alongside old, smoke-test, atomic
symlink swap, both agents alive briefly, operator verifies new
agent's heartbeat then sends explicit stop signal to old, old
exits cleanly. No version is ever erased — N-history on disk is
the rollback target. Operator-driven cutover (not self-stopping)
so the most-trusted side decides the handoff. Implementation
deferred to post-v0.1 backlog; spec exists so anyone can build
it without reinventing the design.

ADR README index updated.
2026-05-06 22:51:14 -04:00

357 lines
17 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Architecture Decision Record: Fleet Agent Upgrade Procedure
Initial Author: Jean-Gabriel Gill-Couture
Initial Date: 2026-05-06
Last Updated Date: 2026-05-06
## Status
Accepted (design); implementation deferred — see roadmap
`ROADMAP/fleet_platform/v0_2_plan.md`.
## Context
The v0.1 fleet agent ships as a single static aarch64-musl binary
sitting at `/usr/local/bin/fleet-agent`, started by a systemd
unit dropped at install time by `FleetDeviceSetupScore`. Every
managed device runs one. Today the only "upgrade procedure" is
`scp` + `systemctl restart` — fine for the bring-up phase, not
fine once paying customers run real workloads on the fleet.
Without a defined upgrade story we cannot ship a v0.1 agent into
the field. The contract a customer needs is:
1. New agent versions can be rolled out without operator-side
manual intervention per device.
2. Workloads currently reconciled on the device do not flap
(start/stop/start) during the upgrade.
3. A failed new version automatically reverts to the last
known-good version, on its own, without page.
4. The operator (the central one in the cluster, not the human)
sees what version each device is running, can drive a target
version per device, and observes upgrade progress.
The agent itself is the only process on-device with full context
on what's reconciling and what's healthy. Anything centralized
(Ansible-pushed, OS-package-managed) doesn't have that signal.
The agent must be the one driving its own swap, with the
operator coordinating but not executing.
## Decision
We adopt a **K8s rolling-updateshape upgrade**, single-host,
agent-driven, operator-coordinated. Old version stays alive until
new is verified healthy from the operator's vantage point; only
then does the operator signal old to exit. **No version is ever
erased from disk.** Symlinks select the active binary.
### On-disk layout
```
/usr/bin/fleet-agent-v0.1.1 ← versioned binary, immutable
/usr/bin/fleet-agent-v0.1.2 ← versioned binary, immutable
/usr/bin/fleet-agent-v0.1.3 ← versioned binary, immutable
/usr/local/bin/fleet-agent → symlink to current versioned binary
```
- Versioned binaries are the source of truth. They live forever
(history-preserving, no GC). Disk use is bounded by humans
cleaning up explicitly, not by the upgrade procedure.
- The systemd unit installed by `FleetDeviceSetupScore` references
`/usr/local/bin/fleet-agent`. Symlink swap is the cutover
primitive — atomic on POSIX (`renameat2`).
- Naming convention: exact crate version string, `v<MAJOR>.<MINOR>.<PATCH>`,
no build metadata in the path. Build metadata lives in the agent's
reported version string but not in the file path (otherwise you
can't predict the path from a version pin).
### State machine on the agent side
```
Running ──[operator publishes desired_version != current]──▶ Draining
▲ │
│ │
│ ▼
│ Staging
│ │
│ ▼
│ Verifying
│ │
│ ▼
│ ┌──────────────────────────────[smoke fails]────────┤
│ │ │
│ [revert: symlink → previous, ▼
│ stay at current] Cutover-Ready
│ │
│ [Cutover-Ready persists ≥ T_OPERATOR_OBSERVE │
│ until operator publishes stop_signal] │
│ ▼
└────────────────────────────────────────────────────── Stopping
(exit)
```
States in detail:
- **Running** — normal reconcile loop.
- **Draining** — refuses to start new podman services for new
desired-state writes. In-flight reconciles complete and report
their final state to the operator. Existing services stay
running. Heartbeat continues. State is published as part of the
agent's heartbeat (`agent_state: "draining"`).
- **Staging** — fetch new versioned binary URL (signed,
hash-pinned), verify, place at `/usr/bin/fleet-agent-v<new>`.
Set chmod, ownership. No other state mutation.
- **Verifying** — invoke the staged binary with `--self-test`. New
binary parses its config, opens NATS connection, validates JWT,
prints version + "ok", exits 0. **No state mutation.** Catches
obvious breakage (missing dependency, wrong arch, corrupt
download, broken config-schema migration) before swap.
- **Cutover-Ready** — staged binary is healthy. Old agent updates
the symlink atomically:
```
ln -sfn /usr/bin/fleet-agent-v0.1.2 /usr/local/bin/fleet-agent.new
mv -T /usr/local/bin/fleet-agent.new /usr/local/bin/fleet-agent
```
Old agent then `systemctl start fleet-agent-v0.1.2.service` (a
parallel transient service, not `systemctl restart` of itself).
Both old and new are now running. New publishes its first
heartbeat with `version=v0.1.2`. Operator sees two heartbeats
per device for a brief window.
- **Stopping** — operator publishes a stop signal to the old
agent's NATS subject. Old agent receives, gracefully exits.
systemd's `Restart=on-failure` does *not* trigger because the
exit is `success` (rc=0, code-path-driven). New agent is now
the only one running. systemd unit is reconfigured to point at
the *current* symlink target on its next restart, but that's
cosmetic — the symlink already does the job.
### Operator-side coordination
The operator is the only source of truth for "what version should
this device run". One new field per device, two new subjects.
**New on `Device` CR / KV `device-info`:**
- `current_version` — what the agent is running right now.
Reported in heartbeat; reflected to the CR.
- `desired_version` — what the operator wants the agent to run.
Set by operator-side logic (default: latest published; eventually
canary / %-based).
**New NATS subjects (per-device, scoped by callout permissions):**
- `device-cmd.<device_id>.upgrade-stop` — operator → old agent.
Payload: `{"reason": "...", "deadline_ms": ...}`. Sent only after
operator has observed a heartbeat from the new version with
`current_version == desired_version` AND `agent_state == "running"`.
- `device-state.<device_id>.upgrade` — agent → operator. Status
events: `staging`, `verifying`, `cutover-ready`, `failed`, `done`.
Drives `Device.status.upgrade.{phase, last_error, ...}`.
The operator only emits `upgrade-stop` after it has independently
verified the new agent is up. **Old agent does not stop itself
based on its own observations.** This is the load-bearing
property: the same operator that disagreed with the upgrade
("haven't seen new version's heartbeat") would never have sent
the stop signal. Single-source-of-truth handoff.
### Failure modes and rollback
- **Staging fails (download / hash mismatch):** Agent stays in
`Running`. Reports `phase: "failed"`, `last_error`. Operator
sees the failure; can fix the artifact + retry by re-publishing
the same desired_version (any change to desired_version
re-triggers the state machine).
- **Verifying fails (smoke test rc != 0):** Agent stays in
`Running`. Reports failure. Staged binary stays on disk for
inspection. Operator can collect, debug, ship a fixed version.
- **Cutover-ready, but new agent never publishes a heartbeat
with the new version within T_HEARTBEAT_TIMEOUT (suggested
60s):** Old agent reverts the symlink, stops the parallel
systemd transient service, transitions back to Running with
the old version. Reports `failed`. Same recovery path.
- **Operator never sends stop signal (e.g., operator-side
outage):** Old agent stays in Cutover-Ready indefinitely. Both
agents are running; only the new one is publishing as the
active one (the old one's writes are gated on its state). This
is expensive (2× resource use) but safe — the operator is the
authoritative coordinator and any other behavior would risk
losing both agents at once.
- **Both agents alive but new agent crashes:** systemd's
`Restart=on-failure` on the new agent's transient unit retries.
If it can't come back, the operator never sends the stop signal,
the old agent stays Cutover-Ready, and a human investigates.
The fleet keeps working on the old version — the rollback is
implicit.
- **Operator publishes an older `desired_version`:** Reverse
rollout. Same mechanism, just with old/new swapped. The "new"
binary is older, but the procedure is identical. The fact that
no version is ever GC'd is what makes this work.
### What this isn't
- **Not fleet-wide.** Per-device. Fleet-wide canary / %-based
rollouts are operator-side orchestration **on top of** this
primitive. The operator would publish `desired_version` to a
rolling subset of devices and watch heartbeats. Out of scope
for v0.2 — single-device upgrade is sufficient for a 100-Pi
fleet which is more than the 12-month customer roadmap.
- **Not blue/green of the entire OS.** We swap one userspace
binary. The OS, podman, the systemd unit text, the kernel — all
unchanged. Out of scope.
- **Not a package manager.** Versioned binaries land at fixed
paths because we control them. apt / dpkg / OSTree are
orthogonal and not in the loop.
## Rationale
- **No version ever erased.** Trivializes rollback (the previous
binary is a `ln -sfn` away). Simplifies the failure tree:
every "what if" branch resolves to "old still on disk". Disk
cost on aarch64-musl is ~510 MB per version — at 12 versions
/ year, that's 100 MB after a decade of upgrades. Negligible
compared to Pi storage.
- **Symlink swap as cutover.** POSIX-atomic. No daemon state.
Cheap to revert. Compatible with systemd unit references that
point at a stable path.
- **Old verifies new, then reports up.** This is the load-bearing
property: it places the verification at the agent (which has
the only complete view of its own runtime state) but the
*commitment* at the operator (which is the only thing safe to
trust as the cluster-wide source of truth). Either side alone
can fail safe; only consensus advances the upgrade.
- **Operator-driven stop, not agent self-stop.** A self-stopping
agent could decide to exit before the operator agrees, leaving
the cluster blind. Forcing the stop through the operator means
any disagreement keeps the old agent alive — which is the
desired bias.
- **Drains in-flight work first.** Mirrors K8s pod-shutdown
semantics. A workload reconciling at the moment of swap
finishes its current step, reports state, then queues. New
agent picks up the queue once it's the active version. No
observable flap on the workload.
- **Heartbeat-driven version reporting.** The agent already
publishes heartbeats; adding the version field is one line.
No new transport.
## Consequences
**Pros:**
- Bounded blast radius per upgrade (one device).
- Rollback is the same code path as upgrade — no special-case
bug class.
- Operator's view is monotonic: heartbeats with versions are
immutable history; there's no "did the upgrade really happen"
state.
- Old agent never decides to exit on its own. The most dangerous
failure mode in self-upgrading software (premature exit) is
designed out.
- Compatible with eventual fleet-wide rollouts (canary, %-based)
which become operator-side orchestration on top of this
primitive.
**Cons:**
- Briefly runs two agents per device (Cutover-Ready window).
Memory and connection-count both ~2× during that window.
Acceptable for the upgrade duration (typically <60s).
- Requires reliable connectivity between agent and operator to
complete the handoff. A device whose NATS link fails mid-
upgrade stays in Cutover-Ready until link recovers.
- Disk grows monotonically with version count. Bounded by human
cleanup. We do not GC.
- New NATS subjects, new heartbeat fields, new `Device.status`
fields. Schema bump that operators-in-the-field need to handle
(the operator must understand "old agent reporting no version
field" as `version: unknown`, not crash).
## Alternatives considered
1. **OS-package upgrade (apt / dpkg / OSTree).** *Pros:* zero
custom code, standard toolchain, GPG-signed.
*Cons:* Loses the "agent verifies the new agent before swap"
property. apt's restart hook flips the symlink and `systemctl
restart`s; if the new binary is broken, the device is bricked
until human intervention. Doesn't drain in-flight work. Doesn't
know about NATS-managed pause states. Couples the upgrade
schedule to the distro's repo, not to the cluster operator's
intent. Rejected.
2. **Pull-from-OCI-registry on each agent restart.** *Pros:* same
primitive as podman / kube node-image-rotation.
*Cons:* Coupling to a registry the device must reach — many
customer fleets are on private subnets without registry
access. Would mean shipping a registry mirror per fleet. Adds
a dependency for a problem we can solve with a signed binary
on a CDN.
3. **Two systemd units, blue/green at the unit level.**
`fleet-agent-v0.1.1.service` and `fleet-agent-v0.1.2.service`,
ratchet via systemctl enable/disable. *Pros:* no symlink dance.
*Cons:* duplicates a lot of unit-file content; harder to
reason about what the "active" unit is (you have to ask
systemd, not `readlink`); doesn't compose well with the
`ExecStart=/usr/local/bin/fleet-agent` line we already ship.
Symlink swap is the lighter primitive.
4. **Self-stopping agent (no operator stop signal).** New agent
tells old agent "I'm up, you can go" via NATS. *Pros:* one
fewer subject.
*Cons:* The new agent is also the agent we're least sure of
— putting it in charge of the old one's lifecycle inverts the
trust model. If the new agent has a bug that causes it to
announce ready prematurely, the cluster goes blind. The
operator path is the conservative choice.
5. **Operator-pushed binary (instead of agent-pulled).** The
operator sshes / executes a one-off command per device.
*Pros:* operator controls timing precisely.
*Cons:* Reintroduces SSH as a control plane (we just spent a
month getting rid of it for the enrollment flow). Doesn't
scale to fleets where most devices are NATted away from the
operator.
## Implementation milestones
(For a future implementer; not committed to a date here. Lives
in the v0.2+ backlog.)
1. **M1** — Versioned binary layout: builds produce
`fleet-agent-v<version>` artifacts; install Score writes them
to `/usr/bin/fleet-agent-v<version>` + creates
`/usr/local/bin/fleet-agent` symlink. Existing tests cover the
rest.
2. **M2** — Version field in heartbeat + `Device.status.current_version`
reflection on the operator side. No upgrade behavior yet.
3. **M3** — `desired_version` field on the device-info KV +
operator setter. No agent-side action yet.
4. **M4** — Agent state machine, end to end, gated by a feature
flag. Operator publishes desired_version → agent does the
dance → operator sends stop signal → done. Includes failure-
mode tests (download fail, smoke fail, heartbeat-timeout
revert).
5. **M5** — Remove the feature flag. Default-on.
6. **M6** — Operator-side rollout strategies (canary, %-based) —
only after M5 has been in production for 30 days against a
real fleet.
## Additional Notes
- Binary signing + signature verification is in scope for the
`Staging` step but the *which* signing scheme (cosign / Rekor
/ minisign) is deferred until the M1 implementation. Whatever
we pick must work on aarch64-musl Pi devices without
additional system dependencies.
- The N-versions-on-disk policy is "all of them, forever" per
the constraint above. If disk pressure becomes real on some
customer fleet, a manual GC tool can prune `/usr/bin/fleet-agent-v*`
by date — never automatic, never as part of the upgrade
itself.
- See JG's *Pour l'amour des compilateurs* talk (Botpress
Meetup, 2026-04-30) for the framing applied here:
cardinality-matched types and operator-as-coordinator are the
same idea, applied to one function and to one platform.