Two design documents framing the next push. `ROADMAP/fleet_platform/v0_2_plan.md` — three-day production push. Replaces the open-ended chapter structure of v0_1_plan.md for the period between the walking-skeleton merge and v0.1.0 in production. Focus is locking the fleet module's public API surface so the inevitable physical refactor (out of `harmony/modules/fleet/`, into `fleet/harmony-fleet/`) is mechanical when we get to it. Anchored in the principle from JG's *Pour l'amour des compilateurs* talk: design the brick before moving the brick. `docs/adr/022-fleet-agent-upgrade.md` — agent upgrade procedure. K8s rolling-update shape applied to one host: drain in-flight work, stage versioned binary alongside old, smoke-test, atomic symlink swap, both agents alive briefly, operator verifies new agent's heartbeat then sends explicit stop signal to old, old exits cleanly. No version is ever erased — N-history on disk is the rollback target. Operator-driven cutover (not self-stopping) so the most-trusted side decides the handoff. Implementation deferred to post-v0.1 backlog; spec exists so anyone can build it without reinventing the design. ADR README index updated.
357 lines
17 KiB
Markdown
357 lines
17 KiB
Markdown
# Architecture Decision Record: Fleet Agent Upgrade Procedure
|
||
|
||
Initial Author: Jean-Gabriel Gill-Couture
|
||
|
||
Initial Date: 2026-05-06
|
||
|
||
Last Updated Date: 2026-05-06
|
||
|
||
## Status
|
||
|
||
Accepted (design); implementation deferred — see roadmap
|
||
`ROADMAP/fleet_platform/v0_2_plan.md`.
|
||
|
||
## Context
|
||
|
||
The v0.1 fleet agent ships as a single static aarch64-musl binary
|
||
sitting at `/usr/local/bin/fleet-agent`, started by a systemd
|
||
unit dropped at install time by `FleetDeviceSetupScore`. Every
|
||
managed device runs one. Today the only "upgrade procedure" is
|
||
`scp` + `systemctl restart` — fine for the bring-up phase, not
|
||
fine once paying customers run real workloads on the fleet.
|
||
|
||
Without a defined upgrade story we cannot ship a v0.1 agent into
|
||
the field. The contract a customer needs is:
|
||
|
||
1. New agent versions can be rolled out without operator-side
|
||
manual intervention per device.
|
||
2. Workloads currently reconciled on the device do not flap
|
||
(start/stop/start) during the upgrade.
|
||
3. A failed new version automatically reverts to the last
|
||
known-good version, on its own, without page.
|
||
4. The operator (the central one in the cluster, not the human)
|
||
sees what version each device is running, can drive a target
|
||
version per device, and observes upgrade progress.
|
||
|
||
The agent itself is the only process on-device with full context
|
||
on what's reconciling and what's healthy. Anything centralized
|
||
(Ansible-pushed, OS-package-managed) doesn't have that signal.
|
||
The agent must be the one driving its own swap, with the
|
||
operator coordinating but not executing.
|
||
|
||
## Decision
|
||
|
||
We adopt a **K8s rolling-update–shape upgrade**, single-host,
|
||
agent-driven, operator-coordinated. Old version stays alive until
|
||
new is verified healthy from the operator's vantage point; only
|
||
then does the operator signal old to exit. **No version is ever
|
||
erased from disk.** Symlinks select the active binary.
|
||
|
||
### On-disk layout
|
||
|
||
```
|
||
/usr/bin/fleet-agent-v0.1.1 ← versioned binary, immutable
|
||
/usr/bin/fleet-agent-v0.1.2 ← versioned binary, immutable
|
||
/usr/bin/fleet-agent-v0.1.3 ← versioned binary, immutable
|
||
/usr/local/bin/fleet-agent → symlink to current versioned binary
|
||
```
|
||
|
||
- Versioned binaries are the source of truth. They live forever
|
||
(history-preserving, no GC). Disk use is bounded by humans
|
||
cleaning up explicitly, not by the upgrade procedure.
|
||
- The systemd unit installed by `FleetDeviceSetupScore` references
|
||
`/usr/local/bin/fleet-agent`. Symlink swap is the cutover
|
||
primitive — atomic on POSIX (`renameat2`).
|
||
- Naming convention: exact crate version string, `v<MAJOR>.<MINOR>.<PATCH>`,
|
||
no build metadata in the path. Build metadata lives in the agent's
|
||
reported version string but not in the file path (otherwise you
|
||
can't predict the path from a version pin).
|
||
|
||
### State machine on the agent side
|
||
|
||
```
|
||
Running ──[operator publishes desired_version != current]──▶ Draining
|
||
▲ │
|
||
│ │
|
||
│ ▼
|
||
│ Staging
|
||
│ │
|
||
│ ▼
|
||
│ Verifying
|
||
│ │
|
||
│ ▼
|
||
│ ┌──────────────────────────────[smoke fails]────────┤
|
||
│ │ │
|
||
│ [revert: symlink → previous, ▼
|
||
│ stay at current] Cutover-Ready
|
||
│ │
|
||
│ [Cutover-Ready persists ≥ T_OPERATOR_OBSERVE │
|
||
│ until operator publishes stop_signal] │
|
||
│ ▼
|
||
└────────────────────────────────────────────────────── Stopping
|
||
│
|
||
▼
|
||
(exit)
|
||
```
|
||
|
||
States in detail:
|
||
|
||
- **Running** — normal reconcile loop.
|
||
- **Draining** — refuses to start new podman services for new
|
||
desired-state writes. In-flight reconciles complete and report
|
||
their final state to the operator. Existing services stay
|
||
running. Heartbeat continues. State is published as part of the
|
||
agent's heartbeat (`agent_state: "draining"`).
|
||
- **Staging** — fetch new versioned binary URL (signed,
|
||
hash-pinned), verify, place at `/usr/bin/fleet-agent-v<new>`.
|
||
Set chmod, ownership. No other state mutation.
|
||
- **Verifying** — invoke the staged binary with `--self-test`. New
|
||
binary parses its config, opens NATS connection, validates JWT,
|
||
prints version + "ok", exits 0. **No state mutation.** Catches
|
||
obvious breakage (missing dependency, wrong arch, corrupt
|
||
download, broken config-schema migration) before swap.
|
||
- **Cutover-Ready** — staged binary is healthy. Old agent updates
|
||
the symlink atomically:
|
||
```
|
||
ln -sfn /usr/bin/fleet-agent-v0.1.2 /usr/local/bin/fleet-agent.new
|
||
mv -T /usr/local/bin/fleet-agent.new /usr/local/bin/fleet-agent
|
||
```
|
||
Old agent then `systemctl start fleet-agent-v0.1.2.service` (a
|
||
parallel transient service, not `systemctl restart` of itself).
|
||
Both old and new are now running. New publishes its first
|
||
heartbeat with `version=v0.1.2`. Operator sees two heartbeats
|
||
per device for a brief window.
|
||
- **Stopping** — operator publishes a stop signal to the old
|
||
agent's NATS subject. Old agent receives, gracefully exits.
|
||
systemd's `Restart=on-failure` does *not* trigger because the
|
||
exit is `success` (rc=0, code-path-driven). New agent is now
|
||
the only one running. systemd unit is reconfigured to point at
|
||
the *current* symlink target on its next restart, but that's
|
||
cosmetic — the symlink already does the job.
|
||
|
||
### Operator-side coordination
|
||
|
||
The operator is the only source of truth for "what version should
|
||
this device run". One new field per device, two new subjects.
|
||
|
||
**New on `Device` CR / KV `device-info`:**
|
||
- `current_version` — what the agent is running right now.
|
||
Reported in heartbeat; reflected to the CR.
|
||
- `desired_version` — what the operator wants the agent to run.
|
||
Set by operator-side logic (default: latest published; eventually
|
||
canary / %-based).
|
||
|
||
**New NATS subjects (per-device, scoped by callout permissions):**
|
||
- `device-cmd.<device_id>.upgrade-stop` — operator → old agent.
|
||
Payload: `{"reason": "...", "deadline_ms": ...}`. Sent only after
|
||
operator has observed a heartbeat from the new version with
|
||
`current_version == desired_version` AND `agent_state == "running"`.
|
||
- `device-state.<device_id>.upgrade` — agent → operator. Status
|
||
events: `staging`, `verifying`, `cutover-ready`, `failed`, `done`.
|
||
Drives `Device.status.upgrade.{phase, last_error, ...}`.
|
||
|
||
The operator only emits `upgrade-stop` after it has independently
|
||
verified the new agent is up. **Old agent does not stop itself
|
||
based on its own observations.** This is the load-bearing
|
||
property: the same operator that disagreed with the upgrade
|
||
("haven't seen new version's heartbeat") would never have sent
|
||
the stop signal. Single-source-of-truth handoff.
|
||
|
||
### Failure modes and rollback
|
||
|
||
- **Staging fails (download / hash mismatch):** Agent stays in
|
||
`Running`. Reports `phase: "failed"`, `last_error`. Operator
|
||
sees the failure; can fix the artifact + retry by re-publishing
|
||
the same desired_version (any change to desired_version
|
||
re-triggers the state machine).
|
||
- **Verifying fails (smoke test rc != 0):** Agent stays in
|
||
`Running`. Reports failure. Staged binary stays on disk for
|
||
inspection. Operator can collect, debug, ship a fixed version.
|
||
- **Cutover-ready, but new agent never publishes a heartbeat
|
||
with the new version within T_HEARTBEAT_TIMEOUT (suggested
|
||
60s):** Old agent reverts the symlink, stops the parallel
|
||
systemd transient service, transitions back to Running with
|
||
the old version. Reports `failed`. Same recovery path.
|
||
- **Operator never sends stop signal (e.g., operator-side
|
||
outage):** Old agent stays in Cutover-Ready indefinitely. Both
|
||
agents are running; only the new one is publishing as the
|
||
active one (the old one's writes are gated on its state). This
|
||
is expensive (2× resource use) but safe — the operator is the
|
||
authoritative coordinator and any other behavior would risk
|
||
losing both agents at once.
|
||
- **Both agents alive but new agent crashes:** systemd's
|
||
`Restart=on-failure` on the new agent's transient unit retries.
|
||
If it can't come back, the operator never sends the stop signal,
|
||
the old agent stays Cutover-Ready, and a human investigates.
|
||
The fleet keeps working on the old version — the rollback is
|
||
implicit.
|
||
- **Operator publishes an older `desired_version`:** Reverse
|
||
rollout. Same mechanism, just with old/new swapped. The "new"
|
||
binary is older, but the procedure is identical. The fact that
|
||
no version is ever GC'd is what makes this work.
|
||
|
||
### What this isn't
|
||
|
||
- **Not fleet-wide.** Per-device. Fleet-wide canary / %-based
|
||
rollouts are operator-side orchestration **on top of** this
|
||
primitive. The operator would publish `desired_version` to a
|
||
rolling subset of devices and watch heartbeats. Out of scope
|
||
for v0.2 — single-device upgrade is sufficient for a 100-Pi
|
||
fleet which is more than the 12-month customer roadmap.
|
||
- **Not blue/green of the entire OS.** We swap one userspace
|
||
binary. The OS, podman, the systemd unit text, the kernel — all
|
||
unchanged. Out of scope.
|
||
- **Not a package manager.** Versioned binaries land at fixed
|
||
paths because we control them. apt / dpkg / OSTree are
|
||
orthogonal and not in the loop.
|
||
|
||
## Rationale
|
||
|
||
- **No version ever erased.** Trivializes rollback (the previous
|
||
binary is a `ln -sfn` away). Simplifies the failure tree:
|
||
every "what if" branch resolves to "old still on disk". Disk
|
||
cost on aarch64-musl is ~5–10 MB per version — at 12 versions
|
||
/ year, that's 100 MB after a decade of upgrades. Negligible
|
||
compared to Pi storage.
|
||
- **Symlink swap as cutover.** POSIX-atomic. No daemon state.
|
||
Cheap to revert. Compatible with systemd unit references that
|
||
point at a stable path.
|
||
- **Old verifies new, then reports up.** This is the load-bearing
|
||
property: it places the verification at the agent (which has
|
||
the only complete view of its own runtime state) but the
|
||
*commitment* at the operator (which is the only thing safe to
|
||
trust as the cluster-wide source of truth). Either side alone
|
||
can fail safe; only consensus advances the upgrade.
|
||
- **Operator-driven stop, not agent self-stop.** A self-stopping
|
||
agent could decide to exit before the operator agrees, leaving
|
||
the cluster blind. Forcing the stop through the operator means
|
||
any disagreement keeps the old agent alive — which is the
|
||
desired bias.
|
||
- **Drains in-flight work first.** Mirrors K8s pod-shutdown
|
||
semantics. A workload reconciling at the moment of swap
|
||
finishes its current step, reports state, then queues. New
|
||
agent picks up the queue once it's the active version. No
|
||
observable flap on the workload.
|
||
- **Heartbeat-driven version reporting.** The agent already
|
||
publishes heartbeats; adding the version field is one line.
|
||
No new transport.
|
||
|
||
## Consequences
|
||
|
||
**Pros:**
|
||
|
||
- Bounded blast radius per upgrade (one device).
|
||
- Rollback is the same code path as upgrade — no special-case
|
||
bug class.
|
||
- Operator's view is monotonic: heartbeats with versions are
|
||
immutable history; there's no "did the upgrade really happen"
|
||
state.
|
||
- Old agent never decides to exit on its own. The most dangerous
|
||
failure mode in self-upgrading software (premature exit) is
|
||
designed out.
|
||
- Compatible with eventual fleet-wide rollouts (canary, %-based)
|
||
which become operator-side orchestration on top of this
|
||
primitive.
|
||
|
||
**Cons:**
|
||
|
||
- Briefly runs two agents per device (Cutover-Ready window).
|
||
Memory and connection-count both ~2× during that window.
|
||
Acceptable for the upgrade duration (typically <60s).
|
||
- Requires reliable connectivity between agent and operator to
|
||
complete the handoff. A device whose NATS link fails mid-
|
||
upgrade stays in Cutover-Ready until link recovers.
|
||
- Disk grows monotonically with version count. Bounded by human
|
||
cleanup. We do not GC.
|
||
- New NATS subjects, new heartbeat fields, new `Device.status`
|
||
fields. Schema bump that operators-in-the-field need to handle
|
||
(the operator must understand "old agent reporting no version
|
||
field" as `version: unknown`, not crash).
|
||
|
||
## Alternatives considered
|
||
|
||
1. **OS-package upgrade (apt / dpkg / OSTree).** *Pros:* zero
|
||
custom code, standard toolchain, GPG-signed.
|
||
*Cons:* Loses the "agent verifies the new agent before swap"
|
||
property. apt's restart hook flips the symlink and `systemctl
|
||
restart`s; if the new binary is broken, the device is bricked
|
||
until human intervention. Doesn't drain in-flight work. Doesn't
|
||
know about NATS-managed pause states. Couples the upgrade
|
||
schedule to the distro's repo, not to the cluster operator's
|
||
intent. Rejected.
|
||
|
||
2. **Pull-from-OCI-registry on each agent restart.** *Pros:* same
|
||
primitive as podman / kube node-image-rotation.
|
||
*Cons:* Coupling to a registry the device must reach — many
|
||
customer fleets are on private subnets without registry
|
||
access. Would mean shipping a registry mirror per fleet. Adds
|
||
a dependency for a problem we can solve with a signed binary
|
||
on a CDN.
|
||
|
||
3. **Two systemd units, blue/green at the unit level.**
|
||
`fleet-agent-v0.1.1.service` and `fleet-agent-v0.1.2.service`,
|
||
ratchet via systemctl enable/disable. *Pros:* no symlink dance.
|
||
*Cons:* duplicates a lot of unit-file content; harder to
|
||
reason about what the "active" unit is (you have to ask
|
||
systemd, not `readlink`); doesn't compose well with the
|
||
`ExecStart=/usr/local/bin/fleet-agent` line we already ship.
|
||
Symlink swap is the lighter primitive.
|
||
|
||
4. **Self-stopping agent (no operator stop signal).** New agent
|
||
tells old agent "I'm up, you can go" via NATS. *Pros:* one
|
||
fewer subject.
|
||
*Cons:* The new agent is also the agent we're least sure of
|
||
— putting it in charge of the old one's lifecycle inverts the
|
||
trust model. If the new agent has a bug that causes it to
|
||
announce ready prematurely, the cluster goes blind. The
|
||
operator path is the conservative choice.
|
||
|
||
5. **Operator-pushed binary (instead of agent-pulled).** The
|
||
operator sshes / executes a one-off command per device.
|
||
*Pros:* operator controls timing precisely.
|
||
*Cons:* Reintroduces SSH as a control plane (we just spent a
|
||
month getting rid of it for the enrollment flow). Doesn't
|
||
scale to fleets where most devices are NATted away from the
|
||
operator.
|
||
|
||
## Implementation milestones
|
||
|
||
(For a future implementer; not committed to a date here. Lives
|
||
in the v0.2+ backlog.)
|
||
|
||
1. **M1** — Versioned binary layout: builds produce
|
||
`fleet-agent-v<version>` artifacts; install Score writes them
|
||
to `/usr/bin/fleet-agent-v<version>` + creates
|
||
`/usr/local/bin/fleet-agent` symlink. Existing tests cover the
|
||
rest.
|
||
2. **M2** — Version field in heartbeat + `Device.status.current_version`
|
||
reflection on the operator side. No upgrade behavior yet.
|
||
3. **M3** — `desired_version` field on the device-info KV +
|
||
operator setter. No agent-side action yet.
|
||
4. **M4** — Agent state machine, end to end, gated by a feature
|
||
flag. Operator publishes desired_version → agent does the
|
||
dance → operator sends stop signal → done. Includes failure-
|
||
mode tests (download fail, smoke fail, heartbeat-timeout
|
||
revert).
|
||
5. **M5** — Remove the feature flag. Default-on.
|
||
6. **M6** — Operator-side rollout strategies (canary, %-based) —
|
||
only after M5 has been in production for 30 days against a
|
||
real fleet.
|
||
|
||
## Additional Notes
|
||
|
||
- Binary signing + signature verification is in scope for the
|
||
`Staging` step but the *which* signing scheme (cosign / Rekor
|
||
/ minisign) is deferred until the M1 implementation. Whatever
|
||
we pick must work on aarch64-musl Pi devices without
|
||
additional system dependencies.
|
||
- The N-versions-on-disk policy is "all of them, forever" per
|
||
the constraint above. If disk pressure becomes real on some
|
||
customer fleet, a manual GC tool can prune `/usr/bin/fleet-agent-v*`
|
||
by date — never automatic, never as part of the upgrade
|
||
itself.
|
||
- See JG's *Pour l'amour des compilateurs* talk (Botpress
|
||
Meetup, 2026-04-30) for the framing applied here:
|
||
cardinality-matched types and operator-as-coordinator are the
|
||
same idea, applied to one function and to one platform.
|