harmony/docs/adr/022-fleet-agent-upgrade.md

# Architecture Decision Record: Fleet Agent Upgrade Procedure

Initial Author: Jean-Gabriel Gill-Couture

Initial Date: 2026-05-06

Last Updated Date: 2026-05-06

## Status

Accepted (design); implementation deferred — see roadmap
`ROADMAP/fleet_platform/v0_2_plan.md`.

## Context

The v0.1 fleet agent ships as a single static aarch64-musl binary
sitting at `/usr/local/bin/fleet-agent`, started by a systemd
unit dropped at install time by `FleetDeviceSetupScore`. Every
managed device runs one. Today the only "upgrade procedure" is
`scp` + `systemctl restart` — fine for the bring-up phase, not
fine once paying customers run real workloads on the fleet.

Without a defined upgrade story we cannot ship a v0.1 agent into
the field. The contract a customer needs is:

1. New agent versions can be rolled out without operator-side
   manual intervention per device.
2. Workloads currently reconciled on the device do not flap
   (start/stop/start) during the upgrade.
3. A failed new version automatically reverts to the last
   known-good version, on its own, without page.
4. The operator (the central one in the cluster, not the human)
   sees what version each device is running, can drive a target
   version per device, and observes upgrade progress.

The agent itself is the only process on-device with full context
on what's reconciling and what's healthy. Anything centralized
(Ansible-pushed, OS-package-managed) doesn't have that signal.
The agent must be the one driving its own swap, with the
operator coordinating but not executing.

## Decision

We adopt a **K8s rolling-update–shape upgrade**, single-host,
agent-driven, operator-coordinated. Old version stays alive until
new is verified healthy from the operator's vantage point; only
then does the operator signal old to exit. **No version is ever
erased from disk.** Symlinks select the active binary.

### On-disk layout

```
/usr/bin/fleet-agent-v0.1.1            ← versioned binary, immutable
/usr/bin/fleet-agent-v0.1.2            ← versioned binary, immutable
/usr/bin/fleet-agent-v0.1.3            ← versioned binary, immutable
/usr/local/bin/fleet-agent             → symlink to current versioned binary
```

- Versioned binaries are the source of truth. They live forever
  (history-preserving, no GC). Disk use is bounded by humans
  cleaning up explicitly, not by the upgrade procedure.
- The systemd unit installed by `FleetDeviceSetupScore` references
  `/usr/local/bin/fleet-agent`. Symlink swap is the cutover
  primitive — atomic on POSIX (`renameat2`).
- Naming convention: exact crate version string, `v<MAJOR>.<MINOR>.<PATCH>`,
  no build metadata in the path. Build metadata lives in the agent's
  reported version string but not in the file path (otherwise you
  can't predict the path from a version pin).

### State machine on the agent side

```
Running ──[operator publishes desired_version != current]──▶ Draining
   ▲                                                            │
   │                                                            │
   │                                                            ▼
   │                                                          Staging
   │                                                            │
   │                                                            ▼
   │                                                         Verifying
   │                                                            │
   │                                                            ▼
   │       ┌──────────────────────────────[smoke fails]────────┤
   │       │                                                   │
   │   [revert: symlink → previous,                            ▼
   │    stay at current]                                  Cutover-Ready
   │                                                            │
   │   [Cutover-Ready persists ≥ T_OPERATOR_OBSERVE              │
   │    until operator publishes stop_signal]                   │
   │                                                            ▼
   └────────────────────────────────────────────────────── Stopping
                                                                │
                                                                ▼
                                                              (exit)
```

States in detail:

- **Running** — normal reconcile loop.
- **Draining** — refuses to start new podman services for new
  desired-state writes. In-flight reconciles complete and report
  their final state to the operator. Existing services stay
  running. Heartbeat continues. State is published as part of the
  agent's heartbeat (`agent_state: "draining"`).
- **Staging** — fetch new versioned binary URL (signed,
  hash-pinned), verify, place at `/usr/bin/fleet-agent-v<new>`.
  Set chmod, ownership. No other state mutation.
- **Verifying** — invoke the staged binary with `--self-test`. New
  binary parses its config, opens NATS connection, validates JWT,
  prints version + "ok", exits 0. **No state mutation.** Catches
  obvious breakage (missing dependency, wrong arch, corrupt
  download, broken config-schema migration) before swap.
- **Cutover-Ready** — staged binary is healthy. Old agent updates
  the symlink atomically:
  ```
  ln -sfn /usr/bin/fleet-agent-v0.1.2 /usr/local/bin/fleet-agent.new
  mv -T /usr/local/bin/fleet-agent.new /usr/local/bin/fleet-agent
  ```
  Old agent then `systemctl start fleet-agent-v0.1.2.service` (a
  parallel transient service, not `systemctl restart` of itself).
  Both old and new are now running. New publishes its first
  heartbeat with `version=v0.1.2`. Operator sees two heartbeats
  per device for a brief window.
- **Stopping** — operator publishes a stop signal to the old
  agent's NATS subject. Old agent receives, gracefully exits.
  systemd's `Restart=on-failure` does *not* trigger because the
  exit is `success` (rc=0, code-path-driven). New agent is now
  the only one running. systemd unit is reconfigured to point at
  the *current* symlink target on its next restart, but that's
  cosmetic — the symlink already does the job.

### Operator-side coordination

The operator is the only source of truth for "what version should
this device run". One new field per device, two new subjects.

**New on `Device` CR / KV `device-info`:**
- `current_version` — what the agent is running right now.
  Reported in heartbeat; reflected to the CR.
- `desired_version` — what the operator wants the agent to run.
  Set by operator-side logic (default: latest published; eventually
  canary / %-based).

**New NATS subjects (per-device, scoped by callout permissions):**
- `device-cmd.<device_id>.upgrade-stop` — operator → old agent.
  Payload: `{"reason": "...", "deadline_ms": ...}`. Sent only after
  operator has observed a heartbeat from the new version with
  `current_version == desired_version` AND `agent_state == "running"`.
- `device-state.<device_id>.upgrade` — agent → operator. Status
  events: `staging`, `verifying`, `cutover-ready`, `failed`, `done`.
  Drives `Device.status.upgrade.{phase, last_error, ...}`.

The operator only emits `upgrade-stop` after it has independently
verified the new agent is up. **Old agent does not stop itself
based on its own observations.** This is the load-bearing
property: the same operator that disagreed with the upgrade
("haven't seen new version's heartbeat") would never have sent
the stop signal. Single-source-of-truth handoff.

### Failure modes and rollback

- **Staging fails (download / hash mismatch):** Agent stays in
  `Running`. Reports `phase: "failed"`, `last_error`. Operator
  sees the failure; can fix the artifact + retry by re-publishing
  the same desired_version (any change to desired_version
  re-triggers the state machine).
- **Verifying fails (smoke test rc != 0):** Agent stays in
  `Running`. Reports failure. Staged binary stays on disk for
  inspection. Operator can collect, debug, ship a fixed version.
- **Cutover-ready, but new agent never publishes a heartbeat
  with the new version within T_HEARTBEAT_TIMEOUT (suggested
  60s):** Old agent reverts the symlink, stops the parallel
  systemd transient service, transitions back to Running with
  the old version. Reports `failed`. Same recovery path.
- **Operator never sends stop signal (e.g., operator-side
  outage):** Old agent stays in Cutover-Ready indefinitely. Both
  agents are running; only the new one is publishing as the
  active one (the old one's writes are gated on its state). This
  is expensive (2× resource use) but safe — the operator is the
  authoritative coordinator and any other behavior would risk
  losing both agents at once.
- **Both agents alive but new agent crashes:** systemd's
  `Restart=on-failure` on the new agent's transient unit retries.
  If it can't come back, the operator never sends the stop signal,
  the old agent stays Cutover-Ready, and a human investigates.
  The fleet keeps working on the old version — the rollback is
  implicit.
- **Operator publishes an older `desired_version`:** Reverse
  rollout. Same mechanism, just with old/new swapped. The "new"
  binary is older, but the procedure is identical. The fact that
  no version is ever GC'd is what makes this work.

### What this isn't

- **Not fleet-wide.** Per-device. Fleet-wide canary / %-based
  rollouts are operator-side orchestration **on top of** this
  primitive. The operator would publish `desired_version` to a
  rolling subset of devices and watch heartbeats. Out of scope
  for v0.2 — single-device upgrade is sufficient for a 100-Pi
  fleet which is more than the 12-month customer roadmap.
- **Not blue/green of the entire OS.** We swap one userspace
  binary. The OS, podman, the systemd unit text, the kernel — all
  unchanged. Out of scope.
- **Not a package manager.** Versioned binaries land at fixed
  paths because we control them. apt / dpkg / OSTree are
  orthogonal and not in the loop.

## Rationale

- **No version ever erased.** Trivializes rollback (the previous
  binary is a `ln -sfn` away). Simplifies the failure tree:
  every "what if" branch resolves to "old still on disk". Disk
  cost on aarch64-musl is ~5–10 MB per version — at 12 versions
  / year, that's 100 MB after a decade of upgrades. Negligible
  compared to Pi storage.
- **Symlink swap as cutover.** POSIX-atomic. No daemon state.
  Cheap to revert. Compatible with systemd unit references that
  point at a stable path.
- **Old verifies new, then reports up.** This is the load-bearing
  property: it places the verification at the agent (which has
  the only complete view of its own runtime state) but the
  *commitment* at the operator (which is the only thing safe to
  trust as the cluster-wide source of truth). Either side alone
  can fail safe; only consensus advances the upgrade.
- **Operator-driven stop, not agent self-stop.** A self-stopping
  agent could decide to exit before the operator agrees, leaving
  the cluster blind. Forcing the stop through the operator means
  any disagreement keeps the old agent alive — which is the
  desired bias.
- **Drains in-flight work first.** Mirrors K8s pod-shutdown
  semantics. A workload reconciling at the moment of swap
  finishes its current step, reports state, then queues. New
  agent picks up the queue once it's the active version. No
  observable flap on the workload.
- **Heartbeat-driven version reporting.** The agent already
  publishes heartbeats; adding the version field is one line.
  No new transport.

## Consequences

**Pros:**

- Bounded blast radius per upgrade (one device).
- Rollback is the same code path as upgrade — no special-case
  bug class.
- Operator's view is monotonic: heartbeats with versions are
  immutable history; there's no "did the upgrade really happen"
  state.
- Old agent never decides to exit on its own. The most dangerous
  failure mode in self-upgrading software (premature exit) is
  designed out.
- Compatible with eventual fleet-wide rollouts (canary, %-based)
  which become operator-side orchestration on top of this
  primitive.

**Cons:**

- Briefly runs two agents per device (Cutover-Ready window).
  Memory and connection-count both ~2× during that window.
  Acceptable for the upgrade duration (typically <60s).
- Requires reliable connectivity between agent and operator to
  complete the handoff. A device whose NATS link fails mid-
  upgrade stays in Cutover-Ready until link recovers.
- Disk grows monotonically with version count. Bounded by human
  cleanup. We do not GC.
- New NATS subjects, new heartbeat fields, new `Device.status`
  fields. Schema bump that operators-in-the-field need to handle
  (the operator must understand "old agent reporting no version
  field" as `version: unknown`, not crash).

## Alternatives considered

1. **OS-package upgrade (apt / dpkg / OSTree).** *Pros:* zero
   custom code, standard toolchain, GPG-signed.
   *Cons:* Loses the "agent verifies the new agent before swap"
   property. apt's restart hook flips the symlink and `systemctl
   restart`s; if the new binary is broken, the device is bricked
   until human intervention. Doesn't drain in-flight work. Doesn't
   know about NATS-managed pause states. Couples the upgrade
   schedule to the distro's repo, not to the cluster operator's
   intent. Rejected.

2. **Pull-from-OCI-registry on each agent restart.** *Pros:* same
   primitive as podman / kube node-image-rotation.
   *Cons:* Coupling to a registry the device must reach — many
   customer fleets are on private subnets without registry
   access. Would mean shipping a registry mirror per fleet. Adds
   a dependency for a problem we can solve with a signed binary
   on a CDN.

3. **Two systemd units, blue/green at the unit level.**
   `fleet-agent-v0.1.1.service` and `fleet-agent-v0.1.2.service`,
   ratchet via systemctl enable/disable. *Pros:* no symlink dance.
   *Cons:* duplicates a lot of unit-file content; harder to
   reason about what the "active" unit is (you have to ask
   systemd, not `readlink`); doesn't compose well with the
   `ExecStart=/usr/local/bin/fleet-agent` line we already ship.
   Symlink swap is the lighter primitive.

4. **Self-stopping agent (no operator stop signal).** New agent
   tells old agent "I'm up, you can go" via NATS. *Pros:* one
   fewer subject.
   *Cons:* The new agent is also the agent we're least sure of
   — putting it in charge of the old one's lifecycle inverts the
   trust model. If the new agent has a bug that causes it to
   announce ready prematurely, the cluster goes blind. The
   operator path is the conservative choice.

5. **Operator-pushed binary (instead of agent-pulled).** The
   operator sshes / executes a one-off command per device.
   *Pros:* operator controls timing precisely.
   *Cons:* Reintroduces SSH as a control plane (we just spent a
   month getting rid of it for the enrollment flow). Doesn't
   scale to fleets where most devices are NATted away from the
   operator.

## Implementation milestones

(For a future implementer; not committed to a date here. Lives
in the v0.2+ backlog.)

1. **M1** — Versioned binary layout: builds produce
   `fleet-agent-v<version>` artifacts; install Score writes them
   to `/usr/bin/fleet-agent-v<version>` + creates
   `/usr/local/bin/fleet-agent` symlink. Existing tests cover the
   rest.
2. **M2** — Version field in heartbeat + `Device.status.current_version`
   reflection on the operator side. No upgrade behavior yet.
3. **M3** — `desired_version` field on the device-info KV +
   operator setter. No agent-side action yet.
4. **M4** — Agent state machine, end to end, gated by a feature
   flag. Operator publishes desired_version → agent does the
   dance → operator sends stop signal → done. Includes failure-
   mode tests (download fail, smoke fail, heartbeat-timeout
   revert).
5. **M5** — Remove the feature flag. Default-on.
6. **M6** — Operator-side rollout strategies (canary, %-based) —
   only after M5 has been in production for 30 days against a
   real fleet.

## Additional Notes

- Binary signing + signature verification is in scope for the
  `Staging` step but the *which* signing scheme (cosign / Rekor
  / minisign) is deferred until the M1 implementation. Whatever
  we pick must work on aarch64-musl Pi devices without
  additional system dependencies.
- The N-versions-on-disk policy is "all of them, forever" per
  the constraint above. If disk pressure becomes real on some
  customer fleet, a manual GC tool can prune `/usr/bin/fleet-agent-v*`
  by date — never automatic, never as part of the upgrade
  itself.
- See JG's *Pour l'amour des compilateurs* talk (Botpress
  Meetup, 2026-04-30) for the framing applied here:
  cardinality-matched types and operator-as-coordinator are the
  same idea, applied to one function and to one platform.