Files
harmony/docs/adr/drafts/Fleet-IoT-Device-System-Upgrade-With-Rollback.md
Jean-Gabriel Gill-Couture 6eb4b94efd docs(fleet): add recovered v0.3 plan + upgrade/rollback ADR draft
Planning artifacts captured on the branch so they persist; placement to
be sorted post-demo.
2026-06-02 10:00:00 -04:00

138 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR-0042: Fleet IoT Device System Upgrade and Rollback
- **Status:** Accepted
- **Date:** 2026-05-24
- **Component:** Harmony — IoT Fleet Management
- **Supersedes:** —
- **Superseded by:** —
## Context
Harmony's IoT fleet management product manages agents running on customer-owned Raspberry Pi devices under Raspberry Pi OS (Debian-based). System updates are performed with the stock package manager (`apt update && apt full-upgrade`) followed by a reboot. We require the ability to **automatically roll back a failed OS upgrade** so that a device which does not return to a healthy, control-plane-connected state is restored to its pre-upgrade condition without a truck roll.
The agent already has a solved, self-contained upgrade-with-rollback flow for itself: it runs as a privileged Podman container that autostarts on boot. On agent upgrade we suspend management activity, set an `upgrading` flag in the control-plane database, the new agent version clears the flag once it connects, the old agent shuts down cleanly, and operations resume. **OS upgrades are the open problem** addressed by this ADR.
### Constraints
These constraints are firm and shaped the decision:
1. **Drop-in agent on a customer's existing installation.** We must minimize changes to the customer's installation flow. Our agent should be drop-in functionality, not a re-platforming of the customer's device.
2. **No distro change.** Customers run custom out-of-tree kernel modules. We cannot switch distributions, and we must survive kernel upgrades that rebuild those modules.
3. **Filesystem-layer change is acceptable; partition-layout change is not.** Owning the partition table and bootloader configuration (as A/B schemes require) is a more invasive intrusion into a customer's installation than changing the filesystem/volume layer underneath an unchanged `root`. Our preference ordering is explicitly: *filesystem/volume layer change ≪ partition layout change.*
4. **Controlling the device image is a last resort.** We can issue imaging recommendations to customers, but we do not want to depend on shipping our own image.
5. **Rolling upgrades with canaries.** Fleet upgrades proceed device-by-device with a small number of canary devices first. This is orthogonal to but assumed by this ADR.
### The core technical problem
The rollback trigger has two distinct failure modes, and a design that handles only the first is insufficient:
- **Soft failure** — the device boots, the agent starts, but cannot reach the control plane (or the new agent crashes). Userspace is alive, so a userspace watchdog can perform the rollback.
- **Hard failure** — the upgrade produces a `root` that does not boot at all: broken initramfs, an incompatible kernel/module combination, a DKMS rebuild failure, or a misconfigured root device. Userspace is never reached, so **a userspace watchdog never runs**. Something earlier in the boot chain must own the rollback decision.
Customers running custom kernel modules are precisely the population most likely to hit the hard-failure case during a kernel bump, so hard-failure coverage is mandatory, not optional.
## Decision
We will implement OS upgrade rollback using **LVM thin snapshots of an unchanged ext4 root**, combined with a **two-tier watchdog**: an initramfs-level boot-attempt counter for hard failures and a userspace control-plane-check-in timer for soft failures.
### Why LVM thin
- **Invisible to the package manager, the kernel package, and customer modules.** Root remains ext4; LVM (device-mapper) sits below the filesystem. `apt`, DKMS, `update-initramfs`, and the customer's build scripts neither know nor care that ext4 lives on a logical volume. This is the maximally non-invasive snapshot layer.
- **In-tree.** Device-mapper is in the mainline kernel. Unlike an out-of-tree module, the snapshot mechanism itself cannot be broken by the kernel upgrade it is meant to protect against.
- **Thin snapshots specifically:** cheap to create, no pre-sized CoW area to overflow, no classic-snapshot write-performance degradation.
- **Honors the constraint ordering.** Converting the existing root partition to PV/VG/LV is a one-time filesystem/volume-layer operation, not a repartition. After conversion, every upgrade is snapshot/merge with no further structural change.
### Boot-time decision authority
The rollback trigger (boot-attempt counter) and rollback markers live in **initramfs and on the FAT `/boot` firmware partition** — outside the LVM volume and outside the snapshot. This is the single decision that separates "survives a kernel that will not boot" from "only survives an agent that will not connect." Operational health-of-agent state remains the control-plane DB's responsibility; the boot-attempt counter on `/boot` drives the unattended rollback decision.
## Architecture
### One-time provisioning (ideally not on a live, in-service device)
1. Convert the existing root partition to PV → VG (`vg0`) → LV (`vg0/root`), preserving ext4.
2. Regenerate the initramfs with LVM support and our custom boot-attempt hook.
3. Update `cmdline.txt` to `root=/dev/mapper/vg0-root` (prefer by-UUID where available).
4. Install the initramfs `local-top` boot-attempt-counter hook.
5. Configure the BCM2835 hardware watchdog.
This conversion touches the initramfs and the kernel command line, so it is a careful, scripted operation best run at provisioning time rather than live on an in-service device.
### Upgrade flow (every upgrade, fully agent-driven)
1. Suspend management activity; set `upgrade-pending` in the control-plane DB.
2. `lvcreate` a thin snapshot `vg0/root_preupgrade`.
3. Write `bootcount=0` and `expected-good=false` markers to `/boot`.
4. `apt update && apt full-upgrade` (initramfs is regenerated here for the new kernel + modules).
5. `reboot`.
### Post-boot resolution
- **initramfs (hard failure):** increment `bootcount` on `/boot`; if `bootcount > N` (target N = 23), run `lvconvert --merge vg0/root_preupgrade`, drop the marker, and reboot into the restored root.
- **userspace agent (soft failure):** the new agent must achieve a successful **control-plane check-in** within the soft timeout. On success it resets `bootcount`, marks the snapshot committed (`lvremove` the pre-upgrade snapshot), clears `upgrade-pending`, and resumes operations. If the timer fires first, it performs `lvconvert --merge` + reboot.
- **hardware watchdog:** catches total hangs that stall even early boot; the reset returns control to the initramfs bootcount logic, which eventually triggers the merge.
### Key parameters and rules
- **Soft timeout** is measured against **successful control-plane check-in, not mere agent process start.** A privileged Podman container can start yet still be unable to reach us; in that state we prefer rollback over a bricked-but-running device. Initial value: 10 minutes.
- **Boot-attempt threshold** is kept low (23) so a hard-failing device gives up in minutes rather than looping for half an hour.
- **Snapshot merge semantics:** `lvconvert --merge` reverts the entire root LV to the snapshot point. The merge of an in-use origin completes on the next reactivation/reboot, so the pattern is *mark-for-merge → reboot → merge completes during activation.*
- **Data written between upgrade and rollback is discarded** by the merge. Anything the new agent must persist across a rollback has to live outside the snapshot — on a separate LV or in the control-plane DB — or it will be reverted with the rest of root.
- **Marker placement is deliberate:** boot-attempt counter on `/boot` (FAT, outside LVM, survives rollback, drives the decision); operational state in the control-plane DB.
## Alternatives considered
### ZFS on root (with snapshots / boot environments) — Rejected
ZFS offers superior rollback ergonomics: atomic, fully consistent `zfs rollback`, and boot environments that provide an elegant A/B-without-partitions story. Rejected for this use case because:
- **Out-of-tree DKMS module.** ZFS rebuilds against every new kernel. A failed build or a non-loading module means **root will not mount** — coupling the snapshot mechanism to the exact event (a bad kernel upgrade) it is meant to protect against. This circular dependency is disqualifying for a tool whose job is surviving bad kernel upgrades. LVM/device-mapper, being in-tree, has no such dependency.
- **Higher customer-visible footprint.** Root becomes ZFS; `fstab`, `root=ZFS=`, initramfs hooks, and the customer's existing monitoring/backup scripts are all affected. "ext4-on-LV" is invisible by comparison.
- **More DKMS surface.** Both ZFS and the customer's custom modules would share the kernel-bump rebuild path.
- **ARC/memory tuning** is a real operational tax on 24 GB Pi models.
ZFS remains the technically superior rollback model and would be reconsidered if we independently wanted ZFS for data-integrity or `send`/`receive` reasons.
### Partition-level A/B (RAUC / Mender / swupdate, Pi `tryboot`) — Rejected
The embedded-industry-standard answer: dual root partitions, image-based deployment, bootloader-owned A/B flip with a boot-attempt counter (on Pi, via the `tryboot` one-shot mechanism). The bootloader, not the agent, owns the rollback decision, which cleanly solves the hard-failure case. Rejected because:
- Requires **owning the partition table and bootloader configuration** — the most invasive intrusion into a customer's installation, violating constraint 3.
- Cannot be retrofitted onto already-deployed stock single-partition devices without re-imaging, which conflicts with the drop-in goal (constraint 1) and the "image control is last resort" goal (constraint 4).
Worth revisiting only for devices we image ourselves from the start.
### Best-effort package-state rollback on a single ext4 partition — Rejected
Capture `dpkg --get-selections` plus locally cached `.deb` files, then roll back via `apt install pkg=oldversion` against the cache. Rejected as a primary mechanism because it restores *packages*, not the *filesystem*: it does not cover config-file changes, maintainer-script state mutations, or kernel/module mismatches — exactly where these upgrades actually break. Additionally, on a single partition the recovery logic lives on the same root that the failed upgrade may have corrupted. Retained only as a possible degraded fallback for fleets where LVM conversion is impossible.
## Consequences
### Positive
- True filesystem-level rollback that survives both soft and hard failures, including unbootable roots from bad kernel/module combinations.
- Snapshot layer is invisible to `apt`, DKMS, and customer modules; no distro change; root stays ext4.
- Honors the constraint ordering — a one-time volume-layer conversion instead of repartitioning.
- The rollback decision-maker (initramfs + `/boot` bootcount) is independent of the rootfs it protects.
### Negative / costs
- **One-time LVM conversion is a real, careful operation** touching initramfs and `cmdline.txt`; best done at provisioning, not live on an in-service device. This is the largest remaining footprint on the customer's installation.
- We must build and maintain a **custom initramfs `local-top` hook** and keep it correct across Raspberry Pi OS initramfs-tools changes.
- `lvconvert --merge` **discards data written during the probation window**; any must-persist state has to be placed outside the snapshot by design.
- Devices where LVM conversion is genuinely impossible fall back to the weaker package-state approach (degraded coverage).
### Risks to watch
- Initramfs regeneration during `apt full-upgrade` must reliably include the LVM bits and our hook for the new kernel; a regression here is itself a hard-failure source.
- Thin-pool capacity must be sized so a snapshot plus upgrade churn cannot exhaust the pool.
- Hardware-watchdog timeout and boot-attempt threshold need field tuning so total hangs are caught without prematurely rolling back a slow-but-healthy boot.
## Follow-up work
- Implement and review the initramfs `local-top` boot-attempt-counter hook (reads/increments/resets the counter on FAT `/boot`, conditionally fires `lvconvert --merge`).
- Implement the userspace systemd timer / agent thread for the soft-failure path (control-plane check-in gate, snapshot commit, clean rollback).
- Define and document the scripted, idempotent provisioning conversion (partition → PV/VG/LV, initramfs regen, `cmdline.txt` update, watchdog config).
- Validate the full matrix on canary hardware: clean upgrade, soft failure (no check-in), hard failure (unbootable kernel), and total hang.
- Specify which agent-local state, if any, must live outside the snapshot, and where.