harmony/docs/adr/drafts/Fleet-IoT-Device-System-Upgrade-With-Rollback.md

# ADR-0042: Fleet IoT Device System Upgrade and Rollback

- **Status:** Accepted
- **Date:** 2026-05-24
- **Component:** Harmony — IoT Fleet Management
- **Supersedes:** —
- **Superseded by:** —

## Context

Harmony's IoT fleet management product manages agents running on customer-owned Raspberry Pi devices under Raspberry Pi OS (Debian-based). System updates are performed with the stock package manager (`apt update && apt full-upgrade`) followed by a reboot. We require the ability to **automatically roll back a failed OS upgrade** so that a device which does not return to a healthy, control-plane-connected state is restored to its pre-upgrade condition without a truck roll.

The agent already has a solved, self-contained upgrade-with-rollback flow for itself: it runs as a privileged Podman container that autostarts on boot. On agent upgrade we suspend management activity, set an `upgrading` flag in the control-plane database, the new agent version clears the flag once it connects, the old agent shuts down cleanly, and operations resume. **OS upgrades are the open problem** addressed by this ADR.

### Constraints

These constraints are firm and shaped the decision:

1. **Drop-in agent on a customer's existing installation.** We must minimize changes to the customer's installation flow. Our agent should be drop-in functionality, not a re-platforming of the customer's device.
2. **No distro change.** Customers run custom out-of-tree kernel modules. We cannot switch distributions, and we must survive kernel upgrades that rebuild those modules.
3. **Filesystem-layer change is acceptable; partition-layout change is not.** Owning the partition table and bootloader configuration (as A/B schemes require) is a more invasive intrusion into a customer's installation than changing the filesystem/volume layer underneath an unchanged `root`. Our preference ordering is explicitly: *filesystem/volume layer change ≪ partition layout change.*
4. **Controlling the device image is a last resort.** We can issue imaging recommendations to customers, but we do not want to depend on shipping our own image.
5. **Rolling upgrades with canaries.** Fleet upgrades proceed device-by-device with a small number of canary devices first. This is orthogonal to but assumed by this ADR.

### The core technical problem

The rollback trigger has two distinct failure modes, and a design that handles only the first is insufficient:

- **Soft failure** — the device boots, the agent starts, but cannot reach the control plane (or the new agent crashes). Userspace is alive, so a userspace watchdog can perform the rollback.
- **Hard failure** — the upgrade produces a `root` that does not boot at all: broken initramfs, an incompatible kernel/module combination, a DKMS rebuild failure, or a misconfigured root device. Userspace is never reached, so **a userspace watchdog never runs**. Something earlier in the boot chain must own the rollback decision.

Customers running custom kernel modules are precisely the population most likely to hit the hard-failure case during a kernel bump, so hard-failure coverage is mandatory, not optional.

## Decision

We will implement OS upgrade rollback using **LVM thin snapshots of an unchanged ext4 root**, combined with a **two-tier watchdog**: an initramfs-level boot-attempt counter for hard failures and a userspace control-plane-check-in timer for soft failures.

### Why LVM thin

- **Invisible to the package manager, the kernel package, and customer modules.** Root remains ext4; LVM (device-mapper) sits below the filesystem. `apt`, DKMS, `update-initramfs`, and the customer's build scripts neither know nor care that ext4 lives on a logical volume. This is the maximally non-invasive snapshot layer.
- **In-tree.** Device-mapper is in the mainline kernel. Unlike an out-of-tree module, the snapshot mechanism itself cannot be broken by the kernel upgrade it is meant to protect against.
- **Thin snapshots specifically:** cheap to create, no pre-sized CoW area to overflow, no classic-snapshot write-performance degradation.
- **Honors the constraint ordering.** Converting the existing root partition to PV/VG/LV is a one-time filesystem/volume-layer operation, not a repartition. After conversion, every upgrade is snapshot/merge with no further structural change.

### Boot-time decision authority

The rollback trigger (boot-attempt counter) and rollback markers live in **initramfs and on the FAT `/boot` firmware partition** — outside the LVM volume and outside the snapshot. This is the single decision that separates "survives a kernel that will not boot" from "only survives an agent that will not connect." Operational health-of-agent state remains the control-plane DB's responsibility; the boot-attempt counter on `/boot` drives the unattended rollback decision.

## Architecture

### One-time provisioning (ideally not on a live, in-service device)

1. Convert the existing root partition to PV → VG (`vg0`) → LV (`vg0/root`), preserving ext4.
2. Regenerate the initramfs with LVM support and our custom boot-attempt hook.
3. Update `cmdline.txt` to `root=/dev/mapper/vg0-root` (prefer by-UUID where available).
4. Install the initramfs `local-top` boot-attempt-counter hook.
5. Configure the BCM2835 hardware watchdog.

This conversion touches the initramfs and the kernel command line, so it is a careful, scripted operation best run at provisioning time rather than live on an in-service device.

### Upgrade flow (every upgrade, fully agent-driven)

1. Suspend management activity; set `upgrade-pending` in the control-plane DB.
2. `lvcreate` a thin snapshot `vg0/root_preupgrade`.
3. Write `bootcount=0` and `expected-good=false` markers to `/boot`.
4. `apt update && apt full-upgrade` (initramfs is regenerated here for the new kernel + modules).
5. `reboot`.

### Post-boot resolution

- **initramfs (hard failure):** increment `bootcount` on `/boot`; if `bootcount > N` (target N = 2–3), run `lvconvert --merge vg0/root_preupgrade`, drop the marker, and reboot into the restored root.
- **userspace agent (soft failure):** the new agent must achieve a successful **control-plane check-in** within the soft timeout. On success it resets `bootcount`, marks the snapshot committed (`lvremove` the pre-upgrade snapshot), clears `upgrade-pending`, and resumes operations. If the timer fires first, it performs `lvconvert --merge` + reboot.
- **hardware watchdog:** catches total hangs that stall even early boot; the reset returns control to the initramfs bootcount logic, which eventually triggers the merge.

### Key parameters and rules

- **Soft timeout** is measured against **successful control-plane check-in, not mere agent process start.** A privileged Podman container can start yet still be unable to reach us; in that state we prefer rollback over a bricked-but-running device. Initial value: 10 minutes.
- **Boot-attempt threshold** is kept low (2–3) so a hard-failing device gives up in minutes rather than looping for half an hour.
- **Snapshot merge semantics:** `lvconvert --merge` reverts the entire root LV to the snapshot point. The merge of an in-use origin completes on the next reactivation/reboot, so the pattern is *mark-for-merge → reboot → merge completes during activation.*
- **Data written between upgrade and rollback is discarded** by the merge. Anything the new agent must persist across a rollback has to live outside the snapshot — on a separate LV or in the control-plane DB — or it will be reverted with the rest of root.
- **Marker placement is deliberate:** boot-attempt counter on `/boot` (FAT, outside LVM, survives rollback, drives the decision); operational state in the control-plane DB.

## Alternatives considered

### ZFS on root (with snapshots / boot environments) — Rejected

ZFS offers superior rollback ergonomics: atomic, fully consistent `zfs rollback`, and boot environments that provide an elegant A/B-without-partitions story. Rejected for this use case because:

- **Out-of-tree DKMS module.** ZFS rebuilds against every new kernel. A failed build or a non-loading module means **root will not mount** — coupling the snapshot mechanism to the exact event (a bad kernel upgrade) it is meant to protect against. This circular dependency is disqualifying for a tool whose job is surviving bad kernel upgrades. LVM/device-mapper, being in-tree, has no such dependency.
- **Higher customer-visible footprint.** Root becomes ZFS; `fstab`, `root=ZFS=`, initramfs hooks, and the customer's existing monitoring/backup scripts are all affected. "ext4-on-LV" is invisible by comparison.
- **More DKMS surface.** Both ZFS and the customer's custom modules would share the kernel-bump rebuild path.
- **ARC/memory tuning** is a real operational tax on 2–4 GB Pi models.

ZFS remains the technically superior rollback model and would be reconsidered if we independently wanted ZFS for data-integrity or `send`/`receive` reasons.

### Partition-level A/B (RAUC / Mender / swupdate, Pi `tryboot`) — Rejected

The embedded-industry-standard answer: dual root partitions, image-based deployment, bootloader-owned A/B flip with a boot-attempt counter (on Pi, via the `tryboot` one-shot mechanism). The bootloader, not the agent, owns the rollback decision, which cleanly solves the hard-failure case. Rejected because:

- Requires **owning the partition table and bootloader configuration** — the most invasive intrusion into a customer's installation, violating constraint 3.
- Cannot be retrofitted onto already-deployed stock single-partition devices without re-imaging, which conflicts with the drop-in goal (constraint 1) and the "image control is last resort" goal (constraint 4).

Worth revisiting only for devices we image ourselves from the start.

### Best-effort package-state rollback on a single ext4 partition — Rejected

Capture `dpkg --get-selections` plus locally cached `.deb` files, then roll back via `apt install pkg=oldversion` against the cache. Rejected as a primary mechanism because it restores *packages*, not the *filesystem*: it does not cover config-file changes, maintainer-script state mutations, or kernel/module mismatches — exactly where these upgrades actually break. Additionally, on a single partition the recovery logic lives on the same root that the failed upgrade may have corrupted. Retained only as a possible degraded fallback for fleets where LVM conversion is impossible.

## Consequences

### Positive

- True filesystem-level rollback that survives both soft and hard failures, including unbootable roots from bad kernel/module combinations.
- Snapshot layer is invisible to `apt`, DKMS, and customer modules; no distro change; root stays ext4.
- Honors the constraint ordering — a one-time volume-layer conversion instead of repartitioning.
- The rollback decision-maker (initramfs + `/boot` bootcount) is independent of the rootfs it protects.

### Negative / costs

- **One-time LVM conversion is a real, careful operation** touching initramfs and `cmdline.txt`; best done at provisioning, not live on an in-service device. This is the largest remaining footprint on the customer's installation.
- We must build and maintain a **custom initramfs `local-top` hook** and keep it correct across Raspberry Pi OS initramfs-tools changes.
- `lvconvert --merge` **discards data written during the probation window**; any must-persist state has to be placed outside the snapshot by design.
- Devices where LVM conversion is genuinely impossible fall back to the weaker package-state approach (degraded coverage).

### Risks to watch

- Initramfs regeneration during `apt full-upgrade` must reliably include the LVM bits and our hook for the new kernel; a regression here is itself a hard-failure source.
- Thin-pool capacity must be sized so a snapshot plus upgrade churn cannot exhaust the pool.
- Hardware-watchdog timeout and boot-attempt threshold need field tuning so total hangs are caught without prematurely rolling back a slow-but-healthy boot.

## Follow-up work

- Implement and review the initramfs `local-top` boot-attempt-counter hook (reads/increments/resets the counter on FAT `/boot`, conditionally fires `lvconvert --merge`).
- Implement the userspace systemd timer / agent thread for the soft-failure path (control-plane check-in gate, snapshot commit, clean rollback).
- Define and document the scripted, idempotent provisioning conversion (partition → PV/VG/LV, initramfs regen, `cmdline.txt` update, watchdog config).
- Validate the full matrix on canary hardware: clean upgrade, soft failure (no check-in), hard failure (unbootable kernel), and total hang.
- Specify which agent-local state, if any, must live outside the snapshot, and where.