Files
harmony/ROADMAP/fleet_platform/v0_3_plan_recovered.md
Jean-Gabriel Gill-Couture e075872333 docs(roadmap): bring agent+system upgrade auto-rollback into scope; reconcile with ADR-022 & ADR-0042
- Agent upgrade (#4): ADR-022 is authoritative and already includes auto-rollback (symlink revert, old binary never GC'd, operator-coordinated). Supersede the simpler no-rollback cut. Flag the systemd-binary vs podman-container runtime contradiction between ADR-022 and ADR-0042.
- System upgrade (#7): rollback no longer deferred to v0.4 — full ADR-0042 (LVM thin snapshot + two-tier watchdog: initramfs bootcount hard-fail, control-plane check-in soft-fail) is in scope. Promote ADR out of drafts + renumber.
- Fix category error: deployment roll-forward-only does NOT apply to agent/system upgrades (different concern; both auto-rollback).
2026-06-03 05:49:45 -04:00

40 KiB
Raw Permalink Blame History

Fleet Platform v0.3 — last-mile plan

Authoritative plan for the last mile before the fleet ships to a real customer. Picks up where v0_2_plan.md left the chapter structure. Written 2026-05-24, after feat/iot-walking-skeleton (#264) merged and feat/smoke-test-contract landed the Phase 0 smoke companion.

The frame:

  • v0.1 proved the shape.
  • v0.2 locked the brick design.
  • v0.3 makes the brick safe to hand to a customer running production workloads on Pis in their basement.

State coming in

  • IoT walking skeleton merged. Operator + agent + NATS + Zitadel + auth callout running end-to-end against an OKD staging cluster.
  • Smoke-test contract Phase 0 merged (feat/smoke-test-contract).
    • Probe / SmokeSuite / SmokeTest companion + deploy_with_smoke in harmony-fleet-deploy/src/companion/smoke/.
    • One concrete probe today: TcpReachable.
    • No fleet Score wired to a real smoke test yet — Phase 1 is in this roadmap.
  • Agent runs as a systemd user unit on devices (see harmony/src/modules/fleet/setup_score.rs:263283).
    • No on-device containerized agent path.
    • The Dockerfile in fleet/harmony-fleet-agent/Dockerfile is k8s-only today.
  • Dashboard has no role enforcement — security gap.
    • Maud/htmx frontend at fleet/harmony-fleet-operator/src/frontend/server.rs.
    • Verifies Zitadel JWT signature + expiry only.
    • JwksCache::verify (harmony_zitadel_auth/src/jwks.rs:74) extracts sub/exp/email/name/nonce — no roles.
    • VerifiedSession has no roles field.
    • Any logged-in Zitadel user gets full dashboard access. Fix immediately (Chapter 1).
  • NATS callout already has the role-extraction logic we need.
    • ZitadelValidator::extract_roles at nats/callout/src/zitadel.rs:203.
    • Handles both array shape (["fleet-admin"]) and Zitadel's object-map shape ({"fleet-admin": {org_id: org_name}}).
    • roles::resolve maps role names to ResolvedRole::Admin/::Device with admin-wins privilege escalation.
    • Chapter 1 reuses the extractor, not the role-to-NATS-permission half.
  • System upgrade ADR drafted at docs/adr/drafts/Fleet-IoT-Device-System-Upgrade-With-Rollback.md.
    • Header says Accepted 2026-05-24 but lives under drafts/.
    • Authoritative status: approach agreed, rollback half deferred (Chapter 7).

Customer constraints baked into this plan

  • Deployments are roll-forward only. No auto-rollback when a new Deployment (customer app) version fails. Dashboard surfaces the failure; customer edits the spec and rolls forward. Customer ask; may change later.
  • Agent and system upgrades DO auto-rollback (updated 2026-06-03). This is a different concern from the deployment rule above and must not be conflated: a broken agent or OS upgrade has no customer "edit the spec and roll forward" path — an unreachable/bricked device needs to self-heal. ADR-022 (agent) and ADR-0042 (system) both already design the rollback; both are now in v0.3 scope. See Chapters 4 & 7.
  • Secrets need Zitadel + OpenBao. No plaintext-env-var shortcut. harmony_secret + OpenBao work is on the critical path for any Deployment that needs credentials.

Feature checklist

Status legend: shipped · 🟡 in flight · 🔴 not started · ⏸ deferred (target version in note).

# Feature Status Owner / branch Notes
1 Dashboard role enforcement (fleet-admin required) 🔴 next branch Reuse ZitadelValidator::extract_roles. Do this right now — security gap.
2 Operator restart / aggregator cold-rebuild 🔴 next branch More critical than smoke wiring; ship before any customer.
3 Deployment getLogs companion + dashboard log view 🟡 feat/fleet-device-exec-logs Basic cut shipped: one-shot podman logs tail (Refresh button) + remote exec + per-deployment selector on the device Logs tab. Companion-trait refactor + live streaming still owed (→ #13/v0.4).
4 Agent self-upgrade + auto-rollback (ADR-022) 🔴 new branch ADR-022 is the accepted design and already includes auto-rollback (symlink swap, old binary never GC'd, auto-revert on smoke-fail/heartbeat-timeout). Supersedes the simpler no-rollback cut sketched in Ch4. Reconcile runtime model first (see Ch4 note).
5 Graceful deployment upgrade (roll-forward only) 🔴 new branch SIGTERM → grace → SIGKILL fallback → start new. No rollback.
6 Init containers in PodmanV0Score 🔴 new branch Ordered, run-to-completion, customer guarantees idempotency.
7 System upgrade + auto-rollback (ADR-0042) 🔴 new branch Now in scope WITH rollback (was deferred to v0.4): LVM thin snapshot + two-tier watchdog (initramfs bootcount = hard-fail, control-plane check-in = soft-fail). Promote ADR out of drafts/ + renumber to next in sequence.
8 Secrets via Zitadel + OpenBao for Deployments 🔴 blocked on machine-identity br. Design locked (Chapter 8): agent fetches via harmony_config, scoped by a per-deployment Zitadel claim OpenBao reads. Heaviest item; depends on the machine-identity/SSO branch.
9 Agent time-drift verification 🔴 new branch Periodic NTP check; refuse JWT operations if skewed.
10 Phase 1 smoke wiring (HTTP / K8sPodReady / NatsKv probes) 🔴 new branch After required features land. Not a functional blocker.
11 CI yaml minimization (logic into harmony-ci scripts) ⏸ v0.4 longer-term Yaml stays for discovery + parallel viz; scripts move.
12 NATS callout CI hardening low-churn crate Already covered by workspace cargo test. Run ignored tests when CI has podman + NATS image.
13 Application log streaming through NATS ⏸ v0.4 follow-on to #3 #3 is the synchronous getLogs; this is the live tail.
14 Device blacklist enforcement + un-blacklist 🔴 new branch Today blacklist is a cosmetic label. Chapter 14: Zitadel deactivate + un-blacklist; no true instant NATS kill (see chapter).
15 Container auto-start after reboot (bug) feat/fleet-device-exec-logs Agent used watch (DeliverPolicy::New) → never replayed desired-state on restart. Fixed: watch_with_history.
16 Containers auto-start while offline at boot ⏸ v0.4 follow-on to #15 #15 covers the online case (agent re-reconciles). Offline-boot resilience would need user podman-restart.service + restart=always; defer.

Sequencing

Order Item Why
1 #1 Dashboard role enforcement Security gap, do right now.
2 #2 Operator restart recovery More critical than smoke wiring. Customer can't tolerate "operator restarted, state unknown."
3 #3 Log forwarding companion Turns the dashboard from a toy into a thing customers actually use.
4 #4 Agent self-upgrade Parallel-safe with #2/#3 — different code paths.
5 #5 + #6 Graceful upgrade + init containers Paired Deployment-layer features; ship together.
6 #9 Time-drift verification Small, isolated; slot between heavier items.
7 #7 System upgrade Builds on agent-upgrade pattern from #4 — #4 lands first.
8 #10 Phase 1 smoke wiring After required features so probes verify real customer-facing surfaces.
9 #8 Secrets Blocks any customer Deployment that needs credentials. Promote if first customer needs them.
10 #11 / #12 CI Opportunistic, doesn't block customer.

Chapter 1 — Dashboard role enforcement (#1)

Goal: every dashboard page requires a valid Zitadel session and a fleet-admin role on the token.

  • Users without the role get a 403 with a clear message.
  • Users without a session get the existing login redirect.

Current state

  • JWKS verify only extracts identity claims. JwksCache::verify (harmony_zitadel_auth/src/jwks.rs:74) parses the JWT and returns a VerifiedSession with sub/exp/email/name/nonce. Roles not extracted.
  • VerifiedSession has no roles field (harmony_zitadel_auth/src/session.rs:5).
  • Middleware checks JWT validity only. require_auth (fleet/harmony-fleet-operator/src/frontend/server.rs:136157). Every authenticated user gets all pages.
  • Role extraction logic already exists and is correct in the callout: ZitadelValidator::extract_roles at nats/callout/src/zitadel.rs:203. Handles both shapes:
    • array — ["fleet-admin"]
    • object-map — {"fleet-admin": {org_id: org_name}}

Plan

  1. Extract a shared role-extraction helper into harmony_zitadel_auth so dashboard and callout import from one place. Callout keeps its API but its body delegates.
  2. Extend VerifiedSession with roles: Vec<String>.
  3. Extend the JWKS Claims decode struct to capture the configured roles claim. Pull the claim name from existing callout config so the two systems agree (Zitadel ships urn:zitadel:iam:org:project:roles or similar).
  4. Add require_role(role: &'static str) middleware to the dashboard. Compose with require_auth. Use on every Router::route(..., post|get(...).layer(...)).
  5. 403 response renders a maud page — "fleet-admin role required; ask your administrator." Not a JSON error; dashboard is human-facing.

Tests

Security code — heavy unit tests are non-negotiable.

  • Array-shape claim → fleet-admin in session. JWT with array-shape role claim.
  • Object-map shape → identical resolution. Same role, Zitadel's other claim shape.
  • No role claim → empty roles. Token with no roles claim.
  • Wrong role doesn't elevate. JWT with only device role does NOT carry fleet-admin.
  • No session → 401/redirect.
  • Session but no fleet-admin → 403.
  • Session + fleet-admin → 200.

Done when

  • Branch merged.
  • All dashboard handlers gated by require_role("fleet-admin").
  • Every test green.
  • Manual smoke against staging Zitadel: user without role sees 403.

Follow-ups (post-demo — shipped a working-but-imperfect cut)

Gate works in staging, but on a temporary footing. Clean these up after the demo:

  1. Get roles into the id_token via scope, not the Zitadel app checkbox. Today it works only because the app has "User Roles Inside ID Token" toggled on — out-of-band IdP config, invisible to our code, easy to miss on a new env (cost us a debug cycle: roles=[] despite the role being granted). The OIDC-idiomatic fix is to request urn:zitadel:iam:org:project:roles in ZitadelAuthConfig.scope ("when requested" per the Zitadel claims matrix), then turn the checkbox back off. Keeps our stateless id_token-as-session design; the dependency travels with the deploy. (UserInfo-endpoint / access-token authz is the heavier vendor-agnostic alternative — not worth it for a first-party UI.)
  2. Fix the SSO doc. docs/guides/operator-dashboard-sso.md step 1b wrongly says enable "Assert Roles on Authentication" — that's the userinfo setting and does not put roles in the id_token. Replace with the scope-request (and/or "User Roles Inside ID Token").
  3. Unify the two role extractors (DRY debt — diverged from Plan #1 above). We now have harmony_zitadel_auth::extract_zitadel_roles and nats/callout ZitadelValidator::extract_roles doing the same job. Worse, the dashboard one only handles the object-map claim shape; the callout one also handles the array shape. Extract one shared helper (handling both shapes + both aggregated/project-scoped claim names) and have both import it, as Chapter 1 Plan originally intended.
  4. require_role is inlined, not composable. The gate lives inside require_auth (one trust boundary, fine for one role). If a second role/permission ever appears, lift it to a composable require_role(..) layer as Plan #4 intended — not before (YAGNI).

Chapter 2 — Operator restart + aggregator recovery (#2)

Goal: the operator pod can be killed, upgraded, or rescheduled at any time and the system converges back to correct state from NATS KV alone. No "unknown state" window visible to customers.

Current state

  • Aggregator rebuilds from scratch on startup. fleet_aggregator.rs (833 LOC, in harmony-fleet-operator/src/) watches the KV buckets to materialize state. JG confirmed: "rebuilt from scratch, yes."
  • Failure modes not exercised yet:
    • Partial KV — device offline during operator reset, never re-published its info.
    • Two operator pods racing during a rolling deploy of the operator.
    • NATS stream loss between operator restart and rebuild completing.
    • Stale KV — Deployment CR deleted in kube while operator was down.

Plan

Scenario-driven. Enumerate failure shapes, then handle one at a time. Discipline: each scenario gets a regression test in harmony-fleet-e2e, then the fix.

  1. Scenario inventory pass. Write docs/fleet-operator-recovery-scenarios.md listing every failure shape we can think of. Cross-reference smoke-a* tests to identify what's already covered.
  2. Cold-start rebuild as the baseline. Confirm + test that kubectl delete pod of the operator and waiting for the replacement converges to pre-kill aggregate in < 30s. Gate on convergence time at N device count.
  3. Stale-KV reconciliation. Define the rule for "KV says device D has Deployment X, but Deployment X no longer exists in kube." Operator cleans up; agents observe the deletion.
  4. Leader election decision. Ship with leader election (one writer at a time) or design for idempotent multi-writer? Score-Topology-Interpret leans idempotent; confirm + assert operator writes are byte-deterministic.
  5. Liveness signaling for the dashboard. Surface "operator converged" / "operator recovering" as states the frontend renders. Customer sees a loading banner, not a blank dashboard, during rebuild.

Open questions

  • Warm-restart snapshot? Keep a per-operator-pod "last known aggregate" snapshot in a KV bucket so warm restarts skip cold rebuild? Probably yes for >1000-device fleets; adds an invalidation problem.
  • One pod or active/passive? Customer's fleet size answers this. Ask before starting.

Done when

  • Scenario inventory exists.
  • Each scenario has a regression test, all green.
  • Manual chaos: kill operator pod during high write load → convergence + dashboard liveness banner observed.

Chapter 3 — Application log forwarding companion (#3)

Goal: when a customer's Deployment is misbehaving on a Pi in the field, the dashboard shows last-N-lines of container logs without anyone SSH-ing the device.

Design

  • Logs attach as a Score companion — same pattern as the smoke-test contract.
  • The companion is optional — Scores without one render "this deployment doesn't expose logs". Acceptable.
  • Sync getLogs ships in v0.3; live tail (streaming) waits for v0.4 — that's the minimum useful UX.

Shape:

// new in harmony-fleet-deploy/src/companion/logs/
pub trait LogQuery<T: Topology>: Send + Sync {
    type Score: Score<T>;
    async fn last_lines(
        &self,
        score: &Self::Score,
        topology: &T,
        n: usize,
    ) -> Result<LogChunk, LogQueryError>;
}

pub struct LogChunk {
    pub source: ProbeName, // reuse the validated newtype
    pub captured_at: chrono::DateTime<chrono::Utc>,
    pub lines: Vec<String>,
    pub truncated: bool,
}

For PodmanV0Score:

  • Transport: NATS request/reply. Subject device-commands.<device_id>.logs.<deployment>.
  • Agent side: runs podman logs --tail N <container> and replies with a LogChunk.
  • Dashboard side: one async call from the logs handler.

Plan

  1. Define LogQuery companion trait in a new harmony-fleet-deploy/src/companion/logs/ module.
  2. PodmanLogQuery implementing LogQuery<…> for PodmanV0Score.
  3. Agent-side command handler — extend the existing request/reply command dispatcher.
  4. Dashboard handler at /deployments/<name>/devices/<id>/logs?lines=N returning rendered maud.
  5. Tests: unit on PodmanLogQuery; integration in harmony-fleet-e2e driving end-to-end.

Done when

  • Customer clicks "View logs" on the dashboard.
  • Sees the last 200 lines.
  • Call returns in < 2s on a 3-device fleet.

Chapter 4 — Agent self-upgrade + auto-rollback (#4)

Reconciliation (2026-06-03). ADR-022 (docs/adr/022-fleet-agent-upgrade.md, Accepted-design) is authoritative — build that, not the simpler cut below. ADR-022 already delivers the auto-rollback JG wants: versioned binaries (/usr/bin/fleet-agent-v<ver>, never GC'd) + atomic symlink swap, a Verifying step (--self-test) before cutover, and auto-revert when the staged binary fails smoke or the new agent misses its heartbeat window — the old version is one ln -sfn away and the operator (not the agent) owns the stop signal. The "no auto-rollback in v0.3" line in the old draft below was a category error: it borrowed the deployment roll-forward-only rule (a customer ask about app versions), which does not apply to agent upgrades. Rollback here is wanted and already designed.

Two divergences to resolve before implementing:

  1. Runtime model. ADR-022 + this chapter assume a systemd binary (symlink swap). ADR-0042 (system upgrade) states the agent "runs as a privileged Podman container that autostarts on boot." These contradict. Pick one and fix the loser. (Recommend: stay systemd-binary for the agent — ADR-022's symlink/verify/revert is simpler and the container path isn't built. Update ADR-0042's premise.)
  2. Protocol. ADR-022's state machine (Running→Draining→Staging→Verifying→ Cutover-Ready→Stopping) supersedes the marker-phase sketch below. Keep the "marker in NATS, no-NATS-no-upgrade" idea; drop the systemctl restart self-swap in favor of ADR-022's parallel-service + operator-stop handoff.

Goal: the agent can upgrade itself in place, auto-reverting to the last known-good version on failure. If NATS is unavailable, the upgrade does not start. The operator sees every step.

Design (per JG's direction)

  • Stay on systemd for v0.3. Switching the agent runtime to podman is its own risk; defer until self-upgrade protocol matures.
  • Upgrade marker lives in NATS, not on disk. New KV bucket agent-upgrade keyed by device_id, carrying start_timestamp, invoker_version, target_version, phase.
  • No NATS → no upgrade. Feature, not limitation: operator can't observe an upgrade it can't see, so refusing without NATS prevents silent half-upgrades.

Protocol

  1. Operator writes Requested. agent-upgrade/<device_id> with phase: Requested, target_version: vX.
  2. Old agent observes + writes Suspending. Verifies NATS liveness with a round-trip first.
  3. Old agent suspends + writes Suspended. Reconcile loop paused; heartbeat continues so the operator knows it's alive.
  4. Old agent fetches new binary + writes Fetched. Mechanism TBD (see open questions). target_path: /usr/local/bin/fleet-agent.new.
  5. Old agent launches new binary as a separate process + writes NewLaunched. Not via systemd unit update yet. Includes new_pid: N.
  6. New agent self-checks + writes NewHealthy. Connects to NATS, verifies permissions, one-shot smoke (KV read, command channel echo).
  7. Old agent writes HandingOff and exits. Tells systemd to swap the binary: systemctl daemon-reload + systemctl restart fleet-agent.service with the new binary in place.
  8. Systemd starts the unit pointing at the new binary. Final state phase: Complete, completed_at: T.

On stall (configurable, default 5 min):

  • Marker writes phase: Failed with last successful step.
  • Operator surfaces this on the dashboard.
  • Customer / operator intervenes manually — no auto-rollback in v0.3, consistent with the deployment roll-forward-only rule.

Open questions

  • Q1.1 Binary distribution. Gitea release asset? Signed OCI artifact? Existing arm-agents.yaml uploads aarch64 binaries to releases — start with that.
  • Q1.2 Verification. Hash signature? GPG? Minimum: SHA-256 pinned in the upgrade-request payload.
  • Q1.3 Atomic systemd swap. systemctl restart is not atomic across binary-on-disk and process. Acceptable? Or systemd-run --transient shim?
  • Q1.4 Cross-arch. Fetch URL has to know the device's arch. KV device-info already carries this; confirm the agent reads its own arch correctly.

Done when

  • Branch contains the protocol implementation + e2e test driving v0.3.0 → v0.3.1 upgrade against a libvirt VM.
  • Operator sees every phase.
  • Failure path tested: deliberately corrupt the new binary → marker reads Failed, old agent stays running.

Chapter 5 — Graceful deployment upgrade, roll-forward only (#5)

Goal: upgrading a Deployment's image/config replaces the old container without dropping traffic mid-request. If the new container won't start, the customer sees the failure clearly and fixes the spec.

Design

Extend PodmanV0Score with a lifecycle block:

pub struct PodmanV0Score {
    // ... existing fields ...
    pub lifecycle: Option<LifecyclePolicy>,
}

pub struct LifecyclePolicy {
    pub stop_signal: StopSignal,       // SIGTERM (default), SIGINT, SIGUSR1
    pub grace_period: Duration,        // default 30s
    pub sigkill_fallback: bool,        // default true
}

Agent's reconcile when image/config changes:

  1. Write Upgrading phase. New DeploymentState::Phase::Upgrading variant. Dashboard shows the in-progress upgrade.
  2. Send stop_signal to the old container.
  3. Wait up to grace_period for clean exit.
  4. SIGKILL fallback if still running and sigkill_fallback.
  5. Start new container.
  6. On startup failure: write Failed and stop. Image pull error, exec error, crash within 5s. No revert to old image.
  7. On success: write Running. Optionally gated by a Phase-1 smoke test (Chapter 10) when that lands.

Explicit non-goals

  • No auto-rollback. Customer-asked constraint. Step 6 firing → dashboard shows "Deployment failed; previous version stopped" and the customer edits the spec.
  • No "stale + new" window. Single container per Deployment per device; short downtime during cutover is accepted.

Done when

  • Upgrade test in harmony-fleet-e2e walks v1 → v2 → v3 image swap with controlled failures.
  • Dashboard reflects every step.

Chapter 6 — Init containers (#6)

Goal: customer can declare init containers that run to completion before the main container starts. Mirror Kubernetes semantics so customer mental model transfers.

Design

Extend PodmanV0Score with init_containers: Vec<InitContainer>:

  • Ordered — declaration order = run order.
  • Run-to-completion — each one must exit zero before the next starts.
  • Fail-the-Deployment on init failure — non-zero exit or timeout exceeded.
pub struct InitContainer {
    pub name: String,
    pub image: String,
    pub args: Vec<String>,
    pub env: Vec<EnvVar>,
    pub volumes: Vec<VolumeMount>,
    pub timeout: Duration, // default 5 min
}

Customer contract (document loudly)

Init containers must be idempotent. They run on every reconcile that requires a fresh main container — power-cycle recovery, graceful upgrade, etc.

  • Customer-side migration scripts that aren't idempotent will misbehave.
  • Document with examples.
  • Add a Score-builder lint that warns on common non-idempotent patterns (e.g. INSERT without ON CONFLICT).

Done when

  • harmony-fleet-e2e deploys a Deployment with one init container (mkdir -p /data && touch /data/initialized) followed by a main container that asserts the file exists.
  • Two-step ordering sequence tested.

Chapter 7 — System upgrade + auto-rollback (#7)

Reconciliation (2026-06-03). Rollback is now in scope (JG: agent and system upgrade both auto-rollback). The previous "rollback deferred to v0.4" stance is dropped. ADR-0042 (docs/adr/drafts/Fleet-IoT-Device-System-Upgrade- With-Rollback.md, Accepted) is authoritative and the rollback is its core. Housekeeping: promote the ADR out of drafts/ and renumber it into the real sequence (it's filed as "0042"; next free is 025) — and fix its premise that the agent is a Podman container (see Ch4 divergence #1).

Goal: the device can apt full-upgrade its base OS without bricking — and a device that fails to return to a healthy, control-plane-connected state rolls back automatically, no truck roll. Covers both failure modes per the ADR: soft (boots, agent runs, can't reach control plane → userspace timer merges the snapshot) and hard (root won't boot at all → initramfs bootcount hook merges the snapshot).

Scope (the full ADR, including the rollback half)

  • One-time provisioning conversion (partition → PV/VG/LV preserving ext4, initramfs regen with LVM + hook, cmdline.txtroot=/dev/mapper/vg0-root, BCM2835 watchdog). Scripted + idempotent; run at provisioning, not live.
  • Per-upgrade flow: set upgrade-pending, lvcreate thin snapshot vg0/root_preupgrade, write bootcount=0/expected-good=false to /boot, apt full-upgrade, reboot.
  • Initramfs local-top boot-attempt hook (hard-fail rollback): increment bootcount on FAT /boot; bootcount > N (N=23) → lvconvert --merge vg0/root_preupgrade + reboot. This is the piece that survives an unbootable kernel — mandatory for customers running out-of-tree modules.
  • Userspace check-in timer (soft-fail rollback): new agent must achieve a successful control-plane check-in within the soft timeout (10 min); success → reset bootcount, lvremove snapshot, clear upgrade-pending; timeout → lvconvert --merge + reboot.
  • Hardware watchdog catches total hangs → reset → initramfs bootcount path.
  • Canary matrix: clean upgrade, soft-fail (no check-in), hard-fail (unbootable kernel), total hang.

Hard constraints carried from the ADR

  • lvconvert --merge discards everything written during the probation window — any must-persist agent state lives outside the snapshot (separate LV or control-plane DB). Specify exactly what, and where.
  • Thin-pool sizing must guarantee snapshot + upgrade churn can't exhaust the pool.

Done when

  • Canary Pi successfully upgrades from a known-good base image to a later one.
  • Snapshot exists post-upgrade.
  • No customer-visible regression.
  • Per "Full Verification Before Done" rule: green on both aarch64 and x86_64 device classes.

Chapter 8 — Secrets via Zitadel + OpenBao (#8)

Goal: a Deployment can reference a secret by name and the device's container receives the value at apply time — without the secret ever sitting in NATS KV, and scoped so a device can read only the secrets for the deployments it actually runs.

Decision (2026-06-03) — agent-fetch, identity-scoped

The agent fetches secrets directly from OpenBao via harmony_config. The score carries a reference (valueFrom), never a literal. Scoping rides on Zitadel machine identity: when a device gains a deployment it gets a custom Zitadel claim for that deployment; OpenBao reads the claim and grants access to that deployment's secrets only. New deployment ⇒ the device's token/permissions must be renewed before the fetch succeeds (not ideal, but the best design we landed on). The admin updates the secret in OpenBao; agents either refresh periodically or the admin restarts the related deployment to pull the new value.

Depends on the machine-identity / Zitadel-SSO-with-automatic-permissions branch (in flight elsewhere) — Chapter 8 is blocked on it landing.

Rejected alternative (simpler security-wise, worse operationally): write the secret to NATS encrypted with the device pubkey. It's a caching layer — needs the whole write/refresh/restart machinery anyway and will cause sync issues eventually. Not chosen.

Shape (subject to the identity branch's API)

  • EnvVar gains a valueFrom variant — a reference { secret: <name>, key: <field> }, resolved against the deployment-scoped OpenBao path. Inline value literals stay supported.
  • Agent-side harmony_config client built from the device's Zitadel identity (the keyfile it already holds), fetching only its permitted paths.
  • Resolution at apply time, in the reconciler before ensure_service_running — a fetch failure fails the deployment with a clear Phase::Failed reason ("secret X not readable: permission / not renewed"), never a silent empty env.
  • Refresh: periodic re-fetch on the 30s tick (re-resolve refs; restart container only if a value changed), plus admin-triggered deployment restart for immediacy.

Non-goals (scope discipline)

  • No general templating / file-mount secrets / CSI-driver story — one ref shape, env only.
  • No secret material in NATS KV (that's the rejected design).
  • No rotation automation beyond "periodic refresh or restart."

Customer-facing until this lands

"Your first Deployments should use inline environment variables only; credential injection arrives with the secrets chapter."


Chapter 9 — Agent time-drift verification (#9)

Goal: agent refuses to operate (or warns loudly) when its clock is skewed enough to break JWT validation.

Design

  • Startup NTP-style query against a configurable server list (default: time.cloudflare.com, pool.ntp.org).
  • Refuse to start on |drift| > 30s. Typical JWT skew tolerance — past it, every NATS callout request fails with a cryptic exp invalid.
  • Periodic re-check every 6 hours. Mid-run drift past threshold → agent publishes a DeviceInfo health flag, dashboard surfaces it.
  • Specific customer-facing error message: "system clock skew is 14m32s; JWT validation will fail. Enable systemd-timesyncd or chrony."

Done when

  • Test in harmony-fleet-e2e runs against a libvirt VM with clock forced 5 minutes off.
  • Agent refuses to start with the expected error message.
  • Recovery: fix the clock → agent comes up clean.

Chapter 10 — Phase 1 smoke wiring (#10)

Goal: real fleet Scores carry real smoke tests. The Phase 0 contract becomes load-bearing.

Scope

  • HttpHealthy probe — GET a URL, expect 2xx, optional response-body-contains assertion.
  • K8sPodReady probe — kube client lookup for pod readiness condition.
  • NatsKvKeyExists probe — KV bucket + key, optional value-deserializes-to-T assertion.
  • FleetOperatorSmokeTest — pairs with FleetOperatorScore. Operator pod ready + /healthz returns 200 + can write to device-info KV.
  • FleetAgentSmokeTest — pairs with FleetAgentScore. Agent pod ready + heartbeat published to KV within 30s.
  • HarmonyEvent::SmokeStage{Started,Finished,Skipped} variants (additive) so the dashboard can render the live pipeline.
  • Dashboard pipeline view — maud renderer subscribing to instrumentation events.

Sequencing within this chapter (strict order)

  1. HarmonyEvent variants — one-line additive change to harmony/src/domain/instrumentation.rs.
  2. Probes one at a time — HTTP, K8sPodReady, NatsKvKeyExists. Each: unit tests + an integration test against the staging cluster.
  3. FleetOperatorSmokeTest composing the above.
  4. FleetAgentSmokeTest.
  5. Dashboard renderer last — once the events are flowing, UI is mostly maud + htmx polling.

Done when

  • deploy_with_smoke(FleetOperatorScore, FleetOperatorSmokeTest, ...) returns successfully against staging.
  • Dashboard shows the live pipeline.
  • Deliberate breakage (point the operator's helm chart at a bad image) → smoke fails visibly, failing probe named on dashboard.

Chapter 11 — CI yaml minimization (#11, longer-term)

Pulled out of the chapter-by-chapter v0.3 work.

  • Frame: workflow yaml files in .gitea/workflows/ (4 files, ~235 LOC) should hold only what Gitea Actions needs for job discovery + parallel viz. Job bodies are one-line calls into portable scripts.

Direction

  • Build out a harmony-ci Rust CLI crate. Commands like harmony-ci build composer-linux, harmony-ci publish operator-image, harmony-ci check.
  • Each workflow yaml job becomes run: cargo run -p harmony-ci -- <command>.
  • Scripts must run identically from a developer's laptop.

Not in v0.3

  • Multi-day effort; doesn't block the customer.
  • Slot when bandwidth allows.
  • Opportunistically convert when touching a workflow file for other reasons.

Chapter 12 — NATS callout CI hardening (#12, minimal)

  • nats/callout is a low-churn crate that works today.
  • Workspace-wide cargo test in .gitea/workflows/check.yml covers the non-ignored tests.
  • Four #[ignore]'d integration tests in nats/integration-test-callout/tests/callout_e2e.rs need podman + a NATS image pull in the runner.

Direction

  • Don't add CI infra in v0.3 just to run these.
  • When a runner with podman + image pull exists for other reasons (e2e harness, system upgrade test matrix), add the callout integration tests to it.
  • Until then: keep current workspace-wide coverage.

Chapter 14 — Device blacklist enforcement + un-blacklist (#14)

Goal: blacklisting a device actually locks it out of the fleet, and the action is reversible. Today blacklist_device only patches a cosmetic k8s label (fleet.nationtech.io/blacklisted) — nothing enforces it; the device keeps its NATS connection and credentials.

Decision (2026-06-03) — Zitadel-deactivate, honest about "immediate"

Core NATS has no force-disconnect, and the callout issues 1-hour user JWTs with no revocation check. So "kill the connection now" is not free. The chosen v0.3 cut:

  1. Deactivate the Zitadel machine user (device-<id>). New management- API call (not yet implemented — see research: no deactivate/lock/delete wired today). After this, the device cannot re-authenticate: its next NATS callout (≤1h, on token renewal/reconnect) fails and it drops off. The agent stays connected until then.
  2. Keep the k8s label for dashboard state + un-blacklist.
  3. Confirmation UX (load-bearing): the blacklist confirm dialog must tell the operator plainly: "The device's access is revoked, but its current NATS connection persists until its token renews (≤1h). For immediate effect, restart the device, or restart NATS — note restarting NATS disconnects every device." No silent false "it's gone now."
  4. Un-blacklist = reactivate the Zitadel user + clear the label. Add FleetService::unblacklist_device (no reverse exists today — kubectl only) and a dashboard button on blacklisted devices.

Explicitly NOT in this cut (and why)

  • True instant kill of a hostile device. Would need NATS-level forced disconnect (monitoring + custom mechanism) — large scope, deferred. The veterinary use case is decommission/quarantine, not adversarial, so deactivate-then-reauth-fails is acceptable.
  • Shortening the JWT TTL. Considered (bounds exposure) but rejected for now — it's a callout-wide change affecting every device's churn; revisit if exposure window matters.

Done when

  • Blacklist deactivates the Zitadel user; a fresh enroll/reconnect by that device is refused by the callout (tested).
  • Un-blacklist reactivates + clears the label; device reconnects.
  • Confirm dialog states the persistence caveat verbatim.

Chapter 15 — Container auto-start after reboot (#15, shipped)

Bug, not a feature. The agent watched desired-state with bucket.watch() (DeliverPolicy::New) — only future Puts, never a replay of existing keys. On any restart (incl. device reboot) the reconciler's in-memory cache started empty and the 30s ground-truth tick had nothing to reconcile, so reboot-stopped containers never came back unless the operator rewrote the KV. Fixed by switching to watch_with_history() (DeliverPolicy::LastPerSubject): the agent replays current desired-state on startup and the idempotent apply path restarts the containers. The user podman.socket is already enable --now + lingered, so it survives reboot.

Offline-boot resilience (containers up before NATS is reachable) is not covered — see #16; would need podman-restart.service + restart=always.


Out of scope for v0.3 (deferred deliberately)

Item Target Why deferred
Deployment-level auto-rollback maybe never Customer asked for roll-forward only.
System-upgrade LVM-snapshot rollback half v0.4 Push to prod first; widen scope after.
Live log tailing (streaming) v0.4 Chapter 3 ships sync getLogs; live tail builds on it.
Deployment dependencies (cross-deploy ordering) TBD Init containers cover the common case; wait for customer ask.
Secrets via Zitadel + OpenBao v0.3.x Blocked on harmony_secret work.
Containerized agent (podman instead of systemd) v0.4+ Self-upgrade protocol matures first on systemd.
Operator HA (active/active or active/passive) TBD One pod sufficient for v0.3; scale-out when fleet size demands.
Multi-tenant fleet isolation tests v0.4 Callout permissions cover the mechanism; cross-tenant smoke later.

Open questions

These don't block starting v0.3 work but need resolution before the relevant chapter completes.

  • Q1 (Chapter 4): Binary distribution mechanism for agent upgrades. Gitea releases vs OCI artifacts vs something else.
  • Q2 (Chapter 2): Snapshot the aggregate to KV? Faster recovery vs invalidation complexity.
  • Q3 (Chapter 7): Canary test matrix? Concretely: which Pi models, which base images, which apt sources.
  • Q4 (Chapters 5 + 10): Sequencing of Chapter 10 vs Chapter 5. Both benefit from smoke; right answer might be to ship Phase 1 smoke during Chapter 5 so upgrade gates on it. Decide when starting Chapter 5.
  • Q5 (cross-cutting): One operator pod or active/passive? Customer's fleet size answers this; ask before Chapter 2 starts.

When v0.3 is done

  • All chapters 110 merged.
  • A real customer Deployment runs on a real Pi in a real basement.
  • The dashboard shows live status and logs.
  • An agent upgrade has been driven through the full protocol successfully (and a failure path tested).
  • A system upgrade has been driven through the full protocol on a canary.

v0.4 picks up the deferred items in priority order.