- Agent upgrade (#4): ADR-022 is authoritative and already includes auto-rollback (symlink revert, old binary never GC'd, operator-coordinated). Supersede the simpler no-rollback cut. Flag the systemd-binary vs podman-container runtime contradiction between ADR-022 and ADR-0042. - System upgrade (#7): rollback no longer deferred to v0.4 — full ADR-0042 (LVM thin snapshot + two-tier watchdog: initramfs bootcount hard-fail, control-plane check-in soft-fail) is in scope. Promote ADR out of drafts + renumber. - Fix category error: deployment roll-forward-only does NOT apply to agent/system upgrades (different concern; both auto-rollback).
40 KiB
Fleet Platform v0.3 — last-mile plan
Authoritative plan for the last mile before the fleet ships to a real
customer. Picks up where v0_2_plan.md left the chapter structure.
Written 2026-05-24, after feat/iot-walking-skeleton (#264) merged
and feat/smoke-test-contract landed the Phase 0 smoke companion.
The frame:
- v0.1 proved the shape.
- v0.2 locked the brick design.
- v0.3 makes the brick safe to hand to a customer running production workloads on Pis in their basement.
State coming in
- IoT walking skeleton merged. Operator + agent + NATS + Zitadel + auth callout running end-to-end against an OKD staging cluster.
- Smoke-test contract Phase 0 merged (
feat/smoke-test-contract).Probe/SmokeSuite/SmokeTestcompanion +deploy_with_smokeinharmony-fleet-deploy/src/companion/smoke/.- One concrete probe today:
TcpReachable. - No fleet Score wired to a real smoke test yet — Phase 1 is in this roadmap.
- Agent runs as a systemd user unit on devices (see
harmony/src/modules/fleet/setup_score.rs:263–283).- No on-device containerized agent path.
- The Dockerfile in
fleet/harmony-fleet-agent/Dockerfileis k8s-only today.
- Dashboard has no role enforcement — security gap.
- Maud/htmx frontend at
fleet/harmony-fleet-operator/src/frontend/server.rs. - Verifies Zitadel JWT signature + expiry only.
JwksCache::verify(harmony_zitadel_auth/src/jwks.rs:74) extractssub/exp/email/name/nonce— no roles.VerifiedSessionhas norolesfield.- Any logged-in Zitadel user gets full dashboard access. Fix immediately (Chapter 1).
- Maud/htmx frontend at
- NATS callout already has the role-extraction logic we need.
ZitadelValidator::extract_rolesatnats/callout/src/zitadel.rs:203.- Handles both array shape (
["fleet-admin"]) and Zitadel's object-map shape ({"fleet-admin": {org_id: org_name}}). roles::resolvemaps role names toResolvedRole::Admin/::Devicewith admin-wins privilege escalation.- Chapter 1 reuses the extractor, not the role-to-NATS-permission half.
- System upgrade ADR drafted at
docs/adr/drafts/Fleet-IoT-Device-System-Upgrade-With-Rollback.md.- Header says Accepted 2026-05-24 but lives under
drafts/. - Authoritative status: approach agreed, rollback half deferred (Chapter 7).
- Header says Accepted 2026-05-24 but lives under
Customer constraints baked into this plan
- Deployments are roll-forward only. No auto-rollback when a new Deployment (customer app) version fails. Dashboard surfaces the failure; customer edits the spec and rolls forward. Customer ask; may change later.
- Agent and system upgrades DO auto-rollback (updated 2026-06-03). This is a different concern from the deployment rule above and must not be conflated: a broken agent or OS upgrade has no customer "edit the spec and roll forward" path — an unreachable/bricked device needs to self-heal. ADR-022 (agent) and ADR-0042 (system) both already design the rollback; both are now in v0.3 scope. See Chapters 4 & 7.
- Secrets need Zitadel + OpenBao. No plaintext-env-var shortcut.
harmony_secret+ OpenBao work is on the critical path for any Deployment that needs credentials.
Feature checklist
Status legend: ✅ shipped · 🟡 in flight · 🔴 not started · ⏸ deferred (target version in note).
| # | Feature | Status | Owner / branch | Notes |
|---|---|---|---|---|
| 1 | Dashboard role enforcement (fleet-admin required) |
🔴 | next branch | Reuse ZitadelValidator::extract_roles. Do this right now — security gap. |
| 2 | Operator restart / aggregator cold-rebuild | 🔴 | next branch | More critical than smoke wiring; ship before any customer. |
| 3 | Deployment getLogs companion + dashboard log view |
🟡 | feat/fleet-device-exec-logs |
Basic cut shipped: one-shot podman logs tail (Refresh button) + remote exec + per-deployment selector on the device Logs tab. Companion-trait refactor + live streaming still owed (→ #13/v0.4). |
| 4 | Agent self-upgrade + auto-rollback (ADR-022) | 🔴 | new branch | ADR-022 is the accepted design and already includes auto-rollback (symlink swap, old binary never GC'd, auto-revert on smoke-fail/heartbeat-timeout). Supersedes the simpler no-rollback cut sketched in Ch4. Reconcile runtime model first (see Ch4 note). |
| 5 | Graceful deployment upgrade (roll-forward only) | 🔴 | new branch | SIGTERM → grace → SIGKILL fallback → start new. No rollback. |
| 6 | Init containers in PodmanV0Score |
🔴 | new branch | Ordered, run-to-completion, customer guarantees idempotency. |
| 7 | System upgrade + auto-rollback (ADR-0042) | 🔴 | new branch | Now in scope WITH rollback (was deferred to v0.4): LVM thin snapshot + two-tier watchdog (initramfs bootcount = hard-fail, control-plane check-in = soft-fail). Promote ADR out of drafts/ + renumber to next in sequence. |
| 8 | Secrets via Zitadel + OpenBao for Deployments | 🔴 | blocked on machine-identity br. | Design locked (Chapter 8): agent fetches via harmony_config, scoped by a per-deployment Zitadel claim OpenBao reads. Heaviest item; depends on the machine-identity/SSO branch. |
| 9 | Agent time-drift verification | 🔴 | new branch | Periodic NTP check; refuse JWT operations if skewed. |
| 10 | Phase 1 smoke wiring (HTTP / K8sPodReady / NatsKv probes) | 🔴 | new branch | After required features land. Not a functional blocker. |
| 11 | CI yaml minimization (logic into harmony-ci scripts) |
⏸ v0.4 | longer-term | Yaml stays for discovery + parallel viz; scripts move. |
| 12 | NATS callout CI hardening | ⏸ | low-churn crate | Already covered by workspace cargo test. Run ignored tests when CI has podman + NATS image. |
| 13 | Application log streaming through NATS | ⏸ v0.4 | follow-on to #3 | #3 is the synchronous getLogs; this is the live tail. |
| 14 | Device blacklist enforcement + un-blacklist | 🔴 | new branch | Today blacklist is a cosmetic label. Chapter 14: Zitadel deactivate + un-blacklist; no true instant NATS kill (see chapter). |
| 15 | Container auto-start after reboot (bug) | ✅ | feat/fleet-device-exec-logs |
Agent used watch (DeliverPolicy::New) → never replayed desired-state on restart. Fixed: watch_with_history. |
| 16 | Containers auto-start while offline at boot | ⏸ v0.4 | follow-on to #15 | #15 covers the online case (agent re-reconciles). Offline-boot resilience would need user podman-restart.service + restart=always; defer. |
Sequencing
| Order | Item | Why |
|---|---|---|
| 1 | #1 Dashboard role enforcement | Security gap, do right now. |
| 2 | #2 Operator restart recovery | More critical than smoke wiring. Customer can't tolerate "operator restarted, state unknown." |
| 3 | #3 Log forwarding companion | Turns the dashboard from a toy into a thing customers actually use. |
| 4 | #4 Agent self-upgrade | Parallel-safe with #2/#3 — different code paths. |
| 5 | #5 + #6 Graceful upgrade + init containers | Paired Deployment-layer features; ship together. |
| 6 | #9 Time-drift verification | Small, isolated; slot between heavier items. |
| 7 | #7 System upgrade | Builds on agent-upgrade pattern from #4 — #4 lands first. |
| 8 | #10 Phase 1 smoke wiring | After required features so probes verify real customer-facing surfaces. |
| 9 | #8 Secrets | Blocks any customer Deployment that needs credentials. Promote if first customer needs them. |
| 10 | #11 / #12 CI | Opportunistic, doesn't block customer. |
Chapter 1 — Dashboard role enforcement (#1)
Goal: every dashboard page requires a valid Zitadel session and a fleet-admin role on the token.
- Users without the role get a 403 with a clear message.
- Users without a session get the existing login redirect.
Current state
- JWKS verify only extracts identity claims.
JwksCache::verify(harmony_zitadel_auth/src/jwks.rs:74) parses the JWT and returns aVerifiedSessionwithsub/exp/email/name/nonce. Roles not extracted. VerifiedSessionhas norolesfield (harmony_zitadel_auth/src/session.rs:5).- Middleware checks JWT validity only.
require_auth(fleet/harmony-fleet-operator/src/frontend/server.rs:136–157). Every authenticated user gets all pages. - Role extraction logic already exists and is correct in the callout:
ZitadelValidator::extract_rolesatnats/callout/src/zitadel.rs:203. Handles both shapes:- array —
["fleet-admin"] - object-map —
{"fleet-admin": {org_id: org_name}}
- array —
Plan
- Extract a shared role-extraction helper into
harmony_zitadel_authso dashboard and callout import from one place. Callout keeps its API but its body delegates. - Extend
VerifiedSessionwithroles: Vec<String>. - Extend the JWKS
Claimsdecode struct to capture the configured roles claim. Pull the claim name from existing callout config so the two systems agree (Zitadel shipsurn:zitadel:iam:org:project:rolesor similar). - Add
require_role(role: &'static str)middleware to the dashboard. Compose withrequire_auth. Use on everyRouter::route(..., post|get(...).layer(...)). - 403 response renders a maud page — "fleet-admin role required; ask your administrator." Not a JSON error; dashboard is human-facing.
Tests
Security code — heavy unit tests are non-negotiable.
- Array-shape claim → fleet-admin in session. JWT with array-shape role claim.
- Object-map shape → identical resolution. Same role, Zitadel's other claim shape.
- No role claim → empty roles. Token with no
rolesclaim. - Wrong role doesn't elevate. JWT with only
devicerole does NOT carryfleet-admin. - No session → 401/redirect.
- Session but no
fleet-admin→ 403. - Session +
fleet-admin→ 200.
Done when
- Branch merged.
- All dashboard handlers gated by
require_role("fleet-admin"). - Every test green.
- Manual smoke against staging Zitadel: user without role sees 403.
Follow-ups (post-demo — shipped a working-but-imperfect cut)
Gate works in staging, but on a temporary footing. Clean these up after the demo:
- Get roles into the id_token via
scope, not the Zitadel app checkbox. Today it works only because the app has "User Roles Inside ID Token" toggled on — out-of-band IdP config, invisible to our code, easy to miss on a new env (cost us a debug cycle:roles=[]despite the role being granted). The OIDC-idiomatic fix is to requesturn:zitadel:iam:org:project:rolesinZitadelAuthConfig.scope("when requested" per the Zitadel claims matrix), then turn the checkbox back off. Keeps our stateless id_token-as-session design; the dependency travels with the deploy. (UserInfo-endpoint / access-token authz is the heavier vendor-agnostic alternative — not worth it for a first-party UI.) - Fix the SSO doc.
docs/guides/operator-dashboard-sso.mdstep 1b wrongly says enable "Assert Roles on Authentication" — that's the userinfo setting and does not put roles in the id_token. Replace with the scope-request (and/or "User Roles Inside ID Token"). - Unify the two role extractors (DRY debt — diverged from Plan #1 above). We now have
harmony_zitadel_auth::extract_zitadel_rolesandnats/calloutZitadelValidator::extract_rolesdoing the same job. Worse, the dashboard one only handles the object-map claim shape; the callout one also handles the array shape. Extract one shared helper (handling both shapes + both aggregated/project-scoped claim names) and have both import it, as Chapter 1 Plan originally intended. require_roleis inlined, not composable. The gate lives insiderequire_auth(one trust boundary, fine for one role). If a second role/permission ever appears, lift it to a composablerequire_role(..)layer as Plan #4 intended — not before (YAGNI).
Chapter 2 — Operator restart + aggregator recovery (#2)
Goal: the operator pod can be killed, upgraded, or rescheduled at any time and the system converges back to correct state from NATS KV alone. No "unknown state" window visible to customers.
Current state
- Aggregator rebuilds from scratch on startup.
fleet_aggregator.rs(833 LOC, inharmony-fleet-operator/src/) watches the KV buckets to materialize state. JG confirmed: "rebuilt from scratch, yes." - Failure modes not exercised yet:
- Partial KV — device offline during operator reset, never re-published its info.
- Two operator pods racing during a rolling deploy of the operator.
- NATS stream loss between operator restart and rebuild completing.
- Stale KV — Deployment CR deleted in kube while operator was down.
Plan
Scenario-driven. Enumerate failure shapes, then handle one at a time. Discipline: each scenario gets a regression test in harmony-fleet-e2e, then the fix.
- Scenario inventory pass. Write
docs/fleet-operator-recovery-scenarios.mdlisting every failure shape we can think of. Cross-reference smoke-a* tests to identify what's already covered. - Cold-start rebuild as the baseline. Confirm + test that
kubectl delete podof the operator and waiting for the replacement converges to pre-kill aggregate in < 30s. Gate on convergence time at N device count. - Stale-KV reconciliation. Define the rule for "KV says device D has Deployment X, but Deployment X no longer exists in kube." Operator cleans up; agents observe the deletion.
- Leader election decision. Ship with leader election (one writer at a time) or design for idempotent multi-writer? Score-Topology-Interpret leans idempotent; confirm + assert operator writes are byte-deterministic.
- Liveness signaling for the dashboard. Surface "operator converged" / "operator recovering" as states the frontend renders. Customer sees a loading banner, not a blank dashboard, during rebuild.
Open questions
- Warm-restart snapshot? Keep a per-operator-pod "last known aggregate" snapshot in a KV bucket so warm restarts skip cold rebuild? Probably yes for >1000-device fleets; adds an invalidation problem.
- One pod or active/passive? Customer's fleet size answers this. Ask before starting.
Done when
- Scenario inventory exists.
- Each scenario has a regression test, all green.
- Manual chaos: kill operator pod during high write load → convergence + dashboard liveness banner observed.
Chapter 3 — Application log forwarding companion (#3)
Goal: when a customer's Deployment is misbehaving on a Pi in the field, the dashboard shows last-N-lines of container logs without anyone SSH-ing the device.
Design
- Logs attach as a Score companion — same pattern as the smoke-test contract.
- The companion is optional — Scores without one render "this deployment doesn't expose logs". Acceptable.
- Sync
getLogsships in v0.3; live tail (streaming) waits for v0.4 — that's the minimum useful UX.
Shape:
// new in harmony-fleet-deploy/src/companion/logs/
pub trait LogQuery<T: Topology>: Send + Sync {
type Score: Score<T>;
async fn last_lines(
&self,
score: &Self::Score,
topology: &T,
n: usize,
) -> Result<LogChunk, LogQueryError>;
}
pub struct LogChunk {
pub source: ProbeName, // reuse the validated newtype
pub captured_at: chrono::DateTime<chrono::Utc>,
pub lines: Vec<String>,
pub truncated: bool,
}
For PodmanV0Score:
- Transport: NATS request/reply. Subject
device-commands.<device_id>.logs.<deployment>. - Agent side: runs
podman logs --tail N <container>and replies with aLogChunk. - Dashboard side: one async call from the logs handler.
Plan
- Define
LogQuerycompanion trait in a newharmony-fleet-deploy/src/companion/logs/module. PodmanLogQueryimplementingLogQuery<…> for PodmanV0Score.- Agent-side command handler — extend the existing request/reply command dispatcher.
- Dashboard handler at
/deployments/<name>/devices/<id>/logs?lines=Nreturning rendered maud. - Tests: unit on
PodmanLogQuery; integration inharmony-fleet-e2edriving end-to-end.
Done when
- Customer clicks "View logs" on the dashboard.
- Sees the last 200 lines.
- Call returns in < 2s on a 3-device fleet.
Chapter 4 — Agent self-upgrade + auto-rollback (#4)
Reconciliation (2026-06-03). ADR-022 (
docs/adr/022-fleet-agent-upgrade.md, Accepted-design) is authoritative — build that, not the simpler cut below. ADR-022 already delivers the auto-rollback JG wants: versioned binaries (/usr/bin/fleet-agent-v<ver>, never GC'd) + atomic symlink swap, aVerifyingstep (--self-test) before cutover, and auto-revert when the staged binary fails smoke or the new agent misses its heartbeat window — the old version is oneln -sfnaway and the operator (not the agent) owns the stop signal. The "no auto-rollback in v0.3" line in the old draft below was a category error: it borrowed the deployment roll-forward-only rule (a customer ask about app versions), which does not apply to agent upgrades. Rollback here is wanted and already designed.Two divergences to resolve before implementing:
- Runtime model. ADR-022 + this chapter assume a systemd binary (symlink swap). ADR-0042 (system upgrade) states the agent "runs as a privileged Podman container that autostarts on boot." These contradict. Pick one and fix the loser. (Recommend: stay systemd-binary for the agent — ADR-022's symlink/verify/revert is simpler and the container path isn't built. Update ADR-0042's premise.)
- Protocol. ADR-022's state machine (Running→Draining→Staging→Verifying→ Cutover-Ready→Stopping) supersedes the marker-phase sketch below. Keep the "marker in NATS, no-NATS-no-upgrade" idea; drop the
systemctl restartself-swap in favor of ADR-022's parallel-service + operator-stop handoff.
Goal: the agent can upgrade itself in place, auto-reverting to the last known-good version on failure. If NATS is unavailable, the upgrade does not start. The operator sees every step.
Design (per JG's direction)
- Stay on systemd for v0.3. Switching the agent runtime to podman is its own risk; defer until self-upgrade protocol matures.
- Upgrade marker lives in NATS, not on disk. New KV bucket
agent-upgradekeyed bydevice_id, carryingstart_timestamp,invoker_version,target_version,phase. - No NATS → no upgrade. Feature, not limitation: operator can't observe an upgrade it can't see, so refusing without NATS prevents silent half-upgrades.
Protocol
- Operator writes
Requested.agent-upgrade/<device_id>withphase: Requested, target_version: vX. - Old agent observes + writes
Suspending. Verifies NATS liveness with a round-trip first. - Old agent suspends + writes
Suspended. Reconcile loop paused; heartbeat continues so the operator knows it's alive. - Old agent fetches new binary + writes
Fetched. Mechanism TBD (see open questions).target_path: /usr/local/bin/fleet-agent.new. - Old agent launches new binary as a separate process + writes
NewLaunched. Not via systemd unit update yet. Includesnew_pid: N. - New agent self-checks + writes
NewHealthy. Connects to NATS, verifies permissions, one-shot smoke (KV read, command channel echo). - Old agent writes
HandingOffand exits. Tells systemd to swap the binary:systemctl daemon-reload+systemctl restart fleet-agent.servicewith the new binary in place. - Systemd starts the unit pointing at the new binary. Final state
phase: Complete, completed_at: T.
On stall (configurable, default 5 min):
- Marker writes
phase: Failedwith last successful step. - Operator surfaces this on the dashboard.
- Customer / operator intervenes manually — no auto-rollback in v0.3, consistent with the deployment roll-forward-only rule.
Open questions
- Q1.1 Binary distribution. Gitea release asset? Signed OCI artifact? Existing
arm-agents.yamluploads aarch64 binaries to releases — start with that. - Q1.2 Verification. Hash signature? GPG? Minimum: SHA-256 pinned in the upgrade-request payload.
- Q1.3 Atomic systemd swap.
systemctl restartis not atomic across binary-on-disk and process. Acceptable? Orsystemd-run --transientshim? - Q1.4 Cross-arch. Fetch URL has to know the device's arch. KV
device-infoalready carries this; confirm the agent reads its own arch correctly.
Done when
- Branch contains the protocol implementation + e2e test driving v0.3.0 → v0.3.1 upgrade against a libvirt VM.
- Operator sees every phase.
- Failure path tested: deliberately corrupt the new binary → marker reads
Failed, old agent stays running.
Chapter 5 — Graceful deployment upgrade, roll-forward only (#5)
Goal: upgrading a Deployment's image/config replaces the old container without dropping traffic mid-request. If the new container won't start, the customer sees the failure clearly and fixes the spec.
Design
Extend PodmanV0Score with a lifecycle block:
pub struct PodmanV0Score {
// ... existing fields ...
pub lifecycle: Option<LifecyclePolicy>,
}
pub struct LifecyclePolicy {
pub stop_signal: StopSignal, // SIGTERM (default), SIGINT, SIGUSR1
pub grace_period: Duration, // default 30s
pub sigkill_fallback: bool, // default true
}
Agent's reconcile when image/config changes:
- Write
Upgradingphase. NewDeploymentState::Phase::Upgradingvariant. Dashboard shows the in-progress upgrade. - Send
stop_signalto the old container. - Wait up to
grace_periodfor clean exit. - SIGKILL fallback if still running and
sigkill_fallback. - Start new container.
- On startup failure: write
Failedand stop. Image pull error, exec error, crash within 5s. No revert to old image. - On success: write
Running. Optionally gated by a Phase-1 smoke test (Chapter 10) when that lands.
Explicit non-goals
- No auto-rollback. Customer-asked constraint. Step 6 firing → dashboard shows "Deployment failed; previous version stopped" and the customer edits the spec.
- No "stale + new" window. Single container per Deployment per device; short downtime during cutover is accepted.
Done when
- Upgrade test in
harmony-fleet-e2ewalks v1 → v2 → v3 image swap with controlled failures. - Dashboard reflects every step.
Chapter 6 — Init containers (#6)
Goal: customer can declare init containers that run to completion before the main container starts. Mirror Kubernetes semantics so customer mental model transfers.
Design
Extend PodmanV0Score with init_containers: Vec<InitContainer>:
- Ordered — declaration order = run order.
- Run-to-completion — each one must exit zero before the next starts.
- Fail-the-Deployment on init failure — non-zero exit or timeout exceeded.
pub struct InitContainer {
pub name: String,
pub image: String,
pub args: Vec<String>,
pub env: Vec<EnvVar>,
pub volumes: Vec<VolumeMount>,
pub timeout: Duration, // default 5 min
}
Customer contract (document loudly)
Init containers must be idempotent. They run on every reconcile that requires a fresh main container — power-cycle recovery, graceful upgrade, etc.
- Customer-side migration scripts that aren't idempotent will misbehave.
- Document with examples.
- Add a Score-builder lint that warns on common non-idempotent patterns (e.g.
INSERTwithoutON CONFLICT).
Done when
harmony-fleet-e2edeploys a Deployment with one init container (mkdir -p /data && touch /data/initialized) followed by a main container that asserts the file exists.- Two-step ordering sequence tested.
Chapter 7 — System upgrade + auto-rollback (#7)
Reconciliation (2026-06-03). Rollback is now in scope (JG: agent and system upgrade both auto-rollback). The previous "rollback deferred to v0.4" stance is dropped. ADR-0042 (
docs/adr/drafts/Fleet-IoT-Device-System-Upgrade- With-Rollback.md, Accepted) is authoritative and the rollback is its core. Housekeeping: promote the ADR out ofdrafts/and renumber it into the real sequence (it's filed as "0042"; next free is 025) — and fix its premise that the agent is a Podman container (see Ch4 divergence #1).
Goal: the device can apt full-upgrade its base OS without bricking — and
a device that fails to return to a healthy, control-plane-connected state
rolls back automatically, no truck roll. Covers both failure modes per the
ADR: soft (boots, agent runs, can't reach control plane → userspace timer
merges the snapshot) and hard (root won't boot at all → initramfs bootcount
hook merges the snapshot).
Scope (the full ADR, including the rollback half)
- One-time provisioning conversion (partition → PV/VG/LV preserving ext4, initramfs regen with LVM + hook,
cmdline.txt→root=/dev/mapper/vg0-root, BCM2835 watchdog). Scripted + idempotent; run at provisioning, not live. - Per-upgrade flow: set
upgrade-pending,lvcreatethin snapshotvg0/root_preupgrade, writebootcount=0/expected-good=falseto/boot,apt full-upgrade, reboot. - Initramfs
local-topboot-attempt hook (hard-fail rollback): incrementbootcounton FAT/boot;bootcount > N(N=2–3) →lvconvert --merge vg0/root_preupgrade+ reboot. This is the piece that survives an unbootable kernel — mandatory for customers running out-of-tree modules. - Userspace check-in timer (soft-fail rollback): new agent must achieve a successful control-plane check-in within the soft timeout (10 min); success → reset bootcount,
lvremovesnapshot, clearupgrade-pending; timeout →lvconvert --merge+ reboot. - Hardware watchdog catches total hangs → reset → initramfs bootcount path.
- Canary matrix: clean upgrade, soft-fail (no check-in), hard-fail (unbootable kernel), total hang.
Hard constraints carried from the ADR
lvconvert --mergediscards everything written during the probation window — any must-persist agent state lives outside the snapshot (separate LV or control-plane DB). Specify exactly what, and where.- Thin-pool sizing must guarantee snapshot + upgrade churn can't exhaust the pool.
Done when
- Canary Pi successfully upgrades from a known-good base image to a later one.
- Snapshot exists post-upgrade.
- No customer-visible regression.
- Per "Full Verification Before Done" rule: green on both aarch64 and x86_64 device classes.
Chapter 8 — Secrets via Zitadel + OpenBao (#8)
Goal: a Deployment can reference a secret by name and the device's container receives the value at apply time — without the secret ever sitting in NATS KV, and scoped so a device can read only the secrets for the deployments it actually runs.
Decision (2026-06-03) — agent-fetch, identity-scoped
The agent fetches secrets directly from OpenBao via harmony_config.
The score carries a reference (valueFrom), never a literal. Scoping
rides on Zitadel machine identity: when a device gains a deployment it
gets a custom Zitadel claim for that deployment; OpenBao reads the claim
and grants access to that deployment's secrets only. New deployment ⇒
the device's token/permissions must be renewed before the fetch
succeeds (not ideal, but the best design we landed on). The admin updates
the secret in OpenBao; agents either refresh periodically or the
admin restarts the related deployment to pull the new value.
Depends on the machine-identity / Zitadel-SSO-with-automatic-permissions branch (in flight elsewhere) — Chapter 8 is blocked on it landing.
Rejected alternative (simpler security-wise, worse operationally): write the secret to NATS encrypted with the device pubkey. It's a caching layer — needs the whole write/refresh/restart machinery anyway and will cause sync issues eventually. Not chosen.
Shape (subject to the identity branch's API)
EnvVargains avalueFromvariant — a reference{ secret: <name>, key: <field> }, resolved against the deployment-scoped OpenBao path. Inlinevalueliterals stay supported.- Agent-side
harmony_configclient built from the device's Zitadel identity (the keyfile it already holds), fetching only its permitted paths. - Resolution at apply time, in the reconciler before
ensure_service_running— a fetch failure fails the deployment with a clearPhase::Failedreason ("secret X not readable: permission / not renewed"), never a silent empty env. - Refresh: periodic re-fetch on the 30s tick (re-resolve refs; restart container only if a value changed), plus admin-triggered deployment restart for immediacy.
Non-goals (scope discipline)
- No general templating / file-mount secrets / CSI-driver story — one ref shape, env only.
- No secret material in NATS KV (that's the rejected design).
- No rotation automation beyond "periodic refresh or restart."
Customer-facing until this lands
"Your first Deployments should use inline environment variables only; credential injection arrives with the secrets chapter."
Chapter 9 — Agent time-drift verification (#9)
Goal: agent refuses to operate (or warns loudly) when its clock is skewed enough to break JWT validation.
Design
- Startup NTP-style query against a configurable server list (default:
time.cloudflare.com,pool.ntp.org). - Refuse to start on |drift| > 30s. Typical JWT skew tolerance — past it, every NATS callout request fails with a cryptic
exp invalid. - Periodic re-check every 6 hours. Mid-run drift past threshold → agent publishes a
DeviceInfohealth flag, dashboard surfaces it. - Specific customer-facing error message: "system clock skew is 14m32s; JWT validation will fail. Enable
systemd-timesyncdorchrony."
Done when
- Test in
harmony-fleet-e2eruns against a libvirt VM with clock forced 5 minutes off. - Agent refuses to start with the expected error message.
- Recovery: fix the clock → agent comes up clean.
Chapter 10 — Phase 1 smoke wiring (#10)
Goal: real fleet Scores carry real smoke tests. The Phase 0 contract becomes load-bearing.
Scope
HttpHealthyprobe — GET a URL, expect 2xx, optional response-body-contains assertion.K8sPodReadyprobe — kube client lookup for pod readiness condition.NatsKvKeyExistsprobe — KV bucket + key, optional value-deserializes-to-T assertion.FleetOperatorSmokeTest— pairs withFleetOperatorScore. Operator pod ready +/healthzreturns 200 + can write todevice-infoKV.FleetAgentSmokeTest— pairs withFleetAgentScore. Agent pod ready + heartbeat published to KV within 30s.HarmonyEvent::SmokeStage{Started,Finished,Skipped}variants (additive) so the dashboard can render the live pipeline.- Dashboard pipeline view — maud renderer subscribing to instrumentation events.
Sequencing within this chapter (strict order)
HarmonyEventvariants — one-line additive change toharmony/src/domain/instrumentation.rs.- Probes one at a time — HTTP, K8sPodReady, NatsKvKeyExists. Each: unit tests + an integration test against the staging cluster.
FleetOperatorSmokeTestcomposing the above.FleetAgentSmokeTest.- Dashboard renderer last — once the events are flowing, UI is mostly maud + htmx polling.
Done when
deploy_with_smoke(FleetOperatorScore, FleetOperatorSmokeTest, ...)returns successfully against staging.- Dashboard shows the live pipeline.
- Deliberate breakage (point the operator's helm chart at a bad image) → smoke fails visibly, failing probe named on dashboard.
Chapter 11 — CI yaml minimization (#11, longer-term)
Pulled out of the chapter-by-chapter v0.3 work.
- Frame: workflow yaml files in
.gitea/workflows/(4 files, ~235 LOC) should hold only what Gitea Actions needs for job discovery + parallel viz. Job bodies are one-line calls into portable scripts.
Direction
- Build out a
harmony-ciRust CLI crate. Commands likeharmony-ci build composer-linux,harmony-ci publish operator-image,harmony-ci check. - Each workflow yaml job becomes
run: cargo run -p harmony-ci -- <command>. - Scripts must run identically from a developer's laptop.
Not in v0.3
- Multi-day effort; doesn't block the customer.
- Slot when bandwidth allows.
- Opportunistically convert when touching a workflow file for other reasons.
Chapter 12 — NATS callout CI hardening (#12, minimal)
nats/calloutis a low-churn crate that works today.- Workspace-wide
cargo testin.gitea/workflows/check.ymlcovers the non-ignored tests. - Four
#[ignore]'d integration tests innats/integration-test-callout/tests/callout_e2e.rsneed podman + a NATS image pull in the runner.
Direction
- Don't add CI infra in v0.3 just to run these.
- When a runner with podman + image pull exists for other reasons (e2e harness, system upgrade test matrix), add the callout integration tests to it.
- Until then: keep current workspace-wide coverage.
Chapter 14 — Device blacklist enforcement + un-blacklist (#14)
Goal: blacklisting a device actually locks it out of the fleet, and
the action is reversible. Today blacklist_device only patches a
cosmetic k8s label (fleet.nationtech.io/blacklisted) — nothing
enforces it; the device keeps its NATS connection and credentials.
Decision (2026-06-03) — Zitadel-deactivate, honest about "immediate"
Core NATS has no force-disconnect, and the callout issues 1-hour user JWTs with no revocation check. So "kill the connection now" is not free. The chosen v0.3 cut:
- Deactivate the Zitadel machine user (
device-<id>). New management- API call (not yet implemented — see research: no deactivate/lock/delete wired today). After this, the device cannot re-authenticate: its next NATS callout (≤1h, on token renewal/reconnect) fails and it drops off. The agent stays connected until then. - Keep the k8s label for dashboard state + un-blacklist.
- Confirmation UX (load-bearing): the blacklist confirm dialog must tell the operator plainly: "The device's access is revoked, but its current NATS connection persists until its token renews (≤1h). For immediate effect, restart the device, or restart NATS — note restarting NATS disconnects every device." No silent false "it's gone now."
- Un-blacklist = reactivate the Zitadel user + clear the label. Add
FleetService::unblacklist_device(no reverse exists today — kubectl only) and a dashboard button on blacklisted devices.
Explicitly NOT in this cut (and why)
- True instant kill of a hostile device. Would need NATS-level forced disconnect (monitoring + custom mechanism) — large scope, deferred. The veterinary use case is decommission/quarantine, not adversarial, so deactivate-then-reauth-fails is acceptable.
- Shortening the JWT TTL. Considered (bounds exposure) but rejected for now — it's a callout-wide change affecting every device's churn; revisit if exposure window matters.
Done when
- Blacklist deactivates the Zitadel user; a fresh enroll/reconnect by that device is refused by the callout (tested).
- Un-blacklist reactivates + clears the label; device reconnects.
- Confirm dialog states the persistence caveat verbatim.
Chapter 15 — Container auto-start after reboot (#15, ✅ shipped)
Bug, not a feature. The agent watched desired-state with
bucket.watch() (DeliverPolicy::New) — only future Puts, never a replay
of existing keys. On any restart (incl. device reboot) the reconciler's
in-memory cache started empty and the 30s ground-truth tick had nothing to
reconcile, so reboot-stopped containers never came back unless the operator
rewrote the KV. Fixed by switching to watch_with_history()
(DeliverPolicy::LastPerSubject): the agent replays current desired-state
on startup and the idempotent apply path restarts the containers. The user
podman.socket is already enable --now + lingered, so it survives reboot.
Offline-boot resilience (containers up before NATS is reachable) is not
covered — see #16; would need podman-restart.service + restart=always.
Out of scope for v0.3 (deferred deliberately)
| Item | Target | Why deferred |
|---|---|---|
| Deployment-level auto-rollback | maybe never | Customer asked for roll-forward only. |
| System-upgrade LVM-snapshot rollback half | v0.4 | Push to prod first; widen scope after. |
| Live log tailing (streaming) | v0.4 | Chapter 3 ships sync getLogs; live tail builds on it. |
| Deployment dependencies (cross-deploy ordering) | TBD | Init containers cover the common case; wait for customer ask. |
| Secrets via Zitadel + OpenBao | v0.3.x | Blocked on harmony_secret work. |
| Containerized agent (podman instead of systemd) | v0.4+ | Self-upgrade protocol matures first on systemd. |
| Operator HA (active/active or active/passive) | TBD | One pod sufficient for v0.3; scale-out when fleet size demands. |
| Multi-tenant fleet isolation tests | v0.4 | Callout permissions cover the mechanism; cross-tenant smoke later. |
Open questions
These don't block starting v0.3 work but need resolution before the relevant chapter completes.
- Q1 (Chapter 4): Binary distribution mechanism for agent upgrades. Gitea releases vs OCI artifacts vs something else.
- Q2 (Chapter 2): Snapshot the aggregate to KV? Faster recovery vs invalidation complexity.
- Q3 (Chapter 7): Canary test matrix? Concretely: which Pi models, which base images, which apt sources.
- Q4 (Chapters 5 + 10): Sequencing of Chapter 10 vs Chapter 5. Both benefit from smoke; right answer might be to ship Phase 1 smoke during Chapter 5 so upgrade gates on it. Decide when starting Chapter 5.
- Q5 (cross-cutting): One operator pod or active/passive? Customer's fleet size answers this; ask before Chapter 2 starts.
When v0.3 is done
- All chapters 1–10 merged.
- A real customer Deployment runs on a real Pi in a real basement.
- The dashboard shows live status and logs.
- An agent upgrade has been driven through the full protocol successfully (and a failure path tested).
- A system upgrade has been driven through the full protocol on a canary.
v0.4 picks up the deferred items in priority order.