Single dense authoritative plan superseding v0_3_plan.md and v0_3_plan_recovered.md: status table, sequencing, ADR map, and the four open decisions that gate work (agent runtime model, secret scoping axis, operator topology, binary distribution). Reflects shipped state (role gate, exec+logs+selector, stale health, auto-start reboot fix) and the in-scope agent+system upgrade auto-rollback per ADR-022 / ADR-0042.
14 KiB
Fleet Platform v0.3 — consolidated plan
Authoritative go-forward plan, post-demo (2026-06-03). Supersedes
v0_3_plan.md and v0_3_plan_recovered.md for planning (kept for history).
Frame: v0.1 proved the shape · v0.2 locked the brick · v0.3 makes the brick
safe to hand a customer running production workloads on Pis.
Legend: ✅ shipped · 🟡 partial · 🔴 not started · ⏸ deferred (version in note).
Status
| # | Feature | St | Governs / branch | Note |
|---|---|---|---|---|
| 1 | Dashboard role gate (fleet-admin) |
🟡 | this branch | Gate works; 4 follow-ups owed (see Ch1). |
| 2 | Operator restart + aggregator recovery | 🔴 | new branch | Converge from NATS KV alone; no customer-visible "unknown state". |
| 3 | Deployment logs + remote exec | 🟡 | this branch | One-shot podman logs tail + exec + per-deployment selector shipped. Companion-trait refactor + live streaming → v0.4. |
| 4 | Agent self-upgrade + auto-rollback | 🔴 | ADR-022 | ADR is authoritative + already rolls back. Build it; drop the old no-rollback cut. |
| 5 | Graceful deployment upgrade (roll-forward) | 🔴 | new branch | SIGTERM→grace→SIGKILL→start-new. App versions roll forward only (customer ask). |
| 6 | Init containers in PodmanV0Score |
🔴 | new branch | Ordered, run-to-completion, customer guarantees idempotency. |
| 7 | System (OS) upgrade + auto-rollback | 🔴 | ADR-0042→025 | LVM thin snapshot + two-tier watchdog. Rollback now in scope. |
| 8 | Deployment secrets from OpenBao | 🔴 | blocked: machine-identity branch | Agent-fetch via harmony_config. Scoping axis undecided (Ch8). |
| 9 | Agent time-drift guard | 🔴 | new branch | NTP check; refuse JWT ops if |skew| > 30s. |
| 10 | Phase-1 smoke wiring | 🔴 | new branch | HTTP / K8sPodReady / NatsKv probes on real Scores. |
| 14 | Blacklist enforcement + un-blacklist | 🔴 | new branch | Today cosmetic. Zitadel-deactivate; no instant NATS kill (Ch14). |
| 15 | Container auto-start after reboot | ✅ | this branch | watch→watch_with_history; agent replays desired-state on boot. |
| 16 | Container auto-start while offline at boot | ⏸ v0.4 | follow-on #15 | Needs user podman-restart.service + restart=always. |
| 11/12 | CI yaml minimization · callout CI | ⏸ v0.4 | — | Opportunistic; doesn't block customer. |
| 13 | App log streaming through NATS | ⏸ v0.4 | follow-on #3 | Live tail; #3 ships the sync getLogs. |
Shipped foundation (this branch, not re-listed above): real dashboard data
(CQRS — operator writes CRs, UI reads them), SSO via ConfigClient, TLS
ingress on fleet-stg.<domain>, stale-vs-healthy deployment health, dev
build/deploy + enroll loops.
Sequencing
- #2 operator recovery — customer can't tolerate "operator restarted, state unknown".
- #14 blacklist enforcement — smallest of the heavy items; closes a real security gap.
- #6 init containers — self-contained, design locked.
- #4 agent upgrade — gated on Decision D1. Parallel-safe with #2/#3.
- #5 graceful deployment upgrade — pairs naturally with #6.
- #9 time-drift — small, slot between heavier items.
- #7 system upgrade — gated on D1; builds on #4's pattern.
- #10 smoke wiring — after the above, so probes cover real surfaces.
- #8 secrets — gated on D2 + machine-identity branch; promote when a customer needs credentials.
Open decisions (block the gated items)
- D1 — Agent runtime model. ADR-022 (+#4/#7 plans) assume a systemd binary (symlink swap); ADR-0042 asserts a privileged podman container. Contradiction. Recommend systemd-binary (simpler, already built/planned); fix ADR-0042's premise. Blocks #4, #7.
- D2 — Secret scoping axis. Per-deployment Zitadel claim (single secret copy, but token re-login on every new deployment) vs device-stable policy templating (no re-login, but per-device secret copies) vs operator-minted response-wrapped tokens (no claim race; operator scopes at mint). Also confirm every device can reach OpenBao — a single fact that can invalidate agent-fetch. Blocks #8.
- D3 — Operator topology. One pod vs active/passive. Fleet size answers it. Blocks #2's HA shape.
- D4 — Agent binary distribution. Gitea release asset vs signed OCI vs CDN; SHA-256 pin minimum. Blocks #4 staging step.
ADR map
| ADR | Scope |
|---|---|
| 016 | Agent + NATS JetStream mesh. |
| 020 | Zitadel + OpenBao unified config/secrets — substrate for #8. |
| 022 | Agent self-upgrade (symlink swap, verify, auto-revert). #4. |
| 023 | Deploy architecture (Scores, not handrolled manifests). |
| 024 | Fleet capability decomposition. |
| 0042 → 025 | System OS upgrade + LVM-snapshot rollback. #7. Promote from drafts/ + renumber. |
Chapters (pending work only)
Ch1 — Role gate follow-ups (🟡)
Gate enforces fleet-admin. Owed: (1) request roles via scope, not the Zitadel "User Roles Inside ID Token" checkbox (out-of-band, bit us once); (2) fix docs/guides/operator-dashboard-sso.md step 1b (names the wrong setting); (3) unify the two role extractors — harmony_zitadel_auth::extract_zitadel_roles (object-map only) and callout ZitadelValidator::extract_roles (array + object-map) — into one shared helper; (4) lift require_role to a composable layer only if a second role appears (YAGNI).
Ch2 — Operator restart + aggregator recovery (🔴)
Goal: kill/upgrade/reschedule the operator at any time; converge from NATS KV alone, no customer-visible unknown-state window.
Aggregator already cold-rebuilds from KV watches. Scenario-driven: enumerate failure shapes in docs/fleet-operator-recovery-scenarios.md, one regression test in harmony-fleet-e2e per shape, then fix. Cover: partial KV (device offline during reset), two operators racing on rolling deploy, stale KV (CR deleted while operator down). Writes must be byte-deterministic (idempotent multi-writer) or add leader election (→ D3). Surface "recovering/converged" to the UI so the customer sees a banner, not a blank.
Done: scenario doc + green regression per scenario + chaos kill under write load converges < 30s with liveness banner.
Ch3 — Logs + exec (🟡)
Shipped: agent runs podman logs --tail N <container> and one-shot exec over NATS request/reply; dashboard tail + Refresh + per-deployment selector. Owed: extract a LogQuery<T> Score companion (mirror the smoke contract) so logs attach declaratively; live streaming → v0.4 (#13).
Ch4 — Agent self-upgrade + auto-rollback (🔴) — D1
Build ADR-022 verbatim: versioned binaries at /usr/bin/fleet-agent-v<ver> (never GC'd) + atomic symlink swap; state machine Running→Draining→Staging→Verifying→Cutover-Ready→Stopping; --self-test before cutover; operator owns the stop signal (agent never self-stops). Rollback is the same code path: smoke-fail or new-version heartbeat-timeout → revert symlink, stay on old. Marker in NATS (agent-upgrade/<device_id>); no NATS → no upgrade. Heartbeat carries current_version; operator sets desired_version.
Done: e2e drives vX→vX+1 against a libvirt VM + a corrupt-binary run proves auto-revert; operator sees every phase.
Ch5 — Graceful deployment upgrade, roll-forward only (🔴)
Add lifecycle: Option<LifecyclePolicy> (stop_signal=SIGTERM, grace=30s, sigkill_fallback=true) to PodmanV0Score. On image/config change: Upgrading phase → signal old → wait grace → SIGKILL → start new → Running, or Failed (no revert — app versions roll forward only; customer edits spec). Single container per deployment per device; brief cutover downtime accepted.
Done: e2e v1→v2→v3 with controlled failures; dashboard reflects each step.
Ch6 — Init containers (🔴)
init_containers: Vec<InitContainer> on PodmanV0Score — ordered, run-to-completion, non-zero/timeout fails the deployment. K8s-shaped mental model. Contract (document loudly): must be idempotent — they rerun on every fresh-main-container reconcile (reboot, upgrade). Score-builder lint warns on common non-idempotent patterns.
Done: e2e — init mkdir -p /data && touch /data/init then main asserts the file; two-step ordering tested.
Ch7 — System OS upgrade + auto-rollback (🔴) — D1
Build ADR-0042 in full (rollback included). Covers soft fail (boots, can't reach control plane → userspace timer merges snapshot) and hard fail (unbootable root → initramfs bootcount hook merges snapshot) — hard-fail coverage is mandatory for customers running out-of-tree kernel modules.
- One-time provisioning (scripted, idempotent, off-line-of-service): root partition → PV/VG/LV (ext4 unchanged), initramfs + LVM +
local-tophook,cmdline.txt→root=/dev/mapper/vg0-root, BCM2835 watchdog. - Per-upgrade:
upgrade-pendingflag →lvcreatethinvg0/root_preupgrade→bootcount=0on FAT/boot→apt full-upgrade→ reboot. - Resolve: initramfs increments
bootcount,> N(=2–3)→lvconvert --merge+ reboot; userspace check-in (10 min) success → reset +lvremove, else merge + reboot; hardware watchdog catches hangs. - Hard constraints: merge discards everything written in the probation window — must-persist state lives outside the snapshot (separate LV or control-plane DB); size the thin pool so snapshot + churn can't exhaust it. Done: canary matrix green — clean upgrade, soft-fail, hard-fail (unbootable kernel), total hang.
Ch8 — Deployment secrets from OpenBao (🔴) — D2, blocked on machine-identity branch
Goal: a Deployment references a secret by name; the device's container gets the value at apply time; the secret never sits in NATS KV; a device reads only its deployments' secrets.
Direction (chosen): agent fetches from OpenBao via harmony_config, authenticating with its existing Zitadel identity. EnvVar gains a valueFrom ref; inline literals stay. Resolution in the reconciler before ensure_service_running; fetch failure → legible Phase::Failed, never silent empty env. Refresh: periodic re-resolve on the tick (restart container only on change) + admin-triggered deployment restart.
Synergy: Zitadel-deactivate (Ch14) kills NATS and OpenBao access at once.
Rejected: secret encrypted-per-device in NATS KV — re-encrypt/fan-out storm on rotation, sync drift.
Unresolved (D2): scoping axis — per-deployment claim (re-login on every new deployment; note lease-renew ≠ re-login) vs device-stable policy templating vs operator-minted response-wrapped token. Multi-deployment scoping likely needs OIDC-groups→policies, not array-claim path templating. Confirm device→OpenBao reachability first.
Ch9 — Agent time-drift guard (🔴)
Startup + 6-hourly NTP-style check; refuse to start on |drift| > 30s with a specific message ("clock skew Xs; JWT validation will fail; enable systemd-timesyncd/chrony"); mid-run drift → DeviceInfo health flag → dashboard. Done: e2e VM with clock forced 5 min off refuses start; fixing the clock recovers clean.
Ch10 — Phase-1 smoke wiring (🔴)
Make the Phase-0 smoke contract load-bearing. Add HttpHealthy, K8sPodReady, NatsKvKeyExists probes; FleetOperatorSmokeTest + FleetAgentSmokeTest; additive HarmonyEvent::SmokeStage{…}; dashboard pipeline view. Strict order: events → probes (one at a time) → operator suite → agent suite → renderer.
Done: deploy_with_smoke(FleetOperatorScore, …) green vs staging; deliberate bad image fails visibly with the failing probe named.
Ch14 — Blacklist enforcement + un-blacklist (🔴)
Today blacklist_device only patches a cosmetic k8s label. Decision: (1) deactivate the Zitadel machine user (device-<id>) — a new management-API call (none wired today); device cannot re-auth, drops at next callout (≤ JWT TTL, currently 1h). (2) Keep the label for state + reversal. (3) Confirmation UX is load-bearing: state plainly that the current NATS connection persists until token renewal (≤1h), and immediate effect needs a device restart or a NATS restart (which disconnects all devices). No false "it's gone now." (4) Un-blacklist = reactivate user + clear label (FleetService::unblacklist_device + button).
Not in scope: true instant kill of a hostile device (needs NATS-level forced disconnect — large); shortening JWT TTL (callout-wide; revisit if exposure matters). Use case is decommission/quarantine, not adversarial.
Done: blacklist → fresh enroll/reconnect refused; un-blacklist → reconnects; confirm dialog states the caveat verbatim.
Out of scope (deferred deliberately)
| Item | Target | Why |
|---|---|---|
| Deployment auto-rollback (app versions) | maybe never | Customer asked roll-forward only. (≠ agent/system rollback, which is in scope.) |
| Live log streaming | v0.4 | #3 ships sync getLogs. |
| Offline-boot container autostart | v0.4 | #15 covers the online case. |
| Cross-deployment ordering | TBD | Init containers cover the common case. |
| Containerized agent (vs systemd) | v0.4+ | Settle D1 + mature self-upgrade on systemd first. |
| Operator HA | TBD (D3) | One pod sufficient until fleet size demands. |
Principles (carried forward)
No YAML in framework paths · Scores describe desired state, topologies expose capabilities · cross-boundary wire types in harmony-reconciler-contracts · never ship untested code (real e2e before "done") · prove upstream claims before blaming upstream · thiserror in libs, anyhow only at binary glue · minimal/DRY/no-bloat — the slimmest correct solution wins.
When v0.3 is done
Chapters 1–10 + 14 merged · a real customer Deployment runs on a real Pi · dashboard shows live status + logs · an agent upgrade and a system upgrade have each been driven through the full protocol including a proven auto-rollback on the failure path. v0.4 picks up the deferred list.