Files

Jean-Gabriel Gill-Couture dfcafcfc1f docs(roadmap): consolidated v0.3 fleet plan (post-demo)

Single dense authoritative plan superseding v0_3_plan.md and
v0_3_plan_recovered.md: status table, sequencing, ADR map, and the four
open decisions that gate work (agent runtime model, secret scoping axis,
operator topology, binary distribution). Reflects shipped state (role
gate, exec+logs+selector, stale health, auto-start reboot fix) and the
in-scope agent+system upgrade auto-rollback per ADR-022 / ADR-0042.

2026-06-03 14:54:19 -04:00

14 KiB

Raw Permalink Blame History

Fleet Platform v0.3 — consolidated plan

Authoritative go-forward plan, post-demo (2026-06-03). Supersedes v0_3_plan.md and v0_3_plan_recovered.md for planning (kept for history). Frame: v0.1 proved the shape · v0.2 locked the brick · v0.3 makes the brick safe to hand a customer running production workloads on Pis.

Legend: ✅ shipped · 🟡 partial · 🔴 not started · ⏸ deferred (version in note).

Status

#	Feature	St	Governs / branch	Note
1	Dashboard role gate (`fleet-admin`)	🟡	this branch	Gate works; 4 follow-ups owed (see Ch1).
2	Operator restart + aggregator recovery	🔴	new branch	Converge from NATS KV alone; no customer-visible "unknown state".
3	Deployment logs + remote exec	🟡	this branch	One-shot `podman logs` tail + exec + per-deployment selector shipped. Companion-trait refactor + live streaming → v0.4.
4	Agent self-upgrade + auto-rollback	🔴	ADR-022	ADR is authoritative + already rolls back. Build it; drop the old no-rollback cut.
5	Graceful deployment upgrade (roll-forward)	🔴	new branch	SIGTERM→grace→SIGKILL→start-new. App versions roll forward only (customer ask).
6	Init containers in `PodmanV0Score`	🔴	new branch	Ordered, run-to-completion, customer guarantees idempotency.
7	System (OS) upgrade + auto-rollback	🔴	ADR-0042→025	LVM thin snapshot + two-tier watchdog. Rollback now in scope.
8	Deployment secrets from OpenBao	🔴	blocked: machine-identity branch	Agent-fetch via `harmony_config`. Scoping axis undecided (Ch8).
9	Agent time-drift guard	🔴	new branch	NTP check; refuse JWT ops if \|skew\| > 30s.
10	Phase-1 smoke wiring	🔴	new branch	HTTP / K8sPodReady / NatsKv probes on real Scores.
14	Blacklist enforcement + un-blacklist	🔴	new branch	Today cosmetic. Zitadel-deactivate; no instant NATS kill (Ch14).
15	Container auto-start after reboot	✅	this branch	`watch`→`watch_with_history`; agent replays desired-state on boot.
16	Container auto-start while offline at boot	⏸ v0.4	follow-on #15	Needs user `podman-restart.service` + `restart=always`.
11/12	CI yaml minimization · callout CI	⏸ v0.4	—	Opportunistic; doesn't block customer.
13	App log streaming through NATS	⏸ v0.4	follow-on #3	Live tail; #3 ships the sync getLogs.

Shipped foundation (this branch, not re-listed above): real dashboard data (CQRS — operator writes CRs, UI reads them), SSO via ConfigClient, TLS ingress on fleet-stg.<domain>, stale-vs-healthy deployment health, dev build/deploy + enroll loops.

Sequencing

#2 operator recovery — customer can't tolerate "operator restarted, state unknown".
#14 blacklist enforcement — smallest of the heavy items; closes a real security gap.
#6 init containers — self-contained, design locked.
#4 agent upgrade — gated on Decision D1. Parallel-safe with #2/#3.
#5 graceful deployment upgrade — pairs naturally with #6.
#9 time-drift — small, slot between heavier items.
#7 system upgrade — gated on D1; builds on #4's pattern.
#10 smoke wiring — after the above, so probes cover real surfaces.
#8 secrets — gated on D2 + machine-identity branch; promote when a customer needs credentials.

Open decisions (block the gated items)

D1 — Agent runtime model. ADR-022 (+#4/#7 plans) assume a systemd binary (symlink swap); ADR-0042 asserts a privileged podman container. Contradiction. Recommend systemd-binary (simpler, already built/planned); fix ADR-0042's premise. Blocks #4, #7.
D2 — Secret scoping axis. Per-deployment Zitadel claim (single secret copy, but token re-login on every new deployment) vs device-stable policy templating (no re-login, but per-device secret copies) vs operator-minted response-wrapped tokens (no claim race; operator scopes at mint). Also confirm every device can reach OpenBao — a single fact that can invalidate agent-fetch. Blocks #8.
D3 — Operator topology. One pod vs active/passive. Fleet size answers it. Blocks #2's HA shape.
D4 — Agent binary distribution. Gitea release asset vs signed OCI vs CDN; SHA-256 pin minimum. Blocks #4 staging step.

ADR map

ADR	Scope
016	Agent + NATS JetStream mesh.
020	Zitadel + OpenBao unified config/secrets — substrate for #8.
022	Agent self-upgrade (symlink swap, verify, auto-revert). #4.
023	Deploy architecture (Scores, not handrolled manifests).
024	Fleet capability decomposition.
0042 → 025	System OS upgrade + LVM-snapshot rollback. #7. Promote from `drafts/` + renumber.

Chapters (pending work only)

Ch1 — Role gate follow-ups (🟡)

Gate enforces fleet-admin. Owed: (1) request roles via scope, not the Zitadel "User Roles Inside ID Token" checkbox (out-of-band, bit us once); (2) fix docs/guides/operator-dashboard-sso.md step 1b (names the wrong setting); (3) unify the two role extractors — harmony_zitadel_auth::extract_zitadel_roles (object-map only) and callout ZitadelValidator::extract_roles (array + object-map) — into one shared helper; (4) lift require_role to a composable layer only if a second role appears (YAGNI).

Ch2 — Operator restart + aggregator recovery (🔴)

Goal: kill/upgrade/reschedule the operator at any time; converge from NATS KV alone, no customer-visible unknown-state window. Aggregator already cold-rebuilds from KV watches. Scenario-driven: enumerate failure shapes in docs/fleet-operator-recovery-scenarios.md, one regression test in harmony-fleet-e2e per shape, then fix. Cover: partial KV (device offline during reset), two operators racing on rolling deploy, stale KV (CR deleted while operator down). Writes must be byte-deterministic (idempotent multi-writer) or add leader election (→ D3). Surface "recovering/converged" to the UI so the customer sees a banner, not a blank. Done: scenario doc + green regression per scenario + chaos kill under write load converges < 30s with liveness banner.

Ch3 — Logs + exec (🟡)

Shipped: agent runs podman logs --tail N <container> and one-shot exec over NATS request/reply; dashboard tail + Refresh + per-deployment selector. Owed: extract a LogQuery<T> Score companion (mirror the smoke contract) so logs attach declaratively; live streaming → v0.4 (#13).

Ch4 — Agent self-upgrade + auto-rollback (🔴) — D1

Build ADR-022 verbatim: versioned binaries at /usr/bin/fleet-agent-v<ver> (never GC'd) + atomic symlink swap; state machine Running→Draining→Staging→Verifying→Cutover-Ready→Stopping; --self-test before cutover; operator owns the stop signal (agent never self-stops). Rollback is the same code path: smoke-fail or new-version heartbeat-timeout → revert symlink, stay on old. Marker in NATS (agent-upgrade/<device_id>); no NATS → no upgrade. Heartbeat carries current_version; operator sets desired_version. Done: e2e drives vX→vX+1 against a libvirt VM + a corrupt-binary run proves auto-revert; operator sees every phase.

Ch5 — Graceful deployment upgrade, roll-forward only (🔴)

Add lifecycle: Option<LifecyclePolicy> (stop_signal=SIGTERM, grace=30s, sigkill_fallback=true) to PodmanV0Score. On image/config change: Upgrading phase → signal old → wait grace → SIGKILL → start new → Running, or Failed (no revert — app versions roll forward only; customer edits spec). Single container per deployment per device; brief cutover downtime accepted. Done: e2e v1→v2→v3 with controlled failures; dashboard reflects each step.

Ch6 — Init containers (🔴)

init_containers: Vec<InitContainer> on PodmanV0Score — ordered, run-to-completion, non-zero/timeout fails the deployment. K8s-shaped mental model. Contract (document loudly): must be idempotent — they rerun on every fresh-main-container reconcile (reboot, upgrade). Score-builder lint warns on common non-idempotent patterns. Done: e2e — init mkdir -p /data && touch /data/init then main asserts the file; two-step ordering tested.

Ch7 — System OS upgrade + auto-rollback (🔴) — D1

Build ADR-0042 in full (rollback included). Covers soft fail (boots, can't reach control plane → userspace timer merges snapshot) and hard fail (unbootable root → initramfs bootcount hook merges snapshot) — hard-fail coverage is mandatory for customers running out-of-tree kernel modules.

One-time provisioning (scripted, idempotent, off-line-of-service): root partition → PV/VG/LV (ext4 unchanged), initramfs + LVM + local-top hook, cmdline.txt → root=/dev/mapper/vg0-root, BCM2835 watchdog.
Per-upgrade: upgrade-pending flag → lvcreate thin vg0/root_preupgrade → bootcount=0 on FAT /boot → apt full-upgrade → reboot.
Resolve: initramfs increments bootcount, > N(=2–3) → lvconvert --merge + reboot; userspace check-in (10 min) success → reset + lvremove, else merge + reboot; hardware watchdog catches hangs.
Hard constraints: merge discards everything written in the probation window — must-persist state lives outside the snapshot (separate LV or control-plane DB); size the thin pool so snapshot + churn can't exhaust it. Done: canary matrix green — clean upgrade, soft-fail, hard-fail (unbootable kernel), total hang.

Ch8 — Deployment secrets from OpenBao (🔴) — D2, blocked on machine-identity branch

Goal: a Deployment references a secret by name; the device's container gets the value at apply time; the secret never sits in NATS KV; a device reads only its deployments' secrets. Direction (chosen): agent fetches from OpenBao via harmony_config, authenticating with its existing Zitadel identity. EnvVar gains a valueFrom ref; inline literals stay. Resolution in the reconciler before ensure_service_running; fetch failure → legible Phase::Failed, never silent empty env. Refresh: periodic re-resolve on the tick (restart container only on change) + admin-triggered deployment restart. Synergy: Zitadel-deactivate (Ch14) kills NATS and OpenBao access at once. Rejected: secret encrypted-per-device in NATS KV — re-encrypt/fan-out storm on rotation, sync drift. Unresolved (D2): scoping axis — per-deployment claim (re-login on every new deployment; note lease-renew ≠ re-login) vs device-stable policy templating vs operator-minted response-wrapped token. Multi-deployment scoping likely needs OIDC-groups→policies, not array-claim path templating. Confirm device→OpenBao reachability first.

Ch9 — Agent time-drift guard (🔴)

Startup + 6-hourly NTP-style check; refuse to start on |drift| > 30s with a specific message ("clock skew Xs; JWT validation will fail; enable systemd-timesyncd/chrony"); mid-run drift → DeviceInfo health flag → dashboard. Done: e2e VM with clock forced 5 min off refuses start; fixing the clock recovers clean.

Ch10 — Phase-1 smoke wiring (🔴)

Make the Phase-0 smoke contract load-bearing. Add HttpHealthy, K8sPodReady, NatsKvKeyExists probes; FleetOperatorSmokeTest + FleetAgentSmokeTest; additive HarmonyEvent::SmokeStage{…}; dashboard pipeline view. Strict order: events → probes (one at a time) → operator suite → agent suite → renderer. Done: deploy_with_smoke(FleetOperatorScore, …) green vs staging; deliberate bad image fails visibly with the failing probe named.

Ch14 — Blacklist enforcement + un-blacklist (🔴)

Today blacklist_device only patches a cosmetic k8s label. Decision: (1) deactivate the Zitadel machine user (device-<id>) — a new management-API call (none wired today); device cannot re-auth, drops at next callout (≤ JWT TTL, currently 1h). (2) Keep the label for state + reversal. (3) Confirmation UX is load-bearing: state plainly that the current NATS connection persists until token renewal (≤1h), and immediate effect needs a device restart or a NATS restart (which disconnects all devices). No false "it's gone now." (4) Un-blacklist = reactivate user + clear label (FleetService::unblacklist_device + button). Not in scope: true instant kill of a hostile device (needs NATS-level forced disconnect — large); shortening JWT TTL (callout-wide; revisit if exposure matters). Use case is decommission/quarantine, not adversarial. Done: blacklist → fresh enroll/reconnect refused; un-blacklist → reconnects; confirm dialog states the caveat verbatim.

Out of scope (deferred deliberately)

Item	Target	Why
Deployment auto-rollback (app versions)	maybe never	Customer asked roll-forward only. (≠ agent/system rollback, which is in scope.)
Live log streaming	v0.4	#3 ships sync getLogs.
Offline-boot container autostart	v0.4	#15 covers the online case.
Cross-deployment ordering	TBD	Init containers cover the common case.
Containerized agent (vs systemd)	v0.4+	Settle D1 + mature self-upgrade on systemd first.
Operator HA	TBD (D3)	One pod sufficient until fleet size demands.

Principles (carried forward)

No YAML in framework paths · Scores describe desired state, topologies expose capabilities · cross-boundary wire types in harmony-reconciler-contracts · never ship untested code (real e2e before "done") · prove upstream claims before blaming upstream · thiserror in libs, anyhow only at binary glue · minimal/DRY/no-bloat — the slimmest correct solution wins.

When v0.3 is done

Chapters 1–10 + 14 merged · a real customer Deployment runs on a real Pi · dashboard shows live status + logs · an agent upgrade and a system upgrade have each been driven through the full protocol including a proven auto-rollback on the failure path. v0.4 picks up the deferred list.

14 KiB Raw Permalink Blame History Unescape Escape