Files
harmony/ROADMAP/fleet_platform/v0_3_plan_consolidated.md
Jean-Gabriel Gill-Couture dfcafcfc1f docs(roadmap): consolidated v0.3 fleet plan (post-demo)
Single dense authoritative plan superseding v0_3_plan.md and
v0_3_plan_recovered.md: status table, sequencing, ADR map, and the four
open decisions that gate work (agent runtime model, secret scoping axis,
operator topology, binary distribution). Reflects shipped state (role
gate, exec+logs+selector, stale health, auto-start reboot fix) and the
in-scope agent+system upgrade auto-rollback per ADR-022 / ADR-0042.
2026-06-03 14:54:19 -04:00

14 KiB
Raw Permalink Blame History

Fleet Platform v0.3 — consolidated plan

Authoritative go-forward plan, post-demo (2026-06-03). Supersedes v0_3_plan.md and v0_3_plan_recovered.md for planning (kept for history). Frame: v0.1 proved the shape · v0.2 locked the brick · v0.3 makes the brick safe to hand a customer running production workloads on Pis.

Legend: shipped · 🟡 partial · 🔴 not started · ⏸ deferred (version in note).

Status

# Feature St Governs / branch Note
1 Dashboard role gate (fleet-admin) 🟡 this branch Gate works; 4 follow-ups owed (see Ch1).
2 Operator restart + aggregator recovery 🔴 new branch Converge from NATS KV alone; no customer-visible "unknown state".
3 Deployment logs + remote exec 🟡 this branch One-shot podman logs tail + exec + per-deployment selector shipped. Companion-trait refactor + live streaming → v0.4.
4 Agent self-upgrade + auto-rollback 🔴 ADR-022 ADR is authoritative + already rolls back. Build it; drop the old no-rollback cut.
5 Graceful deployment upgrade (roll-forward) 🔴 new branch SIGTERM→grace→SIGKILL→start-new. App versions roll forward only (customer ask).
6 Init containers in PodmanV0Score 🔴 new branch Ordered, run-to-completion, customer guarantees idempotency.
7 System (OS) upgrade + auto-rollback 🔴 ADR-0042→025 LVM thin snapshot + two-tier watchdog. Rollback now in scope.
8 Deployment secrets from OpenBao 🔴 blocked: machine-identity branch Agent-fetch via harmony_config. Scoping axis undecided (Ch8).
9 Agent time-drift guard 🔴 new branch NTP check; refuse JWT ops if |skew| > 30s.
10 Phase-1 smoke wiring 🔴 new branch HTTP / K8sPodReady / NatsKv probes on real Scores.
14 Blacklist enforcement + un-blacklist 🔴 new branch Today cosmetic. Zitadel-deactivate; no instant NATS kill (Ch14).
15 Container auto-start after reboot this branch watchwatch_with_history; agent replays desired-state on boot.
16 Container auto-start while offline at boot ⏸ v0.4 follow-on #15 Needs user podman-restart.service + restart=always.
11/12 CI yaml minimization · callout CI ⏸ v0.4 Opportunistic; doesn't block customer.
13 App log streaming through NATS ⏸ v0.4 follow-on #3 Live tail; #3 ships the sync getLogs.

Shipped foundation (this branch, not re-listed above): real dashboard data (CQRS — operator writes CRs, UI reads them), SSO via ConfigClient, TLS ingress on fleet-stg.<domain>, stale-vs-healthy deployment health, dev build/deploy + enroll loops.

Sequencing

  1. #2 operator recovery — customer can't tolerate "operator restarted, state unknown".
  2. #14 blacklist enforcement — smallest of the heavy items; closes a real security gap.
  3. #6 init containers — self-contained, design locked.
  4. #4 agent upgradegated on Decision D1. Parallel-safe with #2/#3.
  5. #5 graceful deployment upgrade — pairs naturally with #6.
  6. #9 time-drift — small, slot between heavier items.
  7. #7 system upgradegated on D1; builds on #4's pattern.
  8. #10 smoke wiring — after the above, so probes cover real surfaces.
  9. #8 secretsgated on D2 + machine-identity branch; promote when a customer needs credentials.

Open decisions (block the gated items)

  • D1 — Agent runtime model. ADR-022 (+#4/#7 plans) assume a systemd binary (symlink swap); ADR-0042 asserts a privileged podman container. Contradiction. Recommend systemd-binary (simpler, already built/planned); fix ADR-0042's premise. Blocks #4, #7.
  • D2 — Secret scoping axis. Per-deployment Zitadel claim (single secret copy, but token re-login on every new deployment) vs device-stable policy templating (no re-login, but per-device secret copies) vs operator-minted response-wrapped tokens (no claim race; operator scopes at mint). Also confirm every device can reach OpenBao — a single fact that can invalidate agent-fetch. Blocks #8.
  • D3 — Operator topology. One pod vs active/passive. Fleet size answers it. Blocks #2's HA shape.
  • D4 — Agent binary distribution. Gitea release asset vs signed OCI vs CDN; SHA-256 pin minimum. Blocks #4 staging step.

ADR map

ADR Scope
016 Agent + NATS JetStream mesh.
020 Zitadel + OpenBao unified config/secrets — substrate for #8.
022 Agent self-upgrade (symlink swap, verify, auto-revert). #4.
023 Deploy architecture (Scores, not handrolled manifests).
024 Fleet capability decomposition.
0042 → 025 System OS upgrade + LVM-snapshot rollback. #7. Promote from drafts/ + renumber.

Chapters (pending work only)

Ch1 — Role gate follow-ups (🟡)

Gate enforces fleet-admin. Owed: (1) request roles via scope, not the Zitadel "User Roles Inside ID Token" checkbox (out-of-band, bit us once); (2) fix docs/guides/operator-dashboard-sso.md step 1b (names the wrong setting); (3) unify the two role extractors — harmony_zitadel_auth::extract_zitadel_roles (object-map only) and callout ZitadelValidator::extract_roles (array + object-map) — into one shared helper; (4) lift require_role to a composable layer only if a second role appears (YAGNI).

Ch2 — Operator restart + aggregator recovery (🔴)

Goal: kill/upgrade/reschedule the operator at any time; converge from NATS KV alone, no customer-visible unknown-state window. Aggregator already cold-rebuilds from KV watches. Scenario-driven: enumerate failure shapes in docs/fleet-operator-recovery-scenarios.md, one regression test in harmony-fleet-e2e per shape, then fix. Cover: partial KV (device offline during reset), two operators racing on rolling deploy, stale KV (CR deleted while operator down). Writes must be byte-deterministic (idempotent multi-writer) or add leader election (→ D3). Surface "recovering/converged" to the UI so the customer sees a banner, not a blank. Done: scenario doc + green regression per scenario + chaos kill under write load converges < 30s with liveness banner.

Ch3 — Logs + exec (🟡)

Shipped: agent runs podman logs --tail N <container> and one-shot exec over NATS request/reply; dashboard tail + Refresh + per-deployment selector. Owed: extract a LogQuery<T> Score companion (mirror the smoke contract) so logs attach declaratively; live streaming → v0.4 (#13).

Ch4 — Agent self-upgrade + auto-rollback (🔴) — D1

Build ADR-022 verbatim: versioned binaries at /usr/bin/fleet-agent-v<ver> (never GC'd) + atomic symlink swap; state machine Running→Draining→Staging→Verifying→Cutover-Ready→Stopping; --self-test before cutover; operator owns the stop signal (agent never self-stops). Rollback is the same code path: smoke-fail or new-version heartbeat-timeout → revert symlink, stay on old. Marker in NATS (agent-upgrade/<device_id>); no NATS → no upgrade. Heartbeat carries current_version; operator sets desired_version. Done: e2e drives vX→vX+1 against a libvirt VM + a corrupt-binary run proves auto-revert; operator sees every phase.

Ch5 — Graceful deployment upgrade, roll-forward only (🔴)

Add lifecycle: Option<LifecyclePolicy> (stop_signal=SIGTERM, grace=30s, sigkill_fallback=true) to PodmanV0Score. On image/config change: Upgrading phase → signal old → wait grace → SIGKILL → start new → Running, or Failed (no revert — app versions roll forward only; customer edits spec). Single container per deployment per device; brief cutover downtime accepted. Done: e2e v1→v2→v3 with controlled failures; dashboard reflects each step.

Ch6 — Init containers (🔴)

init_containers: Vec<InitContainer> on PodmanV0Score — ordered, run-to-completion, non-zero/timeout fails the deployment. K8s-shaped mental model. Contract (document loudly): must be idempotent — they rerun on every fresh-main-container reconcile (reboot, upgrade). Score-builder lint warns on common non-idempotent patterns. Done: e2e — init mkdir -p /data && touch /data/init then main asserts the file; two-step ordering tested.

Ch7 — System OS upgrade + auto-rollback (🔴) — D1

Build ADR-0042 in full (rollback included). Covers soft fail (boots, can't reach control plane → userspace timer merges snapshot) and hard fail (unbootable root → initramfs bootcount hook merges snapshot) — hard-fail coverage is mandatory for customers running out-of-tree kernel modules.

  • One-time provisioning (scripted, idempotent, off-line-of-service): root partition → PV/VG/LV (ext4 unchanged), initramfs + LVM + local-top hook, cmdline.txtroot=/dev/mapper/vg0-root, BCM2835 watchdog.
  • Per-upgrade: upgrade-pending flag → lvcreate thin vg0/root_preupgradebootcount=0 on FAT /bootapt full-upgrade → reboot.
  • Resolve: initramfs increments bootcount, > N(=23)lvconvert --merge + reboot; userspace check-in (10 min) success → reset + lvremove, else merge + reboot; hardware watchdog catches hangs.
  • Hard constraints: merge discards everything written in the probation window — must-persist state lives outside the snapshot (separate LV or control-plane DB); size the thin pool so snapshot + churn can't exhaust it. Done: canary matrix green — clean upgrade, soft-fail, hard-fail (unbootable kernel), total hang.

Ch8 — Deployment secrets from OpenBao (🔴) — D2, blocked on machine-identity branch

Goal: a Deployment references a secret by name; the device's container gets the value at apply time; the secret never sits in NATS KV; a device reads only its deployments' secrets. Direction (chosen): agent fetches from OpenBao via harmony_config, authenticating with its existing Zitadel identity. EnvVar gains a valueFrom ref; inline literals stay. Resolution in the reconciler before ensure_service_running; fetch failure → legible Phase::Failed, never silent empty env. Refresh: periodic re-resolve on the tick (restart container only on change) + admin-triggered deployment restart. Synergy: Zitadel-deactivate (Ch14) kills NATS and OpenBao access at once. Rejected: secret encrypted-per-device in NATS KV — re-encrypt/fan-out storm on rotation, sync drift. Unresolved (D2): scoping axis — per-deployment claim (re-login on every new deployment; note lease-renew ≠ re-login) vs device-stable policy templating vs operator-minted response-wrapped token. Multi-deployment scoping likely needs OIDC-groups→policies, not array-claim path templating. Confirm device→OpenBao reachability first.

Ch9 — Agent time-drift guard (🔴)

Startup + 6-hourly NTP-style check; refuse to start on |drift| > 30s with a specific message ("clock skew Xs; JWT validation will fail; enable systemd-timesyncd/chrony"); mid-run drift → DeviceInfo health flag → dashboard. Done: e2e VM with clock forced 5 min off refuses start; fixing the clock recovers clean.

Ch10 — Phase-1 smoke wiring (🔴)

Make the Phase-0 smoke contract load-bearing. Add HttpHealthy, K8sPodReady, NatsKvKeyExists probes; FleetOperatorSmokeTest + FleetAgentSmokeTest; additive HarmonyEvent::SmokeStage{…}; dashboard pipeline view. Strict order: events → probes (one at a time) → operator suite → agent suite → renderer. Done: deploy_with_smoke(FleetOperatorScore, …) green vs staging; deliberate bad image fails visibly with the failing probe named.

Ch14 — Blacklist enforcement + un-blacklist (🔴)

Today blacklist_device only patches a cosmetic k8s label. Decision: (1) deactivate the Zitadel machine user (device-<id>) — a new management-API call (none wired today); device cannot re-auth, drops at next callout (≤ JWT TTL, currently 1h). (2) Keep the label for state + reversal. (3) Confirmation UX is load-bearing: state plainly that the current NATS connection persists until token renewal (≤1h), and immediate effect needs a device restart or a NATS restart (which disconnects all devices). No false "it's gone now." (4) Un-blacklist = reactivate user + clear label (FleetService::unblacklist_device + button). Not in scope: true instant kill of a hostile device (needs NATS-level forced disconnect — large); shortening JWT TTL (callout-wide; revisit if exposure matters). Use case is decommission/quarantine, not adversarial. Done: blacklist → fresh enroll/reconnect refused; un-blacklist → reconnects; confirm dialog states the caveat verbatim.


Out of scope (deferred deliberately)

Item Target Why
Deployment auto-rollback (app versions) maybe never Customer asked roll-forward only. (≠ agent/system rollback, which is in scope.)
Live log streaming v0.4 #3 ships sync getLogs.
Offline-boot container autostart v0.4 #15 covers the online case.
Cross-deployment ordering TBD Init containers cover the common case.
Containerized agent (vs systemd) v0.4+ Settle D1 + mature self-upgrade on systemd first.
Operator HA TBD (D3) One pod sufficient until fleet size demands.

Principles (carried forward)

No YAML in framework paths · Scores describe desired state, topologies expose capabilities · cross-boundary wire types in harmony-reconciler-contracts · never ship untested code (real e2e before "done") · prove upstream claims before blaming upstream · thiserror in libs, anyhow only at binary glue · minimal/DRY/no-bloat — the slimmest correct solution wins.

When v0.3 is done

Chapters 110 + 14 merged · a real customer Deployment runs on a real Pi · dashboard shows live status + logs · an agent upgrade and a system upgrade have each been driven through the full protocol including a proven auto-rollback on the failure path. v0.4 picks up the deferred list.