Authoritative plan for the last mile before fleet ships to a real
customer. Picks up where v0_2_plan.md left the chapter structure.
Twelve chapters, organized in execution order:
1. Dashboard role enforcement (security gap, do right now)
2. Operator restart + aggregator recovery (more critical than smoke)
3. Application log forwarding companion (dashboard utility)
4. Agent self-upgrade, NATS-coordinated, systemd-resident
5. Graceful deployment upgrade (roll-forward only — customer ask)
6. Init containers in PodmanV0Score
7. System upgrade, rollback deferred to v0.4
8. Secrets via Zitadel + OpenBao (blocked on harmony_secret work)
9. Agent time-drift verification
10. Phase 1 smoke wiring
11. CI yaml minimization (longer-term)
12. NATS callout CI hardening (minimal)
Customer constraints baked in: deployments are roll-forward only
(no auto-rollback on Deployment failure); system rollback half of
the upgrade ADR is deferred to v0.4 (snapshot is created but not
used for revert in v0.3); secrets must go through Zitadel + OpenBao
(no plaintext shortcut).
Includes:
- feature checklist as a status table (14 items),
- sequencing table with ordering rationale,
- per-chapter goal / current state with file:line citations /
plan / open questions / "done when",
- out-of-scope table with target version + reason,
- cross-cutting open questions Q1–Q5.
Format follows the user's "tables over prose" preference: every
multi-item section is either a table or bold-led bullets with
nested supporting detail. Scannable at three depths (30-second
scroll for bold leads, 2-minute read for nested detail, deep read
with code where it matters).
28 KiB
Fleet Platform v0.3 — last-mile plan
Authoritative plan for the last mile before the fleet ships to a real
customer. Picks up where v0_2_plan.md left the chapter structure.
Written 2026-05-24, after feat/iot-walking-skeleton (#264) merged
and feat/smoke-test-contract landed the Phase 0 smoke companion.
The frame:
- v0.1 proved the shape.
- v0.2 locked the brick design.
- v0.3 makes the brick safe to hand to a customer running production workloads on Pis in their basement.
State coming in
- IoT walking skeleton merged. Operator + agent + NATS + Zitadel + auth callout running end-to-end against an OKD staging cluster.
- Smoke-test contract Phase 0 merged (
feat/smoke-test-contract).Probe/SmokeSuite/SmokeTestcompanion +deploy_with_smokeinharmony-fleet-deploy/src/companion/smoke/.- One concrete probe today:
TcpReachable. - No fleet Score wired to a real smoke test yet — Phase 1 is in this roadmap.
- Agent runs as a systemd user unit on devices (see
harmony/src/modules/fleet/setup_score.rs:263–283).- No on-device containerized agent path.
- The Dockerfile in
fleet/harmony-fleet-agent/Dockerfileis k8s-only today.
- Dashboard has no role enforcement — security gap.
- Maud/htmx frontend at
fleet/harmony-fleet-operator/src/frontend/server.rs. - Verifies Zitadel JWT signature + expiry only.
JwksCache::verify(harmony_zitadel_auth/src/jwks.rs:74) extractssub/exp/email/name/nonce— no roles.VerifiedSessionhas norolesfield.- Any logged-in Zitadel user gets full dashboard access. Fix immediately (Chapter 1).
- Maud/htmx frontend at
- NATS callout already has the role-extraction logic we need.
ZitadelValidator::extract_rolesatnats/callout/src/zitadel.rs:203.- Handles both array shape (
["fleet-admin"]) and Zitadel's object-map shape ({"fleet-admin": {org_id: org_name}}). roles::resolvemaps role names toResolvedRole::Admin/::Devicewith admin-wins privilege escalation.- Chapter 1 reuses the extractor, not the role-to-NATS-permission half.
- System upgrade ADR drafted at
docs/adr/drafts/Fleet-IoT-Device-System-Upgrade-With-Rollback.md.- Header says Accepted 2026-05-24 but lives under
drafts/. - Authoritative status: approach agreed, rollback half deferred (Chapter 7).
- Header says Accepted 2026-05-24 but lives under
Customer constraints baked into this plan
- Deployments are roll-forward only. No auto-rollback when a new Deployment version fails. Dashboard surfaces the failure; customer edits the spec and rolls forward. Customer ask; may change later, not in v0.3.
- System rollback is deferred to v0.4. v0.3 implements upgrade per the ADR; the LVM-snapshot rollback half waits until we've shipped something to production.
- Secrets need Zitadel + OpenBao. No plaintext-env-var shortcut.
harmony_secret+ OpenBao work is on the critical path for any Deployment that needs credentials.
Feature checklist
Status legend: ✅ shipped · 🟡 in flight · 🔴 not started · ⏸ deferred (target version in note).
| # | Feature | Status | Owner / branch | Notes |
|---|---|---|---|---|
| 1 | Dashboard role enforcement (fleet-admin required) |
🔴 | next branch | Reuse ZitadelValidator::extract_roles. Do this right now — security gap. |
| 2 | Operator restart / aggregator cold-rebuild | 🔴 | next branch | More critical than smoke wiring; ship before any customer. |
| 3 | Deployment getLogs companion + dashboard log view |
🔴 | next branch | "Makes dashboard useful rather than a toy." Score companion shape. |
| 4 | Agent self-upgrade (NATS-coordinated, systemd-resident) | 🔴 | new branch | Marker lives in NATS, not on disk. Systemd stays. |
| 5 | Graceful deployment upgrade (roll-forward only) | 🔴 | new branch | SIGTERM → grace → SIGKILL fallback → start new. No rollback. |
| 6 | Init containers in PodmanV0Score |
🔴 | new branch | Ordered, run-to-completion, customer guarantees idempotency. |
| 7 | System upgrade (no rollback yet) | 🔴 | new branch | Per drafted ADR, minus the LVM-snapshot rollback half. |
| 8 | Secrets via Zitadel + OpenBao for Deployments | ⏸ v0.3+ | blocked on harmony_secret |
Required for production but not blocking the first customer. |
| 9 | Agent time-drift verification | 🔴 | new branch | Periodic NTP check; refuse JWT operations if skewed. |
| 10 | Phase 1 smoke wiring (HTTP / K8sPodReady / NatsKv probes) | 🔴 | new branch | After required features land. Not a functional blocker. |
| 11 | CI yaml minimization (logic into harmony-ci scripts) |
⏸ v0.4 | longer-term | Yaml stays for discovery + parallel viz; scripts move. |
| 12 | NATS callout CI hardening | ⏸ | low-churn crate | Already covered by workspace cargo test. Run ignored tests when CI has podman + NATS image. |
| 13 | Application log streaming through NATS | ⏸ v0.4 | follow-on to #3 | #3 is the synchronous getLogs; this is the live tail. |
| 14 | Deployment dependencies (after: [...]) |
⏸ | not chosen | Init containers (#6) cover the in-deployment case; defer until customers ask. |
Sequencing
| Order | Item | Why |
|---|---|---|
| 1 | #1 Dashboard role enforcement | Security gap, do right now. |
| 2 | #2 Operator restart recovery | More critical than smoke wiring. Customer can't tolerate "operator restarted, state unknown." |
| 3 | #3 Log forwarding companion | Turns the dashboard from a toy into a thing customers actually use. |
| 4 | #4 Agent self-upgrade | Parallel-safe with #2/#3 — different code paths. |
| 5 | #5 + #6 Graceful upgrade + init containers | Paired Deployment-layer features; ship together. |
| 6 | #9 Time-drift verification | Small, isolated; slot between heavier items. |
| 7 | #7 System upgrade | Builds on agent-upgrade pattern from #4 — #4 lands first. |
| 8 | #10 Phase 1 smoke wiring | After required features so probes verify real customer-facing surfaces. |
| 9 | #8 Secrets | Blocks any customer Deployment that needs credentials. Promote if first customer needs them. |
| 10 | #11 / #12 CI | Opportunistic, doesn't block customer. |
Chapter 1 — Dashboard role enforcement (#1)
Goal: every dashboard page requires a valid Zitadel session and a fleet-admin role on the token.
- Users without the role get a 403 with a clear message.
- Users without a session get the existing login redirect.
Current state
- JWKS verify only extracts identity claims.
JwksCache::verify(harmony_zitadel_auth/src/jwks.rs:74) parses the JWT and returns aVerifiedSessionwithsub/exp/email/name/nonce. Roles not extracted. VerifiedSessionhas norolesfield (harmony_zitadel_auth/src/session.rs:5).- Middleware checks JWT validity only.
require_auth(fleet/harmony-fleet-operator/src/frontend/server.rs:136–157). Every authenticated user gets all pages. - Role extraction logic already exists and is correct in the callout:
ZitadelValidator::extract_rolesatnats/callout/src/zitadel.rs:203. Handles both shapes:- array —
["fleet-admin"] - object-map —
{"fleet-admin": {org_id: org_name}}
- array —
Plan
- Extract a shared role-extraction helper into
harmony_zitadel_authso dashboard and callout import from one place. Callout keeps its API but its body delegates. - Extend
VerifiedSessionwithroles: Vec<String>. - Extend the JWKS
Claimsdecode struct to capture the configured roles claim. Pull the claim name from existing callout config so the two systems agree (Zitadel shipsurn:zitadel:iam:org:project:rolesor similar). - Add
require_role(role: &'static str)middleware to the dashboard. Compose withrequire_auth. Use on everyRouter::route(..., post|get(...).layer(...)). - 403 response renders a maud page — "fleet-admin role required; ask your administrator." Not a JSON error; dashboard is human-facing.
Tests
Security code — heavy unit tests are non-negotiable.
- Array-shape claim → fleet-admin in session. JWT with array-shape role claim.
- Object-map shape → identical resolution. Same role, Zitadel's other claim shape.
- No role claim → empty roles. Token with no
rolesclaim. - Wrong role doesn't elevate. JWT with only
devicerole does NOT carryfleet-admin. - No session → 401/redirect.
- Session but no
fleet-admin→ 403. - Session +
fleet-admin→ 200.
Done when
- Branch merged.
- All dashboard handlers gated by
require_role("fleet-admin"). - Every test green.
- Manual smoke against staging Zitadel: user without role sees 403.
Chapter 2 — Operator restart + aggregator recovery (#2)
Goal: the operator pod can be killed, upgraded, or rescheduled at any time and the system converges back to correct state from NATS KV alone. No "unknown state" window visible to customers.
Current state
- Aggregator rebuilds from scratch on startup.
fleet_aggregator.rs(833 LOC, inharmony-fleet-operator/src/) watches the KV buckets to materialize state. JG confirmed: "rebuilt from scratch, yes." - Failure modes not exercised yet:
- Partial KV — device offline during operator reset, never re-published its info.
- Two operator pods racing during a rolling deploy of the operator.
- NATS stream loss between operator restart and rebuild completing.
- Stale KV — Deployment CR deleted in kube while operator was down.
Plan
Scenario-driven. Enumerate failure shapes, then handle one at a time. Discipline: each scenario gets a regression test in harmony-fleet-e2e, then the fix.
- Scenario inventory pass. Write
docs/fleet-operator-recovery-scenarios.mdlisting every failure shape we can think of. Cross-reference smoke-a* tests to identify what's already covered. - Cold-start rebuild as the baseline. Confirm + test that
kubectl delete podof the operator and waiting for the replacement converges to pre-kill aggregate in < 30s. Gate on convergence time at N device count. - Stale-KV reconciliation. Define the rule for "KV says device D has Deployment X, but Deployment X no longer exists in kube." Operator cleans up; agents observe the deletion.
- Leader election decision. Ship with leader election (one writer at a time) or design for idempotent multi-writer? Score-Topology-Interpret leans idempotent; confirm + assert operator writes are byte-deterministic.
- Liveness signaling for the dashboard. Surface "operator converged" / "operator recovering" as states the frontend renders. Customer sees a loading banner, not a blank dashboard, during rebuild.
Open questions
- Warm-restart snapshot? Keep a per-operator-pod "last known aggregate" snapshot in a KV bucket so warm restarts skip cold rebuild? Probably yes for >1000-device fleets; adds an invalidation problem.
- One pod or active/passive? Customer's fleet size answers this. Ask before starting.
Done when
- Scenario inventory exists.
- Each scenario has a regression test, all green.
- Manual chaos: kill operator pod during high write load → convergence + dashboard liveness banner observed.
Chapter 3 — Application log forwarding companion (#3)
Goal: when a customer's Deployment is misbehaving on a Pi in the field, the dashboard shows last-N-lines of container logs without anyone SSH-ing the device.
Design
- Logs attach as a Score companion — same pattern as the smoke-test contract.
- The companion is optional — Scores without one render "this deployment doesn't expose logs". Acceptable.
- Sync
getLogsships in v0.3; live tail (streaming) waits for v0.4 — that's the minimum useful UX.
Shape:
// new in harmony-fleet-deploy/src/companion/logs/
pub trait LogQuery<T: Topology>: Send + Sync {
type Score: Score<T>;
async fn last_lines(
&self,
score: &Self::Score,
topology: &T,
n: usize,
) -> Result<LogChunk, LogQueryError>;
}
pub struct LogChunk {
pub source: ProbeName, // reuse the validated newtype
pub captured_at: chrono::DateTime<chrono::Utc>,
pub lines: Vec<String>,
pub truncated: bool,
}
For PodmanV0Score:
- Transport: NATS request/reply. Subject
device-commands.<device_id>.logs.<deployment>. - Agent side: runs
podman logs --tail N <container>and replies with aLogChunk. - Dashboard side: one async call from the logs handler.
Plan
- Define
LogQuerycompanion trait in a newharmony-fleet-deploy/src/companion/logs/module. PodmanLogQueryimplementingLogQuery<…> for PodmanV0Score.- Agent-side command handler — extend the existing request/reply command dispatcher.
- Dashboard handler at
/deployments/<name>/devices/<id>/logs?lines=Nreturning rendered maud. - Tests: unit on
PodmanLogQuery; integration inharmony-fleet-e2edriving end-to-end.
Done when
- Customer clicks "View logs" on the dashboard.
- Sees the last 200 lines.
- Call returns in < 2s on a 3-device fleet.
Chapter 4 — Agent self-upgrade, NATS-coordinated (#4)
Goal: the agent can upgrade itself in place. If NATS is unavailable, the upgrade does not start. The operator sees every step.
Design (per JG's direction)
- Stay on systemd for v0.3. Switching the agent runtime to podman is its own risk; defer until self-upgrade protocol matures.
- Upgrade marker lives in NATS, not on disk. New KV bucket
agent-upgradekeyed bydevice_id, carryingstart_timestamp,invoker_version,target_version,phase. - No NATS → no upgrade. Feature, not limitation: operator can't observe an upgrade it can't see, so refusing without NATS prevents silent half-upgrades.
Protocol
- Operator writes
Requested.agent-upgrade/<device_id>withphase: Requested, target_version: vX. - Old agent observes + writes
Suspending. Verifies NATS liveness with a round-trip first. - Old agent suspends + writes
Suspended. Reconcile loop paused; heartbeat continues so the operator knows it's alive. - Old agent fetches new binary + writes
Fetched. Mechanism TBD (see open questions).target_path: /usr/local/bin/fleet-agent.new. - Old agent launches new binary as a separate process + writes
NewLaunched. Not via systemd unit update yet. Includesnew_pid: N. - New agent self-checks + writes
NewHealthy. Connects to NATS, verifies permissions, one-shot smoke (KV read, command channel echo). - Old agent writes
HandingOffand exits. Tells systemd to swap the binary:systemctl daemon-reload+systemctl restart fleet-agent.servicewith the new binary in place. - Systemd starts the unit pointing at the new binary. Final state
phase: Complete, completed_at: T.
On stall (configurable, default 5 min):
- Marker writes
phase: Failedwith last successful step. - Operator surfaces this on the dashboard.
- Customer / operator intervenes manually — no auto-rollback in v0.3, consistent with the deployment roll-forward-only rule.
Open questions
- Q1.1 Binary distribution. Gitea release asset? Signed OCI artifact? Existing
arm-agents.yamluploads aarch64 binaries to releases — start with that. - Q1.2 Verification. Hash signature? GPG? Minimum: SHA-256 pinned in the upgrade-request payload.
- Q1.3 Atomic systemd swap.
systemctl restartis not atomic across binary-on-disk and process. Acceptable? Orsystemd-run --transientshim? - Q1.4 Cross-arch. Fetch URL has to know the device's arch. KV
device-infoalready carries this; confirm the agent reads its own arch correctly.
Done when
- Branch contains the protocol implementation + e2e test driving v0.3.0 → v0.3.1 upgrade against a libvirt VM.
- Operator sees every phase.
- Failure path tested: deliberately corrupt the new binary → marker reads
Failed, old agent stays running.
Chapter 5 — Graceful deployment upgrade, roll-forward only (#5)
Goal: upgrading a Deployment's image/config replaces the old container without dropping traffic mid-request. If the new container won't start, the customer sees the failure clearly and fixes the spec.
Design
Extend PodmanV0Score with a lifecycle block:
pub struct PodmanV0Score {
// ... existing fields ...
pub lifecycle: Option<LifecyclePolicy>,
}
pub struct LifecyclePolicy {
pub stop_signal: StopSignal, // SIGTERM (default), SIGINT, SIGUSR1
pub grace_period: Duration, // default 30s
pub sigkill_fallback: bool, // default true
}
Agent's reconcile when image/config changes:
- Write
Upgradingphase. NewDeploymentState::Phase::Upgradingvariant. Dashboard shows the in-progress upgrade. - Send
stop_signalto the old container. - Wait up to
grace_periodfor clean exit. - SIGKILL fallback if still running and
sigkill_fallback. - Start new container.
- On startup failure: write
Failedand stop. Image pull error, exec error, crash within 5s. No revert to old image. - On success: write
Running. Optionally gated by a Phase-1 smoke test (Chapter 10) when that lands.
Explicit non-goals
- No auto-rollback. Customer-asked constraint. Step 6 firing → dashboard shows "Deployment failed; previous version stopped" and the customer edits the spec.
- No "stale + new" window. Single container per Deployment per device; short downtime during cutover is accepted.
Done when
- Upgrade test in
harmony-fleet-e2ewalks v1 → v2 → v3 image swap with controlled failures. - Dashboard reflects every step.
Chapter 6 — Init containers (#6)
Goal: customer can declare init containers that run to completion before the main container starts. Mirror Kubernetes semantics so customer mental model transfers.
Design
Extend PodmanV0Score with init_containers: Vec<InitContainer>:
- Ordered — declaration order = run order.
- Run-to-completion — each one must exit zero before the next starts.
- Fail-the-Deployment on init failure — non-zero exit or timeout exceeded.
pub struct InitContainer {
pub name: String,
pub image: String,
pub args: Vec<String>,
pub env: Vec<EnvVar>,
pub volumes: Vec<VolumeMount>,
pub timeout: Duration, // default 5 min
}
Customer contract (document loudly)
Init containers must be idempotent. They run on every reconcile that requires a fresh main container — power-cycle recovery, graceful upgrade, etc.
- Customer-side migration scripts that aren't idempotent will misbehave.
- Document with examples.
- Add a Score-builder lint that warns on common non-idempotent patterns (e.g.
INSERTwithoutON CONFLICT).
Done when
harmony-fleet-e2edeploys a Deployment with one init container (mkdir -p /data && touch /data/initialized) followed by a main container that asserts the file exists.- Two-step ordering sequence tested.
Chapter 7 — System upgrade, rollback deferred (#7)
Goal: the device can apt-upgrade its base OS without bricking. Implements the upgrade flow per the drafted ADR; the LVM-snapshot rollback half is deferred to v0.4.
What ships in v0.3
- Pre-upgrade snapshot creation (LVM thin snapshot of root LV). Created but not used for revert in v0.3.
- Boot-attempt counter on FAT
/bootpartition (per ADR design). - Userspace control-plane check-in timer.
- Idempotent provisioning conversion script (partition → PV/VG/LV, initramfs regen, cmdline.txt update, watchdog config).
- Canary hardware test of the upgrade-succeeds path.
What's explicitly NOT in v0.3
- Initramfs
local-topboot-attempt hook that triggers rollback. - Userspace soft-failure path that merge-reverts the snapshot.
- Any rollback wiring.
The snapshot exists so v0.4 can flip on the rollback half without re-provisioning devices.
Done when
- Canary Pi successfully upgrades from a known-good base image to a later one.
- Snapshot exists post-upgrade.
- No customer-visible regression.
- Per "Full Verification Before Done" rule: green on both aarch64 and x86_64 device classes.
Chapter 8 — Secrets via Zitadel + OpenBao (#8, deferred)
- Lands when
harmony_secretis ready. - Out of scope for v0.3 chapter-by-chapter work, but required before any production customer deploys an app that needs credentials.
- Track as a separate item. Surface to the customer as: "your first Deployments should use environment variables only until v0.3.x."
Chapter 9 — Agent time-drift verification (#9)
Goal: agent refuses to operate (or warns loudly) when its clock is skewed enough to break JWT validation.
Design
- Startup NTP-style query against a configurable server list (default:
time.cloudflare.com,pool.ntp.org). - Refuse to start on |drift| > 30s. Typical JWT skew tolerance — past it, every NATS callout request fails with a cryptic
exp invalid. - Periodic re-check every 6 hours. Mid-run drift past threshold → agent publishes a
DeviceInfohealth flag, dashboard surfaces it. - Specific customer-facing error message: "system clock skew is 14m32s; JWT validation will fail. Enable
systemd-timesyncdorchrony."
Done when
- Test in
harmony-fleet-e2eruns against a libvirt VM with clock forced 5 minutes off. - Agent refuses to start with the expected error message.
- Recovery: fix the clock → agent comes up clean.
Chapter 10 — Phase 1 smoke wiring (#10)
Goal: real fleet Scores carry real smoke tests. The Phase 0 contract becomes load-bearing.
Scope
HttpHealthyprobe — GET a URL, expect 2xx, optional response-body-contains assertion.K8sPodReadyprobe — kube client lookup for pod readiness condition.NatsKvKeyExistsprobe — KV bucket + key, optional value-deserializes-to-T assertion.FleetOperatorSmokeTest— pairs withFleetOperatorScore. Operator pod ready +/healthzreturns 200 + can write todevice-infoKV.FleetAgentSmokeTest— pairs withFleetAgentScore. Agent pod ready + heartbeat published to KV within 30s.HarmonyEvent::SmokeStage{Started,Finished,Skipped}variants (additive) so the dashboard can render the live pipeline.- Dashboard pipeline view — maud renderer subscribing to instrumentation events.
Sequencing within this chapter (strict order)
HarmonyEventvariants — one-line additive change toharmony/src/domain/instrumentation.rs.- Probes one at a time — HTTP, K8sPodReady, NatsKvKeyExists. Each: unit tests + an integration test against the staging cluster.
FleetOperatorSmokeTestcomposing the above.FleetAgentSmokeTest.- Dashboard renderer last — once the events are flowing, UI is mostly maud + htmx polling.
Done when
deploy_with_smoke(FleetOperatorScore, FleetOperatorSmokeTest, ...)returns successfully against staging.- Dashboard shows the live pipeline.
- Deliberate breakage (point the operator's helm chart at a bad image) → smoke fails visibly, failing probe named on dashboard.
Chapter 11 — CI yaml minimization (#11, longer-term)
Pulled out of the chapter-by-chapter v0.3 work.
- Frame: workflow yaml files in
.gitea/workflows/(4 files, ~235 LOC) should hold only what Gitea Actions needs for job discovery + parallel viz. Job bodies are one-line calls into portable scripts.
Direction
- Build out a
harmony-ciRust CLI crate. Commands likeharmony-ci build composer-linux,harmony-ci publish operator-image,harmony-ci check. - Each workflow yaml job becomes
run: cargo run -p harmony-ci -- <command>. - Scripts must run identically from a developer's laptop.
Not in v0.3
- Multi-day effort; doesn't block the customer.
- Slot when bandwidth allows.
- Opportunistically convert when touching a workflow file for other reasons.
Chapter 12 — NATS callout CI hardening (#12, minimal)
nats/calloutis a low-churn crate that works today.- Workspace-wide
cargo testin.gitea/workflows/check.ymlcovers the non-ignored tests. - Four
#[ignore]'d integration tests innats/integration-test-callout/tests/callout_e2e.rsneed podman + a NATS image pull in the runner.
Direction
- Don't add CI infra in v0.3 just to run these.
- When a runner with podman + image pull exists for other reasons (e2e harness, system upgrade test matrix), add the callout integration tests to it.
- Until then: keep current workspace-wide coverage.
Out of scope for v0.3 (deferred deliberately)
| Item | Target | Why deferred |
|---|---|---|
| Deployment-level auto-rollback | maybe never | Customer asked for roll-forward only. |
| System-upgrade LVM-snapshot rollback half | v0.4 | Push to prod first; widen scope after. |
| Live log tailing (streaming) | v0.4 | Chapter 3 ships sync getLogs; live tail builds on it. |
| Deployment dependencies (cross-deploy ordering) | TBD | Init containers cover the common case; wait for customer ask. |
| Secrets via Zitadel + OpenBao | v0.3.x | Blocked on harmony_secret work. |
| Containerized agent (podman instead of systemd) | v0.4+ | Self-upgrade protocol matures first on systemd. |
| Operator HA (active/active or active/passive) | TBD | One pod sufficient for v0.3; scale-out when fleet size demands. |
| Multi-tenant fleet isolation tests | v0.4 | Callout permissions cover the mechanism; cross-tenant smoke later. |
Open questions
These don't block starting v0.3 work but need resolution before the relevant chapter completes.
- Q1 (Chapter 4): Binary distribution mechanism for agent upgrades. Gitea releases vs OCI artifacts vs something else.
- Q2 (Chapter 2): Snapshot the aggregate to KV? Faster recovery vs invalidation complexity.
- Q3 (Chapter 7): Canary test matrix? Concretely: which Pi models, which base images, which apt sources.
- Q4 (Chapters 5 + 10): Sequencing of Chapter 10 vs Chapter 5. Both benefit from smoke; right answer might be to ship Phase 1 smoke during Chapter 5 so upgrade gates on it. Decide when starting Chapter 5.
- Q5 (cross-cutting): One operator pod or active/passive? Customer's fleet size answers this; ask before Chapter 2 starts.
When v0.3 is done
- All chapters 1–10 merged.
- A real customer Deployment runs on a real Pi in a real basement.
- The dashboard shows live status and logs.
- An agent upgrade has been driven through the full protocol successfully (and a failure path tested).
- A system upgrade has been driven through the full protocol on a canary.
v0.4 picks up the deferred items in priority order.