Files

Jean-Gabriel Gill-Couture 9deebab1ff

Run Check Script / check (pull_request) Successful in 2m25s

Details

Authoritative plan for the last mile before fleet ships to a real
customer. Picks up where v0_2_plan.md left the chapter structure.

Twelve chapters, organized in execution order:

  1. Dashboard role enforcement (security gap, do right now)
  2. Operator restart + aggregator recovery (more critical than smoke)
  3. Application log forwarding companion (dashboard utility)
  4. Agent self-upgrade, NATS-coordinated, systemd-resident
  5. Graceful deployment upgrade (roll-forward only — customer ask)
  6. Init containers in PodmanV0Score
  7. System upgrade, rollback deferred to v0.4
  8. Secrets via Zitadel + OpenBao (blocked on harmony_secret work)
  9. Agent time-drift verification
  10. Phase 1 smoke wiring
  11. CI yaml minimization (longer-term)
  12. NATS callout CI hardening (minimal)

Customer constraints baked in: deployments are roll-forward only
(no auto-rollback on Deployment failure); system rollback half of
the upgrade ADR is deferred to v0.4 (snapshot is created but not
used for revert in v0.3); secrets must go through Zitadel + OpenBao
(no plaintext shortcut).

Includes:
  - feature checklist as a status table (14 items),
  - sequencing table with ordering rationale,
  - per-chapter goal / current state with file:line citations /
    plan / open questions / "done when",
  - out-of-scope table with target version + reason,
  - cross-cutting open questions Q1–Q5.

Format follows the user's "tables over prose" preference: every
multi-item section is either a table or bold-led bullets with
nested supporting detail. Scannable at three depths (30-second
scroll for bold leads, 2-minute read for nested detail, deep read
with code where it matters).

2026-05-24 10:54:08 -04:00

28 KiB

Raw Permalink Blame History

Fleet Platform v0.3 — last-mile plan

Authoritative plan for the last mile before the fleet ships to a real customer. Picks up where v0_2_plan.md left the chapter structure. Written 2026-05-24, after feat/iot-walking-skeleton (#264) merged and feat/smoke-test-contract landed the Phase 0 smoke companion.

The frame:

v0.1 proved the shape.
v0.2 locked the brick design.
v0.3 makes the brick safe to hand to a customer running production workloads on Pis in their basement.

State coming in

IoT walking skeleton merged. Operator + agent + NATS + Zitadel + auth callout running end-to-end against an OKD staging cluster.
Smoke-test contract Phase 0 merged (feat/smoke-test-contract).
- Probe / SmokeSuite / SmokeTest companion + deploy_with_smoke in harmony-fleet-deploy/src/companion/smoke/.
- One concrete probe today: TcpReachable.
- No fleet Score wired to a real smoke test yet — Phase 1 is in this roadmap.
Agent runs as a systemd user unit on devices (see harmony/src/modules/fleet/setup_score.rs:263–283).
- No on-device containerized agent path.
- The Dockerfile in fleet/harmony-fleet-agent/Dockerfile is k8s-only today.
Dashboard has no role enforcement — security gap.
- Maud/htmx frontend at fleet/harmony-fleet-operator/src/frontend/server.rs.
- Verifies Zitadel JWT signature + expiry only.
- JwksCache::verify (harmony_zitadel_auth/src/jwks.rs:74) extracts sub/exp/email/name/nonce — no roles.
- VerifiedSession has no roles field.
- Any logged-in Zitadel user gets full dashboard access. Fix immediately (Chapter 1).
NATS callout already has the role-extraction logic we need.
- ZitadelValidator::extract_roles at nats/callout/src/zitadel.rs:203.
- Handles both array shape (["fleet-admin"]) and Zitadel's object-map shape ({"fleet-admin": {org_id: org_name}}).
- roles::resolve maps role names to ResolvedRole::Admin/::Device with admin-wins privilege escalation.
- Chapter 1 reuses the extractor, not the role-to-NATS-permission half.
System upgrade ADR drafted at docs/adr/drafts/Fleet-IoT-Device-System-Upgrade-With-Rollback.md.
- Header says Accepted 2026-05-24 but lives under drafts/.
- Authoritative status: approach agreed, rollback half deferred (Chapter 7).

Customer constraints baked into this plan

Deployments are roll-forward only. No auto-rollback when a new Deployment version fails. Dashboard surfaces the failure; customer edits the spec and rolls forward. Customer ask; may change later, not in v0.3.
System rollback is deferred to v0.4. v0.3 implements upgrade per the ADR; the LVM-snapshot rollback half waits until we've shipped something to production.
Secrets need Zitadel + OpenBao. No plaintext-env-var shortcut. harmony_secret + OpenBao work is on the critical path for any Deployment that needs credentials.

Feature checklist

Status legend: ✅ shipped · 🟡 in flight · 🔴 not started · ⏸ deferred (target version in note).

#	Feature	Status	Owner / branch	Notes
1	Dashboard role enforcement (`fleet-admin` required)	🔴	next branch	Reuse `ZitadelValidator::extract_roles`. Do this right now — security gap.
2	Operator restart / aggregator cold-rebuild	🔴	next branch	More critical than smoke wiring; ship before any customer.
3	Deployment `getLogs` companion + dashboard log view	🔴	next branch	"Makes dashboard useful rather than a toy." Score companion shape.
4	Agent self-upgrade (NATS-coordinated, systemd-resident)	🔴	new branch	Marker lives in NATS, not on disk. Systemd stays.
5	Graceful deployment upgrade (roll-forward only)	🔴	new branch	SIGTERM → grace → SIGKILL fallback → start new. No rollback.
6	Init containers in `PodmanV0Score`	🔴	new branch	Ordered, run-to-completion, customer guarantees idempotency.
7	System upgrade (no rollback yet)	🔴	new branch	Per drafted ADR, minus the LVM-snapshot rollback half.
8	Secrets via Zitadel + OpenBao for Deployments	⏸ v0.3+	blocked on `harmony_secret`	Required for production but not blocking the first customer.
9	Agent time-drift verification	🔴	new branch	Periodic NTP check; refuse JWT operations if skewed.
10	Phase 1 smoke wiring (HTTP / K8sPodReady / NatsKv probes)	🔴	new branch	After required features land. Not a functional blocker.
11	CI yaml minimization (logic into `harmony-ci` scripts)	⏸ v0.4	longer-term	Yaml stays for discovery + parallel viz; scripts move.
12	NATS callout CI hardening	⏸	low-churn crate	Already covered by workspace `cargo test`. Run ignored tests when CI has podman + NATS image.
13	Application log streaming through NATS	⏸ v0.4	follow-on to #3	#3 is the synchronous `getLogs`; this is the live tail.
14	Deployment dependencies (`after: [...]`)	⏸	not chosen	Init containers (#6) cover the in-deployment case; defer until customers ask.

Sequencing

Order	Item	Why
1	#1 Dashboard role enforcement	Security gap, do right now.
2	#2 Operator restart recovery	More critical than smoke wiring. Customer can't tolerate "operator restarted, state unknown."
3	#3 Log forwarding companion	Turns the dashboard from a toy into a thing customers actually use.
4	#4 Agent self-upgrade	Parallel-safe with #2/#3 — different code paths.
5	#5 + #6 Graceful upgrade + init containers	Paired Deployment-layer features; ship together.
6	#9 Time-drift verification	Small, isolated; slot between heavier items.
7	#7 System upgrade	Builds on agent-upgrade pattern from #4 — #4 lands first.
8	#10 Phase 1 smoke wiring	After required features so probes verify real customer-facing surfaces.
9	#8 Secrets	Blocks any customer Deployment that needs credentials. Promote if first customer needs them.
10	#11 / #12 CI	Opportunistic, doesn't block customer.

Chapter 1 — Dashboard role enforcement (#1)

Goal: every dashboard page requires a valid Zitadel session and a fleet-admin role on the token.

Users without the role get a 403 with a clear message.
Users without a session get the existing login redirect.

Current state

JWKS verify only extracts identity claims. JwksCache::verify (harmony_zitadel_auth/src/jwks.rs:74) parses the JWT and returns a VerifiedSession with sub/exp/email/name/nonce. Roles not extracted.
VerifiedSession has no roles field (harmony_zitadel_auth/src/session.rs:5).
Middleware checks JWT validity only. require_auth (fleet/harmony-fleet-operator/src/frontend/server.rs:136–157). Every authenticated user gets all pages.
Role extraction logic already exists and is correct in the callout: ZitadelValidator::extract_roles at nats/callout/src/zitadel.rs:203. Handles both shapes:
- array — ["fleet-admin"]
- object-map — {"fleet-admin": {org_id: org_name}}

Plan

Extract a shared role-extraction helper into harmony_zitadel_auth so dashboard and callout import from one place. Callout keeps its API but its body delegates.
Extend VerifiedSession with roles: Vec<String>.
Extend the JWKS Claims decode struct to capture the configured roles claim. Pull the claim name from existing callout config so the two systems agree (Zitadel ships urn:zitadel:iam:org:project:roles or similar).
Add require_role(role: &'static str) middleware to the dashboard. Compose with require_auth. Use on every Router::route(..., post|get(...).layer(...)).
403 response renders a maud page — "fleet-admin role required; ask your administrator." Not a JSON error; dashboard is human-facing.

Tests

Security code — heavy unit tests are non-negotiable.

Array-shape claim → fleet-admin in session. JWT with array-shape role claim.
Object-map shape → identical resolution. Same role, Zitadel's other claim shape.
No role claim → empty roles. Token with no roles claim.
Wrong role doesn't elevate. JWT with only device role does NOT carry fleet-admin.
No session → 401/redirect.
Session but no fleet-admin → 403.
Session + fleet-admin → 200.

Done when

Branch merged.
All dashboard handlers gated by require_role("fleet-admin").
Every test green.
Manual smoke against staging Zitadel: user without role sees 403.

Chapter 2 — Operator restart + aggregator recovery (#2)

Goal: the operator pod can be killed, upgraded, or rescheduled at any time and the system converges back to correct state from NATS KV alone. No "unknown state" window visible to customers.

Current state

Aggregator rebuilds from scratch on startup. fleet_aggregator.rs (833 LOC, in harmony-fleet-operator/src/) watches the KV buckets to materialize state. JG confirmed: "rebuilt from scratch, yes."
Failure modes not exercised yet:
- Partial KV — device offline during operator reset, never re-published its info.
- Two operator pods racing during a rolling deploy of the operator.
- NATS stream loss between operator restart and rebuild completing.
- Stale KV — Deployment CR deleted in kube while operator was down.

Plan

Scenario-driven. Enumerate failure shapes, then handle one at a time. Discipline: each scenario gets a regression test in harmony-fleet-e2e, then the fix.

Scenario inventory pass. Write docs/fleet-operator-recovery-scenarios.md listing every failure shape we can think of. Cross-reference smoke-a* tests to identify what's already covered.
Cold-start rebuild as the baseline. Confirm + test that kubectl delete pod of the operator and waiting for the replacement converges to pre-kill aggregate in < 30s. Gate on convergence time at N device count.
Stale-KV reconciliation. Define the rule for "KV says device D has Deployment X, but Deployment X no longer exists in kube." Operator cleans up; agents observe the deletion.
Leader election decision. Ship with leader election (one writer at a time) or design for idempotent multi-writer? Score-Topology-Interpret leans idempotent; confirm + assert operator writes are byte-deterministic.
Liveness signaling for the dashboard. Surface "operator converged" / "operator recovering" as states the frontend renders. Customer sees a loading banner, not a blank dashboard, during rebuild.

Open questions

Warm-restart snapshot? Keep a per-operator-pod "last known aggregate" snapshot in a KV bucket so warm restarts skip cold rebuild? Probably yes for >1000-device fleets; adds an invalidation problem.
One pod or active/passive? Customer's fleet size answers this. Ask before starting.

Done when

Scenario inventory exists.
Each scenario has a regression test, all green.
Manual chaos: kill operator pod during high write load → convergence + dashboard liveness banner observed.

Chapter 3 — Application log forwarding companion (#3)

Goal: when a customer's Deployment is misbehaving on a Pi in the field, the dashboard shows last-N-lines of container logs without anyone SSH-ing the device.

Design

Logs attach as a Score companion — same pattern as the smoke-test contract.
The companion is optional — Scores without one render "this deployment doesn't expose logs". Acceptable.
Sync getLogs ships in v0.3; live tail (streaming) waits for v0.4 — that's the minimum useful UX.

Shape:

// new in harmony-fleet-deploy/src/companion/logs/
pub trait LogQuery<T: Topology>: Send + Sync {
    type Score: Score<T>;
    async fn last_lines(
        &self,
        score: &Self::Score,
        topology: &T,
        n: usize,
    ) -> Result<LogChunk, LogQueryError>;
}

pub struct LogChunk {
    pub source: ProbeName, // reuse the validated newtype
    pub captured_at: chrono::DateTime<chrono::Utc>,
    pub lines: Vec<String>,
    pub truncated: bool,
}

For PodmanV0Score:

Transport: NATS request/reply. Subject device-commands.<device_id>.logs.<deployment>.
Agent side: runs podman logs --tail N <container> and replies with a LogChunk.
Dashboard side: one async call from the logs handler.

Plan

Define LogQuery companion trait in a new harmony-fleet-deploy/src/companion/logs/ module.
PodmanLogQuery implementing LogQuery<…> for PodmanV0Score.
Agent-side command handler — extend the existing request/reply command dispatcher.
Dashboard handler at /deployments/<name>/devices/<id>/logs?lines=N returning rendered maud.
Tests: unit on PodmanLogQuery; integration in harmony-fleet-e2e driving end-to-end.

Done when

Customer clicks "View logs" on the dashboard.
Sees the last 200 lines.
Call returns in < 2s on a 3-device fleet.

Chapter 4 — Agent self-upgrade, NATS-coordinated (#4)

Goal: the agent can upgrade itself in place. If NATS is unavailable, the upgrade does not start. The operator sees every step.

Design (per JG's direction)

Stay on systemd for v0.3. Switching the agent runtime to podman is its own risk; defer until self-upgrade protocol matures.
Upgrade marker lives in NATS, not on disk. New KV bucket agent-upgrade keyed by device_id, carrying start_timestamp, invoker_version, target_version, phase.
No NATS → no upgrade. Feature, not limitation: operator can't observe an upgrade it can't see, so refusing without NATS prevents silent half-upgrades.

Protocol

Operator writes Requested. agent-upgrade/<device_id> with phase: Requested, target_version: vX.
Old agent observes + writes Suspending. Verifies NATS liveness with a round-trip first.
Old agent suspends + writes Suspended. Reconcile loop paused; heartbeat continues so the operator knows it's alive.
Old agent fetches new binary + writes Fetched. Mechanism TBD (see open questions). target_path: /usr/local/bin/fleet-agent.new.
Old agent launches new binary as a separate process + writes NewLaunched. Not via systemd unit update yet. Includes new_pid: N.
New agent self-checks + writes NewHealthy. Connects to NATS, verifies permissions, one-shot smoke (KV read, command channel echo).
Old agent writes HandingOff and exits. Tells systemd to swap the binary: systemctl daemon-reload + systemctl restart fleet-agent.service with the new binary in place.
Systemd starts the unit pointing at the new binary. Final state phase: Complete, completed_at: T.

On stall (configurable, default 5 min):

Marker writes phase: Failed with last successful step.
Operator surfaces this on the dashboard.
Customer / operator intervenes manually — no auto-rollback in v0.3, consistent with the deployment roll-forward-only rule.

Open questions

Q1.1 Binary distribution. Gitea release asset? Signed OCI artifact? Existing arm-agents.yaml uploads aarch64 binaries to releases — start with that.
Q1.2 Verification. Hash signature? GPG? Minimum: SHA-256 pinned in the upgrade-request payload.
Q1.3 Atomic systemd swap. systemctl restart is not atomic across binary-on-disk and process. Acceptable? Or systemd-run --transient shim?
Q1.4 Cross-arch. Fetch URL has to know the device's arch. KV device-info already carries this; confirm the agent reads its own arch correctly.

Done when

Branch contains the protocol implementation + e2e test driving v0.3.0 → v0.3.1 upgrade against a libvirt VM.
Operator sees every phase.
Failure path tested: deliberately corrupt the new binary → marker reads Failed, old agent stays running.

Chapter 5 — Graceful deployment upgrade, roll-forward only (#5)

Goal: upgrading a Deployment's image/config replaces the old container without dropping traffic mid-request. If the new container won't start, the customer sees the failure clearly and fixes the spec.

Design

Extend PodmanV0Score with a lifecycle block:

pub struct PodmanV0Score {
    // ... existing fields ...
    pub lifecycle: Option<LifecyclePolicy>,
}

pub struct LifecyclePolicy {
    pub stop_signal: StopSignal,       // SIGTERM (default), SIGINT, SIGUSR1
    pub grace_period: Duration,        // default 30s
    pub sigkill_fallback: bool,        // default true
}

Agent's reconcile when image/config changes:

Write Upgrading phase. New DeploymentState::Phase::Upgrading variant. Dashboard shows the in-progress upgrade.
Send stop_signal to the old container.
Wait up to grace_period for clean exit.
SIGKILL fallback if still running and sigkill_fallback.
Start new container.
On startup failure: write Failed and stop. Image pull error, exec error, crash within 5s. No revert to old image.
On success: write Running. Optionally gated by a Phase-1 smoke test (Chapter 10) when that lands.

Explicit non-goals

No auto-rollback. Customer-asked constraint. Step 6 firing → dashboard shows "Deployment failed; previous version stopped" and the customer edits the spec.
No "stale + new" window. Single container per Deployment per device; short downtime during cutover is accepted.

Done when

Upgrade test in harmony-fleet-e2e walks v1 → v2 → v3 image swap with controlled failures.
Dashboard reflects every step.

Chapter 6 — Init containers (#6)

Goal: customer can declare init containers that run to completion before the main container starts. Mirror Kubernetes semantics so customer mental model transfers.

Design

Extend PodmanV0Score with init_containers: Vec<InitContainer>:

Ordered — declaration order = run order.
Run-to-completion — each one must exit zero before the next starts.
Fail-the-Deployment on init failure — non-zero exit or timeout exceeded.

pub struct InitContainer {
    pub name: String,
    pub image: String,
    pub args: Vec<String>,
    pub env: Vec<EnvVar>,
    pub volumes: Vec<VolumeMount>,
    pub timeout: Duration, // default 5 min
}

Customer contract (document loudly)

Init containers must be idempotent. They run on every reconcile that requires a fresh main container — power-cycle recovery, graceful upgrade, etc.

Customer-side migration scripts that aren't idempotent will misbehave.
Document with examples.
Add a Score-builder lint that warns on common non-idempotent patterns (e.g. INSERT without ON CONFLICT).

Done when

harmony-fleet-e2e deploys a Deployment with one init container (mkdir -p /data && touch /data/initialized) followed by a main container that asserts the file exists.
Two-step ordering sequence tested.

Chapter 7 — System upgrade, rollback deferred (#7)

Goal: the device can apt-upgrade its base OS without bricking. Implements the upgrade flow per the drafted ADR; the LVM-snapshot rollback half is deferred to v0.4.

What ships in v0.3

Pre-upgrade snapshot creation (LVM thin snapshot of root LV). Created but not used for revert in v0.3.
Boot-attempt counter on FAT /boot partition (per ADR design).
Userspace control-plane check-in timer.
Idempotent provisioning conversion script (partition → PV/VG/LV, initramfs regen, cmdline.txt update, watchdog config).
Canary hardware test of the upgrade-succeeds path.

What's explicitly NOT in v0.3

Initramfs local-top boot-attempt hook that triggers rollback.
Userspace soft-failure path that merge-reverts the snapshot.
Any rollback wiring.

The snapshot exists so v0.4 can flip on the rollback half without re-provisioning devices.

Done when

Canary Pi successfully upgrades from a known-good base image to a later one.
Snapshot exists post-upgrade.
No customer-visible regression.
Per "Full Verification Before Done" rule: green on both aarch64 and x86_64 device classes.

Chapter 8 — Secrets via Zitadel + OpenBao (#8, deferred)

Lands when harmony_secret is ready.
Out of scope for v0.3 chapter-by-chapter work, but required before any production customer deploys an app that needs credentials.
Track as a separate item. Surface to the customer as: "your first Deployments should use environment variables only until v0.3.x."

Chapter 9 — Agent time-drift verification (#9)

Goal: agent refuses to operate (or warns loudly) when its clock is skewed enough to break JWT validation.

Design

Startup NTP-style query against a configurable server list (default: time.cloudflare.com, pool.ntp.org).
Refuse to start on |drift| > 30s. Typical JWT skew tolerance — past it, every NATS callout request fails with a cryptic exp invalid.
Periodic re-check every 6 hours. Mid-run drift past threshold → agent publishes a DeviceInfo health flag, dashboard surfaces it.
Specific customer-facing error message: "system clock skew is 14m32s; JWT validation will fail. Enable systemd-timesyncd or chrony."

Done when

Test in harmony-fleet-e2e runs against a libvirt VM with clock forced 5 minutes off.
Agent refuses to start with the expected error message.
Recovery: fix the clock → agent comes up clean.

Chapter 10 — Phase 1 smoke wiring (#10)

Goal: real fleet Scores carry real smoke tests. The Phase 0 contract becomes load-bearing.

Scope

HttpHealthy probe — GET a URL, expect 2xx, optional response-body-contains assertion.
K8sPodReady probe — kube client lookup for pod readiness condition.
NatsKvKeyExists probe — KV bucket + key, optional value-deserializes-to-T assertion.
FleetOperatorSmokeTest — pairs with FleetOperatorScore. Operator pod ready + /healthz returns 200 + can write to device-info KV.
FleetAgentSmokeTest — pairs with FleetAgentScore. Agent pod ready + heartbeat published to KV within 30s.
HarmonyEvent::SmokeStage{Started,Finished,Skipped} variants (additive) so the dashboard can render the live pipeline.
Dashboard pipeline view — maud renderer subscribing to instrumentation events.

Sequencing within this chapter (strict order)

HarmonyEvent variants — one-line additive change to harmony/src/domain/instrumentation.rs.
Probes one at a time — HTTP, K8sPodReady, NatsKvKeyExists. Each: unit tests + an integration test against the staging cluster.
FleetOperatorSmokeTest composing the above.
FleetAgentSmokeTest.
Dashboard renderer last — once the events are flowing, UI is mostly maud + htmx polling.

Done when

deploy_with_smoke(FleetOperatorScore, FleetOperatorSmokeTest, ...) returns successfully against staging.
Dashboard shows the live pipeline.
Deliberate breakage (point the operator's helm chart at a bad image) → smoke fails visibly, failing probe named on dashboard.

Chapter 11 — CI yaml minimization (#11, longer-term)

Pulled out of the chapter-by-chapter v0.3 work.

Frame: workflow yaml files in .gitea/workflows/ (4 files, ~235 LOC) should hold only what Gitea Actions needs for job discovery + parallel viz. Job bodies are one-line calls into portable scripts.

Direction

Build out a harmony-ci Rust CLI crate. Commands like harmony-ci build composer-linux, harmony-ci publish operator-image, harmony-ci check.
Each workflow yaml job becomes run: cargo run -p harmony-ci -- <command>.
Scripts must run identically from a developer's laptop.

Not in v0.3

Multi-day effort; doesn't block the customer.
Slot when bandwidth allows.
Opportunistically convert when touching a workflow file for other reasons.

Chapter 12 — NATS callout CI hardening (#12, minimal)

nats/callout is a low-churn crate that works today.
Workspace-wide cargo test in .gitea/workflows/check.yml covers the non-ignored tests.
Four #[ignore]'d integration tests in nats/integration-test-callout/tests/callout_e2e.rs need podman + a NATS image pull in the runner.

Direction

Don't add CI infra in v0.3 just to run these.
When a runner with podman + image pull exists for other reasons (e2e harness, system upgrade test matrix), add the callout integration tests to it.
Until then: keep current workspace-wide coverage.

Out of scope for v0.3 (deferred deliberately)

Item	Target	Why deferred
Deployment-level auto-rollback	maybe never	Customer asked for roll-forward only.
System-upgrade LVM-snapshot rollback half	v0.4	Push to prod first; widen scope after.
Live log tailing (streaming)	v0.4	Chapter 3 ships sync `getLogs`; live tail builds on it.
Deployment dependencies (cross-deploy ordering)	TBD	Init containers cover the common case; wait for customer ask.
Secrets via Zitadel + OpenBao	v0.3.x	Blocked on `harmony_secret` work.
Containerized agent (podman instead of systemd)	v0.4+	Self-upgrade protocol matures first on systemd.
Operator HA (active/active or active/passive)	TBD	One pod sufficient for v0.3; scale-out when fleet size demands.
Multi-tenant fleet isolation tests	v0.4	Callout permissions cover the mechanism; cross-tenant smoke later.

Open questions

These don't block starting v0.3 work but need resolution before the relevant chapter completes.

Q1 (Chapter 4): Binary distribution mechanism for agent upgrades. Gitea releases vs OCI artifacts vs something else.
Q2 (Chapter 2): Snapshot the aggregate to KV? Faster recovery vs invalidation complexity.
Q3 (Chapter 7): Canary test matrix? Concretely: which Pi models, which base images, which apt sources.
Q4 (Chapters 5 + 10): Sequencing of Chapter 10 vs Chapter 5. Both benefit from smoke; right answer might be to ship Phase 1 smoke during Chapter 5 so upgrade gates on it. Decide when starting Chapter 5.
Q5 (cross-cutting): One operator pod or active/passive? Customer's fleet size answers this; ask before Chapter 2 starts.

When v0.3 is done

All chapters 1–10 merged.
A real customer Deployment runs on a real Pi in a real basement.
The dashboard shows live status and logs.
An agent upgrade has been driven through the full protocol successfully (and a failure path tested).
A system upgrade has been driven through the full protocol on a canary.

v0.4 picks up the deferred items in priority order.

28 KiB Raw Permalink Blame History Unescape Escape

Fleet Platform v0.3 — last-mile plan

State coming in

Customer constraints baked into this plan

Feature checklist

Sequencing

Chapter 1 — Dashboard role enforcement (#1)

Current state

Plan

Tests

Done when

Chapter 2 — Operator restart + aggregator recovery (#2)

Current state

Plan

Open questions

Done when

Chapter 3 — Application log forwarding companion (#3)

Design

Plan

Done when

Chapter 4 — Agent self-upgrade, NATS-coordinated (#4)

Design (per JG's direction)

Protocol

Open questions

Done when

Chapter 5 — Graceful deployment upgrade, roll-forward only (#5)

Design

Explicit non-goals

Done when

Chapter 6 — Init containers (#6)

Design

Customer contract (document loudly)

Done when

Chapter 7 — System upgrade, rollback deferred (#7)

What ships in v0.3

What's explicitly NOT in v0.3

Done when

Chapter 8 — Secrets via Zitadel + OpenBao (#8, deferred)

Chapter 9 — Agent time-drift verification (#9)

Design

Done when

Chapter 10 — Phase 1 smoke wiring (#10)

Scope

Sequencing within this chapter (strict order)

Done when

Chapter 11 — CI yaml minimization (#11, longer-term)

Direction

Not in v0.3

Chapter 12 — NATS callout CI hardening (#12, minimal)

Direction

Out of scope for v0.3 (deferred deliberately)

Open questions

When v0.3 is done

28 KiB

Raw Permalink Blame History