Two design documents framing the next push. `ROADMAP/fleet_platform/v0_2_plan.md` — three-day production push. Replaces the open-ended chapter structure of v0_1_plan.md for the period between the walking-skeleton merge and v0.1.0 in production. Focus is locking the fleet module's public API surface so the inevitable physical refactor (out of `harmony/modules/fleet/`, into `fleet/harmony-fleet/`) is mechanical when we get to it. Anchored in the principle from JG's *Pour l'amour des compilateurs* talk: design the brick before moving the brick. `docs/adr/022-fleet-agent-upgrade.md` — agent upgrade procedure. K8s rolling-update shape applied to one host: drain in-flight work, stage versioned binary alongside old, smoke-test, atomic symlink swap, both agents alive briefly, operator verifies new agent's heartbeat then sends explicit stop signal to old, old exits cleanly. No version is ever erased — N-history on disk is the rollback target. Operator-driven cutover (not self-stopping) so the most-trusted side decides the handoff. Implementation deferred to post-v0.1 backlog; spec exists so anyone can build it without reinventing the design. ADR README index updated.
9.9 KiB
Fleet Platform v0.2 — 3-day production push
Authoritative plan for the next three days. Picks up where
v0_1_plan.md left the chapter structure and supersedes its forward
chapters where they conflict. Written 2026-05-06, end of the
feat/iot-walking-skeleton branch (31 996 LOC, 184 commits).
State coming in
- Skeleton end-to-end works against an OKD staging cluster: Zitadel
- NATS + auth callout + operator + agent (one VM today, real Pi tomorrow). Verified by hand 2026-05-06.
- ~10 ancillary PRs still open across the team. Branch graph is noisy.
harmony/modules/fleet/is the wrong long-term home for the fleet code. Flagged in the April 2026 code review. Reasons we kept it there during bring-up are subtle (cross-module dependencies onK8sAnywhereTopology,HelmChartScore,K8sResourceScore,harmony_secret, theTopologycapability traits) — those need to be written down before the file move, not after. ADR pending; not started yet.- Agent upgrade path is undefined. Without it we cannot ship a v0.1 agent into the field.
- ~408 compilation warnings. Not blocking but needs to be 0 before
we put
-Dwarningsin CI.
Strategy
This isn't 10 weeks of scaffolding. It's three days of locking the
API surface so the inevitable refactor — moving fleet out of
harmony/modules/fleet/ into fleet/harmony-fleet/, splitting
K8sAnywhereTopology into K8sBareTopology, etc. — is mechanical
when we get to it.
The frame from JG's Pour l'amour des compilateurs talk applies directly: design the brick before moving the brick. Physical relocation is cheap. Redesigning a public API after customers depend on it is expensive. We use these three days to make sure the type-level contract is what we want it to be at v1.0, even if the file paths still smell like v0.1.
Day 1 — Lock the brick design
Goal: a fleet façade stable enough to ship to production and refactor freely afterwards.
1.1 Decompose FleetDeviceAuth to resolved states only
Today: TomlShared | ZitadelJwt | ZitadelEnroll. Cardinality 3.
After: ZitadelJwt-shape only. Cardinality 1.
TomlShared— v0 dev cruft, no production caller. Delete.ZitadelEnroll— pre-resolution state (carries unresolved admin credentials). Doesn't belong in a type that represents "the agent's NATS auth on disk". Move to its own type (DeviceEnrollmentIntent) used only by the enrollment Score- binary. Resolution produces a
ZitadelJwtand that's what the agent sees.
- binary. Resolution produces a
The render_toml match on &self.auth collapses to one arm. The
"is this resolved yet?" branch class disappears. Test
render_toml_zitadel_enroll_renders_same_as_zitadel_jwt becomes
unnecessary (the question is undefined; you can't render an
unresolved auth).
1.2 Define the fleet façade
What does code outside the fleet module see? Today that's a deep
walk into harmony::modules::fleet::operator::chart::ChartOptions.
Leakage. Lock the seam:
harmony::modules::fleet::
FleetServerScore (existing — composed install)
FleetDeviceEnrollScore (new — wraps fleet_device_enroll)
FleetDeviceSetupScore (existing — keeps API)
FleetDeviceAuth (resolved-only, per 1.1)
AdminAuth (existing)
// sealed:
operator:: pub(crate)
setup_score's internals pub(crate)
chart:: pub(crate)
Once locked, the file location doesn't matter. pub use
re-exports preserve callers' imports across the eventual physical
move.
1.3 Defer the placement ADR
JG isn't satisfied with the design yet. ADR-021 stays in proposed limbo until the seam from 1.2 is committed and we've lived with it for a sprint.
Day 1 done when: fleet façade committed, TomlShared and
ZitadelEnroll removed from FleetDeviceAuth, every existing
caller compiles unchanged, no file moves.
Day 2 — Polish E2E + ship the upgrade ADR
Two streams in parallel.
Stream A — E2E hardening (~½ day)
- A.1 Operator graceful degradation on bad device_id. The CLI now rejects bad ids upfront, but a stray bad KV entry shouldn't take the operator down. Log + skip, don't restart-loop.
- A.2 Persist
nats_auth_passand the issuer NKey viaharmony_secret. The regenerate-every-run footgun bit us twice on 2026-05-06. Make theseSecrets the same wayNatsAdminandZitadelAdminalready are. - A.3 Single regression script.
fleet/scripts/e2e-prod-shape.sh. Full bring-up + enroll + assert against a target cluster. Same shape as the existingsmoke-a*.sh. CI consumes this later.
Stream B — ADR-022: Agent upgrade procedure (~½ day)
The ADR is the deliverable, not the implementation. Specifies the
mechanism so anyone can implement it later without inventing the
design. See docs/adr/022-fleet-agent-upgrade.md.
Summary of the design (full detail in the ADR):
- K8s rolling-update shape, single-host. Wait for in-flight reconciles to complete + all managed services healthy + a scheduling lock from the operator before swapping.
- Versioned binary layout on disk:
No version is ever erased — N-history is the rollback target.
/usr/bin/fleet-agent-v0.1.1 /usr/bin/fleet-agent-v0.1.2 /usr/local/bin/fleet-agent → symlink to current - Old verifies new + reports up. Old agent stages new,
smoke-tests it (
--self-test), starts it, watches for the new agent's heartbeat to land in NATS with the new version. Only then does the operator know the upgrade succeeded. - Operator drives the cutover. Operator sends an explicit stop signal to the old agent over NATS. Old agent exits cleanly. New agent is already running and takes over.
- Reverse path is identical. Roll back = operator publishes desired_version = previous; new agent does the same dance to hand off to old.
Day 2 done when: A.1–A.3 committed, ADR-022 landed, regression script green against staging.
Day 3 — Production deploy
Goal: customer cluster on v0.1, runbook accurate, signed off.
- 3.1 Tag
v0.1.0frommasterafterfeat/iot-walking-skeletonis merged. - 3.2 Run
e2e-prod-shape.shagainst the customer's prod OKD cluster. Every diff between scripted and reality goes back into the script — so the script is the runbook. - 3.3 Production-shape doc twin of
docs/guides/fleet-staging-install.md. Deltas only, ~50 lines. - 3.4
docs/guides/fleet-device-enrollment.md— operator-facing enrollment runbook. Captures the SSO--admin-oidc-client-idresolution and the--device-idRFC1123 validation we locked in on 2026-05-06. - 3.5 Operational basics: revoke a device, rotate a key, read the operator's logs, read NATS. Bullet lists are fine — bullet- list-quality docs beat missing docs.
Day 3 done when: customer's prod cluster runs real workloads, the runbook is what we actually used, and we'd hand operations to someone else.
In parallel — frontend (junior, ~1 week, target Day 5 merge)
Junior owns end-to-end. Spec:
- F.1 Read-only Leptos SPA. Devices + Deployments + per-device drilldown (DeviceInfo + last-heartbeat + agent version).
- F.2 NATS tail panel. SSE stream of
device-infoanddevice-stateupdates, plain text. - F.3 Served by the operator pod itself (one less Deployment).
SSO via the existing Zitadel device-code app (
harmony-cli). - F.4 Not in v0.1: write paths, metrics dashboards, fleet- wide rollout views, NATS GUI. None of those.
This validates the platform is observable from outside the operator's logs — the customer's specific ask.
What slips to v0.2+ (post-prod backlog)
No calendar pressure on these; sequence after we see real customer usage.
| Item | Why deferred | Cost when we do it |
|---|---|---|
Pluggable harmony CLI (kubectl-style PATH discovery) + harmony-fleet plugin |
Customer doesn't run it themselves yet; we do. Examples are good enough. | ~1 week, mostly rename/restructure given Day 1's API freeze. |
Physical refactor of harmony/modules/fleet/ → fleet/harmony-fleet/ |
The Day-1 façade settles the design; the move is mechanical and the ADR for it is still in draft. | ~2 days. |
| Agent upgrade implementation (ADR ships Day 2; impl later) | First customer fleet is small enough to hand-upgrade if needed. | ~1 week. |
| ArgoCD chart publishing | Customer uses ArgoCD downstream but their initial deploy goes through harmony directly. | ~3 days. |
| Full CI e2e (k3d nightly + libvirt + OKD daily) | Manual rehearsal works for one customer. | ~1 week + runner capacity. |
OpenBao integration (replaces ZitadelClientConfig cache file) |
Cache file works for single-operator use; OpenBao is the multi-operator answer. | ~1 week. |
harmony run <ScoreName> --field=value ad-hoc Score CLI |
No v0.1 customer flow needs it. | ~2 weeks (Score-flag derive macro is the hard part). |
| Fleet-wide rollout strategies (canary, %-based) on top of the agent-upgrade primitive | Single-device upgrade is sufficient until >100-device fleets. | ~1 week. |
Drop K8sAnywhereTopology for ad-hoc Score execution; introduce K8sBareTopology |
Per the existing v0_1 §"Principles". Not blocking prod. | ~3 days. |
Principles (kept verbatim from v0_1, still load-bearing)
- No yaml in framework code paths. Typed kube-rs everywhere.
- Scores describe desired state; topologies expose capabilities.
- Cross-boundary wire types in
harmony-reconciler-contracts. - Never ship untested code.
- Prove claims about upstream before blaming upstream.
Adding one for v0.2:
- Design the brick before moving the brick. Lock the public API contract first; physical relocation later. Cardinality-matched types, "make impossible states impossible" — the type system is the deterministic feedback loop that scales with LLM-era code generation throughput. (See JG's Pour l'amour des compilateurs, Botpress Meetup, 2026-04-30.)