Two design documents framing the next push. `ROADMAP/fleet_platform/v0_2_plan.md` — three-day production push. Replaces the open-ended chapter structure of v0_1_plan.md for the period between the walking-skeleton merge and v0.1.0 in production. Focus is locking the fleet module's public API surface so the inevitable physical refactor (out of `harmony/modules/fleet/`, into `fleet/harmony-fleet/`) is mechanical when we get to it. Anchored in the principle from JG's *Pour l'amour des compilateurs* talk: design the brick before moving the brick. `docs/adr/022-fleet-agent-upgrade.md` — agent upgrade procedure. K8s rolling-update shape applied to one host: drain in-flight work, stage versioned binary alongside old, smoke-test, atomic symlink swap, both agents alive briefly, operator verifies new agent's heartbeat then sends explicit stop signal to old, old exits cleanly. No version is ever erased — N-history on disk is the rollback target. Operator-driven cutover (not self-stopping) so the most-trusted side decides the handoff. Implementation deferred to post-v0.1 backlog; spec exists so anyone can build it without reinventing the design. ADR README index updated.
232 lines
9.9 KiB
Markdown
232 lines
9.9 KiB
Markdown
# Fleet Platform v0.2 — 3-day production push
|
||
|
||
Authoritative plan for the next three days. Picks up where
|
||
`v0_1_plan.md` left the chapter structure and supersedes its forward
|
||
chapters where they conflict. Written 2026-05-06, end of the
|
||
`feat/iot-walking-skeleton` branch (31 996 LOC, 184 commits).
|
||
|
||
## State coming in
|
||
|
||
- Skeleton end-to-end works against an OKD staging cluster: Zitadel
|
||
+ NATS + auth callout + operator + agent (one VM today, real Pi
|
||
tomorrow). Verified by hand 2026-05-06.
|
||
- ~10 ancillary PRs still open across the team. Branch graph is
|
||
noisy.
|
||
- `harmony/modules/fleet/` is the wrong long-term home for the fleet
|
||
code. Flagged in the April 2026 code review. Reasons we kept it
|
||
there during bring-up are subtle (cross-module dependencies on
|
||
`K8sAnywhereTopology`, `HelmChartScore`, `K8sResourceScore`,
|
||
`harmony_secret`, the `Topology` capability traits) — those need
|
||
to be written down before the file move, not after. **ADR
|
||
pending; not started yet.**
|
||
- Agent upgrade path is undefined. Without it we cannot ship a
|
||
v0.1 agent into the field.
|
||
- ~408 compilation warnings. Not blocking but needs to be 0 before
|
||
we put `-Dwarnings` in CI.
|
||
|
||
## Strategy
|
||
|
||
This isn't 10 weeks of scaffolding. It's three days of locking the
|
||
**API surface** so the inevitable refactor — moving fleet out of
|
||
`harmony/modules/fleet/` into `fleet/harmony-fleet/`, splitting
|
||
`K8sAnywhereTopology` into `K8sBareTopology`, etc. — is mechanical
|
||
when we get to it.
|
||
|
||
The frame from JG's *Pour l'amour des compilateurs* talk applies
|
||
directly: **design the brick before moving the brick.** Physical
|
||
relocation is cheap. Redesigning a public API after customers
|
||
depend on it is expensive. We use these three days to make sure
|
||
the type-level contract is what we want it to be at v1.0, even if
|
||
the file paths still smell like v0.1.
|
||
|
||
## Day 1 — Lock the brick design
|
||
|
||
**Goal:** a fleet façade stable enough to ship to production and
|
||
refactor freely afterwards.
|
||
|
||
### 1.1 Decompose `FleetDeviceAuth` to *resolved states only*
|
||
|
||
Today: `TomlShared | ZitadelJwt | ZitadelEnroll`. Cardinality 3.
|
||
|
||
After: `ZitadelJwt`-shape only. Cardinality 1.
|
||
|
||
- `TomlShared` — v0 dev cruft, no production caller. Delete.
|
||
- `ZitadelEnroll` — *pre-resolution* state (carries unresolved
|
||
admin credentials). Doesn't belong in a type that represents
|
||
"the agent's NATS auth on disk". Move to its own type
|
||
(`DeviceEnrollmentIntent`) used only by the enrollment Score
|
||
+ binary. Resolution produces a `ZitadelJwt` and that's what
|
||
the agent sees.
|
||
|
||
The `render_toml` match on `&self.auth` collapses to one arm. The
|
||
"is this resolved yet?" branch class disappears. Test
|
||
`render_toml_zitadel_enroll_renders_same_as_zitadel_jwt` becomes
|
||
unnecessary (the question is undefined; you can't render an
|
||
unresolved auth).
|
||
|
||
### 1.2 Define the `fleet` façade
|
||
|
||
What does code outside the fleet module see? Today that's a deep
|
||
walk into `harmony::modules::fleet::operator::chart::ChartOptions`.
|
||
Leakage. Lock the seam:
|
||
|
||
```text
|
||
harmony::modules::fleet::
|
||
FleetServerScore (existing — composed install)
|
||
FleetDeviceEnrollScore (new — wraps fleet_device_enroll)
|
||
FleetDeviceSetupScore (existing — keeps API)
|
||
FleetDeviceAuth (resolved-only, per 1.1)
|
||
AdminAuth (existing)
|
||
|
||
// sealed:
|
||
operator:: pub(crate)
|
||
setup_score's internals pub(crate)
|
||
chart:: pub(crate)
|
||
```
|
||
|
||
Once locked, the *file location* doesn't matter. `pub use`
|
||
re-exports preserve callers' imports across the eventual physical
|
||
move.
|
||
|
||
### 1.3 Defer the placement ADR
|
||
|
||
JG isn't satisfied with the design yet. ADR-021 stays in *proposed*
|
||
limbo until the seam from 1.2 is committed and we've lived with it
|
||
for a sprint.
|
||
|
||
**Day 1 done when:** fleet façade committed, `TomlShared` and
|
||
`ZitadelEnroll` removed from `FleetDeviceAuth`, every existing
|
||
caller compiles unchanged, no file moves.
|
||
|
||
---
|
||
|
||
## Day 2 — Polish E2E + ship the upgrade ADR
|
||
|
||
Two streams in parallel.
|
||
|
||
### Stream A — E2E hardening (~½ day)
|
||
|
||
- **A.1 Operator graceful degradation on bad device_id.** The CLI
|
||
now rejects bad ids upfront, but a stray bad KV entry shouldn't
|
||
take the operator down. Log + skip, don't restart-loop.
|
||
- **A.2 Persist `nats_auth_pass` and the issuer NKey via
|
||
`harmony_secret`.** The regenerate-every-run footgun bit us
|
||
twice on 2026-05-06. Make these `Secret`s the same way `NatsAdmin`
|
||
and `ZitadelAdmin` already are.
|
||
- **A.3 Single regression script.** `fleet/scripts/e2e-prod-shape.sh`.
|
||
Full bring-up + enroll + assert against a target cluster. Same
|
||
shape as the existing `smoke-a*.sh`. CI consumes this later.
|
||
|
||
### Stream B — ADR-022: Agent upgrade procedure (~½ day)
|
||
|
||
The ADR is the deliverable, not the implementation. Specifies the
|
||
mechanism so anyone can implement it later without inventing the
|
||
design. See `docs/adr/022-fleet-agent-upgrade.md`.
|
||
|
||
Summary of the design (full detail in the ADR):
|
||
|
||
- **K8s rolling-update shape, single-host.** Wait for in-flight
|
||
reconciles to complete + all managed services healthy + a
|
||
scheduling lock from the operator before swapping.
|
||
- **Versioned binary layout on disk:**
|
||
```
|
||
/usr/bin/fleet-agent-v0.1.1
|
||
/usr/bin/fleet-agent-v0.1.2
|
||
/usr/local/bin/fleet-agent → symlink to current
|
||
```
|
||
No version is ever erased — N-history is the rollback target.
|
||
- **Old verifies new + reports up.** Old agent stages new,
|
||
smoke-tests it (`--self-test`), starts it, watches for the new
|
||
agent's heartbeat to land in NATS with the new version. Only then
|
||
does the operator know the upgrade succeeded.
|
||
- **Operator drives the cutover.** Operator sends an explicit stop
|
||
signal to the old agent over NATS. Old agent exits cleanly. New
|
||
agent is already running and takes over.
|
||
- **Reverse path is identical.** Roll back = operator publishes
|
||
desired_version = previous; new agent does the same dance to
|
||
hand off to old.
|
||
|
||
**Day 2 done when:** A.1–A.3 committed, ADR-022 landed, regression
|
||
script green against staging.
|
||
|
||
---
|
||
|
||
## Day 3 — Production deploy
|
||
|
||
**Goal:** customer cluster on v0.1, runbook accurate, signed off.
|
||
|
||
- **3.1** Tag `v0.1.0` from `master` after `feat/iot-walking-skeleton`
|
||
is merged.
|
||
- **3.2** Run `e2e-prod-shape.sh` against the customer's prod OKD
|
||
cluster. Every diff between scripted and reality goes back into
|
||
the script — so the script *is* the runbook.
|
||
- **3.3** Production-shape doc twin of
|
||
`docs/guides/fleet-staging-install.md`. Deltas only, ~50 lines.
|
||
- **3.4** `docs/guides/fleet-device-enrollment.md` — operator-facing
|
||
enrollment runbook. Captures the SSO `--admin-oidc-client-id`
|
||
resolution and the `--device-id` RFC1123 validation we locked in
|
||
on 2026-05-06.
|
||
- **3.5** Operational basics: revoke a device, rotate a key, read
|
||
the operator's logs, read NATS. Bullet lists are fine — bullet-
|
||
list-quality docs beat missing docs.
|
||
|
||
**Day 3 done when:** customer's prod cluster runs real workloads,
|
||
the runbook is what we actually used, and we'd hand operations to
|
||
someone else.
|
||
|
||
---
|
||
|
||
## In parallel — frontend (junior, ~1 week, target Day 5 merge)
|
||
|
||
Junior owns end-to-end. Spec:
|
||
|
||
- **F.1** Read-only Leptos SPA. Devices + Deployments + per-device
|
||
drilldown (DeviceInfo + last-heartbeat + agent version).
|
||
- **F.2** NATS tail panel. SSE stream of `device-info` and
|
||
`device-state` updates, plain text.
|
||
- **F.3** Served by the operator pod itself (one less Deployment).
|
||
SSO via the existing Zitadel device-code app (`harmony-cli`).
|
||
- **F.4** **Not** in v0.1: write paths, metrics dashboards, fleet-
|
||
wide rollout views, NATS GUI. None of those.
|
||
|
||
This validates the platform is observable from outside the
|
||
operator's logs — the customer's specific ask.
|
||
|
||
---
|
||
|
||
## What slips to v0.2+ (post-prod backlog)
|
||
|
||
No calendar pressure on these; sequence after we see real customer
|
||
usage.
|
||
|
||
| Item | Why deferred | Cost when we do it |
|
||
|---|---|---|
|
||
| Pluggable `harmony` CLI (kubectl-style PATH discovery) + `harmony-fleet` plugin | Customer doesn't run it themselves yet; we do. Examples are good enough. | ~1 week, mostly rename/restructure given Day 1's API freeze. |
|
||
| Physical refactor of `harmony/modules/fleet/` → `fleet/harmony-fleet/` | The Day-1 façade settles the design; the move is mechanical and the ADR for it is still in draft. | ~2 days. |
|
||
| Agent upgrade implementation (ADR ships Day 2; impl later) | First customer fleet is small enough to hand-upgrade if needed. | ~1 week. |
|
||
| ArgoCD chart publishing | Customer uses ArgoCD downstream but their initial deploy goes through harmony directly. | ~3 days. |
|
||
| Full CI e2e (k3d nightly + libvirt + OKD daily) | Manual rehearsal works for one customer. | ~1 week + runner capacity. |
|
||
| OpenBao integration (replaces `ZitadelClientConfig` cache file) | Cache file works for single-operator use; OpenBao is the multi-operator answer. | ~1 week. |
|
||
| `harmony run <ScoreName> --field=value` ad-hoc Score CLI | No v0.1 customer flow needs it. | ~2 weeks (Score-flag derive macro is the hard part). |
|
||
| Fleet-wide rollout strategies (canary, %-based) on top of the agent-upgrade primitive | Single-device upgrade is sufficient until >100-device fleets. | ~1 week. |
|
||
| Drop `K8sAnywhereTopology` for ad-hoc Score execution; introduce `K8sBareTopology` | Per the existing v0_1 §"Principles". Not blocking prod. | ~3 days. |
|
||
|
||
---
|
||
|
||
## Principles (kept verbatim from v0_1, still load-bearing)
|
||
|
||
- **No yaml in framework code paths.** Typed kube-rs everywhere.
|
||
- **Scores describe desired state; topologies expose capabilities.**
|
||
- **Cross-boundary wire types in `harmony-reconciler-contracts`.**
|
||
- **Never ship untested code.**
|
||
- **Prove claims about upstream before blaming upstream.**
|
||
|
||
Adding one for v0.2:
|
||
|
||
- **Design the brick before moving the brick.** Lock the public API
|
||
contract first; physical relocation later. Cardinality-matched
|
||
types, "make impossible states impossible" — the type system is
|
||
the deterministic feedback loop that scales with LLM-era code
|
||
generation throughput. (See JG's *Pour l'amour des compilateurs*,
|
||
Botpress Meetup, 2026-04-30.)
|