Files
harmony/ROADMAP/fleet_platform/v0_2_plan.md
Jean-Gabriel Gill-Couture 064fa1da0d docs: v0.2 roadmap + ADR-022 fleet agent upgrade procedure
Two design documents framing the next push.

`ROADMAP/fleet_platform/v0_2_plan.md` — three-day production push.
Replaces the open-ended chapter structure of v0_1_plan.md for the
period between the walking-skeleton merge and v0.1.0 in production.
Focus is locking the fleet module's public API surface so the
inevitable physical refactor (out of `harmony/modules/fleet/`,
into `fleet/harmony-fleet/`) is mechanical when we get to it.
Anchored in the principle from JG's *Pour l'amour des compilateurs*
talk: design the brick before moving the brick.

`docs/adr/022-fleet-agent-upgrade.md` — agent upgrade procedure.
K8s rolling-update shape applied to one host: drain in-flight
work, stage versioned binary alongside old, smoke-test, atomic
symlink swap, both agents alive briefly, operator verifies new
agent's heartbeat then sends explicit stop signal to old, old
exits cleanly. No version is ever erased — N-history on disk is
the rollback target. Operator-driven cutover (not self-stopping)
so the most-trusted side decides the handoff. Implementation
deferred to post-v0.1 backlog; spec exists so anyone can build
it without reinventing the design.

ADR README index updated.
2026-05-06 22:51:14 -04:00

232 lines
9.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Fleet Platform v0.2 — 3-day production push
Authoritative plan for the next three days. Picks up where
`v0_1_plan.md` left the chapter structure and supersedes its forward
chapters where they conflict. Written 2026-05-06, end of the
`feat/iot-walking-skeleton` branch (31 996 LOC, 184 commits).
## State coming in
- Skeleton end-to-end works against an OKD staging cluster: Zitadel
+ NATS + auth callout + operator + agent (one VM today, real Pi
tomorrow). Verified by hand 2026-05-06.
- ~10 ancillary PRs still open across the team. Branch graph is
noisy.
- `harmony/modules/fleet/` is the wrong long-term home for the fleet
code. Flagged in the April 2026 code review. Reasons we kept it
there during bring-up are subtle (cross-module dependencies on
`K8sAnywhereTopology`, `HelmChartScore`, `K8sResourceScore`,
`harmony_secret`, the `Topology` capability traits) — those need
to be written down before the file move, not after. **ADR
pending; not started yet.**
- Agent upgrade path is undefined. Without it we cannot ship a
v0.1 agent into the field.
- ~408 compilation warnings. Not blocking but needs to be 0 before
we put `-Dwarnings` in CI.
## Strategy
This isn't 10 weeks of scaffolding. It's three days of locking the
**API surface** so the inevitable refactor — moving fleet out of
`harmony/modules/fleet/` into `fleet/harmony-fleet/`, splitting
`K8sAnywhereTopology` into `K8sBareTopology`, etc. — is mechanical
when we get to it.
The frame from JG's *Pour l'amour des compilateurs* talk applies
directly: **design the brick before moving the brick.** Physical
relocation is cheap. Redesigning a public API after customers
depend on it is expensive. We use these three days to make sure
the type-level contract is what we want it to be at v1.0, even if
the file paths still smell like v0.1.
## Day 1 — Lock the brick design
**Goal:** a fleet façade stable enough to ship to production and
refactor freely afterwards.
### 1.1 Decompose `FleetDeviceAuth` to *resolved states only*
Today: `TomlShared | ZitadelJwt | ZitadelEnroll`. Cardinality 3.
After: `ZitadelJwt`-shape only. Cardinality 1.
- `TomlShared` — v0 dev cruft, no production caller. Delete.
- `ZitadelEnroll`*pre-resolution* state (carries unresolved
admin credentials). Doesn't belong in a type that represents
"the agent's NATS auth on disk". Move to its own type
(`DeviceEnrollmentIntent`) used only by the enrollment Score
+ binary. Resolution produces a `ZitadelJwt` and that's what
the agent sees.
The `render_toml` match on `&self.auth` collapses to one arm. The
"is this resolved yet?" branch class disappears. Test
`render_toml_zitadel_enroll_renders_same_as_zitadel_jwt` becomes
unnecessary (the question is undefined; you can't render an
unresolved auth).
### 1.2 Define the `fleet` façade
What does code outside the fleet module see? Today that's a deep
walk into `harmony::modules::fleet::operator::chart::ChartOptions`.
Leakage. Lock the seam:
```text
harmony::modules::fleet::
FleetServerScore (existing — composed install)
FleetDeviceEnrollScore (new — wraps fleet_device_enroll)
FleetDeviceSetupScore (existing — keeps API)
FleetDeviceAuth (resolved-only, per 1.1)
AdminAuth (existing)
// sealed:
operator:: pub(crate)
setup_score's internals pub(crate)
chart:: pub(crate)
```
Once locked, the *file location* doesn't matter. `pub use`
re-exports preserve callers' imports across the eventual physical
move.
### 1.3 Defer the placement ADR
JG isn't satisfied with the design yet. ADR-021 stays in *proposed*
limbo until the seam from 1.2 is committed and we've lived with it
for a sprint.
**Day 1 done when:** fleet façade committed, `TomlShared` and
`ZitadelEnroll` removed from `FleetDeviceAuth`, every existing
caller compiles unchanged, no file moves.
---
## Day 2 — Polish E2E + ship the upgrade ADR
Two streams in parallel.
### Stream A — E2E hardening (~½ day)
- **A.1 Operator graceful degradation on bad device_id.** The CLI
now rejects bad ids upfront, but a stray bad KV entry shouldn't
take the operator down. Log + skip, don't restart-loop.
- **A.2 Persist `nats_auth_pass` and the issuer NKey via
`harmony_secret`.** The regenerate-every-run footgun bit us
twice on 2026-05-06. Make these `Secret`s the same way `NatsAdmin`
and `ZitadelAdmin` already are.
- **A.3 Single regression script.** `fleet/scripts/e2e-prod-shape.sh`.
Full bring-up + enroll + assert against a target cluster. Same
shape as the existing `smoke-a*.sh`. CI consumes this later.
### Stream B — ADR-022: Agent upgrade procedure (~½ day)
The ADR is the deliverable, not the implementation. Specifies the
mechanism so anyone can implement it later without inventing the
design. See `docs/adr/022-fleet-agent-upgrade.md`.
Summary of the design (full detail in the ADR):
- **K8s rolling-update shape, single-host.** Wait for in-flight
reconciles to complete + all managed services healthy + a
scheduling lock from the operator before swapping.
- **Versioned binary layout on disk:**
```
/usr/bin/fleet-agent-v0.1.1
/usr/bin/fleet-agent-v0.1.2
/usr/local/bin/fleet-agent → symlink to current
```
No version is ever erased — N-history is the rollback target.
- **Old verifies new + reports up.** Old agent stages new,
smoke-tests it (`--self-test`), starts it, watches for the new
agent's heartbeat to land in NATS with the new version. Only then
does the operator know the upgrade succeeded.
- **Operator drives the cutover.** Operator sends an explicit stop
signal to the old agent over NATS. Old agent exits cleanly. New
agent is already running and takes over.
- **Reverse path is identical.** Roll back = operator publishes
desired_version = previous; new agent does the same dance to
hand off to old.
**Day 2 done when:** A.1A.3 committed, ADR-022 landed, regression
script green against staging.
---
## Day 3 — Production deploy
**Goal:** customer cluster on v0.1, runbook accurate, signed off.
- **3.1** Tag `v0.1.0` from `master` after `feat/iot-walking-skeleton`
is merged.
- **3.2** Run `e2e-prod-shape.sh` against the customer's prod OKD
cluster. Every diff between scripted and reality goes back into
the script — so the script *is* the runbook.
- **3.3** Production-shape doc twin of
`docs/guides/fleet-staging-install.md`. Deltas only, ~50 lines.
- **3.4** `docs/guides/fleet-device-enrollment.md` — operator-facing
enrollment runbook. Captures the SSO `--admin-oidc-client-id`
resolution and the `--device-id` RFC1123 validation we locked in
on 2026-05-06.
- **3.5** Operational basics: revoke a device, rotate a key, read
the operator's logs, read NATS. Bullet lists are fine — bullet-
list-quality docs beat missing docs.
**Day 3 done when:** customer's prod cluster runs real workloads,
the runbook is what we actually used, and we'd hand operations to
someone else.
---
## In parallel — frontend (junior, ~1 week, target Day 5 merge)
Junior owns end-to-end. Spec:
- **F.1** Read-only Leptos SPA. Devices + Deployments + per-device
drilldown (DeviceInfo + last-heartbeat + agent version).
- **F.2** NATS tail panel. SSE stream of `device-info` and
`device-state` updates, plain text.
- **F.3** Served by the operator pod itself (one less Deployment).
SSO via the existing Zitadel device-code app (`harmony-cli`).
- **F.4** **Not** in v0.1: write paths, metrics dashboards, fleet-
wide rollout views, NATS GUI. None of those.
This validates the platform is observable from outside the
operator's logs — the customer's specific ask.
---
## What slips to v0.2+ (post-prod backlog)
No calendar pressure on these; sequence after we see real customer
usage.
| Item | Why deferred | Cost when we do it |
|---|---|---|
| Pluggable `harmony` CLI (kubectl-style PATH discovery) + `harmony-fleet` plugin | Customer doesn't run it themselves yet; we do. Examples are good enough. | ~1 week, mostly rename/restructure given Day 1's API freeze. |
| Physical refactor of `harmony/modules/fleet/` → `fleet/harmony-fleet/` | The Day-1 façade settles the design; the move is mechanical and the ADR for it is still in draft. | ~2 days. |
| Agent upgrade implementation (ADR ships Day 2; impl later) | First customer fleet is small enough to hand-upgrade if needed. | ~1 week. |
| ArgoCD chart publishing | Customer uses ArgoCD downstream but their initial deploy goes through harmony directly. | ~3 days. |
| Full CI e2e (k3d nightly + libvirt + OKD daily) | Manual rehearsal works for one customer. | ~1 week + runner capacity. |
| OpenBao integration (replaces `ZitadelClientConfig` cache file) | Cache file works for single-operator use; OpenBao is the multi-operator answer. | ~1 week. |
| `harmony run <ScoreName> --field=value` ad-hoc Score CLI | No v0.1 customer flow needs it. | ~2 weeks (Score-flag derive macro is the hard part). |
| Fleet-wide rollout strategies (canary, %-based) on top of the agent-upgrade primitive | Single-device upgrade is sufficient until >100-device fleets. | ~1 week. |
| Drop `K8sAnywhereTopology` for ad-hoc Score execution; introduce `K8sBareTopology` | Per the existing v0_1 §"Principles". Not blocking prod. | ~3 days. |
---
## Principles (kept verbatim from v0_1, still load-bearing)
- **No yaml in framework code paths.** Typed kube-rs everywhere.
- **Scores describe desired state; topologies expose capabilities.**
- **Cross-boundary wire types in `harmony-reconciler-contracts`.**
- **Never ship untested code.**
- **Prove claims about upstream before blaming upstream.**
Adding one for v0.2:
- **Design the brick before moving the brick.** Lock the public API
contract first; physical relocation later. Cardinality-matched
types, "make impossible states impossible" — the type system is
the deterministic feedback loop that scales with LLM-era code
generation throughput. (See JG's *Pour l'amour des compilateurs*,
Botpress Meetup, 2026-04-30.)