Files

Jean-Gabriel Gill-Couture 064fa1da0d docs: v0.2 roadmap + ADR-022 fleet agent upgrade procedure

Two design documents framing the next push.

`ROADMAP/fleet_platform/v0_2_plan.md` — three-day production push.
Replaces the open-ended chapter structure of v0_1_plan.md for the
period between the walking-skeleton merge and v0.1.0 in production.
Focus is locking the fleet module's public API surface so the
inevitable physical refactor (out of `harmony/modules/fleet/`,
into `fleet/harmony-fleet/`) is mechanical when we get to it.
Anchored in the principle from JG's *Pour l'amour des compilateurs*
talk: design the brick before moving the brick.

`docs/adr/022-fleet-agent-upgrade.md` — agent upgrade procedure.
K8s rolling-update shape applied to one host: drain in-flight
work, stage versioned binary alongside old, smoke-test, atomic
symlink swap, both agents alive briefly, operator verifies new
agent's heartbeat then sends explicit stop signal to old, old
exits cleanly. No version is ever erased — N-history on disk is
the rollback target. Operator-driven cutover (not self-stopping)
so the most-trusted side decides the handoff. Implementation
deferred to post-v0.1 backlog; spec exists so anyone can build
it without reinventing the design.

ADR README index updated.

2026-05-06 22:51:14 -04:00

9.9 KiB

Raw Permalink Blame History

Fleet Platform v0.2 — 3-day production push

Authoritative plan for the next three days. Picks up where v0_1_plan.md left the chapter structure and supersedes its forward chapters where they conflict. Written 2026-05-06, end of the feat/iot-walking-skeleton branch (31 996 LOC, 184 commits).

State coming in

Skeleton end-to-end works against an OKD staging cluster: Zitadel
- NATS + auth callout + operator + agent (one VM today, real Pi tomorrow). Verified by hand 2026-05-06.
~10 ancillary PRs still open across the team. Branch graph is noisy.
harmony/modules/fleet/ is the wrong long-term home for the fleet code. Flagged in the April 2026 code review. Reasons we kept it there during bring-up are subtle (cross-module dependencies on K8sAnywhereTopology, HelmChartScore, K8sResourceScore, harmony_secret, the Topology capability traits) — those need to be written down before the file move, not after. ADR pending; not started yet.
Agent upgrade path is undefined. Without it we cannot ship a v0.1 agent into the field.
~408 compilation warnings. Not blocking but needs to be 0 before we put -Dwarnings in CI.

Strategy

This isn't 10 weeks of scaffolding. It's three days of locking the API surface so the inevitable refactor — moving fleet out of harmony/modules/fleet/ into fleet/harmony-fleet/, splitting K8sAnywhereTopology into K8sBareTopology, etc. — is mechanical when we get to it.

The frame from JG's Pour l'amour des compilateurs talk applies directly: design the brick before moving the brick. Physical relocation is cheap. Redesigning a public API after customers depend on it is expensive. We use these three days to make sure the type-level contract is what we want it to be at v1.0, even if the file paths still smell like v0.1.

Day 1 — Lock the brick design

Goal: a fleet façade stable enough to ship to production and refactor freely afterwards.

1.1 Decompose `FleetDeviceAuth` to resolved states only

Today: TomlShared | ZitadelJwt | ZitadelEnroll. Cardinality 3.

After: ZitadelJwt-shape only. Cardinality 1.

TomlShared — v0 dev cruft, no production caller. Delete.
ZitadelEnroll — pre-resolution state (carries unresolved admin credentials). Doesn't belong in a type that represents "the agent's NATS auth on disk". Move to its own type (DeviceEnrollmentIntent) used only by the enrollment Score
- binary. Resolution produces a ZitadelJwt and that's what the agent sees.

The render_toml match on &self.auth collapses to one arm. The "is this resolved yet?" branch class disappears. Test render_toml_zitadel_enroll_renders_same_as_zitadel_jwt becomes unnecessary (the question is undefined; you can't render an unresolved auth).

1.2 Define the `fleet` façade

What does code outside the fleet module see? Today that's a deep walk into harmony::modules::fleet::operator::chart::ChartOptions. Leakage. Lock the seam:

harmony::modules::fleet::
    FleetServerScore         (existing — composed install)
    FleetDeviceEnrollScore   (new — wraps fleet_device_enroll)
    FleetDeviceSetupScore    (existing — keeps API)
    FleetDeviceAuth          (resolved-only, per 1.1)
    AdminAuth                (existing)

    // sealed:
    operator::                pub(crate)
    setup_score's internals   pub(crate)
    chart::                   pub(crate)

Once locked, the file location doesn't matter. pub use re-exports preserve callers' imports across the eventual physical move.

1.3 Defer the placement ADR

JG isn't satisfied with the design yet. ADR-021 stays in proposed limbo until the seam from 1.2 is committed and we've lived with it for a sprint.

Day 1 done when: fleet façade committed, TomlShared and ZitadelEnroll removed from FleetDeviceAuth, every existing caller compiles unchanged, no file moves.

Day 2 — Polish E2E + ship the upgrade ADR

Two streams in parallel.

Stream A — E2E hardening (~½ day)

A.1 Operator graceful degradation on bad device_id. The CLI now rejects bad ids upfront, but a stray bad KV entry shouldn't take the operator down. Log + skip, don't restart-loop.
A.2 Persist nats_auth_pass and the issuer NKey via harmony_secret. The regenerate-every-run footgun bit us twice on 2026-05-06. Make these Secrets the same way NatsAdmin and ZitadelAdmin already are.
A.3 Single regression script. fleet/scripts/e2e-prod-shape.sh. Full bring-up + enroll + assert against a target cluster. Same shape as the existing smoke-a*.sh. CI consumes this later.

Stream B — ADR-022: Agent upgrade procedure (~½ day)

The ADR is the deliverable, not the implementation. Specifies the mechanism so anyone can implement it later without inventing the design. See docs/adr/022-fleet-agent-upgrade.md.

Summary of the design (full detail in the ADR):

K8s rolling-update shape, single-host. Wait for in-flight reconciles to complete + all managed services healthy + a scheduling lock from the operator before swapping.

Versioned binary layout on disk:

/usr/bin/fleet-agent-v0.1.1
/usr/bin/fleet-agent-v0.1.2
/usr/local/bin/fleet-agent  → symlink to current

No version is ever erased — N-history is the rollback target.

Old verifies new + reports up. Old agent stages new, smoke-tests it (--self-test), starts it, watches for the new agent's heartbeat to land in NATS with the new version. Only then does the operator know the upgrade succeeded.
Operator drives the cutover. Operator sends an explicit stop signal to the old agent over NATS. Old agent exits cleanly. New agent is already running and takes over.
Reverse path is identical. Roll back = operator publishes desired_version = previous; new agent does the same dance to hand off to old.

Day 2 done when: A.1–A.3 committed, ADR-022 landed, regression script green against staging.

Day 3 — Production deploy

Goal: customer cluster on v0.1, runbook accurate, signed off.

3.1 Tag v0.1.0 from master after feat/iot-walking-skeleton is merged.
3.2 Run e2e-prod-shape.sh against the customer's prod OKD cluster. Every diff between scripted and reality goes back into the script — so the script is the runbook.
3.3 Production-shape doc twin of docs/guides/fleet-staging-install.md. Deltas only, ~50 lines.
3.4 docs/guides/fleet-device-enrollment.md — operator-facing enrollment runbook. Captures the SSO --admin-oidc-client-id resolution and the --device-id RFC1123 validation we locked in on 2026-05-06.
3.5 Operational basics: revoke a device, rotate a key, read the operator's logs, read NATS. Bullet lists are fine — bullet- list-quality docs beat missing docs.

Day 3 done when: customer's prod cluster runs real workloads, the runbook is what we actually used, and we'd hand operations to someone else.

In parallel — frontend (junior, ~1 week, target Day 5 merge)

Junior owns end-to-end. Spec:

F.1 Read-only Leptos SPA. Devices + Deployments + per-device drilldown (DeviceInfo + last-heartbeat + agent version).
F.2 NATS tail panel. SSE stream of device-info and device-state updates, plain text.
F.3 Served by the operator pod itself (one less Deployment). SSO via the existing Zitadel device-code app (harmony-cli).
F.4 Not in v0.1: write paths, metrics dashboards, fleet- wide rollout views, NATS GUI. None of those.

This validates the platform is observable from outside the operator's logs — the customer's specific ask.

What slips to v0.2+ (post-prod backlog)

No calendar pressure on these; sequence after we see real customer usage.

Item	Why deferred	Cost when we do it
Pluggable `harmony` CLI (kubectl-style PATH discovery) + `harmony-fleet` plugin	Customer doesn't run it themselves yet; we do. Examples are good enough.	~1 week, mostly rename/restructure given Day 1's API freeze.
Physical refactor of `harmony/modules/fleet/` → `fleet/harmony-fleet/`	The Day-1 façade settles the design; the move is mechanical and the ADR for it is still in draft.	~2 days.
Agent upgrade implementation (ADR ships Day 2; impl later)	First customer fleet is small enough to hand-upgrade if needed.	~1 week.
ArgoCD chart publishing	Customer uses ArgoCD downstream but their initial deploy goes through harmony directly.	~3 days.
Full CI e2e (k3d nightly + libvirt + OKD daily)	Manual rehearsal works for one customer.	~1 week + runner capacity.
OpenBao integration (replaces `ZitadelClientConfig` cache file)	Cache file works for single-operator use; OpenBao is the multi-operator answer.	~1 week.
`harmony run <ScoreName> --field=value` ad-hoc Score CLI	No v0.1 customer flow needs it.	~2 weeks (Score-flag derive macro is the hard part).
Fleet-wide rollout strategies (canary, %-based) on top of the agent-upgrade primitive	Single-device upgrade is sufficient until >100-device fleets.	~1 week.
Drop `K8sAnywhereTopology` for ad-hoc Score execution; introduce `K8sBareTopology`	Per the existing v0_1 §"Principles". Not blocking prod.	~3 days.

Principles (kept verbatim from v0_1, still load-bearing)

No yaml in framework code paths. Typed kube-rs everywhere.
Scores describe desired state; topologies expose capabilities.
Cross-boundary wire types in harmony-reconciler-contracts.
Never ship untested code.
Prove claims about upstream before blaming upstream.

Adding one for v0.2:

Design the brick before moving the brick. Lock the public API contract first; physical relocation later. Cardinality-matched types, "make impossible states impossible" — the type system is the deterministic feedback loop that scales with LLM-era code generation throughput. (See JG's Pour l'amour des compilateurs, Botpress Meetup, 2026-04-30.)

9.9 KiB Raw Permalink Blame History Unescape Escape