Files
harmony/ROADMAP/fleet_platform/v0_2_plan.md
Jean-Gabriel Gill-Couture 064fa1da0d docs: v0.2 roadmap + ADR-022 fleet agent upgrade procedure
Two design documents framing the next push.

`ROADMAP/fleet_platform/v0_2_plan.md` — three-day production push.
Replaces the open-ended chapter structure of v0_1_plan.md for the
period between the walking-skeleton merge and v0.1.0 in production.
Focus is locking the fleet module's public API surface so the
inevitable physical refactor (out of `harmony/modules/fleet/`,
into `fleet/harmony-fleet/`) is mechanical when we get to it.
Anchored in the principle from JG's *Pour l'amour des compilateurs*
talk: design the brick before moving the brick.

`docs/adr/022-fleet-agent-upgrade.md` — agent upgrade procedure.
K8s rolling-update shape applied to one host: drain in-flight
work, stage versioned binary alongside old, smoke-test, atomic
symlink swap, both agents alive briefly, operator verifies new
agent's heartbeat then sends explicit stop signal to old, old
exits cleanly. No version is ever erased — N-history on disk is
the rollback target. Operator-driven cutover (not self-stopping)
so the most-trusted side decides the handoff. Implementation
deferred to post-v0.1 backlog; spec exists so anyone can build
it without reinventing the design.

ADR README index updated.
2026-05-06 22:51:14 -04:00

9.9 KiB
Raw Permalink Blame History

Fleet Platform v0.2 — 3-day production push

Authoritative plan for the next three days. Picks up where v0_1_plan.md left the chapter structure and supersedes its forward chapters where they conflict. Written 2026-05-06, end of the feat/iot-walking-skeleton branch (31 996 LOC, 184 commits).

State coming in

  • Skeleton end-to-end works against an OKD staging cluster: Zitadel
    • NATS + auth callout + operator + agent (one VM today, real Pi tomorrow). Verified by hand 2026-05-06.
  • ~10 ancillary PRs still open across the team. Branch graph is noisy.
  • harmony/modules/fleet/ is the wrong long-term home for the fleet code. Flagged in the April 2026 code review. Reasons we kept it there during bring-up are subtle (cross-module dependencies on K8sAnywhereTopology, HelmChartScore, K8sResourceScore, harmony_secret, the Topology capability traits) — those need to be written down before the file move, not after. ADR pending; not started yet.
  • Agent upgrade path is undefined. Without it we cannot ship a v0.1 agent into the field.
  • ~408 compilation warnings. Not blocking but needs to be 0 before we put -Dwarnings in CI.

Strategy

This isn't 10 weeks of scaffolding. It's three days of locking the API surface so the inevitable refactor — moving fleet out of harmony/modules/fleet/ into fleet/harmony-fleet/, splitting K8sAnywhereTopology into K8sBareTopology, etc. — is mechanical when we get to it.

The frame from JG's Pour l'amour des compilateurs talk applies directly: design the brick before moving the brick. Physical relocation is cheap. Redesigning a public API after customers depend on it is expensive. We use these three days to make sure the type-level contract is what we want it to be at v1.0, even if the file paths still smell like v0.1.

Day 1 — Lock the brick design

Goal: a fleet façade stable enough to ship to production and refactor freely afterwards.

1.1 Decompose FleetDeviceAuth to resolved states only

Today: TomlShared | ZitadelJwt | ZitadelEnroll. Cardinality 3.

After: ZitadelJwt-shape only. Cardinality 1.

  • TomlShared — v0 dev cruft, no production caller. Delete.
  • ZitadelEnrollpre-resolution state (carries unresolved admin credentials). Doesn't belong in a type that represents "the agent's NATS auth on disk". Move to its own type (DeviceEnrollmentIntent) used only by the enrollment Score
    • binary. Resolution produces a ZitadelJwt and that's what the agent sees.

The render_toml match on &self.auth collapses to one arm. The "is this resolved yet?" branch class disappears. Test render_toml_zitadel_enroll_renders_same_as_zitadel_jwt becomes unnecessary (the question is undefined; you can't render an unresolved auth).

1.2 Define the fleet façade

What does code outside the fleet module see? Today that's a deep walk into harmony::modules::fleet::operator::chart::ChartOptions. Leakage. Lock the seam:

harmony::modules::fleet::
    FleetServerScore         (existing — composed install)
    FleetDeviceEnrollScore   (new — wraps fleet_device_enroll)
    FleetDeviceSetupScore    (existing — keeps API)
    FleetDeviceAuth          (resolved-only, per 1.1)
    AdminAuth                (existing)

    // sealed:
    operator::                pub(crate)
    setup_score's internals   pub(crate)
    chart::                   pub(crate)

Once locked, the file location doesn't matter. pub use re-exports preserve callers' imports across the eventual physical move.

1.3 Defer the placement ADR

JG isn't satisfied with the design yet. ADR-021 stays in proposed limbo until the seam from 1.2 is committed and we've lived with it for a sprint.

Day 1 done when: fleet façade committed, TomlShared and ZitadelEnroll removed from FleetDeviceAuth, every existing caller compiles unchanged, no file moves.


Day 2 — Polish E2E + ship the upgrade ADR

Two streams in parallel.

Stream A — E2E hardening (~½ day)

  • A.1 Operator graceful degradation on bad device_id. The CLI now rejects bad ids upfront, but a stray bad KV entry shouldn't take the operator down. Log + skip, don't restart-loop.
  • A.2 Persist nats_auth_pass and the issuer NKey via harmony_secret. The regenerate-every-run footgun bit us twice on 2026-05-06. Make these Secrets the same way NatsAdmin and ZitadelAdmin already are.
  • A.3 Single regression script. fleet/scripts/e2e-prod-shape.sh. Full bring-up + enroll + assert against a target cluster. Same shape as the existing smoke-a*.sh. CI consumes this later.

Stream B — ADR-022: Agent upgrade procedure (~½ day)

The ADR is the deliverable, not the implementation. Specifies the mechanism so anyone can implement it later without inventing the design. See docs/adr/022-fleet-agent-upgrade.md.

Summary of the design (full detail in the ADR):

  • K8s rolling-update shape, single-host. Wait for in-flight reconciles to complete + all managed services healthy + a scheduling lock from the operator before swapping.
  • Versioned binary layout on disk:
    /usr/bin/fleet-agent-v0.1.1
    /usr/bin/fleet-agent-v0.1.2
    /usr/local/bin/fleet-agent  → symlink to current
    
    No version is ever erased — N-history is the rollback target.
  • Old verifies new + reports up. Old agent stages new, smoke-tests it (--self-test), starts it, watches for the new agent's heartbeat to land in NATS with the new version. Only then does the operator know the upgrade succeeded.
  • Operator drives the cutover. Operator sends an explicit stop signal to the old agent over NATS. Old agent exits cleanly. New agent is already running and takes over.
  • Reverse path is identical. Roll back = operator publishes desired_version = previous; new agent does the same dance to hand off to old.

Day 2 done when: A.1A.3 committed, ADR-022 landed, regression script green against staging.


Day 3 — Production deploy

Goal: customer cluster on v0.1, runbook accurate, signed off.

  • 3.1 Tag v0.1.0 from master after feat/iot-walking-skeleton is merged.
  • 3.2 Run e2e-prod-shape.sh against the customer's prod OKD cluster. Every diff between scripted and reality goes back into the script — so the script is the runbook.
  • 3.3 Production-shape doc twin of docs/guides/fleet-staging-install.md. Deltas only, ~50 lines.
  • 3.4 docs/guides/fleet-device-enrollment.md — operator-facing enrollment runbook. Captures the SSO --admin-oidc-client-id resolution and the --device-id RFC1123 validation we locked in on 2026-05-06.
  • 3.5 Operational basics: revoke a device, rotate a key, read the operator's logs, read NATS. Bullet lists are fine — bullet- list-quality docs beat missing docs.

Day 3 done when: customer's prod cluster runs real workloads, the runbook is what we actually used, and we'd hand operations to someone else.


In parallel — frontend (junior, ~1 week, target Day 5 merge)

Junior owns end-to-end. Spec:

  • F.1 Read-only Leptos SPA. Devices + Deployments + per-device drilldown (DeviceInfo + last-heartbeat + agent version).
  • F.2 NATS tail panel. SSE stream of device-info and device-state updates, plain text.
  • F.3 Served by the operator pod itself (one less Deployment). SSO via the existing Zitadel device-code app (harmony-cli).
  • F.4 Not in v0.1: write paths, metrics dashboards, fleet- wide rollout views, NATS GUI. None of those.

This validates the platform is observable from outside the operator's logs — the customer's specific ask.


What slips to v0.2+ (post-prod backlog)

No calendar pressure on these; sequence after we see real customer usage.

Item Why deferred Cost when we do it
Pluggable harmony CLI (kubectl-style PATH discovery) + harmony-fleet plugin Customer doesn't run it themselves yet; we do. Examples are good enough. ~1 week, mostly rename/restructure given Day 1's API freeze.
Physical refactor of harmony/modules/fleet/fleet/harmony-fleet/ The Day-1 façade settles the design; the move is mechanical and the ADR for it is still in draft. ~2 days.
Agent upgrade implementation (ADR ships Day 2; impl later) First customer fleet is small enough to hand-upgrade if needed. ~1 week.
ArgoCD chart publishing Customer uses ArgoCD downstream but their initial deploy goes through harmony directly. ~3 days.
Full CI e2e (k3d nightly + libvirt + OKD daily) Manual rehearsal works for one customer. ~1 week + runner capacity.
OpenBao integration (replaces ZitadelClientConfig cache file) Cache file works for single-operator use; OpenBao is the multi-operator answer. ~1 week.
harmony run <ScoreName> --field=value ad-hoc Score CLI No v0.1 customer flow needs it. ~2 weeks (Score-flag derive macro is the hard part).
Fleet-wide rollout strategies (canary, %-based) on top of the agent-upgrade primitive Single-device upgrade is sufficient until >100-device fleets. ~1 week.
Drop K8sAnywhereTopology for ad-hoc Score execution; introduce K8sBareTopology Per the existing v0_1 §"Principles". Not blocking prod. ~3 days.

Principles (kept verbatim from v0_1, still load-bearing)

  • No yaml in framework code paths. Typed kube-rs everywhere.
  • Scores describe desired state; topologies expose capabilities.
  • Cross-boundary wire types in harmony-reconciler-contracts.
  • Never ship untested code.
  • Prove claims about upstream before blaming upstream.

Adding one for v0.2:

  • Design the brick before moving the brick. Lock the public API contract first; physical relocation later. Cardinality-matched types, "make impossible states impossible" — the type system is the deterministic feedback loop that scales with LLM-era code generation throughput. (See JG's Pour l'amour des compilateurs, Botpress Meetup, 2026-04-30.)