Files

Jean-Gabriel Gill-Couture 616c05d5a4

Run Check Script / check (pull_request) Failing after 52s

Details

docs: fleet architecture review — inventory, principles, alternatives

Working document for the architectural redesign of the fleet
platform before v0.1 ships to production. Captures four sections
of research:

§1 — Current state inventory. Markdown-bullet map of every public
type, score, trait, and module across `harmony/modules/fleet/`,
`harmony-reconciler-contracts`, and `fleet/harmony-fleet-*/`.
Sorted by domain meaning (identity, desired state, observed
state, setup, plumbing) rather than location, so the
cross-cutting concerns become visible. Includes a text "diagram"
of the dependency graph showing the two problematic edges:
runtime crates importing CRD types from the framework crate
(`harmony-fleet-operator` ← `harmony::modules::fleet::operator::crd`
verified at `controller.rs:37`, `device_reconciler.rs:21`,
`main.rs:9`) and the agent importing podman wire types from the
framework crate (`harmony-fleet-agent` ← `harmony::modules::podman`
verified at `main.rs:21-22`, `reconciler.rs:11`).

§2 — Theory review. Pulls principles from JG's *Pour l'amour des
compilateurs* talk (2026-04-30), its references (Crichton,
Feldman, Maguire, Goedecke, Fowler), and harmony's own load-bearing
ADRs (002 hexagonal, 003 infrastructure abstractions, 015 higher-
order topologies, 016 agent + global mesh, 018 template hydration).
Synthesizes eight design principles for the redesign — including
Goedecke's guardrail that "type-driven" ≠ "type-everything" so we
don't over-fit the cardinality argument.

§3 — Ten concrete shape problems (P1–P10), framed as cardinality
mismatches, leaky boundaries, and "is this resolved yet" branches
rather than bugs. P1 is the placement issue JG flagged in code
review; P2 is `FleetDeviceAuth`'s mixed resolved/unresolved
states; P10 is the credential-shape staircase across operator
workstation / operator pod / agent.

§4 — Five design alternatives, each scored against P1–P10:
  A. Move + thin façade (conservative cleanup).
  B. Resolved-only at boundaries + capability traits (principled
     incremental).
  C. Dataflow reframe (events in, state out).
  D. Fleet as kube control plane, period (deliberately weird).
  E. Algebra of fleets (deliberately mathematical).

A is too little, C/D/E are right-shape but wrong-timing for the
3-day window. B is the working recommendation, with explicit
awareness that D is the v2.0 destination and the capability
traits in B are the seam that lets us migrate without breaking
callers.

§5 sketches a concrete shape for B: new `harmony-fleet/` domain
crate with no framework dependency, `harmony-fleet-adapters-*`
crates for NATS/Zitadel/kube, the existing operator/agent/auth
crates wire adapters together, the framework's
`harmony::modules::fleet` collapses to a re-export module that
goes away by v0.2.

§6 — Five open questions for JG's review before locking the
choice. §7 — explicit "spike one slice, then commit or back out"
process so we don't lock the wrong shape.

Not an ADR yet. The ADR happens after JG agrees on which
alternative is the working hypothesis and the spike confirms the
shape feels better in code than on paper.

2026-05-07 05:20:25 -04:00

40 KiB

Raw Permalink Blame History

Fleet platform — architecture review

Working document for the architectural redesign of the fleet platform before v0.1 ships to production. Started 2026-05-07.

This is a research + design document, not a plan to execute. The output of this work is an ADR (or set of ADRs) that lock the new shape; the v0.2 roadmap will reference whichever option we pick.

Why now

Three days from production. No customers depend on the API yet → API/UX/DX is still cheap to change. After ship, every breaking change costs us a week of customer-coordination overhead.
The harmony/modules/fleet/ placement is wrong — already flagged in code review. The reasons it ended up there are subtle (cross- module imports of K8sAnywhereTopology, HelmChartScore, K8sResourceScore, harmony_secret, Topology capability traits). Those need to be written down before the file move, not after.
The plumbing — NATS + Zitadel + auth callout + operator + agent — is sound. Highly secure, scalable by design, low resource footprint. The redesign is about moving code and better data structures, not rebuilding mechanisms.
The frame from JG's Pour l'amour des compilateurs talk: cardinality-matched types, "make impossible states impossible", expressive types as the deterministic feedback loop that scales with LLM-era code generation throughput. Apply that frame here.

Working plan

Inventory. Map every public type, trait, score, module, and crate that participates in the fleet domain. Markdown-bullet shape; no diagrams.
Read the room. Pull principles from JG's talk, its references, and harmony's existing ADRs (002 hexagonal, 003 infrastructure abstractions, 015 higher-order topologies, 016 harmony agent + global mesh, 017 NATS interconnection, 018 template hydration). Note where the existing fleet design already follows them and where it doesn't.
Identify the design problems. Not bugs — shape problems. Cardinality mismatches, leaky boundaries, "is this resolved yet" branches, location/dependency loops.
Sketch alternatives. Three to five. At least one conventional cleanup, at least one out-of-the-box that reframes the domain. Compare on the same axes (cardinality, placement, ergonomics, extensibility).
Pick (or recommend) one. Land as ADR.

This document covers steps 1–4. The pick happens in conversation with JG before the ADR.

§1 — Current state inventory

§1.1 — Where the code lives

The fleet domain spans three concerns that today live in three locations:

Framework-side scoring (what runs on the operator's workstation when they cargo run the install) → lives in harmony/src/modules/fleet/. This is the wrong home; it's the thing this review is about moving.
- mod.rs — re-exports
- assets.rs — Ubuntu/Debian cloud image fetchers, libvirt SSH keypair management
- libvirt_pool.rs — libvirt storage pool bring-up
- setup_score.rs (1053 LOC, the monster) — FleetDeviceSetupScore, FleetDeviceSetupConfig, FleetDeviceAuth (TomlShared|ZitadelJwt|ZitadelEnroll), AdminAuth, HostsEntry, merge_hosts_file
- vm_score.rs — ProvisionVmScore (libvirt VM bring-up)
- preflight.rs — check_fleet_smoke_preflight* (host system checks)
- server.rs — FleetServerScore, FleetServerInterpret (composed bring-up of Zitadel + NATS + callout + operator)
- operator/
  - mod.rs, score.rs — FleetOperatorScore, FleetOperatorInterpret (operator helm install)
  - chart.rs (453 LOC) — chart rendering (ChartOptions, OperatorCredentials, build_chart, operator_secret, build_operator_deployment, build_cluster_role)
  - crd.rs — Deployment CRD type (DeploymentSpec, Rollout, RolloutStrategy, DeploymentStatus, DeploymentAggregate, AggregateLastError); Device CRD type (DeviceSpec)
Cross-boundary wire types (the "contract" agent and operator both have to agree on) → lives in harmony-reconciler-contracts/.
- fleet.rs — DeviceInfo, DeploymentState, HeartbeatPayload, DeploymentName, InvalidDeploymentName
- kv.rs — bucket name constants + key-builder functions
- status.rs — Phase, InventorySnapshot
- re-exports harmony_types::id::Id
Runtime binaries (what runs in the cluster + on devices) → lives in fleet/.
- harmony-fleet-operator/ — the operator pod. controller.rs, device_reconciler.rs, fleet_aggregator.rs (833 LOC), install.rs, main.rs. Pulls Deployment/Device CRDs from harmony::modules::fleet::operator::crd (cross-crate import that should give us pause).
- harmony-fleet-agent/ — the on-device daemon. config.rs, reconciler.rs, fleet_publisher.rs, main.rs.
- harmony-fleet-auth/ — JWT-bearer / NATS-credentials helpers used by both the operator AND the agent. config.rs, credentials.rs (553 LOC). Sits between contracts and the runtime crates.

§1.2 — Public types, sorted by domain meaning (not location)

Identity & devices

harmony_types::id::Id — opaque, sortable, collision-safe identifier. Used as device id, deployment id, …
DeploymentName (newtype with validation, harmony-reconciler-contracts)
DeviceInfo — heartbeat payload that materializes into a Device CR
DeviceSpec — kube CRD, holds an optional InventorySnapshot
InventorySnapshot — hardware/OS facts published once at registration

Deployment desired-state

DeploymentSpec — kube CRD: target_selector: LabelSelector, score: ReconcileScore, rollout: Rollout
ReconcileScore (in harmony::modules::podman, re-exported from harmony::modules::fleet::operator::crd) — externally-tagged enum, today only PodmanV0(PodmanV0Score)
PodmanV0Score, PodmanService, EnvVar, VolumeMount, RestartPolicy
Rollout, RolloutStrategy::Immediate

Deployment observed-state

DeploymentState — what the agent publishes per device per deployment after reconcile
DeploymentStatus (kube CRD) — operator-side rollup of all device states for one Deployment CR
DeploymentAggregate — counts (matched, succeeded, failed, pending) + last_error: Option<AggregateLastError>
Phase — Pending | Running | Failed

Authentication / identity provider

FleetDeviceAuth — sum type with TomlShared | ZitadelJwt | ZitadelEnroll. The ZitadelEnroll arm carries unresolved-state — admin credentials that must be turned into a device JSON key at execute time. Mixes resolved and unresolved states in one type, which is the cardinality bug we keep hitting.
AdminAuth — Sso { client_id } | Token(String) (used inside ZitadelEnroll)
CredentialsSection — TOML-on-disk shape (in harmony-fleet-auth, parallel to FleetDeviceAuth)
CredentialSource — runtime credential factory
NatsCredential — what async-nats actually consumes
MachineKeyFile, CachedToken

Setup procedures (Scores)

FleetDeviceSetupScore (FleetDeviceSetupConfig) — the workhorse: installs podman, drops the agent binary, drops the credentials TOML, drops the keyfile, brings up the systemd unit.
FleetServerScore — orchestrates Zitadel install + identity setup + NATS install + callout install + operator install. Wraps five other scores.
FleetOperatorScore — operator helm chart render + install + the credentials Secret apply.
ProvisionVmScore — libvirt VM bring-up. Used by VM rehearsals.
(External, not in fleet/) ZitadelScore, ZitadelSetupScore, NatsK8sScore, NatsAuthCalloutScore — all consumed by the composed install.

Operator-internal types

FleetState, SharedFleetState, DeploymentKey, DevicePair, CachedDeployment, Context, Error (the controller's local error type), selector_matches, apply_state, drop_state, compute_aggregate

Agent-internal types

AgentConfig, AgentSection, NatsSection, CredentialsSection
FleetPublisher, Reconciler

Fleet plumbing for development

FleetSshKeypair, the cloud-image consts, HarmonyFleetPool, merge_hosts_file, HostsEntry, check_fleet_smoke_preflight*

NATS subjects + KV buckets (the wire seam)

BUCKET_DESIRED_STATE = "desired-state"
BUCKET_DEVICE_INFO = "device-info"
BUCKET_DEVICE_STATE = "device-state"
BUCKET_DEVICE_HEARTBEAT = "device-heartbeat"
Key builders: desired_state_key(device_id, deployment_name), device_info_key(device_id), device_state_key(device_id, deployment_name), device_heartbeat_key(device_id)

§1.3 — Concept clusters

When you squint at the inventory, the domain falls into five clusters:

Identity — who is this device, who is this deployment, who is the operator, what auth do they have.
Desired state — what should be running where.
Observed state — what is actually running where.
Setup — bringing all this into existence on a fresh cluster + fresh device.
Plumbing — the NATS/kube/Zitadel mechanisms that make 1–4 work.

The current code does not cleanly separate these. Examples:

setup_score.rs mixes Setup (drop binary, run systemd) with Identity (FleetDeviceAuth). 1053 LOC.
FleetDeviceAuth mixes resolved-Identity (ZitadelJwt — here's a key) with Setup-time-Identity-resolution-intent (ZitadelEnroll — here's how to mint a key).
The chart-render helpers (build_operator_deployment, etc.) are pub from harmony::modules::fleet::operator::chart so the composed-install scores can pluck the secret out before helm install. Plumbing leaking through Setup.
harmony::modules::fleet::operator::crd::DeploymentSpec is the CRD definition AND it's the type the operator daemon imports to reconcile. Cross-crate import from a runtime crate (harmony-fleet-operator) into a framework crate (harmony). This is the placement bug.

§1.4 — The shape problem in one diagram (text)

                         framework/operator workstation
                              │
   harmony::modules::fleet  ──┤  Scores: FleetServerScore, FleetDeviceSetupScore,
                              │          FleetOperatorScore, ProvisionVmScore
                              │  CRD types: Deployment, Device, DeploymentSpec, ...
                              │  Chart rendering helpers (operator/chart.rs)
                              │
   harmony-reconciler-contracts ── wire types: DeviceInfo, DeploymentState,
                              │                HeartbeatPayload, KV constants
                              │  ▲                                              ▲
                              │  │                                              │
                              │  │  imports                              imports│
                              │  │                                              │
                       fleet/harmony-fleet-agent          fleet/harmony-fleet-operator
                              ▲                                          ▲
                              │                                          │
                              │  ALSO imports                ALSO imports│
                              │  from harmony::modules::      from harmony::modules::
                              │  podman (PodmanV0Score)       fleet::operator::crd

Two problematic edges:

harmony-fleet-operator imports harmony::modules::fleet::operator::crd::Deployment. The runtime daemon depends on the framework crate just for CRD type definitions.
harmony-fleet-agent imports harmony::modules::podman::{PodmanV0Score, PodmanTopology, ReconcileScore}. The agent depends on the framework crate's podman module for the score it deserializes off the wire.

Both edges should run through harmony-reconciler-contracts, not around it. That's the placement bug surfaced.

§2 — Theory review

§2.1 — From the talk

Pulling the load-bearing principles, ranked by relevance to this redesign:

Cardinality matters. Types should match the cardinality of the real-world concept. &str for "primary color" admits infinite invalid inputs; enum { Red, Yellow, Blue } admits exactly three. Friction is proportional to mismatch.
Make impossible states impossible. Don't comment the constraint, code it. Push runtime errors to the design phase.
Representations matter. Same data, different shapes ↔ different operations are cheap. Roman numerals ↔ addition; Arabic ↔ multiplication. "An API is a computational representation of real-world concepts."
The compiler is a deterministic feedback channel. In an era when LLMs generate code at 5–10K LOC/day, the only sensor that keeps up runs in milliseconds and is deterministic. Lean on it.
Strong types reduce code volume + test boilerplate + token waste + review burden + CI time + production incidents — and increase refactoring confidence and velocity-over-time. The bet is asymmetric.

§2.2 — From the references

Grouping by what they imply for this redesign:

Will Crichton — Type-Driven API Design + Rust API Type Patterns

Typestate. Encode "phase of an operation" in the type parameter. A ProgressBar<Bounded> exposes .with_eta(); a ProgressBar<Unbounded> doesn't. The contradictory call doesn't compile.
Direct application: FleetDeviceAuth mixes phases. The ZitadelEnroll arm is unresolved, the ZitadelJwt arm is resolved, the TomlShared arm doesn't even need resolution. A typestate would model these as distinct types; only one of them has agent.write_to_disk().

Richard Feldman — Making Impossible States Impossible

Slogan-as-tool. Look at every Option<T> and ask "can two of these be inconsistent at once?" If yes, that's an impossible state — refactor.
Direct application: FleetDeviceSetupConfig has auth: FleetDeviceAuth AND agent_binary_path: PathBuf. Today nothing prevents auth = TomlShared (no Zitadel) with agent_binary_path pointing at the wrong-arch binary. We could encode the agent binary's target arch as a typestate parameter and refuse to deploy to a device with a known-different arch inventory.

Sandy Maguire — Protos Are Wrong

Protocol buffers throw away information real type systems preserve. Sum types, exhaustiveness, parametric polymorphism, Maybe/Result — protos can't express any of them precisely. The "loose contract" sells you weak invariants.
Direct application: harmony-reconciler-contracts is JSON-shaped at the wire (matched on type tag for ReconcileScore). We're already paying the proto-class tax: any new variant requires both ends to know about it; the wire format doesn't enforce a schema; old agents see new variants as parse errors. This is an honest constraint — wire formats need to be permissive by design — but it argues for keeping the wire types small and obviously evolvable while letting in-memory types be cardinality-matched.

Sean Goedecke — Invalid States

The skeptic's case: making impossible states impossible can be over-applied. Sometimes a String is the right cardinality even when an enum exists, because the enum binds you to a closed world.
Direct application: Don't make device_id a closed enum. The newtype + RFC1123 validation we just added is the right cardinality match: it's a string-like, but only valid strings. Over-modeling would have us build enum DeviceId { Pi(PiSerial), Vm(VmName), …} — closed world, breaks first time a customer plugs in an x86 box.
Useful guardrail: type-driven ≠ type-everything. The question to ask each time is "what's the cardinality of this concept in reality" — not "can I model this".

Martin Fowler — Harness Engineering (April 2026)

Computational sensors (compilers, type checkers, linters) over inferential ones (tests, code review). Compiler runs on every change; tests don't.
Direct application: prefer compiler-checked invariants over doc-comment invariants. If the docs say "this Score's auth field must be resolved at the call site of execute()", the compiler should enforce it.

§2.3 — From harmony's own ADRs

Reading the existing ADRs as design language already in use — what vocabulary should the new fleet shape stay consistent with?

ADR-002 (hexagonal architecture)

"Domain isolated from adapters." Domain types own the vocabulary; adapters (k8s client, NATS, helm) translate at the edge.
Implication for fleet: the domain is identity + desired state + observed state. The adapters are NATS-KV, kube-CRD, helm-chart, ansible-over-SSH. The current harmony::modules::fleet mixes both. Pulling adapters out is the refactor.

ADR-003 (infrastructure abstractions)

"Abstractions at domain level, not provider level. DnsServer not OPNsenseDns."
Implication for fleet: capability traits like DeviceRegistry, DesiredStatePublisher, ObservedStateConsumer — each a standard infrastructure need that NATS-KV happens to fulfill today, that another transport (gRPC streaming, MQTT, Redis streams) could fulfill tomorrow.

ADR-015 (higher-order topologies)

Higher-order topologies (FailoverTopology<T>, DecentralizedTopology<T>) compose via blanket trait impls. T: PostgreSQL ⇒ FailoverTopology<T>: PostgreSQL. Zero boilerplate.
Implication for fleet: FleetTopology<T> could compose with a base K8sTopology<T> rather than being a parallel concept. "A fleet is a thing that is both a kube cluster and a device registry."

ADR-016 (Harmony Agent + Global Mesh)

Agents are processes that observe + reconcile per a desired state published into a NATS mesh. Mesh is the reliable hop; agents are stateless processors at the edge.
Implication for fleet: the IoT fleet is a specialization of the agent + mesh ADR — devices are agents, the operator is a coordinator. The fleet domain types should fit ADR-016's vocabulary, not invent a parallel one.

ADR-017 (NATS clusters interconnection)

Trust topology: per-cluster account isolation, gateway-mediated cross-cluster traffic. Per-device permissions are a specialization of per-account.
Implication for fleet: the auth callout's per-device permission templates should compose with the cluster-interconnection account model — currently they're treated as orthogonal, which is fine until we actually cross fleets.

ADR-018 (template hydration)

Hydrating templates at the edge of the framework, not in the middle. Same pattern as our generated chart YAML: render once, apply via typed code.
Implication for fleet: chart-rendering helpers (build_operator_deployment et al.) are template-hydration edges. They should be hidden from domain code. Today they're pub — visible to consumers like fleet_staging_install who reach in and grab operator_secret(opts). That's adapter leakage.

§2.4 — Synthesis: principles for the redesign

A short list, ordered. Each line is something the new shape should satisfy:

Domain types in harmony-reconciler-contracts (or a sibling crate), with no dependency on harmony framework types.
Resolved types only at the API surface. Pre-resolution intent is a separate type, used only by the resolver.
Capabilities as traits, not concrete types. DeviceRegistry, DesiredStatePublisher, etc. The NATS-backed impl is one of several allowed.
Closed cardinality where reality is closed; open where reality is open. Goedecke's check, not Feldman's.
Higher-order topology, not parallel topology. A fleet is a FleetTopology<T> over a base K8s topology, not a separate capability hierarchy.
Adapters hidden behind capabilities. Helm chart rendering, k8s resource apply, NATS subjects — none of these surface from the fleet's public API.
No yaml in framework code paths. Existing principle from v0_1; keep.
Keep wire types minimal + permissive. Not because they're the canonical model, but because they're the evolvability seam (Maguire's protos critique applies in reverse — embrace the loose contract on the wire, reject it in-memory).

§3 — Design problems with the current shape

Concrete issues the redesign needs to fix. Not "bugs" — shape problems. Each numbered so we can refer back when comparing alternatives.

P1. harmony/modules/fleet/ is in the wrong crate. It pulls framework dependencies (HelmChartScore, K8sResourceScore, K8sAnywhereTopology, harmony_secret, etc.) and the runtime daemons import from it. This makes the operator/agent depend transitively on every harmony module — including the OPNsense XML codegen, OKD bootstrap stuff, etc. Compile times suffer; the release surface is wrong (you can't cargo install harmony-fleet-operator without all of harmony).
P2. FleetDeviceAuth mixes resolved + unresolved states. ZitadelEnroll is pre-resolution intent; ZitadelJwt is post-resolution credential. A single match arm has to handle both. The "render TOML for both" hack we wrote works but is a symptom — the TOML for an unresolved auth should be undefined, not "same as resolved".
P3. setup_score.rs is 1053 LOC monolith. Eight responsibilities in one file: ssh-vs-local connection, ansible orchestration, systemd unit text, hosts-file merging, podman package install, fleet-agent user provisioning, keyfile writing, agent restart. Readability is poor; testability is per-orchestration not per-step.
P4. CRD types live in framework crate. Deployment and Device CRDs are defined in harmony::modules::fleet::operator::crd. The runtime operator crate (harmony-fleet-operator) imports them from there. This is the most visible symptom of P1.
P5. ReconcileScore polymorphism is anemic. Today there's exactly one variant, PodmanV0. The wire format is set up for evolution but no second variant exists, and the cross-crate import from harmony::modules::podman makes adding one expensive (re-export dance).
P6. Adapter leakage from chart rendering. build_operator_deployment, operator_secret, build_chart are pub. Consumers in examples/ reach in to compose helm releases by hand. Domain code should not see "what does the operator's helm chart look like".
P7. Composed scores wrap composed scores wrap composed scores. FleetServerScore wraps {ZitadelScore, ZitadelSetupScore, NatsK8sScore, NatsAuthCalloutScore, FleetOperatorScore}. Each of those does its own k8s resource apply + helm install. Failure modes are deep: a problem in one score's interpret surfaces wrapped through five layers of "context()". Hard to debug; hard to reason about ordering.
P8. Topology assumptions are everywhere. Every Score bound is a hand-rolled union of capability traits — T: Topology + HelmCommand + K8sclient + TlsRouter + 'static. Add a new capability and every callsite has to be updated. Higher- order topology composition (ADR-015) would let us name "a thing that is a fleet-capable cluster" once.
P9. Id is overloaded. Same type for device IDs, machine user IDs, deployment IDs, topology names. Newtype-ing each would catch arg-order swaps at compile time.
P10. Configuration is a staircase. Operator workstation has ZitadelClientConfig cache file. Operator pod has env-var-from- Secret. Agent has TOML on disk. Three different shapes for fundamentally the same data (issuer URL, audience, key material). Maguire's protos critique applies internally — we're using several loose-contract serializations of the same domain object.

§4 — Design alternatives

Five sketches. The first three are increasingly principled cleanups; the last two are deliberately weird, included to force us to recognize where the core of the domain actually is.

For each: one paragraph of premise, the resulting top-level types, how it answers each of P1–P10 (✓ / ✗ / partial), and the honest pros + cons.

Alternative A — Move + thin façade (the conservative cleanup)

Premise: the existing types are mostly right; the location is wrong and the façade leaks. Move harmony/modules/fleet/ to fleet/harmony-fleet/. Re-export only what's intended public. Don't redesign types.

Top-level types: unchanged. FleetDeviceSetupScore, FleetServerScore, FleetOperatorScore, FleetDeviceAuth, AdminAuth, Deployment CRD, Device CRD. Same shapes, new location.

P1 ✓ (location fix is the goal). P2 ✗ (auth still mixes resolved/unresolved). P3 ✗ (monolith preserved). P4 ✓ (CRDs co-located with operator). P5 ✗. P6 partial (we can pub(crate) the chart helpers but the underlying coupling remains). P7 ✗. P8 ✗. P9 ✗. P10 ✗.

Pros: small, safe, mechanical. Two days of work. No customer- visible breakage. Unblocks P4 cleanup naturally.

Cons: doesn't actually fix the shape. We'd be back here in six weeks. JG's review already said this isn't enough. Not the right answer for v0.1 timing — would be the right answer if we'd already shipped to two customers and couldn't break their code.

Alternative B — Resolved-only at boundaries + capability traits (the principled cleanup)

Premise: Crichton's typestate + ADR-003's domain capabilities applied to the existing shape. Split resolved vs. unresolved auth into separate types. Define capability traits for the adapters. Move into the right crate. No wholesale rewrite.

Top-level types:

New crate harmony-fleet/ (sibling to harmony-fleet-operator, -agent, -auth). Domain types live here.
FleetIdentity, FleetDevice, FleetDeployment — domain records. Plain data.
DeviceCredential — resolved only (a JSON keyfile + issuer URL + audience). Replaces FleetDeviceAuth::ZitadelJwt.
EnrollmentIntent — pre-resolution. Carries AdminAuth and what to mint. Method resolve(&self) -> Result<DeviceCredential>.
Scores become small + single-responsibility:
- EnrollDeviceScore — runs EnrollmentIntent::resolve then publishes to NATS.
- InstallAgentScore — drops binary + config + systemd unit. Takes a DeviceCredential. Doesn't know about Zitadel.
- InstallOperatorScore — helm chart + Secret. Doesn't know about devices.
- BringUpFleetScore — composes the above. Single layer of composition, not five.
Capability traits:
- DeviceRegistry — list/get/upsert/delete a FleetDevice. Implementations: NatsKvDeviceRegistry, (later) RedisStreamsDeviceRegistry.
- DesiredStatePublisher, ObservedStateConsumer — same shape.
- IdentityProvider — mint a device credential, issue an admin token. Today: Zitadel. Tomorrow: something else.

P1 ✓ P2 ✓ P3 ✓ (split into 4–5 small Scores). P4 ✓ P5 ✓ (resolve in the runtime crate, contracts stay neutral). P6 ✓ (chart helpers pub(crate), surfaced via IdentityProvider

DeploymentReleaseManager traits). P7 ✓ (one composer, not five). P8 partial (capability traits defined but bound unions still get long). P9 ✓ with newtypes. P10 partial (still three on-disk shapes for credentials, but unified by trait).

Pros: highest-leverage incremental redesign. Buys us most of the principles without rebuilding plumbing. Customer-visible breakage is contained to public API renames + import path moves — no behavior change. Three days is realistic.

Cons: we still have a Score-shaped mental model where the unit of execution is "a Score". If the right primitive turns out to be smaller (an effect, an event, a capability call), this choice wastes some leverage.

Alternative C — The dataflow reframe (events in, state out)

Premise: the fleet platform is, in essence, a stream processor. Events flow in (heartbeats, intent CR creates, agent reconcile reports). State materializes out (Device CRs, DeploymentAggregate counters, KV desired-state writes). Today we model it imperatively as a series of Scores; the dataflow shape is fighting that.

Top-level types:

FleetEvent — sum type. DeviceHeartbeat | DeviceFirstSeen | DeploymentDesired | DeploymentObserved | DeploymentDeleted | …
FleetStateSnapshot — what the operator currently knows. Pure data, derivable.
Reducer — (state, event) → state. Pure function. Tests trivially.
Effect — sum type of side-effects the reducer wants done: WriteKv(bucket, key, value) | UpsertCr(cr) | EmitMetric(...). Reducer returns (new_state, Vec<Effect>).
EffectRunner — adapter that performs effects. The only thing that touches NATS / kube. One implementation per environment.
The operator pod's main loop: for event in stream { (state, effects) = reduce(state, event); runner.run_all(effects) }. ~50 lines.

P1 ✓ P2 ✓ P3 ✓ P4 ✓ P5 ✓ P6 ✓ P7 ✓ P8 ✓ (capabilities collapse into the EffectRunner trait). P9 ✓ P10 partial.

Pros: dramatically simpler operator code. Reducer is pure → property-test-friendly. The dataflow is the platform. Aligns with how Kafka / Materialize / Flink-class systems are structured. Easy to add a new event type — the compiler shows you every reducer arm to update.

Cons: large rewrite of the operator. Three days is unrealistic. The current fleet_aggregator.rs (833 LOC) already roughly does this but in a less disciplined shape — maybe the incremental version of this is "make apply_state a real reducer and split compute_aggregate into pure pieces". That's more like Alternative B with extra discipline. The full effect- typed version is a nice end-state but not a sprint goal.

Cite: Materialize's dataflow paper; Kent Beck's Augmented Coding on factoring; Gergely Orosz on event-sourcing; the talk's "good Lego bricks" framing applies — events are the bricks.

Alternative D — The fleet as a kube control plane, period (deliberately weird)

Premise: strip the design to one observation. A fleet is a Kubernetes cluster whose Nodes happen to be devices, not servers. Stop modelling Devices and Deployments separately from kube primitives. Use Kubernetes itself as the data model. The operator is one CRD reconciler. NATS is just the transport between the API server (in the cluster) and the device-side kubelet-equivalent.

Top-level types:

Device is a Node CR. Already exists; we stop wrapping it.
Deployment is a DaemonSet (one pod per matching node) or a Deployment (count: N targeted nodes). We stop inventing a CRD; we use the standard one.
DeviceInfo is the Node's .status (capacity, allocatable, conditions). We stop publishing parallel data; we update Node status from the agent's NATS messages.
The agent on the device is a custom kubelet that speaks NATS to the operator instead of HTTPS to the API server.
The auth callout still exists; it gates NATS access.
No harmony-fleet-operator-specific CRDs. No Deployment / Device CRs of our own.

P1 ✓ P2 ✓ P3 ✓ P4 N/A (no CRDs of our own to misplace). P5 ✓ P6 ✓ P7 ✓ P8 ✓ P9 ✓ P10 ✓.

Pros: the simplest conceptual answer. We stop fighting kube

inventing parallel concepts. Customers already understand DaemonSets, Node selectors, and kubectl get nodes. The agent becomes a known kind of thing (a kubelet variant) with shoulders to stand on (k3s-iot, kine, virtual-kubelet projects already prove this works).

Cons: a lot of plumbing changes. Devices need to register as Nodes (which means either a real kubelet on each Pi, or a virtual-kubelet façade). The agent's reconcile loop becomes "watch a CR via NATS, render manifests, run pods" — bigger than "watch a KV value, run podman". JetStream KV becomes redundant with the kube API server. Probably the right end-state for v2.0, wrong for v0.1. Worth noting, though, because comparing A/B/C to D pulls out which of our current invented concepts are load-bearing (very few — DeviceInfo is mostly just Node.status; DeploymentAggregate is mostly just kube's .status.observedGeneration / .status.conditions stuff).

Cite: virtual-kubelet, k3s-iot, KubeEdge, OpenYurt. They've walked this path; the lessons are public.

Alternative E — Algebra of fleets (deliberately weird, mathematical)

Premise: model the platform as a small algebra. A fleet is a set of devices + an assignment function (selector → set of deployments). Operations on fleets are set-theoretic + function composition. Treat the API as a query language over this algebra.

Top-level types:

Fleet ::= Set<Device>. With operations: union, intersection, filter-by-selector, partition.
Selector ::= a pure predicate Device → bool. Built from primitives label("k") = "v", arch = aarch64, …, combined with &, |, !.
Assignment ::= Selector → Set<Deployment>. Pure function.
World ::= (Fleet, Assignment). Pure data. The operator's job is to make reality match the World.
Diff(World, Reality) → Vec<Action>. Pure function. Closed form — given the algebra, you can prove what actions are necessary and sufficient.

P1–P10 ✓ (in principle). Code volume probably 30% of current.

Pros: clarity. Properties become provable: "no device gets an unassigned deployment", "removing a label removes the assignment", "two operators can edit independently and the merge is well-defined" (because functions compose). The "make impossible states impossible" principle, applied to the fleet shape itself, not to individual types.

Cons: almost certainly an over-fit. The real platform has dirty edges (devices that fail, network partitions, half-applied state) that don't sit naturally in a pure algebra. Most teams that go down this road end up bolting "real-world" escape hatches back on, ending up with the original design plus extra category theory. Useful as a north star for the cardinality choices, not as the platform's actual shape.

Cite: Hillel Wayne Using Formal Methods at Work; Conal Elliott on functional reactive programming; the classic "set theory for systems people" talks.

Comparison matrix

	A. Move	B. Capabilities	C. Dataflow	D. Kube-native	E. Algebra
Fixes P1 (location)	✓	✓	✓	✓	✓
Fixes P2 (auth states)	✗	✓	✓	✓	✓
Fixes P3 (monolith)	✗	✓	✓	✓	✓
Fixes P4 (CRD placement)	✓	✓	✓	N/A	N/A
Fixes P5 (anemic enum)	✗	✓	✓	N/A	partial
Fixes P6 (adapter leak)	partial	✓	✓	✓	✓
Fixes P7 (deep wrap)	✗	✓	✓	✓	✓
Fixes P8 (trait union)	✗	partial	✓	✓	✓
Fixes P9 (Id overload)	✗	✓	✓	✓	✓
Fixes P10 (config staircase)	✗	partial	partial	✓	partial
Fits 3-day window	✓	✓ (tight)	✗	✗	✗
Customer-visible breakage	low	medium	medium	very high	high
Risk to demo schedule	very low	low	medium	very high	high
Long-term ceiling	low	high	high	very high	very high

§5 — Recommendation (preliminary)

Read the matrix as: B is the right answer for now, with explicit awareness of D as the v2.0 destination.

A is too little. We'd be back here.
C and E are right in shape but wrong in timing — we don't have a week to rebuild the operator's reconcile loop, and the platform isn't in production yet, so there's no urgent "we have to refactor anyway" pressure.
D is conceptually the cleanest, but a v0.1 production push is the wrong moment to start running custom kubelets.
B captures most of the leverage of C/D within the 3-day window, with a clean migration path to either of them later (the capability traits are the seam — swap the implementation, not the callers).

One concrete shape to pursue under Alternative B (worth sketching as the strawman ADR):

New crate harmony-fleet/ (the domain crate). Depends on harmony-reconciler-contracts only.
- Domain records: FleetDevice, FleetDeployment, FleetState.
- Capability traits: DeviceRegistry, DesiredStatePublisher, ObservedStateConsumer, IdentityProvider, AgentLifecycle.
harmony-fleet-adapters-nats/ — NatsDeviceRegistry, NatsDesiredStatePublisher, etc. NATS-specific.
harmony-fleet-adapters-zitadel/ — ZitadelIdentityProvider.
harmony-fleet-adapters-kube/ — KubeFleetReflector (writes Device and Deployment CRs as a reflection of the domain state, not as the source of truth).
harmony-fleet-operator/ — daemon. Wires adapters together.
harmony-fleet-agent/ — daemon. Wires adapters together.
harmony-fleet-cli/ — tomorrow's harmony-fleet plugin.
harmony/modules/fleet/ is deleted. The framework harmony crate gets a thin harmony::modules::fleet re-export only module that points at harmony-fleet. After v0.2 is shipped, the re-export module goes away too.

CRDs (Deployment, Device) move to harmony-fleet-adapters-kube/ because they're a kube-specific projection of the domain, not the domain itself. The agent imports harmony-fleet's domain types, not the CRDs.

The setup-side scores stay in harmony (because they need the framework's HelmCommand, K8sclient, etc.) but they consume harmony-fleet's domain types. The fleet's domain doesn't depend on the framework; the framework's deploy procedures depend on the fleet's domain. Direction of dependency is the inverse of today.

§6 — Open questions before we lock this

These are real questions; pulling them out so JG's review has something concrete to react to:

Q1. Is IdentityProvider the right capability name, or is it more honest to name it after what we actually need (DeviceCredentialMinter, OperatorTokenProvider)? The talk argues against generic names — if reality has two distinct concerns, two traits.
Q2. Should Device CRD live in adapters-kube, or should it not exist at all (replaced by reading kube-API node info, per alternative D)? The middle ground (own CRD that mirrors kube Node) is what we have today, and it's the worst of both.
Q3. The agent's wire-format for ReconcileScore — externally tagged enum, today only PodmanV0. Move it to harmony-reconciler-contracts (canonical wire seam) and let both the agent and the operator import only that crate. This removes the harmony::modules::podman cross-crate dependency. Worth doing in any of A/B/C.
Q4. Does the v0.1 prod push wait for this redesign, or does it ship on the current shape with the redesign happening in v0.2? Tradeoff: shipping now means committing to some public API; shipping after means slipping the customer date. Recommendation: ship the redesign first, slip 3 days, on the grounds that public API churn after a customer is on it costs more than a 3-day delay before they're on it.
Q5. Where do the runtime tools (the harmony-fleet CLI plugin, future frontend) sit in the dependency graph? If they depend on harmony-fleet's domain crate only, we can build them without pulling in helm / kube / ansible at compile time. This is what we want for the device-side enrollment binary too (already feature-gated; the redesign should make the gate unnecessary).

§7 — Next steps

Sit with this document. Walk away from it for an hour.
Round-table on §3 — do P1–P10 capture the problems, or are we missing one?
Round-table on §4 — does the comparison matrix feel honest, or is it tilted?
Pick one alternative as the working hypothesis.
Spike: take one slice through the chosen alternative (suggested: EnrollmentIntent::resolve + DeviceCredential + the IdentityProvider trait — the smallest end-to-end shape that touches every layer). Commit it on a branch. Eyeball: does the resulting code feel better?
Either: commit to the alternative as ADR-023, or back out and try another.

This document gets updated as we go. It is NOT meant to be locked at first draft.

40 KiB Raw Permalink Blame History Unescape Escape

Fleet platform — architecture review

Why now

Working plan

§1 — Current state inventory

§1.1 — Where the code lives

§1.2 — Public types, sorted by domain meaning (not location)

Identity & devices

Deployment desired-state

Deployment observed-state

Authentication / identity provider

Setup procedures (Scores)

Operator-internal types

Agent-internal types

Fleet plumbing for development

NATS subjects + KV buckets (the wire seam)

§1.3 — Concept clusters

§1.4 — The shape problem in one diagram (text)

§2 — Theory review

§2.1 — From the talk

§2.2 — From the references

Will Crichton — Type-Driven API Design + Rust API Type Patterns

Richard Feldman — Making Impossible States Impossible

Sandy Maguire — Protos Are Wrong

Sean Goedecke — Invalid States

Martin Fowler — Harness Engineering (April 2026)

§2.3 — From harmony's own ADRs

ADR-002 (hexagonal architecture)

ADR-003 (infrastructure abstractions)

ADR-015 (higher-order topologies)

ADR-016 (Harmony Agent + Global Mesh)

ADR-017 (NATS clusters interconnection)

ADR-018 (template hydration)

§2.4 — Synthesis: principles for the redesign

§3 — Design problems with the current shape

§4 — Design alternatives

Alternative A — Move + thin façade (the conservative cleanup)

Alternative B — Resolved-only at boundaries + capability traits (the principled cleanup)

Alternative C — The dataflow reframe (events in, state out)

Alternative D — The fleet as a kube control plane, period (deliberately weird)

Alternative E — Algebra of fleets (deliberately weird, mathematical)

Comparison matrix

§5 — Recommendation (preliminary)

§6 — Open questions before we lock this

§7 — Next steps

40 KiB

Raw Permalink Blame History