Working document for the architectural redesign of the fleet
platform before v0.1 ships to production. Captures four sections
of research:
§1 — Current state inventory. Markdown-bullet map of every public
type, score, trait, and module across `harmony/modules/fleet/`,
`harmony-reconciler-contracts`, and `fleet/harmony-fleet-*/`.
Sorted by domain meaning (identity, desired state, observed
state, setup, plumbing) rather than location, so the
cross-cutting concerns become visible. Includes a text "diagram"
of the dependency graph showing the two problematic edges:
runtime crates importing CRD types from the framework crate
(`harmony-fleet-operator` ← `harmony::modules::fleet::operator::crd`
verified at `controller.rs:37`, `device_reconciler.rs:21`,
`main.rs:9`) and the agent importing podman wire types from the
framework crate (`harmony-fleet-agent` ← `harmony::modules::podman`
verified at `main.rs:21-22`, `reconciler.rs:11`).
§2 — Theory review. Pulls principles from JG's *Pour l'amour des
compilateurs* talk (2026-04-30), its references (Crichton,
Feldman, Maguire, Goedecke, Fowler), and harmony's own load-bearing
ADRs (002 hexagonal, 003 infrastructure abstractions, 015 higher-
order topologies, 016 agent + global mesh, 018 template hydration).
Synthesizes eight design principles for the redesign — including
Goedecke's guardrail that "type-driven" ≠ "type-everything" so we
don't over-fit the cardinality argument.
§3 — Ten concrete shape problems (P1–P10), framed as cardinality
mismatches, leaky boundaries, and "is this resolved yet" branches
rather than bugs. P1 is the placement issue JG flagged in code
review; P2 is `FleetDeviceAuth`'s mixed resolved/unresolved
states; P10 is the credential-shape staircase across operator
workstation / operator pod / agent.
§4 — Five design alternatives, each scored against P1–P10:
A. Move + thin façade (conservative cleanup).
B. Resolved-only at boundaries + capability traits (principled
incremental).
C. Dataflow reframe (events in, state out).
D. Fleet as kube control plane, period (deliberately weird).
E. Algebra of fleets (deliberately mathematical).
A is too little, C/D/E are right-shape but wrong-timing for the
3-day window. B is the working recommendation, with explicit
awareness that D is the v2.0 destination and the capability
traits in B are the seam that lets us migrate without breaking
callers.
§5 sketches a concrete shape for B: new `harmony-fleet/` domain
crate with no framework dependency, `harmony-fleet-adapters-*`
crates for NATS/Zitadel/kube, the existing operator/agent/auth
crates wire adapters together, the framework's
`harmony::modules::fleet` collapses to a re-export module that
goes away by v0.2.
§6 — Five open questions for JG's review before locking the
choice. §7 — explicit "spike one slice, then commit or back out"
process so we don't lock the wrong shape.
Not an ADR yet. The ADR happens after JG agrees on which
alternative is the working hypothesis and the spike confirms the
shape feels better in code than on paper.
40 KiB
Fleet platform — architecture review
Working document for the architectural redesign of the fleet platform before v0.1 ships to production. Started 2026-05-07.
This is a research + design document, not a plan to execute. The output of this work is an ADR (or set of ADRs) that lock the new shape; the v0.2 roadmap will reference whichever option we pick.
Why now
- Three days from production. No customers depend on the API yet → API/UX/DX is still cheap to change. After ship, every breaking change costs us a week of customer-coordination overhead.
- The
harmony/modules/fleet/placement is wrong — already flagged in code review. The reasons it ended up there are subtle (cross- module imports ofK8sAnywhereTopology,HelmChartScore,K8sResourceScore,harmony_secret,Topologycapability traits). Those need to be written down before the file move, not after. - The plumbing — NATS + Zitadel + auth callout + operator + agent — is sound. Highly secure, scalable by design, low resource footprint. The redesign is about moving code and better data structures, not rebuilding mechanisms.
- The frame from JG's Pour l'amour des compilateurs talk: cardinality-matched types, "make impossible states impossible", expressive types as the deterministic feedback loop that scales with LLM-era code generation throughput. Apply that frame here.
Working plan
- Inventory. Map every public type, trait, score, module, and crate that participates in the fleet domain. Markdown-bullet shape; no diagrams.
- Read the room. Pull principles from JG's talk, its references, and harmony's existing ADRs (002 hexagonal, 003 infrastructure abstractions, 015 higher-order topologies, 016 harmony agent + global mesh, 017 NATS interconnection, 018 template hydration). Note where the existing fleet design already follows them and where it doesn't.
- Identify the design problems. Not bugs — shape problems. Cardinality mismatches, leaky boundaries, "is this resolved yet" branches, location/dependency loops.
- Sketch alternatives. Three to five. At least one conventional cleanup, at least one out-of-the-box that reframes the domain. Compare on the same axes (cardinality, placement, ergonomics, extensibility).
- Pick (or recommend) one. Land as ADR.
This document covers steps 1–4. The pick happens in conversation with JG before the ADR.
§1 — Current state inventory
§1.1 — Where the code lives
The fleet domain spans three concerns that today live in three locations:
- Framework-side scoring (what runs on the operator's
workstation when they
cargo runthe install) → lives inharmony/src/modules/fleet/. This is the wrong home; it's the thing this review is about moving.mod.rs— re-exportsassets.rs— Ubuntu/Debian cloud image fetchers, libvirt SSH keypair managementlibvirt_pool.rs— libvirt storage pool bring-upsetup_score.rs(1053 LOC, the monster) —FleetDeviceSetupScore,FleetDeviceSetupConfig,FleetDeviceAuth(TomlShared|ZitadelJwt|ZitadelEnroll),AdminAuth,HostsEntry,merge_hosts_filevm_score.rs—ProvisionVmScore(libvirt VM bring-up)preflight.rs—check_fleet_smoke_preflight*(host system checks)server.rs—FleetServerScore,FleetServerInterpret(composed bring-up of Zitadel + NATS + callout + operator)operator/mod.rs,score.rs—FleetOperatorScore,FleetOperatorInterpret(operator helm install)chart.rs(453 LOC) — chart rendering (ChartOptions,OperatorCredentials,build_chart,operator_secret,build_operator_deployment,build_cluster_role)crd.rs—DeploymentCRD type (DeploymentSpec,Rollout,RolloutStrategy,DeploymentStatus,DeploymentAggregate,AggregateLastError);DeviceCRD type (DeviceSpec)
- Cross-boundary wire types (the "contract" agent and operator
both have to agree on) → lives in
harmony-reconciler-contracts/.fleet.rs—DeviceInfo,DeploymentState,HeartbeatPayload,DeploymentName,InvalidDeploymentNamekv.rs— bucket name constants + key-builder functionsstatus.rs—Phase,InventorySnapshot- re-exports
harmony_types::id::Id
- Runtime binaries (what runs in the cluster + on devices) →
lives in
fleet/.harmony-fleet-operator/— the operator pod.controller.rs,device_reconciler.rs,fleet_aggregator.rs(833 LOC),install.rs,main.rs. PullsDeployment/DeviceCRDs fromharmony::modules::fleet::operator::crd(cross-crate import that should give us pause).harmony-fleet-agent/— the on-device daemon.config.rs,reconciler.rs,fleet_publisher.rs,main.rs.harmony-fleet-auth/— JWT-bearer / NATS-credentials helpers used by both the operator AND the agent.config.rs,credentials.rs(553 LOC). Sits between contracts and the runtime crates.
§1.2 — Public types, sorted by domain meaning (not location)
Identity & devices
harmony_types::id::Id— opaque, sortable, collision-safe identifier. Used as device id, deployment id, …DeploymentName(newtype with validation,harmony-reconciler-contracts)DeviceInfo— heartbeat payload that materializes into aDeviceCRDeviceSpec— kube CRD, holds an optionalInventorySnapshotInventorySnapshot— hardware/OS facts published once at registration
Deployment desired-state
DeploymentSpec— kube CRD:target_selector: LabelSelector,score: ReconcileScore,rollout: RolloutReconcileScore(inharmony::modules::podman, re-exported fromharmony::modules::fleet::operator::crd) — externally-tagged enum, today onlyPodmanV0(PodmanV0Score)PodmanV0Score,PodmanService,EnvVar,VolumeMount,RestartPolicyRollout,RolloutStrategy::Immediate
Deployment observed-state
DeploymentState— what the agent publishes per device per deployment after reconcileDeploymentStatus(kube CRD) — operator-side rollup of all device states for one Deployment CRDeploymentAggregate— counts (matched, succeeded, failed, pending) +last_error: Option<AggregateLastError>Phase—Pending | Running | Failed
Authentication / identity provider
FleetDeviceAuth— sum type withTomlShared | ZitadelJwt | ZitadelEnroll. TheZitadelEnrollarm carries unresolved-state — admin credentials that must be turned into a device JSON key at execute time. Mixes resolved and unresolved states in one type, which is the cardinality bug we keep hitting.AdminAuth—Sso { client_id } | Token(String)(used insideZitadelEnroll)CredentialsSection— TOML-on-disk shape (inharmony-fleet-auth, parallel toFleetDeviceAuth)CredentialSource— runtime credential factoryNatsCredential— what async-nats actually consumesMachineKeyFile,CachedToken
Setup procedures (Scores)
FleetDeviceSetupScore(FleetDeviceSetupConfig) — the workhorse: installs podman, drops the agent binary, drops the credentials TOML, drops the keyfile, brings up the systemd unit.FleetServerScore— orchestrates Zitadel install + identity setup + NATS install + callout install + operator install. Wraps five other scores.FleetOperatorScore— operator helm chart render + install + the credentials Secret apply.ProvisionVmScore— libvirt VM bring-up. Used by VM rehearsals.- (External, not in fleet/)
ZitadelScore,ZitadelSetupScore,NatsK8sScore,NatsAuthCalloutScore— all consumed by the composed install.
Operator-internal types
FleetState,SharedFleetState,DeploymentKey,DevicePair,CachedDeployment,Context,Error(the controller's local error type),selector_matches,apply_state,drop_state,compute_aggregate
Agent-internal types
AgentConfig,AgentSection,NatsSection,CredentialsSectionFleetPublisher,Reconciler
Fleet plumbing for development
FleetSshKeypair, the cloud-image consts,HarmonyFleetPool,merge_hosts_file,HostsEntry,check_fleet_smoke_preflight*
NATS subjects + KV buckets (the wire seam)
BUCKET_DESIRED_STATE="desired-state"BUCKET_DEVICE_INFO="device-info"BUCKET_DEVICE_STATE="device-state"BUCKET_DEVICE_HEARTBEAT="device-heartbeat"- Key builders:
desired_state_key(device_id, deployment_name),device_info_key(device_id),device_state_key(device_id, deployment_name),device_heartbeat_key(device_id)
§1.3 — Concept clusters
When you squint at the inventory, the domain falls into five clusters:
- Identity — who is this device, who is this deployment, who is the operator, what auth do they have.
- Desired state — what should be running where.
- Observed state — what is actually running where.
- Setup — bringing all this into existence on a fresh cluster + fresh device.
- Plumbing — the NATS/kube/Zitadel mechanisms that make 1–4 work.
The current code does not cleanly separate these. Examples:
setup_score.rsmixes Setup (drop binary, run systemd) with Identity (FleetDeviceAuth). 1053 LOC.FleetDeviceAuthmixes resolved-Identity (ZitadelJwt— here's a key) with Setup-time-Identity-resolution-intent (ZitadelEnroll— here's how to mint a key).- The chart-render helpers (
build_operator_deployment, etc.) arepubfromharmony::modules::fleet::operator::chartso the composed-install scores can pluck the secret out before helm install. Plumbing leaking through Setup. harmony::modules::fleet::operator::crd::DeploymentSpecis the CRD definition AND it's the type the operator daemon imports to reconcile. Cross-crate import from a runtime crate (harmony-fleet-operator) into a framework crate (harmony). This is the placement bug.
§1.4 — The shape problem in one diagram (text)
framework/operator workstation
│
harmony::modules::fleet ──┤ Scores: FleetServerScore, FleetDeviceSetupScore,
│ FleetOperatorScore, ProvisionVmScore
│ CRD types: Deployment, Device, DeploymentSpec, ...
│ Chart rendering helpers (operator/chart.rs)
│
harmony-reconciler-contracts ── wire types: DeviceInfo, DeploymentState,
│ HeartbeatPayload, KV constants
│ ▲ ▲
│ │ │
│ │ imports imports│
│ │ │
fleet/harmony-fleet-agent fleet/harmony-fleet-operator
▲ ▲
│ │
│ ALSO imports ALSO imports│
│ from harmony::modules:: from harmony::modules::
│ podman (PodmanV0Score) fleet::operator::crd
Two problematic edges:
harmony-fleet-operatorimportsharmony::modules::fleet::operator::crd::Deployment. The runtime daemon depends on the framework crate just for CRD type definitions.harmony-fleet-agentimportsharmony::modules::podman::{PodmanV0Score, PodmanTopology, ReconcileScore}. The agent depends on the framework crate's podman module for the score it deserializes off the wire.
Both edges should run through harmony-reconciler-contracts, not around it. That's the placement bug surfaced.
§2 — Theory review
§2.1 — From the talk
Pulling the load-bearing principles, ranked by relevance to this redesign:
- Cardinality matters. Types should match the cardinality of
the real-world concept.
&strfor "primary color" admits infinite invalid inputs;enum { Red, Yellow, Blue }admits exactly three. Friction is proportional to mismatch. - Make impossible states impossible. Don't comment the constraint, code it. Push runtime errors to the design phase.
- Representations matter. Same data, different shapes ↔ different operations are cheap. Roman numerals ↔ addition; Arabic ↔ multiplication. "An API is a computational representation of real-world concepts."
- The compiler is a deterministic feedback channel. In an era when LLMs generate code at 5–10K LOC/day, the only sensor that keeps up runs in milliseconds and is deterministic. Lean on it.
- Strong types reduce code volume + test boilerplate + token waste + review burden + CI time + production incidents — and increase refactoring confidence and velocity-over-time. The bet is asymmetric.
§2.2 — From the references
Grouping by what they imply for this redesign:
Will Crichton — Type-Driven API Design + Rust API Type Patterns
- Typestate. Encode "phase of an operation" in the type
parameter. A
ProgressBar<Bounded>exposes.with_eta(); aProgressBar<Unbounded>doesn't. The contradictory call doesn't compile. - Direct application:
FleetDeviceAuthmixes phases. TheZitadelEnrollarm is unresolved, theZitadelJwtarm is resolved, theTomlSharedarm doesn't even need resolution. A typestate would model these as distinct types; only one of them hasagent.write_to_disk().
Richard Feldman — Making Impossible States Impossible
- Slogan-as-tool. Look at every
Option<T>and ask "can two of these be inconsistent at once?" If yes, that's an impossible state — refactor. - Direct application:
FleetDeviceSetupConfighasauth: FleetDeviceAuthANDagent_binary_path: PathBuf. Today nothing preventsauth = TomlShared(no Zitadel) withagent_binary_pathpointing at the wrong-arch binary. We could encode the agent binary's target arch as a typestate parameter and refuse to deploy to a device with a known-different arch inventory.
Sandy Maguire — Protos Are Wrong
- Protocol buffers throw away information real type systems preserve. Sum types, exhaustiveness, parametric polymorphism, Maybe/Result — protos can't express any of them precisely. The "loose contract" sells you weak invariants.
- Direct application:
harmony-reconciler-contractsis JSON-shaped at the wire (matched ontypetag forReconcileScore). We're already paying the proto-class tax: any new variant requires both ends to know about it; the wire format doesn't enforce a schema; old agents see new variants as parse errors. This is an honest constraint — wire formats need to be permissive by design — but it argues for keeping the wire types small and obviously evolvable while letting in-memory types be cardinality-matched.
Sean Goedecke — Invalid States
- The skeptic's case: making impossible states impossible can be
over-applied. Sometimes a
Stringis the right cardinality even when an enum exists, because the enum binds you to a closed world. - Direct application: Don't make
device_ida closed enum. The newtype + RFC1123 validation we just added is the right cardinality match: it's a string-like, but only valid strings. Over-modeling would have us buildenum DeviceId { Pi(PiSerial), Vm(VmName), …}— closed world, breaks first time a customer plugs in an x86 box. - Useful guardrail: type-driven ≠ type-everything. The question to ask each time is "what's the cardinality of this concept in reality" — not "can I model this".
Martin Fowler — Harness Engineering (April 2026)
- Computational sensors (compilers, type checkers, linters) over inferential ones (tests, code review). Compiler runs on every change; tests don't.
- Direct application: prefer compiler-checked invariants over
doc-comment invariants. If the docs say "this Score's
authfield must be resolved at the call site ofexecute()", the compiler should enforce it.
§2.3 — From harmony's own ADRs
Reading the existing ADRs as design language already in use — what vocabulary should the new fleet shape stay consistent with?
ADR-002 (hexagonal architecture)
- "Domain isolated from adapters." Domain types own the vocabulary; adapters (k8s client, NATS, helm) translate at the edge.
- Implication for fleet: the domain is identity + desired
state + observed state. The adapters are NATS-KV, kube-CRD,
helm-chart, ansible-over-SSH. The current
harmony::modules::fleetmixes both. Pulling adapters out is the refactor.
ADR-003 (infrastructure abstractions)
- "Abstractions at domain level, not provider level.
DnsServernotOPNsenseDns." - Implication for fleet: capability traits like
DeviceRegistry,DesiredStatePublisher,ObservedStateConsumer— each a standard infrastructure need that NATS-KV happens to fulfill today, that another transport (gRPC streaming, MQTT, Redis streams) could fulfill tomorrow.
ADR-015 (higher-order topologies)
- Higher-order topologies (
FailoverTopology<T>,DecentralizedTopology<T>) compose via blanket trait impls.T: PostgreSQL⇒FailoverTopology<T>: PostgreSQL. Zero boilerplate. - Implication for fleet:
FleetTopology<T>could compose with a baseK8sTopology<T>rather than being a parallel concept. "A fleet is a thing that is both a kube cluster and a device registry."
ADR-016 (Harmony Agent + Global Mesh)
- Agents are processes that observe + reconcile per a desired state published into a NATS mesh. Mesh is the reliable hop; agents are stateless processors at the edge.
- Implication for fleet: the IoT fleet is a specialization of the agent + mesh ADR — devices are agents, the operator is a coordinator. The fleet domain types should fit ADR-016's vocabulary, not invent a parallel one.
ADR-017 (NATS clusters interconnection)
- Trust topology: per-cluster account isolation, gateway-mediated cross-cluster traffic. Per-device permissions are a specialization of per-account.
- Implication for fleet: the auth callout's per-device permission templates should compose with the cluster-interconnection account model — currently they're treated as orthogonal, which is fine until we actually cross fleets.
ADR-018 (template hydration)
- Hydrating templates at the edge of the framework, not in the middle. Same pattern as our generated chart YAML: render once, apply via typed code.
- Implication for fleet: chart-rendering helpers
(
build_operator_deploymentet al.) are template-hydration edges. They should be hidden from domain code. Today they'repub— visible to consumers likefleet_staging_installwho reach in and graboperator_secret(opts). That's adapter leakage.
§2.4 — Synthesis: principles for the redesign
A short list, ordered. Each line is something the new shape should satisfy:
- Domain types in
harmony-reconciler-contracts(or a sibling crate), with no dependency onharmonyframework types. - Resolved types only at the API surface. Pre-resolution intent is a separate type, used only by the resolver.
- Capabilities as traits, not concrete types.
DeviceRegistry,DesiredStatePublisher, etc. The NATS-backed impl is one of several allowed. - Closed cardinality where reality is closed; open where reality is open. Goedecke's check, not Feldman's.
- Higher-order topology, not parallel topology. A fleet is a
FleetTopology<T>over a base K8s topology, not a separate capability hierarchy. - Adapters hidden behind capabilities. Helm chart rendering, k8s resource apply, NATS subjects — none of these surface from the fleet's public API.
- No yaml in framework code paths. Existing principle from v0_1; keep.
- Keep wire types minimal + permissive. Not because they're the canonical model, but because they're the evolvability seam (Maguire's protos critique applies in reverse — embrace the loose contract on the wire, reject it in-memory).
§3 — Design problems with the current shape
Concrete issues the redesign needs to fix. Not "bugs" — shape problems. Each numbered so we can refer back when comparing alternatives.
- P1.
harmony/modules/fleet/is in the wrong crate. It pulls framework dependencies (HelmChartScore,K8sResourceScore,K8sAnywhereTopology,harmony_secret, etc.) and the runtime daemons import from it. This makes the operator/agent depend transitively on every harmony module — including the OPNsense XML codegen, OKD bootstrap stuff, etc. Compile times suffer; the release surface is wrong (you can'tcargo install harmony-fleet-operatorwithout all of harmony). - P2.
FleetDeviceAuthmixes resolved + unresolved states.ZitadelEnrollis pre-resolution intent;ZitadelJwtis post-resolution credential. A single match arm has to handle both. The "render TOML for both" hack we wrote works but is a symptom — the TOML for an unresolved auth should be undefined, not "same as resolved". - P3.
setup_score.rsis 1053 LOC monolith. Eight responsibilities in one file: ssh-vs-local connection, ansible orchestration, systemd unit text, hosts-file merging, podman package install, fleet-agent user provisioning, keyfile writing, agent restart. Readability is poor; testability is per-orchestration not per-step. - P4. CRD types live in framework crate.
DeploymentandDeviceCRDs are defined inharmony::modules::fleet::operator::crd. The runtime operator crate (harmony-fleet-operator) imports them from there. This is the most visible symptom of P1. - P5.
ReconcileScorepolymorphism is anemic. Today there's exactly one variant,PodmanV0. The wire format is set up for evolution but no second variant exists, and the cross-crate import fromharmony::modules::podmanmakes adding one expensive (re-export dance). - P6. Adapter leakage from chart rendering.
build_operator_deployment,operator_secret,build_chartarepub. Consumers inexamples/reach in to compose helm releases by hand. Domain code should not see "what does the operator's helm chart look like". - P7. Composed scores wrap composed scores wrap composed scores.
FleetServerScorewraps {ZitadelScore, ZitadelSetupScore, NatsK8sScore, NatsAuthCalloutScore, FleetOperatorScore}. Each of those does its own k8s resource apply + helm install. Failure modes are deep: a problem in one score's interpret surfaces wrapped through five layers of "context()". Hard to debug; hard to reason about ordering. - P8. Topology assumptions are everywhere. Every
Scorebound is a hand-rolled union of capability traits —T: Topology + HelmCommand + K8sclient + TlsRouter + 'static. Add a new capability and every callsite has to be updated. Higher- order topology composition (ADR-015) would let us name "a thing that is a fleet-capable cluster" once. - P9.
Idis overloaded. Same type for device IDs, machine user IDs, deployment IDs, topology names. Newtype-ing each would catch arg-order swaps at compile time. - P10. Configuration is a staircase. Operator workstation has
ZitadelClientConfigcache file. Operator pod has env-var-from- Secret. Agent has TOML on disk. Three different shapes for fundamentally the same data (issuer URL, audience, key material). Maguire's protos critique applies internally — we're using several loose-contract serializations of the same domain object.
§4 — Design alternatives
Five sketches. The first three are increasingly principled cleanups; the last two are deliberately weird, included to force us to recognize where the core of the domain actually is.
For each: one paragraph of premise, the resulting top-level types, how it answers each of P1–P10 (✓ / ✗ / partial), and the honest pros + cons.
Alternative A — Move + thin façade (the conservative cleanup)
Premise: the existing types are mostly right; the location is
wrong and the façade leaks. Move harmony/modules/fleet/ to
fleet/harmony-fleet/. Re-export only what's intended public.
Don't redesign types.
Top-level types: unchanged. FleetDeviceSetupScore,
FleetServerScore, FleetOperatorScore, FleetDeviceAuth,
AdminAuth, Deployment CRD, Device CRD. Same shapes, new
location.
P1 ✓ (location fix is the goal). P2 ✗ (auth still mixes
resolved/unresolved). P3 ✗ (monolith preserved). P4 ✓
(CRDs co-located with operator). P5 ✗. P6 partial (we
can pub(crate) the chart helpers but the underlying coupling
remains). P7 ✗. P8 ✗. P9 ✗. P10 ✗.
Pros: small, safe, mechanical. Two days of work. No customer- visible breakage. Unblocks P4 cleanup naturally.
Cons: doesn't actually fix the shape. We'd be back here in six weeks. JG's review already said this isn't enough. Not the right answer for v0.1 timing — would be the right answer if we'd already shipped to two customers and couldn't break their code.
Alternative B — Resolved-only at boundaries + capability traits (the principled cleanup)
Premise: Crichton's typestate + ADR-003's domain capabilities applied to the existing shape. Split resolved vs. unresolved auth into separate types. Define capability traits for the adapters. Move into the right crate. No wholesale rewrite.
Top-level types:
- New crate
harmony-fleet/(sibling toharmony-fleet-operator, -agent, -auth). Domain types live here. FleetIdentity,FleetDevice,FleetDeployment— domain records. Plain data.DeviceCredential— resolved only (a JSON keyfile + issuer URL + audience). ReplacesFleetDeviceAuth::ZitadelJwt.EnrollmentIntent— pre-resolution. CarriesAdminAuthand what to mint. Methodresolve(&self) -> Result<DeviceCredential>.Scores become small + single-responsibility:EnrollDeviceScore— runsEnrollmentIntent::resolvethen publishes to NATS.InstallAgentScore— drops binary + config + systemd unit. Takes aDeviceCredential. Doesn't know about Zitadel.InstallOperatorScore— helm chart + Secret. Doesn't know about devices.BringUpFleetScore— composes the above. Single layer of composition, not five.
- Capability traits:
DeviceRegistry— list/get/upsert/delete aFleetDevice. Implementations:NatsKvDeviceRegistry, (later)RedisStreamsDeviceRegistry.DesiredStatePublisher,ObservedStateConsumer— same shape.IdentityProvider— mint a device credential, issue an admin token. Today: Zitadel. Tomorrow: something else.
P1 ✓ P2 ✓ P3 ✓ (split into 4–5 small Scores). P4 ✓ P5 ✓
(resolve in the runtime crate, contracts stay neutral).
P6 ✓ (chart helpers pub(crate), surfaced via IdentityProvider
DeploymentReleaseManagertraits). P7 ✓ (one composer, not five). P8 partial (capability traits defined but bound unions still get long). P9 ✓ with newtypes. P10 partial (still three on-disk shapes for credentials, but unified by trait).
Pros: highest-leverage incremental redesign. Buys us most of the principles without rebuilding plumbing. Customer-visible breakage is contained to public API renames + import path moves — no behavior change. Three days is realistic.
Cons: we still have a Score-shaped mental model where the
unit of execution is "a Score". If the right primitive turns
out to be smaller (an effect, an event, a capability call), this
choice wastes some leverage.
Alternative C — The dataflow reframe (events in, state out)
Premise: the fleet platform is, in essence, a stream
processor. Events flow in (heartbeats, intent CR creates,
agent reconcile reports). State materializes out (Device CRs,
DeploymentAggregate counters, KV desired-state writes). Today
we model it imperatively as a series of Scores; the dataflow
shape is fighting that.
Top-level types:
FleetEvent— sum type.DeviceHeartbeat | DeviceFirstSeen | DeploymentDesired | DeploymentObserved | DeploymentDeleted | …FleetStateSnapshot— what the operator currently knows. Pure data, derivable.Reducer—(state, event) → state. Pure function. Tests trivially.Effect— sum type of side-effects the reducer wants done:WriteKv(bucket, key, value) | UpsertCr(cr) | EmitMetric(...). Reducer returns(new_state, Vec<Effect>).EffectRunner— adapter that performs effects. The only thing that touches NATS / kube. One implementation per environment.- The operator pod's main loop:
for event in stream { (state, effects) = reduce(state, event); runner.run_all(effects) }. ~50 lines.
P1 ✓ P2 ✓ P3 ✓ P4 ✓ P5 ✓ P6 ✓ P7 ✓ P8 ✓ (capabilities
collapse into the EffectRunner trait). P9 ✓ P10 partial.
Pros: dramatically simpler operator code. Reducer is pure → property-test-friendly. The dataflow is the platform. Aligns with how Kafka / Materialize / Flink-class systems are structured. Easy to add a new event type — the compiler shows you every reducer arm to update.
Cons: large rewrite of the operator. Three days is
unrealistic. The current fleet_aggregator.rs (833 LOC) already
roughly does this but in a less disciplined shape — maybe the
incremental version of this is "make apply_state a real
reducer and split compute_aggregate into pure pieces". That's
more like Alternative B with extra discipline. The full effect-
typed version is a nice end-state but not a sprint goal.
Cite: Materialize's dataflow paper; Kent Beck's Augmented Coding on factoring; Gergely Orosz on event-sourcing; the talk's "good Lego bricks" framing applies — events are the bricks.
Alternative D — The fleet as a kube control plane, period (deliberately weird)
Premise: strip the design to one observation. A fleet is a Kubernetes cluster whose Nodes happen to be devices, not servers. Stop modelling Devices and Deployments separately from kube primitives. Use Kubernetes itself as the data model. The operator is one CRD reconciler. NATS is just the transport between the API server (in the cluster) and the device-side kubelet-equivalent.
Top-level types:
Deviceis a Node CR. Already exists; we stop wrapping it.Deploymentis aDaemonSet(one pod per matching node) or aDeployment(count: N targeted nodes). We stop inventing a CRD; we use the standard one.DeviceInfois the Node's.status(capacity, allocatable, conditions). We stop publishing parallel data; we update Node status from the agent's NATS messages.- The agent on the device is a custom kubelet that speaks NATS to the operator instead of HTTPS to the API server.
- The auth callout still exists; it gates NATS access.
- No
harmony-fleet-operator-specific CRDs. NoDeployment/DeviceCRs of our own.
P1 ✓ P2 ✓ P3 ✓ P4 N/A (no CRDs of our own to misplace). P5 ✓ P6 ✓ P7 ✓ P8 ✓ P9 ✓ P10 ✓.
Pros: the simplest conceptual answer. We stop fighting kube
- inventing parallel concepts. Customers already understand
DaemonSets, Node selectors, and
kubectl get nodes. The agent becomes a known kind of thing (a kubelet variant) with shoulders to stand on (k3s-iot, kine, virtual-kubelet projects already prove this works).
Cons: a lot of plumbing changes. Devices need to register as Nodes (which means either a real kubelet on each Pi, or a virtual-kubelet façade). The agent's reconcile loop becomes "watch a CR via NATS, render manifests, run pods" — bigger than "watch a KV value, run podman". JetStream KV becomes redundant with the kube API server. Probably the right end-state for v2.0, wrong for v0.1. Worth noting, though, because comparing A/B/C to D pulls out which of our current invented concepts are load-bearing (very few — DeviceInfo is mostly just Node.status; DeploymentAggregate is mostly just kube's .status.observedGeneration / .status.conditions stuff).
Cite: virtual-kubelet, k3s-iot, KubeEdge, OpenYurt. They've walked this path; the lessons are public.
Alternative E — Algebra of fleets (deliberately weird, mathematical)
Premise: model the platform as a small algebra. A fleet is a set of devices + an assignment function (selector → set of deployments). Operations on fleets are set-theoretic + function composition. Treat the API as a query language over this algebra.
Top-level types:
Fleet::=Set<Device>. With operations: union, intersection, filter-by-selector, partition.Selector::= a pure predicateDevice → bool. Built from primitiveslabel("k") = "v",arch = aarch64, …, combined with&,|,!.Assignment::=Selector → Set<Deployment>. Pure function.World::=(Fleet, Assignment). Pure data. The operator's job is to make reality match the World.Diff(World, Reality) → Vec<Action>. Pure function. Closed form — given the algebra, you can prove what actions are necessary and sufficient.
P1–P10 ✓ (in principle). Code volume probably 30% of current.
Pros: clarity. Properties become provable: "no device gets an unassigned deployment", "removing a label removes the assignment", "two operators can edit independently and the merge is well-defined" (because functions compose). The "make impossible states impossible" principle, applied to the fleet shape itself, not to individual types.
Cons: almost certainly an over-fit. The real platform has dirty edges (devices that fail, network partitions, half-applied state) that don't sit naturally in a pure algebra. Most teams that go down this road end up bolting "real-world" escape hatches back on, ending up with the original design plus extra category theory. Useful as a north star for the cardinality choices, not as the platform's actual shape.
Cite: Hillel Wayne Using Formal Methods at Work; Conal Elliott on functional reactive programming; the classic "set theory for systems people" talks.
Comparison matrix
| A. Move | B. Capabilities | C. Dataflow | D. Kube-native | E. Algebra | |
|---|---|---|---|---|---|
| Fixes P1 (location) | ✓ | ✓ | ✓ | ✓ | ✓ |
| Fixes P2 (auth states) | ✗ | ✓ | ✓ | ✓ | ✓ |
| Fixes P3 (monolith) | ✗ | ✓ | ✓ | ✓ | ✓ |
| Fixes P4 (CRD placement) | ✓ | ✓ | ✓ | N/A | N/A |
| Fixes P5 (anemic enum) | ✗ | ✓ | ✓ | N/A | partial |
| Fixes P6 (adapter leak) | partial | ✓ | ✓ | ✓ | ✓ |
| Fixes P7 (deep wrap) | ✗ | ✓ | ✓ | ✓ | ✓ |
| Fixes P8 (trait union) | ✗ | partial | ✓ | ✓ | ✓ |
| Fixes P9 (Id overload) | ✗ | ✓ | ✓ | ✓ | ✓ |
| Fixes P10 (config staircase) | ✗ | partial | partial | ✓ | partial |
| Fits 3-day window | ✓ | ✓ (tight) | ✗ | ✗ | ✗ |
| Customer-visible breakage | low | medium | medium | very high | high |
| Risk to demo schedule | very low | low | medium | very high | high |
| Long-term ceiling | low | high | high | very high | very high |
§5 — Recommendation (preliminary)
Read the matrix as: B is the right answer for now, with explicit awareness of D as the v2.0 destination.
- A is too little. We'd be back here.
- C and E are right in shape but wrong in timing — we don't have a week to rebuild the operator's reconcile loop, and the platform isn't in production yet, so there's no urgent "we have to refactor anyway" pressure.
- D is conceptually the cleanest, but a v0.1 production push is the wrong moment to start running custom kubelets.
- B captures most of the leverage of C/D within the 3-day window, with a clean migration path to either of them later (the capability traits are the seam — swap the implementation, not the callers).
One concrete shape to pursue under Alternative B (worth sketching as the strawman ADR):
- New crate
harmony-fleet/(the domain crate). Depends onharmony-reconciler-contractsonly.- Domain records:
FleetDevice,FleetDeployment,FleetState. - Capability traits:
DeviceRegistry,DesiredStatePublisher,ObservedStateConsumer,IdentityProvider,AgentLifecycle.
- Domain records:
harmony-fleet-adapters-nats/—NatsDeviceRegistry,NatsDesiredStatePublisher, etc. NATS-specific.harmony-fleet-adapters-zitadel/—ZitadelIdentityProvider.harmony-fleet-adapters-kube/—KubeFleetReflector(writesDeviceandDeploymentCRs as a reflection of the domain state, not as the source of truth).harmony-fleet-operator/— daemon. Wires adapters together.harmony-fleet-agent/— daemon. Wires adapters together.harmony-fleet-cli/— tomorrow'sharmony-fleetplugin.harmony/modules/fleet/is deleted. The frameworkharmonycrate gets a thinharmony::modules::fleetre-export only module that points atharmony-fleet. After v0.2 is shipped, the re-export module goes away too.
CRDs (Deployment, Device) move to
harmony-fleet-adapters-kube/ because they're a kube-specific
projection of the domain, not the domain itself. The agent
imports harmony-fleet's domain types, not the CRDs.
The setup-side scores stay in harmony (because they need the
framework's HelmCommand, K8sclient, etc.) but they consume
harmony-fleet's domain types. The fleet's domain doesn't
depend on the framework; the framework's deploy procedures
depend on the fleet's domain. Direction of dependency is the
inverse of today.
§6 — Open questions before we lock this
These are real questions; pulling them out so JG's review has something concrete to react to:
- Q1. Is
IdentityProviderthe right capability name, or is it more honest to name it after what we actually need (DeviceCredentialMinter,OperatorTokenProvider)? The talk argues against generic names — if reality has two distinct concerns, two traits. - Q2. Should
DeviceCRD live in adapters-kube, or should it not exist at all (replaced by reading kube-API node info, per alternative D)? The middle ground (own CRD that mirrors kube Node) is what we have today, and it's the worst of both. - Q3. The agent's wire-format for
ReconcileScore— externally tagged enum, today onlyPodmanV0. Move it toharmony-reconciler-contracts(canonical wire seam) and let both the agent and the operator import only that crate. This removes theharmony::modules::podmancross-crate dependency. Worth doing in any of A/B/C. - Q4. Does the v0.1 prod push wait for this redesign, or does it ship on the current shape with the redesign happening in v0.2? Tradeoff: shipping now means committing to some public API; shipping after means slipping the customer date. Recommendation: ship the redesign first, slip 3 days, on the grounds that public API churn after a customer is on it costs more than a 3-day delay before they're on it.
- Q5. Where do the runtime tools (the
harmony-fleetCLI plugin, future frontend) sit in the dependency graph? If they depend onharmony-fleet's domain crate only, we can build them without pulling in helm / kube / ansible at compile time. This is what we want for the device-side enrollment binary too (already feature-gated; the redesign should make the gate unnecessary).
§7 — Next steps
- Sit with this document. Walk away from it for an hour.
- Round-table on §3 — do P1–P10 capture the problems, or are we missing one?
- Round-table on §4 — does the comparison matrix feel honest, or is it tilted?
- Pick one alternative as the working hypothesis.
- Spike: take one slice through the chosen alternative
(suggested:
EnrollmentIntent::resolve+DeviceCredential+ theIdentityProvidertrait — the smallest end-to-end shape that touches every layer). Commit it on a branch. Eyeball: does the resulting code feel better? - Either: commit to the alternative as ADR-023, or back out and try another.
This document gets updated as we go. It is NOT meant to be locked at first draft.