harmony/ROADMAP/fleet_platform/architecture_review.md

# Fleet platform — architecture review

Working document for the architectural redesign of the fleet platform
before v0.1 ships to production. Started 2026-05-07.

This is a research + design document, not a plan to execute. The
output of this work is an ADR (or set of ADRs) that lock the new
shape; the v0.2 roadmap will reference whichever option we pick.

## Why now

- Three days from production. No customers depend on the API yet
  → API/UX/DX is still cheap to change. After ship, every breaking
  change costs us a week of customer-coordination overhead.
- The `harmony/modules/fleet/` placement is wrong — already flagged
  in code review. The reasons it ended up there are subtle (cross-
  module imports of `K8sAnywhereTopology`, `HelmChartScore`,
  `K8sResourceScore`, `harmony_secret`, `Topology` capability
  traits). Those need to be written down before the file move,
  not after.
- The plumbing — NATS + Zitadel + auth callout + operator + agent
  — is sound. Highly secure, scalable by design, low resource
  footprint. The redesign is about **moving code** and **better
  data structures**, not rebuilding mechanisms.
- The frame from JG's *Pour l'amour des compilateurs* talk:
  cardinality-matched types, "make impossible states impossible",
  expressive types as the deterministic feedback loop that scales
  with LLM-era code generation throughput. Apply that frame here.

## Working plan

1. **Inventory.** Map every public type, trait, score, module, and
   crate that participates in the fleet domain. Markdown-bullet
   shape; no diagrams.
2. **Read the room.** Pull principles from JG's talk, its
   references, and harmony's existing ADRs (002 hexagonal, 003
   infrastructure abstractions, 015 higher-order topologies, 016
   harmony agent + global mesh, 017 NATS interconnection, 018
   template hydration). Note where the existing fleet design
   already follows them and where it doesn't.
3. **Identify the design problems.** Not bugs — *shape* problems.
   Cardinality mismatches, leaky boundaries, "is this resolved
   yet" branches, location/dependency loops.
4. **Sketch alternatives.** Three to five. At least one
   conventional cleanup, at least one out-of-the-box that
   reframes the domain. Compare on the same axes (cardinality,
   placement, ergonomics, extensibility).
5. **Pick (or recommend) one.** Land as ADR.

This document covers steps 1–4. The pick happens in conversation
with JG before the ADR.

---

## §1 — Current state inventory

### §1.1 — Where the code lives

The fleet domain spans **three concerns** that today live in
**three locations**:

- **Framework-side scoring** (what runs on the operator's
  workstation when they `cargo run` the install) → lives in
  `harmony/src/modules/fleet/`. This is the wrong home; it's the
  thing this review is about moving.
  - `mod.rs` — re-exports
  - `assets.rs` — Ubuntu/Debian cloud image fetchers, libvirt SSH
    keypair management
  - `libvirt_pool.rs` — libvirt storage pool bring-up
  - `setup_score.rs` (1053 LOC, the monster) — `FleetDeviceSetupScore`,
    `FleetDeviceSetupConfig`, `FleetDeviceAuth`
    (TomlShared|ZitadelJwt|ZitadelEnroll), `AdminAuth`, `HostsEntry`,
    `merge_hosts_file`
  - `vm_score.rs` — `ProvisionVmScore` (libvirt VM bring-up)
  - `preflight.rs` — `check_fleet_smoke_preflight*` (host system
    checks)
  - `server.rs` — `FleetServerScore`, `FleetServerInterpret`
    (composed bring-up of Zitadel + NATS + callout + operator)
  - `operator/`
    - `mod.rs`, `score.rs` — `FleetOperatorScore`,
      `FleetOperatorInterpret` (operator helm install)
    - `chart.rs` (453 LOC) — chart rendering (`ChartOptions`,
      `OperatorCredentials`, `build_chart`, `operator_secret`,
      `build_operator_deployment`, `build_cluster_role`)
    - `crd.rs` — `Deployment` CRD type (`DeploymentSpec`,
      `Rollout`, `RolloutStrategy`, `DeploymentStatus`,
      `DeploymentAggregate`, `AggregateLastError`); `Device` CRD type
      (`DeviceSpec`)
- **Cross-boundary wire types** (the "contract" agent and operator
  both have to agree on) → lives in `harmony-reconciler-contracts/`.
  - `fleet.rs` — `DeviceInfo`, `DeploymentState`, `HeartbeatPayload`,
    `DeploymentName`, `InvalidDeploymentName`
  - `kv.rs` — bucket name constants + key-builder functions
  - `status.rs` — `Phase`, `InventorySnapshot`
  - re-exports `harmony_types::id::Id`
- **Runtime binaries** (what runs in the cluster + on devices) →
  lives in `fleet/`.
  - `harmony-fleet-operator/` — the operator pod. `controller.rs`,
    `device_reconciler.rs`, `fleet_aggregator.rs` (833 LOC),
    `install.rs`, `main.rs`. Pulls `Deployment`/`Device` CRDs from
    `harmony::modules::fleet::operator::crd` (cross-crate import
    that should give us pause).
  - `harmony-fleet-agent/` — the on-device daemon. `config.rs`,
    `reconciler.rs`, `fleet_publisher.rs`, `main.rs`.
  - `harmony-fleet-auth/` — JWT-bearer / NATS-credentials helpers
    used by both the operator AND the agent. `config.rs`,
    `credentials.rs` (553 LOC). Sits between contracts and the
    runtime crates.

### §1.2 — Public types, sorted by domain meaning (not location)

#### Identity & devices

- `harmony_types::id::Id` — opaque, sortable, collision-safe
  identifier. Used as device id, deployment id, …
- `DeploymentName` (newtype with validation, `harmony-reconciler-contracts`)
- `DeviceInfo` — heartbeat payload that materializes into a
  `Device` CR
- `DeviceSpec` — kube CRD, holds an optional `InventorySnapshot`
- `InventorySnapshot` — hardware/OS facts published once at
  registration

#### Deployment desired-state

- `DeploymentSpec` — kube CRD: `target_selector: LabelSelector`,
  `score: ReconcileScore`, `rollout: Rollout`
- `ReconcileScore` (in `harmony::modules::podman`, re-exported
  from `harmony::modules::fleet::operator::crd`) — externally-tagged
  enum, today only `PodmanV0(PodmanV0Score)`
- `PodmanV0Score`, `PodmanService`, `EnvVar`, `VolumeMount`,
  `RestartPolicy`
- `Rollout`, `RolloutStrategy::Immediate`

#### Deployment observed-state

- `DeploymentState` — what the agent publishes per device per
  deployment after reconcile
- `DeploymentStatus` (kube CRD) — operator-side rollup of all
  device states for one Deployment CR
- `DeploymentAggregate` — counts (matched, succeeded, failed,
  pending) + `last_error: Option<AggregateLastError>`
- `Phase` — `Pending | Running | Failed`

#### Authentication / identity provider

- `FleetDeviceAuth` — sum type with `TomlShared | ZitadelJwt |
  ZitadelEnroll`. **The `ZitadelEnroll` arm carries
  unresolved-state — admin credentials that must be turned into a
  device JSON key at execute time. Mixes resolved and unresolved
  states in one type, which is the cardinality bug we keep hitting.**
- `AdminAuth` — `Sso { client_id } | Token(String)` (used inside
  `ZitadelEnroll`)
- `CredentialsSection` — TOML-on-disk shape (in
  `harmony-fleet-auth`, parallel to `FleetDeviceAuth`)
- `CredentialSource` — runtime credential factory
- `NatsCredential` — what async-nats actually consumes
- `MachineKeyFile`, `CachedToken`

#### Setup procedures (Scores)

- `FleetDeviceSetupScore` (`FleetDeviceSetupConfig`) — the workhorse:
  installs podman, drops the agent binary, drops the credentials
  TOML, drops the keyfile, brings up the systemd unit.
- `FleetServerScore` — orchestrates Zitadel install + identity
  setup + NATS install + callout install + operator install. Wraps
  five other scores.
- `FleetOperatorScore` — operator helm chart render + install + the
  credentials Secret apply.
- `ProvisionVmScore` — libvirt VM bring-up. Used by VM rehearsals.
- (External, not in fleet/) `ZitadelScore`, `ZitadelSetupScore`,
  `NatsK8sScore`, `NatsAuthCalloutScore` — all consumed by the
  composed install.

#### Operator-internal types

- `FleetState`, `SharedFleetState`, `DeploymentKey`, `DevicePair`,
  `CachedDeployment`, `Context`, `Error` (the controller's local
  error type), `selector_matches`, `apply_state`, `drop_state`,
  `compute_aggregate`

#### Agent-internal types

- `AgentConfig`, `AgentSection`, `NatsSection`, `CredentialsSection`
- `FleetPublisher`, `Reconciler`

#### Fleet plumbing for development

- `FleetSshKeypair`, the cloud-image consts, `HarmonyFleetPool`,
  `merge_hosts_file`, `HostsEntry`, `check_fleet_smoke_preflight*`

#### NATS subjects + KV buckets (the wire seam)

- `BUCKET_DESIRED_STATE` = `"desired-state"`
- `BUCKET_DEVICE_INFO` = `"device-info"`
- `BUCKET_DEVICE_STATE` = `"device-state"`
- `BUCKET_DEVICE_HEARTBEAT` = `"device-heartbeat"`
- Key builders: `desired_state_key(device_id, deployment_name)`,
  `device_info_key(device_id)`, `device_state_key(device_id,
  deployment_name)`, `device_heartbeat_key(device_id)`

### §1.3 — Concept clusters

When you squint at the inventory, the domain falls into **five
clusters**:

1. **Identity** — who is this device, who is this deployment, who
   is the operator, what auth do they have.
2. **Desired state** — what should be running where.
3. **Observed state** — what is actually running where.
4. **Setup** — bringing all this into existence on a fresh
   cluster + fresh device.
5. **Plumbing** — the NATS/kube/Zitadel mechanisms that make 1–4
   work.

The current code does not cleanly separate these. Examples:

- `setup_score.rs` mixes **Setup** (drop binary, run systemd) with
  **Identity** (`FleetDeviceAuth`). 1053 LOC.
- `FleetDeviceAuth` mixes resolved-Identity (`ZitadelJwt` —
  here's a key) with Setup-time-Identity-resolution-intent
  (`ZitadelEnroll` — here's how to mint a key).
- The chart-render helpers (`build_operator_deployment`, etc.) are
  `pub` from `harmony::modules::fleet::operator::chart` so the
  composed-install scores can pluck the secret out before helm
  install. Plumbing leaking through Setup.
- `harmony::modules::fleet::operator::crd::DeploymentSpec` is the
  CRD definition AND it's the type the operator daemon imports to
  reconcile. Cross-crate import from a runtime crate
  (`harmony-fleet-operator`) into a framework crate (`harmony`).
  This is the placement bug.

### §1.4 — The shape problem in one diagram (text)

```
                         framework/operator workstation
                              │
   harmony::modules::fleet  ──┤  Scores: FleetServerScore, FleetDeviceSetupScore,
                              │          FleetOperatorScore, ProvisionVmScore
                              │  CRD types: Deployment, Device, DeploymentSpec, ...
                              │  Chart rendering helpers (operator/chart.rs)
                              │
   harmony-reconciler-contracts ── wire types: DeviceInfo, DeploymentState,
                              │                HeartbeatPayload, KV constants
                              │  ▲                                              ▲
                              │  │                                              │
                              │  │  imports                              imports│
                              │  │                                              │
                       fleet/harmony-fleet-agent          fleet/harmony-fleet-operator
                              ▲                                          ▲
                              │                                          │
                              │  ALSO imports                ALSO imports│
                              │  from harmony::modules::      from harmony::modules::
                              │  podman (PodmanV0Score)       fleet::operator::crd
```

Two problematic edges:

1. `harmony-fleet-operator` imports `harmony::modules::fleet::operator::crd::Deployment`. The runtime daemon depends on the framework crate just for CRD type definitions.
2. `harmony-fleet-agent` imports `harmony::modules::podman::{PodmanV0Score, PodmanTopology, ReconcileScore}`. The agent depends on the framework crate's *podman module* for the score it deserializes off the wire.

Both edges should run *through* `harmony-reconciler-contracts`, not around it. That's the placement bug surfaced.

---

## §2 — Theory review

### §2.1 — From the talk

Pulling the load-bearing principles, ranked by relevance to this
redesign:

1. **Cardinality matters.** Types should match the cardinality of
   the real-world concept. `&str` for "primary color" admits
   infinite invalid inputs; `enum { Red, Yellow, Blue }` admits
   exactly three. Friction is proportional to mismatch.
2. **Make impossible states impossible.** Don't comment the
   constraint, code it. Push runtime errors to the design phase.
3. **Representations matter.** Same data, different shapes ↔
   different operations are cheap. Roman numerals ↔ addition; Arabic
   ↔ multiplication. "An API is a computational representation of
   real-world concepts."
4. **The compiler is a deterministic feedback channel.** In an era
   when LLMs generate code at 5–10K LOC/day, the only sensor that
   keeps up runs in milliseconds and is deterministic. Lean on it.
5. **Strong types reduce code volume + test boilerplate + token
   waste + review burden + CI time + production incidents** — and
   *increase* refactoring confidence and velocity-over-time. The
   bet is asymmetric.

### §2.2 — From the references

Grouping by what they imply for *this* redesign:

#### Will Crichton — *Type-Driven API Design* + *Rust API Type Patterns*

- **Typestate.** Encode "phase of an operation" in the type
  parameter. A `ProgressBar<Bounded>` exposes `.with_eta()`; a
  `ProgressBar<Unbounded>` doesn't. The contradictory call doesn't
  compile.
- Direct application: **`FleetDeviceAuth` mixes phases.** The
  `ZitadelEnroll` arm is unresolved, the `ZitadelJwt` arm is
  resolved, the `TomlShared` arm doesn't even need resolution. A
  typestate would model these as distinct types; only one of them
  has `agent.write_to_disk()`.

#### Richard Feldman — *Making Impossible States Impossible*

- Slogan-as-tool. Look at every `Option<T>` and ask *"can two of
  these be inconsistent at once?"* If yes, that's an impossible
  state — refactor.
- Direct application: `FleetDeviceSetupConfig` has `auth:
  FleetDeviceAuth` AND `agent_binary_path: PathBuf`. Today nothing
  prevents `auth = TomlShared` (no Zitadel) with
  `agent_binary_path` pointing at the wrong-arch binary. We could
  encode the agent binary's target arch as a typestate parameter
  and refuse to deploy to a device with a known-different arch
  inventory.

#### Sandy Maguire — *Protos Are Wrong*

- Protocol buffers throw away information real type systems
  preserve. Sum types, exhaustiveness, parametric polymorphism,
  Maybe/Result — protos can't express any of them precisely. The
  "loose contract" sells you weak invariants.
- Direct application: `harmony-reconciler-contracts` is JSON-shaped
  at the wire (matched on `type` tag for `ReconcileScore`).
  We're already paying the proto-class tax: any new variant
  requires both ends to know about it; the wire format doesn't
  enforce a schema; old agents see new variants as parse errors.
  This is an honest constraint — wire formats need to be permissive
  by design — but it argues for keeping the **wire types small and
  obviously evolvable** while letting in-memory types be
  cardinality-matched.

#### Sean Goedecke — *Invalid States*

- The skeptic's case: making impossible states impossible *can be
  over-applied*. Sometimes a `String` is the right cardinality
  even when an enum exists, because the enum binds you to a
  closed world.
- Direct application: **Don't make `device_id` a closed enum.**
  The newtype + RFC1123 validation we just added is the right
  cardinality match: it's a string-like, but only valid strings.
  Over-modeling would have us build `enum DeviceId {
  Pi(PiSerial), Vm(VmName), …}` — closed world, breaks first time
  a customer plugs in an x86 box.
- Useful guardrail: **type-driven** ≠ **type-everything**. The
  question to ask each time is "what's the cardinality of this
  concept in reality" — not "can I model this".

#### Martin Fowler — *Harness Engineering* (April 2026)

- Computational sensors (compilers, type checkers, linters) over
  inferential ones (tests, code review). Compiler runs on every
  change; tests don't.
- Direct application: prefer compiler-checked invariants over
  doc-comment invariants. If the docs say "this Score's `auth`
  field must be resolved at the call site of `execute()`", the
  compiler should enforce it.

### §2.3 — From harmony's own ADRs

Reading the existing ADRs *as design language already in use* —
what vocabulary should the new fleet shape stay consistent with?

#### ADR-002 (hexagonal architecture)

- "Domain isolated from adapters." Domain types own the
  vocabulary; adapters (k8s client, NATS, helm) translate at the
  edge.
- **Implication for fleet:** the *domain* is identity + desired
  state + observed state. The *adapters* are NATS-KV, kube-CRD,
  helm-chart, ansible-over-SSH. The current
  `harmony::modules::fleet` mixes both. Pulling adapters out is the
  refactor.

#### ADR-003 (infrastructure abstractions)

- "Abstractions at domain level, not provider level. `DnsServer`
  not `OPNsenseDns`."
- **Implication for fleet:** capability traits like
  `DeviceRegistry`, `DesiredStatePublisher`, `ObservedStateConsumer`
  — each a standard infrastructure need that NATS-KV happens to
  fulfill today, that another transport (gRPC streaming, MQTT,
  Redis streams) could fulfill tomorrow.

#### ADR-015 (higher-order topologies)

- Higher-order topologies (`FailoverTopology<T>`,
  `DecentralizedTopology<T>`) compose via blanket trait impls.
  `T: PostgreSQL` ⇒ `FailoverTopology<T>: PostgreSQL`. Zero
  boilerplate.
- **Implication for fleet:** `FleetTopology<T>` could compose with
  a base `K8sTopology<T>` rather than being a parallel concept.
  "A fleet is a thing that is *both* a kube cluster *and* a
  device registry."

#### ADR-016 (Harmony Agent + Global Mesh)

- Agents are processes that observe + reconcile per a desired
  state published into a NATS mesh. Mesh is the reliable hop;
  agents are stateless processors at the edge.
- **Implication for fleet:** the IoT fleet is a *specialization*
  of the agent + mesh ADR — devices are agents, the operator is
  a coordinator. The fleet domain types should fit ADR-016's
  vocabulary, not invent a parallel one.

#### ADR-017 (NATS clusters interconnection)

- Trust topology: per-cluster account isolation, gateway-mediated
  cross-cluster traffic. Per-device permissions are a
  specialization of per-account.
- **Implication for fleet:** the auth callout's per-device permission
  templates should compose with the cluster-interconnection
  account model — currently they're treated as orthogonal, which
  is fine until we actually cross fleets.

#### ADR-018 (template hydration)

- Hydrating templates at the edge of the framework, not in the
  middle. Same pattern as our generated chart YAML: render once,
  apply via typed code.
- **Implication for fleet:** chart-rendering helpers
  (`build_operator_deployment` et al.) are template-hydration
  edges. They *should* be hidden from domain code. Today they're
  `pub` — visible to consumers like `fleet_staging_install` who
  reach in and grab `operator_secret(opts)`. That's adapter
  leakage.

### §2.4 — Synthesis: principles for the redesign

A short list, ordered. Each line is something the new shape
should satisfy:

1. **Domain types in `harmony-reconciler-contracts` (or a sibling
   crate)**, with no dependency on `harmony` framework types.
2. **Resolved types only at the API surface.** Pre-resolution
   intent is a separate type, used only by the resolver.
3. **Capabilities as traits**, not concrete types. `DeviceRegistry`,
   `DesiredStatePublisher`, etc. The NATS-backed impl is one of
   several allowed.
4. **Closed cardinality where reality is closed; open where reality
   is open.** Goedecke's check, not Feldman's.
5. **Higher-order topology, not parallel topology.** A fleet is a
   `FleetTopology<T>` over a base K8s topology, not a separate
   capability hierarchy.
6. **Adapters hidden behind capabilities.** Helm chart rendering,
   k8s resource apply, NATS subjects — none of these surface from
   the fleet's public API.
7. **No yaml in framework code paths.** Existing principle from
   v0_1; keep.
8. **Keep wire types minimal + permissive.** Not because they're
   the canonical model, but because they're the
   evolvability seam (Maguire's protos critique applies in
   reverse — *embrace* the loose contract on the wire, *reject* it
   in-memory).

---

## §3 — Design problems with the current shape

Concrete issues the redesign needs to fix. Not "bugs" — *shape*
problems. Each numbered so we can refer back when comparing
alternatives.

- **P1. `harmony/modules/fleet/` is in the wrong crate.** It pulls
  framework dependencies (`HelmChartScore`, `K8sResourceScore`,
  `K8sAnywhereTopology`, `harmony_secret`, etc.) and the runtime
  daemons import *from it*. This makes the operator/agent depend
  transitively on every harmony module — including the OPNsense
  XML codegen, OKD bootstrap stuff, etc. Compile times suffer; the
  release surface is wrong (you can't `cargo install
  harmony-fleet-operator` without all of harmony).
- **P2. `FleetDeviceAuth` mixes resolved + unresolved states.**
  `ZitadelEnroll` is pre-resolution intent; `ZitadelJwt` is
  post-resolution credential. A single match arm has to handle
  both. The "render TOML for both" hack we wrote works but is a
  symptom — the TOML for an unresolved auth should be undefined,
  not "same as resolved".
- **P3. `setup_score.rs` is 1053 LOC monolith.** Eight responsibilities
  in one file: ssh-vs-local connection, ansible orchestration,
  systemd unit text, hosts-file merging, podman package install,
  fleet-agent user provisioning, keyfile writing, agent restart.
  Readability is poor; testability is per-orchestration not
  per-step.
- **P4. CRD types live in framework crate.** `Deployment` and
  `Device` CRDs are defined in
  `harmony::modules::fleet::operator::crd`. The runtime operator
  crate (`harmony-fleet-operator`) imports them from there. This
  is the most visible symptom of P1.
- **P5. `ReconcileScore` polymorphism is anemic.** Today there's
  exactly one variant, `PodmanV0`. The wire format is set up for
  evolution but no second variant exists, and the cross-crate
  import from `harmony::modules::podman` makes adding one
  expensive (re-export dance).
- **P6. Adapter leakage from chart rendering.**
  `build_operator_deployment`, `operator_secret`, `build_chart`
  are `pub`. Consumers in `examples/` reach in to compose helm
  releases by hand. Domain code should not see "what does the
  operator's helm chart look like".
- **P7. Composed scores wrap composed scores wrap composed scores.**
  `FleetServerScore` wraps {ZitadelScore, ZitadelSetupScore,
  NatsK8sScore, NatsAuthCalloutScore, FleetOperatorScore}. Each
  of those does its own k8s resource apply + helm install.
  Failure modes are deep: a problem in one score's interpret
  surfaces wrapped through five layers of "context()". Hard to
  debug; hard to reason about ordering.
- **P8. Topology assumptions are everywhere.** Every `Score`
  bound is a hand-rolled union of capability traits — `T:
  Topology + HelmCommand + K8sclient + TlsRouter + 'static`. Add
  a new capability and every callsite has to be updated. Higher-
  order topology composition (ADR-015) would let us name "a
  thing that is a fleet-capable cluster" once.
- **P9. `Id` is overloaded.** Same type for device IDs, machine
  user IDs, deployment IDs, topology names. Newtype-ing each
  would catch arg-order swaps at compile time.
- **P10. Configuration is a staircase.** Operator workstation has
  `ZitadelClientConfig` cache file. Operator pod has env-var-from-
  Secret. Agent has TOML on disk. Three different shapes for
  fundamentally the same data (issuer URL, audience, key
  material). Maguire's protos critique applies internally — we're
  using *several* loose-contract serializations of the same
  domain object.

---

## §4 — Design alternatives

Five sketches. The first three are increasingly principled
cleanups; the last two are deliberately weird, included to force
us to recognize where the *core* of the domain actually is.

For each: one paragraph of premise, the resulting top-level types,
how it answers each of P1–P10 (✓ / ✗ / partial), and the
honest pros + cons.

### Alternative A — Move + thin façade (the conservative cleanup)

**Premise:** the existing types are mostly right; the location is
wrong and the façade leaks. Move `harmony/modules/fleet/` to
`fleet/harmony-fleet/`. Re-export only what's intended public.
Don't redesign types.

**Top-level types:** unchanged. `FleetDeviceSetupScore`,
`FleetServerScore`, `FleetOperatorScore`, `FleetDeviceAuth`,
`AdminAuth`, `Deployment` CRD, `Device` CRD. Same shapes, new
location.

**P1 ✓** (location fix is the goal). **P2 ✗** (auth still mixes
resolved/unresolved). **P3 ✗** (monolith preserved). **P4 ✓**
(CRDs co-located with operator). **P5 ✗**. **P6 partial** (we
can `pub(crate)` the chart helpers but the underlying coupling
remains). **P7 ✗**. **P8 ✗**. **P9 ✗**. **P10 ✗**.

**Pros:** small, safe, mechanical. Two days of work. No customer-
visible breakage. Unblocks P4 cleanup naturally.

**Cons:** doesn't actually fix the shape. We'd be back here in
six weeks. JG's review already said this isn't enough. Not the
right answer for v0.1 timing — *would* be the right answer if
we'd already shipped to two customers and couldn't break their
code.

### Alternative B — Resolved-only at boundaries + capability traits (the principled cleanup)

**Premise:** Crichton's typestate + ADR-003's domain capabilities
applied to the existing shape. Split resolved vs. unresolved
auth into separate types. Define capability traits for the
adapters. Move into the right crate. **No wholesale rewrite.**

**Top-level types:**

- New crate `harmony-fleet/` (sibling to `harmony-fleet-operator`,
  -agent, -auth). Domain types live here.
- `FleetIdentity`, `FleetDevice`, `FleetDeployment` — domain
  records. Plain data.
- `DeviceCredential` — *resolved* only (a JSON keyfile + issuer
  URL + audience). Replaces `FleetDeviceAuth::ZitadelJwt`.
- `EnrollmentIntent` — pre-resolution. Carries `AdminAuth` and
  what to mint. Method `resolve(&self) -> Result<DeviceCredential>`.
- `Score`s become small + single-responsibility:
  - `EnrollDeviceScore` — runs `EnrollmentIntent::resolve` then
    publishes to NATS.
  - `InstallAgentScore` — drops binary + config + systemd unit.
    Takes a `DeviceCredential`. Doesn't know about Zitadel.
  - `InstallOperatorScore` — helm chart + Secret. Doesn't know
    about devices.
  - `BringUpFleetScore` — composes the above. Single layer of
    composition, not five.
- Capability traits:
  - `DeviceRegistry` — list/get/upsert/delete a `FleetDevice`.
    Implementations: `NatsKvDeviceRegistry`,
    (later) `RedisStreamsDeviceRegistry`.
  - `DesiredStatePublisher`, `ObservedStateConsumer` — same
    shape.
  - `IdentityProvider` — mint a device credential, issue an
    admin token. Today: Zitadel. Tomorrow: something else.

**P1 ✓ P2 ✓ P3 ✓** (split into 4–5 small Scores). **P4 ✓ P5 ✓**
(resolve in the runtime crate, contracts stay neutral).
**P6 ✓** (chart helpers `pub(crate)`, surfaced via `IdentityProvider`
+ `DeploymentReleaseManager` traits). **P7 ✓** (one composer,
not five). **P8 partial** (capability traits defined but bound
unions still get long). **P9 ✓** with newtypes. **P10 partial**
(still three on-disk shapes for credentials, but unified by
trait).

**Pros:** highest-leverage incremental redesign. Buys us most of
the principles without rebuilding plumbing. Customer-visible
breakage is contained to public API renames + import path
moves — no behavior change. Three days is realistic.

**Cons:** we still have a `Score`-shaped mental model where the
*unit of execution* is "a Score". If the right primitive turns
out to be smaller (an effect, an event, a capability call), this
choice wastes some leverage.

### Alternative C — The dataflow reframe (events in, state out)

**Premise:** the fleet platform is, in essence, a **stream
processor**. Events flow in (heartbeats, intent CR creates,
agent reconcile reports). State materializes out (Device CRs,
DeploymentAggregate counters, KV desired-state writes). Today
we model it imperatively as a series of `Score`s; the dataflow
shape is fighting that.

**Top-level types:**

- `FleetEvent` — sum type. `DeviceHeartbeat | DeviceFirstSeen |
  DeploymentDesired | DeploymentObserved | DeploymentDeleted | …`
- `FleetStateSnapshot` — what the operator currently knows. Pure
  data, derivable.
- `Reducer` — `(state, event) → state`. Pure function. Tests
  trivially.
- `Effect` — sum type of side-effects the reducer wants done:
  `WriteKv(bucket, key, value) | UpsertCr(cr) | EmitMetric(...)`.
  Reducer returns `(new_state, Vec<Effect>)`.
- `EffectRunner` — adapter that performs effects. The only thing
  that touches NATS / kube. One implementation per environment.
- The operator pod's main loop: `for event in stream { (state,
  effects) = reduce(state, event); runner.run_all(effects) }`.
  ~50 lines.

**P1 ✓ P2 ✓ P3 ✓ P4 ✓ P5 ✓ P6 ✓ P7 ✓ P8 ✓** (capabilities
collapse into the `EffectRunner` trait). **P9 ✓ P10 partial**.

**Pros:** dramatically simpler operator code. Reducer is pure →
property-test-friendly. The dataflow is the platform. Aligns
with how Kafka / Materialize / Flink-class systems are
structured. Easy to add a new event type — the compiler shows
you every reducer arm to update.

**Cons:** large rewrite of the operator. Three days is
unrealistic. The current `fleet_aggregator.rs` (833 LOC) already
roughly does this but in a less disciplined shape — maybe the
incremental version of this is "make `apply_state` a real
reducer and split `compute_aggregate` into pure pieces". That's
more like Alternative B with extra discipline. The full effect-
typed version is a nice end-state but not a sprint goal.

**Cite:** Materialize's dataflow paper; Kent Beck's *Augmented
Coding* on factoring; Gergely Orosz on event-sourcing; the talk's
"good Lego bricks" framing applies — *events* are the bricks.

### Alternative D — The fleet as a **kube control plane**, period (deliberately weird)

**Premise:** strip the design to one observation. **A fleet is a
Kubernetes cluster whose Nodes happen to be devices, not
servers.** Stop modelling Devices and Deployments separately
from kube primitives. Use Kubernetes itself as the data model.
The operator is one CRD reconciler. NATS is just the transport
between the API server (in the cluster) and the device-side
kubelet-equivalent.

**Top-level types:**

- `Device` is a Node CR. Already exists; we stop wrapping it.
- `Deployment` is a `DaemonSet` (one pod per matching node) or a
  `Deployment` (count: N targeted nodes). We stop inventing a
  CRD; we use the standard one.
- `DeviceInfo` is the Node's `.status` (capacity, allocatable,
  conditions). We stop publishing parallel data; we update
  Node status from the agent's NATS messages.
- The agent on the device is a custom kubelet that speaks NATS to
  the operator instead of HTTPS to the API server.
- The auth callout still exists; it gates NATS access.
- No `harmony-fleet-operator`-specific CRDs. No `Deployment` /
  `Device` CRs of our own.

**P1 ✓ P2 ✓ P3 ✓ P4 N/A** (no CRDs of our own to misplace).
**P5 ✓ P6 ✓ P7 ✓ P8 ✓ P9 ✓ P10 ✓**.

**Pros:** the simplest *conceptual* answer. We stop fighting kube
+ inventing parallel concepts. Customers already understand
DaemonSets, Node selectors, and `kubectl get nodes`. The agent
becomes a known kind of thing (a kubelet variant) with shoulders
to stand on (k3s-iot, kine, virtual-kubelet projects already
prove this works).

**Cons:** *a lot* of plumbing changes. Devices need to register
as Nodes (which means either a real kubelet on each Pi, or a
virtual-kubelet façade). The agent's reconcile loop becomes
"watch a CR via NATS, render manifests, run pods" — bigger than
"watch a KV value, run podman". JetStream KV becomes redundant
with the kube API server. **Probably the right end-state for
v2.0, wrong for v0.1.** Worth noting, though, because comparing
A/B/C to D pulls out which of our current invented concepts are
load-bearing (very few — DeviceInfo is mostly just Node.status;
DeploymentAggregate is mostly just kube's
.status.observedGeneration / .status.conditions stuff).

**Cite:** virtual-kubelet, k3s-iot, KubeEdge, OpenYurt. They've
walked this path; the lessons are public.

### Alternative E — Algebra of fleets (deliberately weird, mathematical)

**Premise:** model the platform as a small algebra. A fleet is a
**set of devices** + an **assignment function** (selector → set
of deployments). Operations on fleets are set-theoretic +
function composition. Treat the API as a query language over
this algebra.

**Top-level types:**

- `Fleet` ::= `Set<Device>`. With operations: union, intersection,
  filter-by-selector, partition.
- `Selector` ::= a pure predicate `Device → bool`. Built from
  primitives `label("k") = "v"`, `arch = aarch64`, …, combined
  with `&`, `|`, `!`.
- `Assignment` ::= `Selector → Set<Deployment>`. Pure function.
- `World` ::= `(Fleet, Assignment)`. Pure data. The operator's job
  is to make reality match the World.
- `Diff(World, Reality) → Vec<Action>`. Pure function. Closed
  form — given the algebra, you can prove what actions are
  *necessary* and *sufficient*.

**P1–P10 ✓** (in principle). **Code volume probably 30% of
current.**

**Pros:** clarity. Properties become provable: "no device gets
an unassigned deployment", "removing a label removes the
assignment", "two operators can edit independently and the merge
is well-defined" (because functions compose). The "make
impossible states impossible" principle, applied to the *fleet
shape itself*, not to individual types.

**Cons:** **almost certainly an over-fit.** The real platform has
dirty edges (devices that fail, network partitions, half-applied
state) that don't sit naturally in a pure algebra. Most teams
that go down this road end up bolting "real-world" escape hatches
back on, ending up with the original design plus extra category
theory. **Useful as a north star** for the cardinality choices,
**not as the platform's actual shape.**

**Cite:** Hillel Wayne *Using Formal Methods at Work*; Conal
Elliott on functional reactive programming; the classic "set
theory for systems people" talks.

### Comparison matrix

| | A. Move | B. Capabilities | C. Dataflow | D. Kube-native | E. Algebra |
|---|---|---|---|---|---|
| Fixes P1 (location) | ✓ | ✓ | ✓ | ✓ | ✓ |
| Fixes P2 (auth states) | ✗ | ✓ | ✓ | ✓ | ✓ |
| Fixes P3 (monolith) | ✗ | ✓ | ✓ | ✓ | ✓ |
| Fixes P4 (CRD placement) | ✓ | ✓ | ✓ | N/A | N/A |
| Fixes P5 (anemic enum) | ✗ | ✓ | ✓ | N/A | partial |
| Fixes P6 (adapter leak) | partial | ✓ | ✓ | ✓ | ✓ |
| Fixes P7 (deep wrap) | ✗ | ✓ | ✓ | ✓ | ✓ |
| Fixes P8 (trait union) | ✗ | partial | ✓ | ✓ | ✓ |
| Fixes P9 (Id overload) | ✗ | ✓ | ✓ | ✓ | ✓ |
| Fixes P10 (config staircase) | ✗ | partial | partial | ✓ | partial |
| Fits 3-day window | ✓ | ✓ (tight) | ✗ | ✗ | ✗ |
| Customer-visible breakage | low | medium | medium | very high | high |
| Risk to demo schedule | very low | low | medium | very high | high |
| Long-term ceiling | low | high | high | very high | very high |

---

## §5 — Recommendation (preliminary)

Read the matrix as: **B is the right answer for now**, with
**explicit awareness of D as the v2.0 destination**.

- A is too little. We'd be back here.
- C and E are right in shape but wrong in timing — we don't have a
  week to rebuild the operator's reconcile loop, and the platform
  isn't in production yet, so there's no urgent "we have to
  refactor anyway" pressure.
- D is conceptually the cleanest, but a v0.1 production push
  is the wrong moment to start running custom kubelets.
- B captures most of the leverage of C/D within the 3-day window,
  with a clean migration path to either of them later (the
  capability traits are the seam — swap the implementation, not the
  callers).

**One concrete shape** to pursue under Alternative B (worth
sketching as the strawman ADR):

- New crate `harmony-fleet/` (the domain crate). Depends on
  `harmony-reconciler-contracts` only.
  - Domain records: `FleetDevice`, `FleetDeployment`, `FleetState`.
  - Capability traits: `DeviceRegistry`, `DesiredStatePublisher`,
    `ObservedStateConsumer`, `IdentityProvider`,
    `AgentLifecycle`.
- `harmony-fleet-adapters-nats/` — `NatsDeviceRegistry`,
  `NatsDesiredStatePublisher`, etc. NATS-specific.
- `harmony-fleet-adapters-zitadel/` — `ZitadelIdentityProvider`.
- `harmony-fleet-adapters-kube/` — `KubeFleetReflector` (writes
  `Device` and `Deployment` CRs as a *reflection* of the domain
  state, not as the source of truth).
- `harmony-fleet-operator/` — daemon. Wires adapters together.
- `harmony-fleet-agent/` — daemon. Wires adapters together.
- `harmony-fleet-cli/` — tomorrow's `harmony-fleet` plugin.
- `harmony/modules/fleet/` is **deleted**. The framework `harmony`
  crate gets a thin `harmony::modules::fleet` *re-export only*
  module that points at `harmony-fleet`. After v0.2 is shipped,
  the re-export module goes away too.

CRDs (`Deployment`, `Device`) move to
`harmony-fleet-adapters-kube/` because they're a kube-specific
projection of the domain, not the domain itself. The agent
imports `harmony-fleet`'s domain types, not the CRDs.

The setup-side scores stay in `harmony` (because they need the
framework's `HelmCommand`, `K8sclient`, etc.) but they consume
`harmony-fleet`'s domain types. The fleet's *domain* doesn't
depend on the framework; the framework's *deploy procedures*
depend on the fleet's domain. Direction of dependency is the
inverse of today.

## §6 — Open questions before we lock this

These are real questions; pulling them out so JG's review has
something concrete to react to:

- **Q1.** Is `IdentityProvider` the right capability name, or is
  it more honest to name it after what we actually need
  (`DeviceCredentialMinter`, `OperatorTokenProvider`)? The talk
  argues against generic names — if reality has two distinct
  concerns, two traits.
- **Q2.** Should `Device` CRD live in adapters-kube, or should it
  not exist at all (replaced by reading kube-API node info, per
  alternative D)? The middle ground (own CRD that mirrors kube
  Node) is what we have today, and it's the worst of both.
- **Q3.** The agent's wire-format for `ReconcileScore` —
  externally tagged enum, today only `PodmanV0`. Move it to
  `harmony-reconciler-contracts` (canonical wire seam) and let
  *both* the agent and the operator import only that crate. This
  removes the `harmony::modules::podman` cross-crate dependency.
  Worth doing in any of A/B/C.
- **Q4.** Does the v0.1 prod push wait for this redesign, or does
  it ship on the current shape with the redesign happening in
  v0.2? Tradeoff: shipping now means committing to *some* public
  API; shipping after means slipping the customer date.
  Recommendation: **ship the redesign first, slip 3 days**, on
  the grounds that public API churn after a customer is on it
  costs more than a 3-day delay before they're on it.
- **Q5.** Where do the *runtime tools* (the `harmony-fleet` CLI
  plugin, future frontend) sit in the dependency graph? If they
  depend on `harmony-fleet`'s domain crate only, we can build
  them without pulling in helm / kube / ansible at compile time.
  This is what we want for the device-side enrollment binary too
  (already feature-gated; the redesign should make the gate
  unnecessary).

---

## §7 — Next steps

1. Sit with this document. Walk away from it for an hour.
2. Round-table on §3 — do P1–P10 capture *the* problems, or are
   we missing one?
3. Round-table on §4 — does the comparison matrix feel honest,
   or is it tilted?
4. Pick one alternative as the working hypothesis.
5. Spike: take one slice through the chosen alternative
   (suggested: `EnrollmentIntent::resolve` + `DeviceCredential` +
   the `IdentityProvider` trait — the smallest end-to-end shape
   that touches every layer). Commit it on a branch. Eyeball:
   does the resulting code feel better?
6. Either: commit to the alternative as ADR-023, or back out
   and try another.

This document gets updated as we go. It is NOT meant to be
locked at first draft.