feat: scaffold IoT walking skeleton — podman module, operator, and agent #264

Merged
johnride merged 210 commits from feat/iot-walking-skeleton into master 2026-05-22 22:16:18 +00:00
Owner
  • Add PodmanV0Score/IotScore (adjacent-tagged serde) and PodmanV0Interpret stub
  • Gate virt behind kvm feature and podman-api behind podman feature
  • Scaffold iot-operator-v0 (kube-rs operator stub) and iot-agent-v0 (NATS KV watch)
  • Add PodmanV0 to InterpretName enum
  • Fix aarch64 cross-compilation by making kvm/podman optional features
  • Align async-nats across workspace, add workspace deps for tracing/toml/tracing-subscriber
  • Remove unused deps (serde_yaml from agent, schemars from operator)
  • Add Send+Sync to CredentialSource, fix &PathBuf → &Path, remove dead_code allow
  • Update 5 KVM example Cargo.tomls with explicit features = ["kvm"]
- Add PodmanV0Score/IotScore (adjacent-tagged serde) and PodmanV0Interpret stub - Gate virt behind kvm feature and podman-api behind podman feature - Scaffold iot-operator-v0 (kube-rs operator stub) and iot-agent-v0 (NATS KV watch) - Add PodmanV0 to InterpretName enum - Fix aarch64 cross-compilation by making kvm/podman optional features - Align async-nats across workspace, add workspace deps for tracing/toml/tracing-subscriber - Remove unused deps (serde_yaml from agent, schemars from operator) - Add Send+Sync to CredentialSource, fix &PathBuf → &Path, remove dead_code allow - Update 5 KVM example Cargo.tomls with explicit features = ["kvm"]
johnride added 1 commit 2026-04-18 02:37:58 +00:00
feat: scaffold IoT walking skeleton — podman module, operator, and agent
Some checks failed
Run Check Script / check (pull_request) Has been cancelled
65ef540b97
- Add PodmanV0Score/IotScore (adjacent-tagged serde) and PodmanV0Interpret stub
- Gate virt behind kvm feature and podman-api behind podman feature
- Scaffold iot-operator-v0 (kube-rs operator stub) and iot-agent-v0 (NATS KV watch)
- Add PodmanV0 to InterpretName enum
- Fix aarch64 cross-compilation by making kvm/podman optional features
- Align async-nats across workspace, add workspace deps for tracing/toml/tracing-subscriber
- Remove unused deps (serde_yaml from agent, schemars from operator)
- Add Send+Sync to CredentialSource, fix &PathBuf → &Path, remove dead_code allow
- Update 5 KVM example Cargo.tomls with explicit features = ["kvm"]
johnride added 2 commits 2026-04-18 14:08:04 +00:00
Implement the A1 task from the IoT walking-skeleton roadmap:

- CRD (kube-derive): `iot.nationtech.io/v1alpha1/Deployment`, namespaced,
  with `targetDevices`, `score {type, data}`, `rollout.strategy`, and a
  status subresource carrying `observedScoreString`.
- Controller: `kube::runtime::Controller` + `finalizer` helper. On Apply,
  writes `<device_id>.<deployment_name>` into NATS KV bucket
  `desired-state` and patches `.status.observedScoreString` via
  server-side apply. Skips KV write + status patch when the score is
  unchanged to avoid reconcile-loop churn. On Cleanup, removes the
  per-device keys before releasing the finalizer.
- CLI: `gen-crd` subcommand prints the CRD YAML from the Rust types;
  `run` (default) starts the controller. `deploy/crd.yaml` is generated
  by that subcommand — single source of truth, no drift.
- Deploy manifests: `deploy/operator.yaml` (Namespace, SA, ClusterRole,
  ClusterRoleBinding, Deployment) and generated `deploy/crd.yaml`.

Agent fixes surfaced while aligning with the operator's key layout:

- Watch filter: was `starts_with("desired-state.<id>.")` on
  `watch_all()`; bucket name is not a key prefix, so it never matched.
  Now uses `bucket.watch("<id>.>")` with the NATS wildcard and handles
  `Put`/`Delete`/`Purge` distinctly.
- Multi-server connect: was joining `nats.urls` with `","` into a single
  malformed URL. Pass the `Vec<String>` to `ConnectOptions::connect`.
- `credentials.type` is now validated (rejects unknown discriminators)
  so a v0.2 `zitadel` config doesn't silently fall back to shared creds.

Verification on feat/iot-walking-skeleton:
- cargo clippy --no-deps -D warnings: clean (agent + operator).
- cargo fmt --check: clean.
- x86_64 + aarch64 cross-compile: both build.
- podman module unit tests: pass.
test(iot-operator): A1 end-to-end smoke test + CRD/patch fixes
All checks were successful
Run Check Script / check (pull_request) Successful in 2m15s
1c916340f1
`iot/scripts/smoke-a1.sh` drives the A1 acceptance flow end-to-end:
spins up NATS and a k3d cluster via podman, applies the generated CRD,
runs the operator, applies a Deployment CR, asserts the expected
`<device>.<deployment>` key lands in the `desired-state` KV bucket and
`.status.observedScoreString` round-trips the same JSON, then deletes
the CR and asserts the finalizer removes the KV key. Cleans up on exit.

Two fixes surfaced while running it:

1. `ScorePayload.data: serde_json::Value` generated an empty `{}`
   schema, which the API server rejects. Attach a `schemars(schema_with
   = preserve_arbitrary)` helper that emits `x-kubernetes-preserve-
   unknown-fields: true`, letting the Score payload be any JSON shape.
2. `Patch::Merge` combined with `PatchParams::apply(...).force()` is
   rejected by kube-rs (force is Apply-only). Use a plain `Merge` patch
   for the status subresource — simpler and correct for v0.
johnride added 1 commit 2026-04-18 14:37:10 +00:00
feat(iot-operator): CEL-validate score.type as a Rust identifier
All checks were successful
Run Check Script / check (pull_request) Successful in 2m15s
d21bdef050
The CRD previously accepted any string for `score.type`, so typos like
`"pdoman"` or `"PodmnV0"` would be persisted by the apiserver and only
surface on-device as agent-side deserialize warnings. That class of
failure is distasteful and hard to debug.

Replace the auto-derived schema for `ScorePayload` with a hand-rolled
one that keeps the same visible shape but adds two apiserver-level
guardrails:

- `score.type` gets `minLength: 1` and an `x-kubernetes-validations`
  CEL rule requiring it to match `^[A-Za-z_][A-Za-z0-9_]*$` — a valid
  Rust identifier, since score variants *are* Rust struct names in
  `harmony::modules::podman::IotScore`. Message points operators at
  the concrete example `PodmanV0`.
- `score.data` still carries only `x-kubernetes-preserve-unknown-
  fields: true`. The rule validates the discriminator's *shape*, not
  its *value*, so v0.3+ variants (OkdApplyV0, KubectlApplyV0) don't
  require an operator release — preserves ROADMAP §6.1's
  generic-router design.

The `x-kubernetes-preserve-unknown-fields` extension stays scoped to
`score.data` alone; every other field in the CRD has a strict schema,
exactly one preserve-unknown-fields marker and exactly one
validations block in the whole document.

Smoke test extended: phase 2b applies a CR with `score.type: "has
spaces"` and asserts the apiserver rejects it with the CEL message
before the operator ever sees it. Positive phases (kubectl apply ->
NATS KV put -> status observed -> delete -> KV key removed) still
PASS end-to-end.

Matches the `preserve_arbitrary` pattern used by ArgoCD
(`Application.spec.source.helm.valuesObject`) and Flux
(`HelmRelease.spec.values`), both of which similarly use narrow
preserve-unknown-fields on a payload field without coupling the CRD
to their variant catalog.
Author
Owner

Code review guide — feat/iot-walking-skeleton

One-line summary. Thin end-to-end thread for the IoT platform (ROADMAP/iot_platform/v0_walking_skeleton.md): kubectl apply Deployment in a
central cluster → operator writes to NATS KV → on-device agent runs the container, all in Rust with Harmony's Score/Topology/Interpret
pattern.

Size. 5 commits, ~3k lines added, almost all new code — two new crates (iot-operator-v0, iot-agent-v0), one new Harmony module
(modules/podman/), one new capability trait (domain::topology::ContainerRuntime), one new inventory constructor, one smoke test script.


The 5 commits, what each is for

┌─────┬───────────────────────┬─────────────────────────────────────────────────────────────────────┬───────────────────────────────────┐
│ # │ Commit │ What it does │ Review weight │
├─────┼───────────────────────┼─────────────────────────────────────────────────────────────────────┼───────────────────────────────────┤
│ 1 │ 65ef540 scaffold │ Two iot crates with stub binaries, modules/podman/ with typed Score │ Skim — structural, low risk │
│ │ │ + stub interpret, workspace plumbing │ │
├─────┼───────────────────────┼─────────────────────────────────────────────────────────────────────┼───────────────────────────────────┤
│ 2 │ e50ab74 operator │ kube-rs Controller + finalizer, Deployment CRD types, deploy │ Heavy — this is the whole │
│ │ controller │ manifests, gen-crd subcommand │ operator │
├─────┼───────────────────────┼─────────────────────────────────────────────────────────────────────┼───────────────────────────────────┤
│ 3 │ 1c91634 smoke test + │ smoke-a1.sh, x-kubernetes-preserve-unknown-fields on score.data, │ Medium — fixes are small, script │
│ │ CRD/patch fixes │ drop invalid Patch::Merge + .force() combo │ is important │
├─────┼───────────────────────┼─────────────────────────────────────────────────────────────────────┼───────────────────────────────────┤
│ │ d21bdef CEL │ Hand-rolled schema_with emitting x-kubernetes-validations that │ Medium — the schemars::r#gen │
│ 4 │ validation │ forces score.type to match a Rust identifier │ escape hatch is the only │
│ │ │ │ non-obvious part │
├─────┼───────────────────────┼─────────────────────────────────────────────────────────────────────┼───────────────────────────────────┤
│ 5 │ 1112125 agent │ ContainerRuntime trait, PodmanTopology, PodmanV0Interpret wired to │ Heavy — largest, most novel │
│ │ reconciliation │ podman-api, Inventory::from_localhost, reconciler with 30s tick │ │
└─────┴───────────────────────┴─────────────────────────────────────────────────────────────────────┴───────────────────────────────────┘

Suggested reading order

Don't read chronologically — read by architecture layer.

  1. Start at the domain layer. Read harmony/src/domain/topology/container_runtime.rs first. Small file. Anchor point for everything else.
    Confirm the capability shape, the MANAGED_BY_LABEL fleet-safety story, the intentional "Podman-shaped not CRI-shaped" scope comment.
  2. Then the wire format. harmony/src/modules/podman/score.rs — note the ContainerRuntime bound on the Score impls, the IotScore
    adjacently-tagged serde enum (#[serde(tag = "type", content = "data")]), the existing round-trip tests. This is the only polymorphic variant
    today; the shape is designed for OkdApplyV0 / KubectlApplyV0 later without operator changes.
  3. The operator. iot/iot-operator-v0/src/crd.rs → src/controller.rs → src/main.rs. 300-line budget held. Most interesting bits: the
    hand-rolled score_payload_schema (commit 4), the finalizer helper usage, the no-op guard in apply() that skips KV write + status patch when
    the score is unchanged.
  4. The podman topology. harmony/src/modules/podman/topology.rs — drift detection via matches_spec, image pull before create, graceful stop
    with 5-min timeout per ROADMAP §5.6.
  5. The agent. iot/iot-agent-v0/src/reconciler.rs → src/main.rs. Pay attention to the HashMap<key, (serialized, parsed)> cache — that's the
    §5.5 string-compare idempotency made explicit.
  6. The smoke test. iot/scripts/smoke-a1.sh. Six phases top to bottom. This is what proves the whole thread.

What to look at closely

CRD schema (commit 3 + 4). The x-kubernetes-preserve-unknown-fields: true extension is scoped to .spec.score.data only — everything else has
a strict schema. Verify this is still true after any rebase: grep -c preserve-unknown iot/iot-operator-v0/deploy/crd.yaml should be exactly

  1. The CEL rule on score.type requires a Rust identifier (^[A-Za-z_][A-Za-z0-9_]*$) — validates shape not variant catalog, so the operator
    stays generic. See harmony/src/modules/podman/score.rs::deployment_label for how that string flows through.

serde_json::Value schema hack. crd.rs::preserve_arbitrary and score_payload_schema use schemars::r#gen::SchemaGenerator (raw-identifier
escape — gen is a 2024 keyword). This is the only place that synthesises non-derived OpenAPI schema; if schemars ever grows first-class
x-kubernetes-* support, this is the migration target.

Finalizer + status subresource. iot-operator-v0/src/controller.rs::reconcile uses kube::runtime::finalizer::finalizer(...) so delete goes
through Cleanup → KV key removed → finalizer released. Patch::Merge on the status subresource rather than Patch::Apply because .force() is
Apply-only (that's the fix in commit 3). Look for "drift between KV write and status patch" — no transaction across those two. If the KV
write succeeds and the status patch fails, the next reconcile retries both and the no-op guard sees the mismatch and re-writes. Fine for v0.

ContainerRuntime surface area. Three methods, no networks/volumes/stacks. Doc comment on the trait explains why — Docker likely fits without
change, Containerd/CRI-O need a separate capability. If a reviewer argues for Docker/Containerd compat today, push back: ROADMAP §6.4
explicitly says the capability must be a "real industry concept, not a tool" but the PostgreSQL exception applies here (the Score author
writes container-runtime-specific configs).

Agent reconcile loop. reconciler.rs::apply compares the incoming serialized JSON byte-string to the last-seen before dispatching — ROADMAP
§5.5 "change detection via string comparison (not content hash), cheap, deterministic." The 30s run_periodic tick re-runs every cached score
so podman rm outside the agent self-heals. remove() iterates the last-seen score's services; if the agent restarts after a delete it'll log
"unknown key — nothing to remove" which is correct.

Inventory::from_localhost. Minimal — single PhysicalHost in worker_host, hostname as a label, one synthetic CPU, one synthetic MemoryModule.
Everything else empty. If a Score later reaches for inventory.firewall_mgmt or inventory.switch it'll get the ManualManagementInterface /
empty vec, which is correct for a single-host topology.

Smoke test coverage. The test asserts (a) CRD validates, (b) CEL rejects a typo discriminator, (c) operator reconciles (KV put + status
set), (d) agent reconciles (container runs, curl passes), (e) delete propagates (KV gone, container gone). It does not test (a) multi-device
(b) drift recovery (c) agent restart (d) NATS restart. Those are v0.1 per the roadmap.


Intentional trade-offs the reviewer might flag

  • No auth between operator and NATS. Deferred to v0.2. Roadmap §5.6 explicitly covers this ("Same k8s cluster, same namespace. Network
    trust.").
  • Operator doesn't parse score.data. That's the feature, not a bug — §6.1 generic router. Validation is agent-side via typed IotScore
    deserialise.
  • String-compare idempotency rather than structural equality. §5.5 explicit choice, removes hashing-algorithm risk.
  • x-kubernetes-preserve-unknown-fields on one field. See the commit 4 planning notes — ArgoCD Application.spec.source.helm.valuesObject uses
    the same pattern.
  • Podman user socket, not shell-out. Directly per partner-feedback context in ROADMAP §5.3 / context_conversation.md Phase 5 ("Use
    podman-api Rust crate, not shell-out").
  • No Quadlet. Explicit §4 deferral to v0.1.
  • MANAGED_BY_LABEL only on containers, not on KV keys. The key namespace is .; no need for a second label there.
  • Graceful stop timeout = 5 min. ROADMAP §5.6, not kubelet-style draining.
  • is_not_found matches the error string. podman-api doesn't cleanly expose 404 variants; string match is documented in the function doc.
  • dad/ directory in repo root. Untracked, not part of this branch — looks like a stray worktree copy. Not mine; flag separately.

Not in this branch (explicit v0 scope cuts, per ROADMAP §4)

  • Zitadel / OpenBao / real auth — v0.2
  • Multiple devices — v0.1
  • Inventory reporting from agent — v0.1
  • Status aggregation in operator CRD — v0.1
  • Log streaming over NATS — v0.1
  • Real Pi testing — roadmap Sunday item (we tested against localhost podman on Arch)
  • Power cycle / network-out / crash-loop testing — ROADMAP §8 Monday item
  • Installer script (A3) — Saturday item, not merged here

How to verify locally

Prerequisite: rootless podman user socket (one-time)

systemctl --user enable --now podman.socket

Build everything

./build/check.sh # or at least cargo check --all-targets --all-features

aarch64 cross-compile sanity

cargo build --target aarch64-unknown-linux-gnu
-p harmony --features podman
-p iot-agent-v0 -p iot-operator-v0

End-to-end smoke (~1 minute)

./iot/scripts/smoke-a1.sh

KEEP=1 leaves NATS + k3d cluster up for manual poking afterwards

The smoke test teardown trap removes the demo container unconditionally, even with KEEP=1, so it won't leave a rogue nginx on host:8080.

One reviewer-requested thing I'd like a second opinion on

The ContainerRuntime trait has no notion of networks or volumes. For Docker parity it'd need both eventually. Two options when we get there:
(1) grow the trait, (2) add sibling capability traits (ContainerNetworks, ContainerVolumes) that a topology opts into separately. The
latter keeps Pi-sized topologies from having to implement stubs. Not a blocker for v0; worth an eventual decision.

Code review guide — feat/iot-walking-skeleton One-line summary. Thin end-to-end thread for the IoT platform (ROADMAP/iot_platform/v0_walking_skeleton.md): kubectl apply Deployment in a central cluster → operator writes to NATS KV → on-device agent runs the container, all in Rust with Harmony's Score/Topology/Interpret pattern. Size. 5 commits, ~3k lines added, almost all new code — two new crates (iot-operator-v0, iot-agent-v0), one new Harmony module (modules/podman/), one new capability trait (domain::topology::ContainerRuntime), one new inventory constructor, one smoke test script. --- The 5 commits, what each is for ┌─────┬───────────────────────┬─────────────────────────────────────────────────────────────────────┬───────────────────────────────────┐ │ # │ Commit │ What it does │ Review weight │ ├─────┼───────────────────────┼─────────────────────────────────────────────────────────────────────┼───────────────────────────────────┤ │ 1 │ 65ef540 scaffold │ Two iot crates with stub binaries, modules/podman/ with typed Score │ Skim — structural, low risk │ │ │ │ + stub interpret, workspace plumbing │ │ ├─────┼───────────────────────┼─────────────────────────────────────────────────────────────────────┼───────────────────────────────────┤ │ 2 │ e50ab74 operator │ kube-rs Controller + finalizer, Deployment CRD types, deploy │ Heavy — this is the whole │ │ │ controller │ manifests, gen-crd subcommand │ operator │ ├─────┼───────────────────────┼─────────────────────────────────────────────────────────────────────┼───────────────────────────────────┤ │ 3 │ 1c91634 smoke test + │ smoke-a1.sh, x-kubernetes-preserve-unknown-fields on score.data, │ Medium — fixes are small, script │ │ │ CRD/patch fixes │ drop invalid Patch::Merge + .force() combo │ is important │ ├─────┼───────────────────────┼─────────────────────────────────────────────────────────────────────┼───────────────────────────────────┤ │ │ d21bdef CEL │ Hand-rolled schema_with emitting x-kubernetes-validations that │ Medium — the schemars::r#gen │ │ 4 │ validation │ forces score.type to match a Rust identifier │ escape hatch is the only │ │ │ │ │ non-obvious part │ ├─────┼───────────────────────┼─────────────────────────────────────────────────────────────────────┼───────────────────────────────────┤ │ 5 │ 1112125 agent │ ContainerRuntime trait, PodmanTopology, PodmanV0Interpret wired to │ Heavy — largest, most novel │ │ │ reconciliation │ podman-api, Inventory::from_localhost, reconciler with 30s tick │ │ └─────┴───────────────────────┴─────────────────────────────────────────────────────────────────────┴───────────────────────────────────┘ Suggested reading order Don't read chronologically — read by architecture layer. 1. Start at the domain layer. Read harmony/src/domain/topology/container_runtime.rs first. Small file. Anchor point for everything else. Confirm the capability shape, the MANAGED_BY_LABEL fleet-safety story, the intentional "Podman-shaped not CRI-shaped" scope comment. 2. Then the wire format. harmony/src/modules/podman/score.rs — note the ContainerRuntime bound on the Score<T> impls, the IotScore adjacently-tagged serde enum (#[serde(tag = "type", content = "data")]), the existing round-trip tests. This is the only polymorphic variant today; the shape is designed for OkdApplyV0 / KubectlApplyV0 later without operator changes. 3. The operator. iot/iot-operator-v0/src/crd.rs → src/controller.rs → src/main.rs. 300-line budget held. Most interesting bits: the hand-rolled score_payload_schema (commit 4), the finalizer helper usage, the no-op guard in apply() that skips KV write + status patch when the score is unchanged. 4. The podman topology. harmony/src/modules/podman/topology.rs — drift detection via matches_spec, image pull before create, graceful stop with 5-min timeout per ROADMAP §5.6. 5. The agent. iot/iot-agent-v0/src/reconciler.rs → src/main.rs. Pay attention to the HashMap<key, (serialized, parsed)> cache — that's the §5.5 string-compare idempotency made explicit. 6. The smoke test. iot/scripts/smoke-a1.sh. Six phases top to bottom. This is what proves the whole thread. --- What to look at closely CRD schema (commit 3 + 4). The x-kubernetes-preserve-unknown-fields: true extension is scoped to .spec.score.data only — everything else has a strict schema. Verify this is still true after any rebase: grep -c preserve-unknown iot/iot-operator-v0/deploy/crd.yaml should be exactly 1. The CEL rule on score.type requires a Rust identifier (^[A-Za-z_][A-Za-z0-9_]*$) — validates shape not variant catalog, so the operator stays generic. See harmony/src/modules/podman/score.rs::deployment_label for how that string flows through. serde_json::Value schema hack. crd.rs::preserve_arbitrary and score_payload_schema use schemars::r#gen::SchemaGenerator (raw-identifier escape — gen is a 2024 keyword). This is the only place that synthesises non-derived OpenAPI schema; if schemars ever grows first-class x-kubernetes-* support, this is the migration target. Finalizer + status subresource. iot-operator-v0/src/controller.rs::reconcile uses kube::runtime::finalizer::finalizer(...) so delete goes through Cleanup → KV key removed → finalizer released. Patch::Merge on the status subresource rather than Patch::Apply because .force() is Apply-only (that's the fix in commit 3). Look for "drift between KV write and status patch" — no transaction across those two. If the KV write succeeds and the status patch fails, the next reconcile retries both and the no-op guard sees the mismatch and re-writes. Fine for v0. ContainerRuntime surface area. Three methods, no networks/volumes/stacks. Doc comment on the trait explains why — Docker likely fits without change, Containerd/CRI-O need a separate capability. If a reviewer argues for Docker/Containerd compat today, push back: ROADMAP §6.4 explicitly says the capability must be a "real industry concept, not a tool" but the PostgreSQL exception applies here (the Score author writes container-runtime-specific configs). Agent reconcile loop. reconciler.rs::apply compares the incoming serialized JSON byte-string to the last-seen before dispatching — ROADMAP §5.5 "change detection via string comparison (not content hash), cheap, deterministic." The 30s run_periodic tick re-runs every cached score so podman rm outside the agent self-heals. remove() iterates the last-seen score's services; if the agent restarts after a delete it'll log "unknown key — nothing to remove" which is correct. Inventory::from_localhost. Minimal — single PhysicalHost in worker_host, hostname as a label, one synthetic CPU, one synthetic MemoryModule. Everything else empty. If a Score later reaches for inventory.firewall_mgmt or inventory.switch it'll get the ManualManagementInterface / empty vec, which is correct for a single-host topology. Smoke test coverage. The test asserts (a) CRD validates, (b) CEL rejects a typo discriminator, (c) operator reconciles (KV put + status set), (d) agent reconciles (container runs, curl passes), (e) delete propagates (KV gone, container gone). It does not test (a) multi-device (b) drift recovery (c) agent restart (d) NATS restart. Those are v0.1 per the roadmap. --- Intentional trade-offs the reviewer might flag - No auth between operator and NATS. Deferred to v0.2. Roadmap §5.6 explicitly covers this ("Same k8s cluster, same namespace. Network trust."). - Operator doesn't parse score.data. That's the feature, not a bug — §6.1 generic router. Validation is agent-side via typed IotScore deserialise. - String-compare idempotency rather than structural equality. §5.5 explicit choice, removes hashing-algorithm risk. - x-kubernetes-preserve-unknown-fields on one field. See the commit 4 planning notes — ArgoCD Application.spec.source.helm.valuesObject uses the same pattern. - Podman user socket, not shell-out. Directly per partner-feedback context in ROADMAP §5.3 / context_conversation.md Phase 5 ("Use podman-api Rust crate, not shell-out"). - No Quadlet. Explicit §4 deferral to v0.1. - MANAGED_BY_LABEL only on containers, not on KV keys. The key namespace is <device>.<name>; no need for a second label there. - Graceful stop timeout = 5 min. ROADMAP §5.6, not kubelet-style draining. - is_not_found matches the error string. podman-api doesn't cleanly expose 404 variants; string match is documented in the function doc. - dad/ directory in repo root. Untracked, not part of this branch — looks like a stray worktree copy. Not mine; flag separately. Not in this branch (explicit v0 scope cuts, per ROADMAP §4) - Zitadel / OpenBao / real auth — v0.2 - Multiple devices — v0.1 - Inventory reporting from agent — v0.1 - Status aggregation in operator CRD — v0.1 - Log streaming over NATS — v0.1 - Real Pi testing — roadmap Sunday item (we tested against localhost podman on Arch) - Power cycle / network-out / crash-loop testing — ROADMAP §8 Monday item - Installer script (A3) — Saturday item, not merged here How to verify locally # Prerequisite: rootless podman user socket (one-time) systemctl --user enable --now podman.socket # Build everything ./build/check.sh # or at least cargo check --all-targets --all-features # aarch64 cross-compile sanity cargo build --target aarch64-unknown-linux-gnu \ -p harmony --features podman \ -p iot-agent-v0 -p iot-operator-v0 # End-to-end smoke (~1 minute) ./iot/scripts/smoke-a1.sh # KEEP=1 leaves NATS + k3d cluster up for manual poking afterwards The smoke test teardown trap removes the demo container unconditionally, even with KEEP=1, so it won't leave a rogue nginx on host:8080. One reviewer-requested thing I'd like a second opinion on The ContainerRuntime trait has no notion of networks or volumes. For Docker parity it'd need both eventually. Two options when we get there: (1) grow the trait, (2) add sibling capability traits (ContainerNetworks, ContainerVolumes) that a topology opts into separately. The latter keeps Pi-sized topologies from having to implement stubs. Not a blocker for v0; worth an eventual decision.
johnride added 4 commits 2026-04-20 13:40:43 +00:00
The agent now finishes the walking-skeleton thread end-to-end: a Deployment
CR applied in the central cluster flows through the operator into NATS KV,
the agent reconciles it into a running container on the host, and deletion
(or drift) runs through the same loop in reverse.

Key additions:

- `domain::topology::ContainerRuntime` — new capability trait for
  node-level container runtimes with `ensure_service_running` /
  `remove_service` / `list_managed_services`. Intentional scope doc
  notes Docker likely fits, Containerd/CRI-O likely need a separate
  capability; no attempt to generalise further up front. `ContainerSpec`
  carries a `MANAGED_BY_LABEL` so `list_managed_services` can filter
  out containers Harmony didn't create.
- `modules::podman::PodmanTopology` (feature-gated behind `podman`)
  implements both `Topology` and `ContainerRuntime` over
  `podman_api::Podman` on the local user socket. Handles image pull,
  create/start, drift-triggered recreate, and a 5-minute graceful stop
  per ROADMAP §5.6.
- `modules::podman::PodmanV0Interpret::execute` is no longer a stub —
  its bound is tightened to `T: Topology + ContainerRuntime` and it
  dispatches each `PodmanService` to the capability. `IotScore` /
  `PodmanV0Score` carry the same bound so agent code calls
  `Score::create_interpret` cleanly.
- `domain::inventory::Inventory::from_localhost()` — minimal
  single-host inventory (hostname as label, logical CPU count, total
  memory). Pulls in `sysinfo 0.30` (already a transitive dep via
  `harmony_inventory_agent`).
- `iot-agent-v0` rewired around a `Reconciler` that owns the topology
  + inventory + a `HashMap<key, (serialized_score, parsed_score)>`
  cache. KV Put → dispatch iff the serialized score changed
  (ROADMAP §5.5 string-compare). KV Delete/Purge → tear down the
  cached score's containers. Separate 30s reconcile tick re-runs
  every cached score against podman (ROADMAP §5.6 "polls podman
  every 30s as ground truth; KV watch events are accelerators").

Smoke test (`iot/scripts/smoke-a1.sh`) extended with phase 3b
(builds + starts agent) and phase 4b (verifies the container is
running and `curl http://127.0.0.1:8080/` returns nginx). Phase 5
now also asserts the container is gone after CR delete. PASS locally
against a fresh k3d + NATS podman container + rootless podman on the
dev host. aarch64 + x86_64 cross-compile stay green.
Adds the plumbing so Harmony can both provision a VM to stand in for a
fleet device and (re)configure any Linux host to join the fleet. The
walking skeleton's "VM-as-device" test path needs all three pieces:

- `domain::topology::HostConfigurationProvider` — new capability trait
  with `ensure_package`, `ensure_user`, `ensure_file`,
  `ensure_systemd_unit`, `restart_service`, `ensure_linger`,
  `ensure_user_unit_active`, and a reachability `ping`. Returns
  `ChangeReport { changed: bool }` so callers can reconcile-restart only
  when something actually changed. Trait doc calls out the narrow scope
  (not a general CM replacement) and the swappability story.

- `modules::linux::AnsibleHostConfigurator` + `LinuxHostTopology` —
  concrete impl that shells out to `ansible-playbook --stdout-callback
  json`, one play per trait method, parsing the JSON for the task's
  `changed` flag. Deliberately the laziest reasonable adapter: when
  Ansible's error surface becomes painful, this is the piece we replace
  with a Rust-native impl behind the same trait, with zero Score churn.
  Runtime requirement: `ansible-playbook` (>= 2.15) on the Harmony
  runner host.

- `modules::kvm::KvmVmScore` + cloud-init seed ISO generation — thin
  Score that wraps `KvmExecutor::ensure_vm` with a generated cloud-init
  seed ISO (hostname + authorized SSH key + sudoer user, nothing more).
  Uses `xorriso -as mkisofs` to build the ISO; returns the booted VM's
  IP. Docs note cloud-init is strictly for the VM test rig — customer
  Pi deployments go through rpi-imager / PXE instead. New `KvmHost`
  capability + `KvmHostTopology` expose the underlying `KvmExecutor`.

- `modules::iot::IotDeviceSetupScore` — customer-facing Score bound to
  `T: Topology + HostConfigurationProvider`. Installs podman + system-
  d-container, creates the `iot-agent` system user with linger,
  activates user podman.socket, uploads the agent binary via a
  base64-in-tmpfile + oneshot unit pattern (docstring flags this as a
  v0.1 candidate for a proper remote-fetch), writes
  `/etc/iot-agent/config.toml` and the systemd unit, and restarts only
  if any of the config/unit/binary-install tasks reported changes.
  Re-running with a different `group` rewrites the TOML and bounces
  the agent.

Scope note: this turn stops at one VM. Multi-VM + group routing is the
next step — `group` in the config is a label that the agent will carry
into its status bucket, but `Deployment.spec.targetGroups` isn't wired
anywhere yet. `smoke-a3.sh` (VM-as-device end-to-end) lands in the
next commit.
- New binary crate `examples/iot_vm_setup` — composes the two Scores
  from the previous commit (`KvmVmScore`, `IotDeviceSetupScore`) with
  `KvmHostTopology` + `LinuxHostTopology`. CLI flags cover everything
  a customer-facing "onboard this VM" invocation would need (device
  id, group, NATS URL+creds, SSH key paths, cloud image path, agent
  binary path). `--only-vm` skips the setup step when iterating on VM
  provisioning.

- `iot/scripts/smoke-a3.sh` — end-to-end smoke that stands up a NATS
  podman container, builds the iot-agent, runs the example, and waits
  for the VM's agent to write its `status.<device-id>` key into the
  `agent-status` KV bucket. Preflight fails fast with copy-paste
  commands when any of `virsh`, `xorriso`, `ansible-playbook`, the
  Ubuntu cloud image, or an SSH keypair is missing — the script does
  not try to self-bootstrap these (would turn a 90-second smoke into a
  ~20-minute download-and-generate session).

- Clippy cleanups: redundant closure + useless `format!`s.
refactor(linux): ansible ad-hoc mode + self-installing venv
All checks were successful
Run Check Script / check (pull_request) Successful in 2m20s
1577348dbb
Rewrites AnsibleHostConfigurator to avoid the two coupling points that
last year's Kubespray investigation taught us to stay away from: YAML
playbook generation and Ansible inventory.

- **No more YAML, no more inventory files.** Every primitive is now one
  or two `ansible all -i '<ip>,' -m <module> -a '<json>'` ad-hoc
  invocations. JSON args go straight through Ansible's own module
  interface; the tmpfile-playbook-and-inventory dance is gone entirely.
  Harmony owns 100% of orchestration, Ansible owns only per-host
  idempotent module execution. `ensure_systemd_unit` collapses to two
  ad-hoc calls (copy + systemd) rather than a multi-task playbook.
  `ensure_linger` sentinels change-state through the shell module's
  stdout since ad-hoc has no `changed_when`.

- **Self-installing venv.** New `modules::linux::ansible_venv`:
  `ensure_ansible_venv()` creates `$HARMONY_DATA_DIR/ansible-venv/` via
  `python3 -m venv` + `pip install ansible-core==2.17.*` on first use,
  cached via `tokio::sync::OnceCell`. No more "install ansible before
  running Harmony" step — python3 + venv is the only host requirement,
  and we print the exact package names for Arch/Debian/Fedora when
  python is missing.

- **smoke-a3.sh**: drop `ansible-playbook` from preflight, add
  `python3`. Example gains `--bootstrap-ansible-only` for warming the
  venv ahead of the real run (turns a ~60s first-run smoke into
  deterministic sub-second after bootstrap).

Output parsing uses the `oneline` callback (`host | VERB => {json}`)
which is trivially regex-free to split and handles FAILED!/UNREACHABLE!
as errors. SSH control sockets are pinned under `$HARMONY_DATA_DIR/
ansible-cp` so multiple Harmony processes don't race in /tmp.

Verified: `ensure_ansible_venv()` first call installs ansible-core
2.17.14 into the managed venv (~12s, network-bound); second call is
cache-fast (<50ms). Clippy + fmt clean, aarch64 cross-compile green.
johnride added 1 commit 2026-04-20 18:15:51 +00:00
fix(iot): end-to-end smoke-a3 greens; CI-ready
All checks were successful
Run Check Script / check (pull_request) Successful in 2m16s
63847ac059
Eight fixes surfaced by actually running the VM-as-device flow end to
end. All six commit deltas are small and self-contained.

KvmVmScore + cloud-init:
- **Overlay disk**: VM now boots off a per-VM qcow2 backed by the base
  image instead of writing into the base in-place. Re-runs of the same
  vm_name reuse the overlay (idempotent); fresh runs wipe the overlay
  so cloud-init starts clean. Requires `qemu-img`.
- **UUID instance-id**: cloud-init's meta-data now carries a fresh
  UUID per seed build, so when the overlay gets recreated cloud-init
  treats it as a first boot and re-runs all per-instance modules.
  Without this, repeated runs silently skipped user/hostname/ssh setup.
- **xorriso deadlock**: `.status()` with piped stderr filled the pipe
  buffer and SIGPIPE'd the child; switched to `.output()` which drains
  both. Also unlink any pre-existing seed ISO before running xorriso,
  since it otherwise treats the file as overwriteable input "media"
  and aborts with exit 32.
- **wait_for_ip**: 180s → 300s. First boot of a cloud image on a
  constrained runner (or CI worker) can take 2-4 minutes.

Ansible adapter — a half-dozen sharp corners of ad-hoc mode that only
show up in a live run:
- **`--ssh-common-args=VALUE`** (equals form, single token). Separate
  `--ssh-common-args VALUE` form has ansible's argparse re-interpret
  the `-o …` inside the value as its own `-o` flag and dump a help
  screen. Lost an afternoon to this decades ago on another project.
- **Skip `-a` when empty**: `-a '{}'` trips ansible-core 2.17's "extra
  params" check on parameterless modules like `ping`. Pass no `-a`
  when the JSON dict is empty.
- **`ANSIBLE_LOAD_CALLBACK_PLUGINS=True`**: ad-hoc mode silently
  ignores `ANSIBLE_STDOUT_CALLBACK` without this. Default callback
  produces multi-line JSON that's fragile to parse.
- **`ANSIBLE_PIPELINING=True`**: required when `become`-ing an
  unprivileged user (iot-agent for the user-scope podman.socket),
  otherwise ansible's temp-file shuffle falls back to an ACL chmod
  syntax no Linux distro accepts.
- **Parse shell/command oneline shape**: oneline callback emits
  `host | VERB | rc=N | (stdout) … | (stderr) …` for shell-style
  modules in addition to the `host | VERB => {json}` shape. Parser
  now handles both and synthesises a JSON payload from the shell form.
- **Auto-create parent dir in ensure_file**: ansible's `copy` module
  won't create `/etc/iot-agent/` for you; a `file state=directory`
  call before every `copy` is idempotent and cheap.
- **ensure_package uses apt directly**: `ansible.builtin.package` is
  distro-agnostic but doesn't auto-run `apt update`, so a fresh cloud
  image fails with "no package matching". Switched to
  `ansible.builtin.apt` with `update_cache=true, cache_valid_time=3600`.
  Debian-family only for v0 (ROADMAP §5.3); RHEL switch is a future
  capability refinement.

HostConfigurationProvider surface:
- **`FileSpec.source: FileSource`**: new `Content(String)` vs
  `LocalPath(PathBuf)`. LocalPath ships binary files over SFTP via
  ansible's native mechanism instead of passing base64 content through
  argv (which hit ARG_MAX on the ~10MB agent). This replaces the whole
  base64-in-tmpfile + oneshot install-unit dance in
  IotDeviceSetupScore — the binary now installs in a single idempotent
  `ensure_file` call that reports `changed` only when bytes differ.

IotDeviceSetupScore:
- Dropped the base64 + oneshot install machinery (80 fewer lines).
- Dropped the explicit primary `group:` on ensure_user — Debian-family
  useradd auto-creates a group matching the username; setting `group:`
  required pre-creating it.

smoke-a3.sh: builds iot-agent-v0 `--release` instead of debug (400MB
debug binary filled the VM's thin-provisioned 3.5GB cloud rootfs).

Verified end-to-end three times on this host:
  run 1: 9 changes  (fresh install — package install, user create, binary, config, restart)
  run 2: 0 changes  (true NOOP — `already configured`)
  run 3: 2 changes  (group swap — only TOML + agent restart)
Agent reports status.iot-smoke-vm into NATS after each run.
johnride reviewed 2026-04-20 20:50:51 +00:00
johnride left a comment
Author
Owner

Halfway through the review, many small things and a few bigger things to fix. Overall not terrible. But take the time to step back, understand clearly the code review and revisit the entire p-r with the comments in mind and improve it.

Halfway through the review, many small things and a few bigger things to fix. Overall not terrible. But take the time to step back, understand clearly the code review and revisit the entire p-r with the comments in mind and improve it.
@@ -0,0 +25,4 @@
/// deliberately Ansible-agnostic so a Rust-native impl can be dropped in
/// later without Score changes.
#[async_trait]
pub trait HostConfigurationProvider: Send + Sync {
Author
Owner

I have some doubts about this trait.

First it is linux specific because of systemd and other linux specific references (which is not a problem outside the naming)

I also feel it is packing many things into a single interface and it is very likely to cause interface segregation and LSP problems.

I have some doubts about this trait. First it is linux specific because of systemd and other linux specific references (which is not a problem outside the naming) I also feel it is packing many things into a single interface and it is very likely to cause interface segregation and LSP problems.
Author
Owner

Also many things this does require sudo (like creating a user) which might end up in cloud init for a reason or another, I think it could be misleading to have a single trait with implementations all over the place from cloud init to calling an ansible module to install a package, etc.

Also many things this does require sudo (like creating a user) which might end up in cloud init for a reason or another, I think it could be misleading to have a single trait with implementations all over the place from cloud init to calling an ansible module to install a package, etc.
@@ -0,0 +35,4 @@
pub struct IotDeviceSetupConfig {
/// Stable device identifier. Written into the agent's TOML and used
/// as the KV key prefix (`<device_id>.<deployment>`).
pub device_id: String,
Author
Owner

This could very well use a harmony id from harmony_types. I like the id format, it is unique, relatively short and contains a timestamp.

This could very well use a harmony id from harmony_types. I like the id format, it is unique, relatively short and contains a timestamp.
@@ -0,0 +52,4 @@
impl IotDeviceSetupConfig {
/// Render the agent's `/etc/iot-agent/config.toml` content.
pub fn render_toml(&self) -> String {
Author
Owner

This is ugly. Use a long string with format! or askama templates like we do in other places.

This is ugly. Use a long string with format! or askama templates like we do in other places.
@@ -0,0 +87,4 @@
/// Render the systemd unit file content.
pub fn render_systemd_unit(&self) -> String {
String::from(
"[Unit]
Author
Owner

Are there alternatives to systemd unit files to make sure it restarts on reboot? What about running the agent itself as a podman container? It would probably have to be privileged but that would reduce the configuration burden on the host and centralize our logic around podman instead of spreading it to systemd.

Are there alternatives to systemd unit files to make sure it restarts on reboot? What about running the agent itself as a podman container? It would probably have to be privileged but that would reduce the configuration burden on the host and centralize our logic around podman instead of spreading it to systemd.
@@ -0,0 +45,4 @@
cfg: &CloudInitSeedConfig<'_>,
output_dir: &Path,
) -> Result<PathBuf, KvmError> {
if which_xorriso().await.is_none() {
Author
Owner

Why do we need that? It is yet another dependency? Can't we pass the cloud init any other way when creating the vm? Do we need that for cloud init as I'm assuming or something else?

Why do we need that? It is yet another dependency? Can't we pass the cloud init any other way when creating the vm? Do we need that for cloud init as I'm assuming or something else?
@@ -0,0 +121,4 @@
}
fn render_user_data(cfg: &CloudInitSeedConfig<'_>) -> String {
let mut s = String::new();
Author
Owner

unreadable crap

unreadable crap
@@ -0,0 +15,4 @@
/// `VirtualMachineHost` capability can be introduced then, and we'll
/// either implement it *in terms of* `KvmHost` or drop `KvmHost`
/// altogether.
pub trait KvmHost {
Author
Owner

I tend to disagree with the comment that we need to be tool specific here.

All we really need is :

a vm with given cpu, iso, ip address, storage size, cloud init . This is absolutely not tool specific. The tool specific details can be hidden inside the kvm implementation of the VMHost capability.

The way I see it VirtualMachineHost capability trait should have a few methods like :

list_vms() -> Vec
ensure_vm(VirtualMachine)
delete_vm(VirtualMachine)
get_vm_info(VirtualMachine) -> VirtualMachineRuntimeInfo // ip address, network method, hypervisor name etc

I don't see what cannot work with kvm/virtualbox/vmware/proxmox/openstack right here. I know it is limited but it's fine. Most users don't care about the details, especially for the original intended use here of CI runners and development environments.

I tend to disagree with the comment that we need to be tool specific here. All we really need is : a vm with given cpu, iso, ip address, storage size, cloud init . This is absolutely not tool specific. The tool specific details can be hidden inside the kvm implementation of the VMHost capability. The way I see it VirtualMachineHost capability trait should have a few methods like : list_vms() -> Vec<VirtualMachine> ensure_vm(VirtualMachine) delete_vm(VirtualMachine) get_vm_info(VirtualMachine) -> VirtualMachineRuntimeInfo // ip address, network method, hypervisor name etc I don't see what cannot work with kvm/virtualbox/vmware/proxmox/openstack right here. I know it is limited but it's fine. Most users don't care about the details, especially for the original intended use here of CI runners and development environments.
@@ -0,0 +43,4 @@
}
async fn ensure_ready(&self) -> Result<PreparationOutcome, PreparationError> {
// The executor holds the URI — a cheap hypervisor-version query is
Author
Owner

This is tricky and is actually an architectural problem in harmony. Ensure ready is executed eagerly on a topology, but this topology won't necessarily be running a kvm related workload this run and/or might be doing the kvm setup in an earlier task before calling the kvm dependent scores. I am pretty sure we have a ROADMAP entry on that topic of topology initialization dependency. Add a reference here to the roadmap entry with a TODO so we capture this use case when we get to this work.

This is tricky and is actually an architectural problem in harmony. Ensure ready is executed eagerly on a topology, but this topology won't necessarily be running a kvm related workload this run and/or might be doing the kvm setup in an earlier task before calling the kvm dependent scores. I am pretty sure we have a ROADMAP entry on that topic of topology initialization dependency. Add a reference here to the roadmap entry with a TODO so we capture this use case when we get to this work.
@@ -0,0 +66,4 @@
/// if a domain with this name already exists.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct KvmVmScore {
pub config: CloudInitVmConfig,
Author
Owner

This feels straight up wrong. KvmVmScore (sounds generic) depending on CloudInit (specific) is a crime against good architecture and naming.

This feels straight up wrong. KvmVmScore (sounds generic) depending on CloudInit (specific) is a crime against good architecture and naming.
@@ -0,0 +133,4 @@
tokio::fs::create_dir_all(&cfg.seed_output_dir)
.await
.map_err(|e| InterpretError::new(format!("create seed dir: {e}")))?;
let status = Command::new("qemu-img")
Author
Owner

Use args vec, more readable. Also true for all other Command::new calls.

Use args vec, more readable. Also true for all other Command::new calls.
@@ -0,0 +9,4 @@
//! `podman-api` over shelling to `podman` elsewhere — use the mature
//! upstream where it's mature (apt/systemd/user module idempotency),
//! don't adopt its orchestration model (playbooks, inventory, YAML
//! templating, the Kubespray mess).
Author
Owner

Don't insult other projects, kubespray is cool and suits a purpose.

Don't insult other projects, kubespray is cool and suits a purpose.
@@ -0,0 +58,4 @@
// keeps re-runs cheap: the update is skipped if the cache was
// refreshed within the last hour.
//
// When we grow RHEL-family support, switch on the distro
Author
Owner

This comment does not feeld correct. Ensure_package is distribution agnostic. For now we choose to support only debian as this is our first concrete target, but this may change soon and the encapsulation is correct, choosing the correct tool based on the distribution is this function's burden. We might move that to the topology and separate the topologies more granularly between debian an rhel and others but at this moment I think this would be wrong.

There is complexity involved though, as very often (most of the time?) packages providing a utility have different names and ideas depending on the distribution family. For example installing qemu + kvm tooling is different package names on redhat family than debian family than arch family. We could easily provide a nice cross distribution score to install the qemu/kvm dependencies. That would probably be cleanest. So right here we have to work towards that. And I confirm using ansible to perform the actual installation is correct as we leverage ansible's strenght in low level module idempotency.

This comment does not feeld correct. Ensure_package is distribution agnostic. For now we choose to support only debian as this is our first concrete target, but this may change soon and the encapsulation is correct, choosing the correct tool based on the distribution is this function's burden. We might move that to the topology and separate the topologies more granularly between debian an rhel and others but at this moment I think this would be wrong. There is complexity involved though, as very often (most of the time?) packages providing a utility have different names and ideas depending on the distribution family. For example installing qemu + kvm tooling is different package names on redhat family than debian family than arch family. We could easily provide a nice cross distribution score to install the qemu/kvm dependencies. That would probably be cleanest. So right here we have to work towards that. And I confirm using ansible to perform the actual installation is correct as we leverage ansible's strenght in low level module idempotency.
@@ -0,0 +108,4 @@
host: IpAddress,
creds: &SshCredentials,
spec: &FileSpec,
) -> Result<ChangeReport, ExecutorError> {
Author
Owner

I'd like to see here a proper rust struct for the file module config built as proper rust code and simply serialized to the ansible module format instead of a bunch of json! invocations feeling very fragile. We should research if there already exist ansible rust crates. I doubt we will find mature and solid ones but it's worth looking we could be surprised.

I'd like to see here a proper rust struct for the file module config built as proper rust code and simply serialized to the ansible module format instead of a bunch of `json!` invocations feeling very fragile. We should research if there already exist ansible rust crates. I doubt we will find mature and solid ones but it's worth looking we could be surprised.
Author
Owner

The file spec is probably the correct one, all that is lacking is a function to serialize it directly to ansible file module json.

The file spec is probably the correct one, all that is lacking is a function to serialize it directly to ansible file module json.
@@ -0,0 +170,4 @@
spec: &SystemdUnitSpec,
) -> Result<ChangeReport, ExecutorError> {
// Step 1: write the unit file.
let (unit_path, scope_user) = match &spec.scope {
Author
Owner

I was not convinced that we should use ansible for the file copy but here for the systemd unit I am sure that we want to use the ansible builtin systemd module https://docs.ansible.com/projects/ansible/latest/collections/ansible/builtin/systemd_module.html

That note applies as well to the podman setup. It might make a lot of sense to use the podman ansible module to deploy the iot containers. https://docs.ansible.com/projects/ansible/latest/collections/containers/podman/index.html

This is also true for all other non trivial installation tasks. ansible is great at running commands and ensuring a package/service/file is installed correctly.

I was not convinced that we should use ansible for the file copy but here for the systemd unit I am sure that we want to use the ansible builtin systemd module https://docs.ansible.com/projects/ansible/latest/collections/ansible/builtin/systemd_module.html That note applies as well to the podman setup. It might make a lot of sense to use the podman ansible module to deploy the iot containers. https://docs.ansible.com/projects/ansible/latest/collections/containers/podman/index.html This is also true for all other non trivial installation tasks. ansible is great at running commands and ensuring a package/service/file is installed correctly.
@@ -0,0 +256,4 @@
.await
}
pub async fn ensure_linger(
Author
Owner

feels like ansible should have a purpose built module here too. Don't be lazy and call shell commands. Calling shell commands through ansible is completely useless as they're not truly idempotent in the way that file and package installations are.

feels like ansible should have a purpose built module here too. Don't be lazy and call shell commands. Calling shell commands through ansible is completely useless as they're not truly idempotent in the way that file and package installations are.
@@ -0,0 +377,4 @@
let mut cmd = Command::new(&bins.ansible);
cmd.arg("all")
.arg("-i")
Author
Owner

args vec more readable

args vec more readable
johnride added 15 commits 2026-04-21 19:04:55 +00:00
The smoke test now runs end-to-end against a pristine host with only
generic deps installed (libvirt, qemu, xorriso, python3, podman,
cargo, kubectl) — no manual ISO downloads, ssh-keygen rituals, or
chmod dances. Pairs with a hard power-cycle recovery phase that
matches ROADMAP §8's "power cycle test" shape.

Harmony-side bootstrap (all under $HARMONY_DATA_DIR/iot/):

- `modules::iot::assets` — SHA256-verified Ubuntu 24.04 cloud image
  download (cached, streaming via reqwest) + ed25519 SSH keypair
  generation. OnceCell-cached like `ensure_ansible_venv`.

- `modules::iot::libvirt_pool` — user-owned dir-backed libvirt
  storage pool at $HARMONY_DATA_DIR/iot/kvm/pool/. Per-VM overlay
  disks + seed ISOs land here; libvirt dynamic-ownership handles the
  libvirt-qemu chown transitions we used to do by hand. Pool is
  defined/built once via the `virt` crate inside a spawn_blocking,
  then auto-started + auto-autostarted on every process boot.

- `modules::iot::preflight::check_iot_smoke_preflight()` — fail-fast
  checks for every runner-host prereq (`virsh`, `qemu-img`, `xorriso`,
  `python3`, `ssh-keygen`, libvirt-group membership, default
  network active). Each missing piece surfaces with the Arch/Debian/
  Fedora install command inline.

KvmVmScore now owns these calls internally — `CloudInitVmConfig`
loses `base_image_path`, `seed_output_dir`, `authorized_key`. The
Score returns the SSH private-key path in its outcome details so the
caller can hand it straight to `LinuxHostTopology`.

smoke-a3.sh dropped from 125 lines of manual setup to a thin
orchestration script. Adds phase 5: `virsh destroy` + `sleep` +
`virsh start`, then a wall-clock gate that rejects any status writes
from before the reboot. Verified: real power-cycles produce
timestamps ~14s after the gate (agent boot + connect latency); the
gate catches in-flight writes that happen during destroy.

Verified end-to-end from a fully nuked `$HARMONY_DATA_DIR/iot/`:
- cold boot: downloads 600MB cloud image (~25s), generates SSH key,
  defines + starts libvirt pool, provisions VM, onboards device,
  verifies phase 5 power-cycle recovery
- warm boot: cache hits on all bootstrap steps; same end-to-end
  PASS in 2-3 minutes total

aarch64 cross-compile still green.
Structural changes (the biggest items from the review):

- `HostConfigurationProvider` split into five narrower capabilities:
  `HostReachable`, `PackageInstaller`, `FileDelivery`,
  `UnixUserManager`, `SystemdManager`. Each implementation now only
  implements what it can actually deliver — a future cloud-init /
  ignition / podman-agent backend can pick a subset without
  inheriting systemd assumptions it can't honour. Added an umbrella
  trait `LinuxHostConfiguration` blanket-impl'd for any type that
  has all five, so Scores keep a single bound.

- New `VirtualMachineHost` capability in domain/topology/: `list_vms`
  / `ensure_vm` / `delete_vm` / `get_vm_info`, with generic
  `VirtualMachineSpec` carrying a typed optional `VmFirstBootConfig`
  (hostname, admin user, authorized keys). `KvmHost` trait and
  `KvmHostTopology` deleted; `KvmVirtualMachineHost` is the
  concrete libvirt implementation. Cloud-init stays a KVM-impl
  detail — callers never see it.

- `KvmVmScore` + `CloudInitVmConfig` deleted; replaced by a generic
  `ProvisionVmScore` in `modules::iot::vm_score` bound to
  `T: VirtualMachineHost`. The Score itself has no knowledge of the
  hypervisor or its first-boot delivery mechanism.

- `IotDeviceSetupConfig.device_id` is now `harmony_types:🆔:Id`
  (timestamp-prefixed, sortable-by-creation, collision-safe).

- `ensure_ready` on `KvmVirtualMachineHost` is a Noop with a TODO
  pointing at ROADMAP/12-code-review-april-2026.md §12.1 (phased
  topology). Captures the concern about eagerly probing the
  hypervisor even when the current run doesn't need KVM.

Code quality fixes from the line-level comments:

- `render_toml` / `render_systemd_unit` / `render_user_data`
  rewritten as `format!` with raw-string templates (no more
  push_str chains).

- Every `Command::new(…).arg().arg().arg()` chain in the touched
  files converted to `.args([…])`.

- Ansible module args are now typed Rust structs (`AptArgs`,
  `AnsibleFileArgs`, `AnsibleUserArgs`, `AnsibleCopyArgs`,
  `AnsibleSystemdArgs`, `AnsibleCommandArgs`, `AnsibleStatArgs`)
  serialized via `serde_json::to_value`. No more `json!` macros
  with ad-hoc string keys.

- `ensure_linger`: no more shell sentinel. Uses
  `ansible.builtin.stat` on `/var/lib/systemd/linger/<user>` for
  the idempotent change-state check, then `ansible.builtin.command
  loginctl enable-linger` only on miss. `loginctl` is required
  (not just `file state=touch`) because systemd-logind needs the
  dbus signal to actually start the user manager; a plain file
  touch doesn't wake it up and every subsequent `systemctl --user
  …` fails with "Failed to connect to bus". Documented in-place.

- `ensure_user_unit_active`: picks up the user's UID first via
  `ansible.builtin.command id -u <user>` and wraps the
  `systemctl --user enable --now <unit>` invocation in `env
  XDG_RUNTIME_DIR=/run/user/<UID>`. The systemd module's
  task-level `environment:` keyword isn't available in ad-hoc
  mode; this is the cleanest equivalent. Documented the
  inline-playbook path as a future when we get more task-level-
  env callsites.

- `ensure_package` comment clarified: distro dispatch is this
  function's job; Debian-family is the first concrete target and
  extending to RHEL/Fedora/Alpine is an implementation detail,
  not a capability change.

- Kubespray line removed.

Verified: from a primed `$HARMONY_DATA_DIR/iot/`, smoke-a3.sh
still completes all 5 phases (bootstrap + provision + 9 setup
changes + initial NATS status + power-cycle recovery).
Ansible's `command` module is a Python-wrapped SSH round trip with
zero added value when the operation isn't built around Ansible's
idempotency primitives. `russh` is already a workspace dep and
gives us the exit code + stdout + stderr in a typed struct, with
one round trip. Moving the two call sites that were using
`ansible.builtin.command` to russh directly:

- New `modules::linux::ssh_executor::ssh_exec(host, creds, cmd)`
  returning `SshCommandOutput { rc, stdout, stderr }`. Loads the
  private key via `russh::keys::load_secret_key`, authenticates,
  opens an exec channel, drains all `ChannelMsg` until the
  channel closes, returns the collected data. Draining past `Eof`
  matters: some sshd implementations emit `ExitStatus` *after*
  `Eof`, and an early break loses the rc.

- `ensure_linger`: `test -e /var/lib/systemd/linger/<user>` over
  russh for the check, then `sudo loginctl enable-linger <user>`
  only on miss. Two SSH round trips, no Ansible. Same semantics
  as the previous `stat` + `command` pair but without the Python
  hop.

- `ensure_user_unit_active`: `id -u <user>` + `sudo -u <user>
  env XDG_RUNTIME_DIR=/run/user/<uid> systemctl --user enable
  --now <unit>`. This is the case that couldn't be done cleanly
  via ad-hoc `ansible.builtin.systemd` in the first place because
  task-level `environment:` isn't available in ad-hoc; russh makes
  it a one-liner.

Ansible still owns: `apt` (distro dispatch + cache), `user`
(idempotent account management), `copy` (file delivery with
content-diff change reporting), `file` (directory/mode), `systemd`
(daemon-reload + enable + start as one atomic call). Those are
where `ansible`'s value is real; `command` was a category error.

Verified: smoke-a3 PASS end-to-end — same 9-change initial setup,
NATS status, and power-cycle recovery as before.
Adds the type-safe arch dimension for the aarch64-on-x86_64
emulation work to follow. No behaviour change: every existing call
site gets `VmArchitecture::X86_64` via `Default`, and the XML
renderer (unchanged in this commit) emits the same bytes it
always did.

- `VmArchitecture { X86_64 (default), Aarch64 }` in
  domain/topology/virtualization.rs, with `as_str()` and
  `ubuntu_cloudimg_suffix()` helpers (Ubuntu uses `amd64`/`arm64`
  in filenames, not the `uname -m` spelling).
- `VirtualMachineSpec.architecture` + `#[serde(default)]` for
  on-disk compat.
- `VmConfig.architecture` + `VmConfig.firmware: Option<UefiFirmware>`
  in modules/kvm/types.rs. `UefiFirmware { code, vars }` is the
  typed pair libvirt's `<loader>` + `<nvram>` need for aarch64
  guests; x86_64 leaves it None. `VmConfigBuilder::architecture()`
  / `firmware()` setters added.
- `KvmVirtualMachineHost::ensure_vm` threads the arch through to
  VmConfig; firmware wiring is commit 3.

Re-exported: `VmArchitecture`, `UefiFirmware` from
`modules::kvm`. `VmArchitecture` is a type-alias re-export from
domain/topology so the arch enum lives in one place.

Verified: cargo check clean, fmt clean, aarch64 cross-compile of
harmony + iot crates still green.
Rewrites `domain_xml` to consume a resolved `DomainXmlParams`
(domain_type / arch / machine / emulator / cpu_block / firmware)
so per-arch branching happens once — at param resolution — and
the XML template itself stays a single readable format-string.

Per-arch values (from Linaro's "QEMU: A Tale of Performance
analysis" Jan 2025 for the aarch64 TCG knobs):

- **x86_64** → `<domain type='kvm'>` + machine `q35` + emulator
  `qemu-system-x86_64` + `<cpu mode='host-model'/>`. No firmware.
  (Unchanged — all existing XML still emits byte-identical output
  on the default arch.)

- **aarch64** → `<domain type='qemu'>` (TCG emulation), machine
  `virt`, emulator `qemu-system-aarch64`, custom CPU
  `<model>max</model>` with `<feature policy='require'
  name='pauth-impdef'/>`. MTTCG (`-accel tcg,thread=multi`) is
  the default in QEMU ≥ 9.1 so no libvirt-side knob is needed.
  UEFI via `<loader readonly='yes' type='pflash'>CODE</loader>`
  + `<nvram>VARS</nvram>` — a `UefiFirmware` pair is required
  (populated by `KvmVirtualMachineHost` in commit 3).

Four new unit tests verify the aarch64 path emits the right
domain type, arch, machine, emulator, CPU features, and firmware
elements — and that x86_64 stays BIOS-default with no loader/
nvram leakage. 26/26 `modules::kvm::xml` tests green.

When a native-aarch64 runner (Ampere) shows up, it's a one-line
fork inside `DomainXmlParams::for_vm` to switch to `kvm` +
`host-model` for the aarch64 branch — the shape already handles
it.
aarch64 guests boot via UEFI — there is no SeaBIOS equivalent for
the arm64 `virt` machine type. Libvirt needs two paths:

  - CODE (read-only firmware image, shared across VMs)
  - VARS (writable NVRAM, per-VM)

Every distro ships these under a different filename. New module
`modules/kvm/firmware.rs`:

- `AarchFirmware { code, vars_template }` — typed pair.
- `discover_aarch64_firmware()` walks four known-paths groups
  (Arch `edk2-armvirt`, Arch old naming, Debian/Ubuntu
  `qemu-efi-aarch64`, Fedora `edk2-aarch64`). First pair where
  both files exist wins. Miss → `ExecutorError` carrying the
  per-distro `pacman`/`apt`/`dnf` install command + the full
  candidate list for diagnosis.
- `copy_vars_template_for_vm(fw, dest)` produces the per-VM NVRAM
  at `$pool/<vm>-VARS.fd` and chmods 0644 so libvirt-qemu's
  dynamic-ownership chown on VM start works.

Wired into `KvmVirtualMachineHost::ensure_vm`: when
`spec.architecture == Aarch64`, the topology runs firmware
discovery + per-VM copy before composing the `VmConfig`, then
hands the resolved `UefiFirmware` to the XML renderer
(commit 2 already consumes it). x86_64 path unchanged.

Firmware discovery is deliberately a runtime check with a clear
error, not a preflight — this lets x86_64-only runs succeed on
hosts without AAVMF installed. Commit 4 adds an arch-aware
preflight that surfaces it upfront when a caller asks for
aarch64.

Verified: 26/26 kvm::xml tests still green, cargo check clean,
cargo fmt clean.
Add the pinned Ubuntu 24.04 arm64 cloud image alongside the existing
amd64 pin, with sha256 verification and a per-arch OnceCell cache so
both images can coexist under $HARMONY_DATA_DIR/iot/cloud-images/.

New entry point `ensure_ubuntu_2404_cloud_image_for_arch` selects the
right URL/sha256/filename tuple by VmArchitecture; the existing
`ensure_ubuntu_2404_cloud_image` becomes a back-compat shim pointing
at x86_64 so current callers don't need to thread an arch through yet.

Preflight gains `check_iot_smoke_preflight_for_arch`: on top of the
host-generic checks, an aarch64 target additionally requires
`qemu-system-aarch64` on PATH and a usable AAVMF firmware pair
(same `discover_aarch64_firmware` call the topology makes at
ensure_vm time — preflight surfaces it up front). Package-map
helpers learn `qemu-system-aarch64` for pacman/apt/dnf.
Wire the VmArchitecture story all the way to the user-facing entry
points so an arm64 smoke run is a single env flip.

Example (`example_iot_vm_setup`):
  * New `--arch {x86-64|aarch64}` flag (default x86-64) backed by a
    `CliArch` enum that converts cleanly to `VmArchitecture`.
  * Preflight and cloud-image bootstrap now call the `_for_arch`
    variants, and the `VirtualMachineSpec.architecture` field gets
    the real value instead of `Default::default()`.

Smoke script (`iot/scripts/smoke-a3.sh`):
  * Reads `ARCH=x86-64|aarch64` from env (default x86-64).
  * When `ARCH=aarch64`, `rustup target add aarch64-unknown-linux-gnu`
    + `cargo build --target ...` produces an arm64 agent binary;
    otherwise the existing host-target build path is kept.
  * Threads `--arch` to the example.
  * Extends the phase-4 initial-status timeout (60s → 300s) and the
    phase-5 post-reboot wait (240s → 900s) under TCG, which runs
    3-5× slower than native KVM.

New `smoke-a3-arm.sh` wrapper: exports `ARCH=aarch64` and a separate
`VM_NAME` / NATS container name so an arm smoke run can coexist with
an x86 one on the same host without stepping on libvirt state.

Topology side (`KvmVirtualMachineHost::ensure_vm`): `wait_for_ip`
timeout is now arch-derived — 300s for x86_64, 900s for aarch64 —
because first-boot cloud-init under TCG routinely needs 8-12 min
on a constrained worker.
The on-device agent builds `harmony` with `default-features = false,
features = ["podman"]`, which does not pull in the `kvm` feature.
Cross-compiling iot-agent-v0 for `aarch64-unknown-linux-gnu` to put
it on a Pi / arm64 VM currently fails with:

    error[E0433]: failed to resolve: could not find `kvm` in `modules`
     --> harmony/src/modules/iot/preflight.rs:18:21
        use crate::modules::kvm::firmware::discover_aarch64_firmware;

Gate the import and the `discover_aarch64_firmware()` call inside
`check_iot_smoke_preflight_for_arch` behind `#[cfg(feature = "kvm")]`.
Callers who build `harmony` without kvm (the agent) still get the
`qemu-system-aarch64` PATH check — the firmware probe only matters
to the host that will actually boot the VM, and that host always
builds with `kvm` enabled anyway.

Verification: `cargo build --release --target aarch64-unknown-linux-gnu
-p iot-agent-v0` now succeeds and produces a valid ELF aarch64 binary
(~13 MB).
Current Arch edk2-armvirt ships the pair as
  /usr/share/edk2/aarch64/QEMU_EFI.fd
  /usr/share/edk2/aarch64/QEMU_VARS.fd
(plus a compatibility copy under /usr/share/edk2-armvirt/aarch64/).
The previous CANDIDATES list looked for `QEMU_CODE.fd` and
`vars-template-pflash.raw` — neither name matches the actual
distro layout, so `discover_aarch64_firmware` reported
"no firmware found" on a fully-provisioned Arch host.

Add the `QEMU_EFI.fd` + `QEMU_VARS.fd` pair at both Arch paths at the
top of the probe order; keep the older raw-pflash variant and the
speculative CODE/VARS naming as later fallbacks. Sync the error
message's "checked paths" hint with the new list so the diagnostic
matches what's actually probed.

Verified against /usr/share/edk2/aarch64/QEMU_{EFI,VARS}.fd on this
host — `discover_aarch64_firmware` now returns the pair and
`cargo run -p example_iot_vm_setup -- --arch aarch64 --bootstrap-only`
completes (downloads + sha256-verifies the 598 MB arm64 image and
caches it under $HARMONY_DATA_DIR/iot/cloud-images/).
Three fixes landed during arm smoke debugging. Each is a real
correctness / perf issue that would bite anyone running aarch64
under TCG via libvirt, independent of any particular firmware.

**xml.rs — qemu:commandline overrides for -cpu and -accel**

`pauth-impdef=on` is a QEMU property of `-cpu max`, not a libvirt
`<feature>` entry. Putting it under `<cpu><feature policy='require'
name='pauth-impdef'/>` is rejected by libvirt with:

    error: unsupported configuration: unknown CPU feature: pauth-impdef

Route it instead via `<qemu:commandline>` (with the qemu namespace
declared on `<domain>`). QEMU takes the LAST `-cpu` arg as
authoritative, so libvirt's `-cpu max` followed by our
`-cpu max,pauth-impdef=on` yields max + pauth-impdef.

Same mechanism forces MTTCG: despite docs claiming QEMU ≥ 9.1
defaults to `thread=multi` on aarch64, observation on QEMU 10.2
shows cross-arch `-accel tcg` runs single-threaded (`vcpu.1.time`
stays at 0 forever). Appending `-accel tcg,thread=multi` creates
a real per-vcpu thread and roughly halves cold-boot wall time.

Also added a `<rng model='virtio'>` device feeding host `/dev/urandom`.
aarch64 cloud-init blocks minutes on first-boot SSH host-key
generation without it under TCG (entropy pool never fills on its
own). Cheap insurance on x86_64 too.

**topology.rs — 30-min wait_for_ip budget for aarch64**

Cold boot under TCG on an 8-core x86 host is 10-15 min even with
virtio-rng + pauth-impdef + MTTCG. The previous 900s ceiling
trips healthy boots; 1800s covers slower CI workers.

**smoke-a3.sh — cleanup must pass --nvram**

`virsh undefine --remove-all-storage` refuses to remove an aarch64
domain without `--nvram`, because NVRAM files aren't considered
"storage." Before this, a failed run left the domain definition
behind with yesterday's XML — subsequent runs would replay the
stale XML (ensure_vm is idempotent and doesn't redefine when the
domain already exists), masking any XML change until a manual
`virsh undefine` was issued. Also bump REBOOT_STEPS to match the
new topology-side budget.

Verified: `cargo test -p harmony --lib kvm::xml` passes (26/26),
including the 5 aarch64 assertions (namespace, cpu block, pflash
wiring, qemu:commandline contents for both -cpu and -accel).
QEMU's `virt` machine hardwires pflash unit 0 as a CFI flash device
of fixed size 64 MiB. When libvirt's `<loader type='pflash'>` points
at a file smaller than that, qemu refuses to start:

    cfi.pflash01 device '/machine/virt.flash0' requires 67108864
    bytes, block backend provides 3145728 bytes

Different distros ship the CODE firmware differently:

- Pre-padded (upstream QEMU pc-bios/edk2-aarch64-code.fd, Debian/
  Ubuntu qemu-efi-aarch64): file is exactly 64 MiB, zero-padded at
  the tail. Works as-is with libvirt's pflash loader.
- Raw edk2 build output (Arch `edk2-aarch64 202508+`): file is
  ~2-4 MiB, just the firmware volume without pflash padding. Has
  to be padded before libvirt accepts it.

Our discovery previously handed the discovered path straight to
libvirt. That works on pre-padded distros and silently fails on
raw-output distros.

Add `ensure_code_pflash_padded` in modules/kvm/firmware.rs:

- If the source is already 64 MiB, return the path unchanged —
  no copy, no bytes moved.
- If smaller, check a cache path (pool_dir/aarch64-code-padded.fd)
  for a correctly-sized copy newer than the source and reuse it.
- Otherwise copy + `File::set_len(64 MiB)` (sparse zero pad, one
  syscall), chmod 0644, return the cached path.
- If larger than 64 MiB, error out — no amount of padding saves us.

`ensure_vm_firmware` in topology.rs now runs the discovered code
through the padder before handing it to libvirt. One padded copy
per pool, reused across every aarch64 VM on that pool.

Verification path: `cargo test -p harmony --lib kvm::` passes
(26 tests — XML suite unchanged since this is runtime-only).
fix(kvm): wait for port 22 after DHCP lease when first_boot is set
All checks were successful
Run Check Script / check (pull_request) Successful in 2m14s
762e3b5b99
`wait_for_ip` returns as soon as libvirt sees a DHCP lease, but the
guest may still be minutes away from accepting SSH connections —
cloud-init is usually mid-firstboot (SSH host-key generation, runcmd,
etc.). Any Score that SSHes in immediately after `ensure_vm`
resolves races with sshd startup:

    ansible.builtin.ping failed against 192.168.122.11: UNREACHABLE!
    ssh: connect to host 192.168.122.11 port 22: Connection refused

This is painful on native KVM (seconds) and catastrophic under TCG
(1-3 min between DHCP and sshd listening).

When `spec.first_boot.is_some()` — i.e. the caller asked us to run
cloud-init and therefore almost certainly intends to SSH next — also
block on `wait_for_tcp_port(ip, 22, budget)` before returning. The
budget is reused from `wait_for_ip` (300 s x86_64 / 1800 s aarch64)
because if cloud-init takes that long to bring SSH up, something is
broken that a longer wait wouldn't fix.

`wait_for_tcp_port` uses 1 s backoff polling with a 5 s per-attempt
TCP connect timeout, so a silently dropped SYN doesn't burn half
the budget on a single hung syscall.

Cases without `first_boot` (caller bringing their own pre-baked
image and not expecting SSH) get the old behavior: return as soon
as DHCP resolves.
Merge pull request 'feat/iot-arm-vm' (#269) from feat/iot-arm-vm into feat/iot-walking-skeleton
All checks were successful
Run Check Script / check (pull_request) Successful in 2m15s
4e787ddb71
Reviewed-on: #269
johnride reviewed 2026-04-21 19:57:13 +00:00
@@ -0,0 +1,71 @@
apiVersion: apiextensions.k8s.io/v1
Author
Owner

never write yaml. This must be typed rust. kube-rs provides all we need to declare a fully typed crd. Same goes for operator.yaml file.

never write yaml. This must be typed rust. kube-rs provides all we need to declare a fully typed crd. Same goes for operator.yaml file.
@@ -0,0 +78,4 @@
};
obj.extensions.insert(
"x-kubernetes-validations".to_string(),
serde_json::json!([{
Author
Owner

Why is this not part of the full crd? Why add a bit more json into it after the fact? Am I missing something?

Why is this not part of the full crd? Why add a bit more json into it after the fact? Am I missing something?
johnride added 6 commits 2026-04-21 20:13:18 +00:00
refactor(iot): extract iot-contracts crate for cross-boundary types
All checks were successful
Run Check Script / check (pull_request) Successful in 2m13s
24b94a362d
Consolidate the data types, NATS bucket names, and KV key formats
that were scattered across the IoT operator, on-device agent, and
harmony's podman module. Each was defined in one place and quoted /
reimplemented in the others, which is exactly the kind of contract
drift the roadmap v0.1 §2 called for consolidating before we start
layering new features on top.

New crate `iot/iot-contracts`:
  * score.rs — `IotScore`, `PodmanV0Score`, `PodmanService` (moved
    from `harmony::modules::podman::score`). Pure data, no harmony
    deps.
  * kv.rs — `BUCKET_DESIRED_STATE`, `BUCKET_AGENT_STATUS` constants,
    `desired_state_key(device, deployment)`, `status_key(device)`.
    These values used to be hard-coded in five places (agent main.rs,
    operator main.rs, operator/deploy/operator.yaml, smoke-a1.sh,
    smoke-a3.sh). Tests lock the literals so a flip can't slip.
  * status.rs — typed `AgentStatus { device_id, status, timestamp }`.
    Replaces the anonymous `serde_json::json!{}` the agent was
    publishing, so the operator can deserialize the heartbeat
    payload via a shared struct when §12 v0.1 status aggregation
    lands.

Consumer updates:
  * `harmony::modules::podman::score` now holds only the
    `Score<T>` / `Interpret<T>` trait bindings; the pure types are
    re-exported from iot-contracts. Trait impls can't move because
    the trait lives in harmony, so this is the cleanest split.
  * `iot-operator-v0` uses `BUCKET_DESIRED_STATE` and
    `desired_state_key` — the inline `kv_key` fn now delegates so
    the existing internal call sites stay untouched.
  * `iot-agent-v0` uses `BUCKET_DESIRED_STATE`, `BUCKET_AGENT_STATUS`,
    `status_key`, and `AgentStatus` for the heartbeat publish.

No behavior change. Tests: `cargo test -p iot-contracts` passes
(8/8). Regression: `smoke-a3.sh` on x86_64 PASSes end-to-end
(reboot-reconnect loop included) — wire format is byte-identical
to the pre-refactor serialization.

Next consumers on deck: operator-side status aggregation (§12 v0.1
#3) and journald log streaming (§12 v0.1 #5), both of which need
shared types across the operator/agent boundary and were the
reason this extraction was prioritized.
Replaces an 8-link `.arg("-t").arg("ed25519").arg("-N")…` chain with
a single `.args([...])` of string literals, plus one trailing `.arg()`
for the `&PathBuf` (kept separate so we don't force it through the
`IntoIterator<Item=&str>` channel). No behavior change.
feat(iot-contracts): type AgentStatus fields with Id + DateTime<Utc>
All checks were successful
Run Check Script / check (pull_request) Successful in 2m8s
0d01a71cd5
`AgentStatus.device_id` and `AgentStatus.timestamp` were stringly
typed. Both now carry real types that prevent a whole class of
wire-format typos while keeping the on-wire JSON shape intact.

**device_id: String → harmony_types:🆔:Id**

Agent config + heartbeat payload now share the same `Id` that the
example IoT pipeline already uses for `IotDeviceSetupConfig`. Mixing
a device id with a deployment name or arbitrary `String` is now a
type error. `Id` is re-exported from `iot-contracts` so consumers
don't need a direct `harmony_types` dependency just to name the
field.

To keep the wire format byte-compatible, `harmony_types::Id` gains
`#[serde(transparent)]`. Audit: no consumer in the tree relies on
the previous `{"value": "…"}` shape — `Id` is persisted by sqlite
via `to_string()`, never serialized directly — so this is a
latent-bug fix more than a behavior change.

**timestamp: String → chrono::DateTime<Utc>**

The agent was calling `chrono::Utc::now().to_rfc3339()` and stuffing
the String into the payload. It now holds a real `DateTime<Utc>`
which serde-serializes as RFC 3339 anyway. The smoke script's
reboot-gate lex comparison still works: time-digit prefixes resolve
before the trailing `Z` (chrono default) vs `+00:00` (prior format)
difference matters.

**Plumbing**

- `iot/iot-agent-v0/src/config.rs`: `AgentSection.device_id: Id`.
  TOML deserializes the bare string thanks to `#[serde(transparent)]`.
- `iot/iot-agent-v0/src/main.rs`: `watch_desired_state` and
  `report_status` take `Id` instead of `String`.
- `iot/iot-contracts/Cargo.toml`: adds `harmony_types` path dep and
  `chrono = { workspace, features = ["serde"] }`.

**Verification**

- `cargo test -p iot-contracts`: 8/8 passes. New assertions pin the
  wire format: `"device_id":"pi-01"` (not `{"value":"pi-01"}`) and
  `"timestamp":"2026-04-21T18:15:42Z"` (RFC 3339).
- x86_64 smoke-a3.sh PASSes end-to-end including the reboot-
  reconnect loop — wire format remains compatible with the existing
  smoke-script parsing.
Review feedback: `ContainerRuntime` is a first-class harmony
capability (already lives at
`harmony/src/domain/topology/container_runtime.rs`) and the Score
types that describe what containers a caller wants running belong
next to the trait impls, not hidden in an IoT-labeled contracts
crate. Putting `PodmanService`, `PodmanV0Score`, and `IotScore` in
`iot-contracts` conflated the product-shape (IoT fleet agent) with
a reusable container-orchestration primitive.

Move the data definitions (plus the three serde tests) back to
`harmony/src/modules/podman/score.rs` where they were before the
extraction in commit 24b94a3. That file now again holds the types
and their `Score<T>` / `Interpret<T>` trait impls in one place.

No behavior change:
- `harmony::modules::podman::{IotScore, PodmanV0Score, PodmanService}`
  re-exports still resolve (through the restored local module rather
  than a forwarded re-export from iot-contracts).
- The single external consumer that imports these types —
  `iot-agent-v0/src/reconciler.rs` — already went through
  `harmony::modules::podman::*`, so no import flip needed.

iot-contracts now holds only the cross-boundary bits that are
genuinely reconciler-wire-format-specific (bucket names + key
helpers, `AgentStatus`, `Id` re-export). A follow-up commit will
rename the crate itself to reflect that scope.

Verification: `cargo test -p harmony --features podman --lib podman`
(3 score tests pass in their restored home), `cargo test -p
iot-contracts` (5 remaining tests), `cargo check --all-features`
clean.
refactor(reconciler): rename iot-contracts → harmony-reconciler-contracts
All checks were successful
Run Check Script / check (pull_request) Successful in 2m29s
75c3ef9bb8
Review feedback: "iot" is the wrong scope label. The pattern this
crate encodes — a central operator writing desired state to NATS
JetStream KV, a remote agent watching KV and reconciling — is the
foundation for harmony's decentralized infrastructure management,
not an IoT thing. Raspberry Pi is one concrete use case; the next
consumers (OKD fleet agents, edge-compute reconcilers, any host
harmony can't reach directly over a control-plane API) aren't IoT
either.

Rename the crate to reflect what it actually is:

- `iot/iot-contracts/` → `harmony-reconciler-contracts/` (moved to
  the repo root, alongside the other support crates).
- Package name `iot-contracts` → `harmony-reconciler-contracts`.
- Consumer `Cargo.toml` path references updated in operator, agent.
- `use iot_contracts::…` → `use harmony_reconciler_contracts::…`
  across agent + operator sources.
- Crate-level prose in lib.rs + kv.rs rewritten to drop the IoT
  framing and describe the reconciler pattern in its own terms.
- harmony/Cargo.toml drops the dep entirely — after the preceding
  commit moved podman Score types back in-tree, harmony no longer
  pulls anything from this crate.

No behavior change. Wire format unchanged — the two existing public
modules (`kv`, `status`) are byte-identical.

Verified:
- `cargo check --all-targets --all-features` clean.
- `cargo test -p harmony-reconciler-contracts` — 5/5 pass.
- x86_64 `smoke-a3.sh` end-to-end PASS (reboot-reconnect included).

Out of scope / follow-up: the operator and agent crate names
(`iot-operator-v0`, `iot-agent-v0`) and `IotScore` are still
IoT-branded. Evaluating whether to flip those in this branch next.
Reviewed-on: #270
johnride added 4 commits 2026-04-21 20:53:34 +00:00
refactor(operator): replace gen-crd yaml pipeline with a harmony Score
All checks were successful
Run Check Script / check (pull_request) Successful in 2m12s
588afb9ab9
Review feedback: writing yaml and shelling out to kubectl is the
exact anti-pattern harmony exists to eliminate. The operator already
has typed Rust for its CRD (`#[derive(CustomResource)]`), and
harmony-k8s already has a typed apply path. So the "install" step
should be a Score, not `cargo run -- gen-crd | kubectl apply -f -`.

Changes:

- **New** `iot/iot-operator-v0/src/install.rs` — `install_crds()`
  builds `Deployment::crd()` via `kube::CustomResourceExt`, wraps it
  in `harmony::modules::k8s::resource::K8sResourceScore`, and
  executes the Score against a tiny local `InstallTopology` that
  just carries a `K8sClient` loaded from `KUBECONFIG`.

  The local topology exists because `K8sAnywhereTopology::ensure_ready`
  does a lot of product-level setup (cert-manager, tenant manager,
  helm probes) that isn't appropriate for a narrow "apply a CRD"
  action. A 30-line inline topology that implements `K8sclient` +
  a noop `ensure_ready` is the right-sized abstraction for now.
  When a larger "install the operator in-cluster" Score lands
  (Deployment + SA + RBAC + ClusterRoleBinding), that may justify
  promoting the topology to a shared crate.

- **Renamed subcommand** `gen-crd` → `install`. Old path: print yaml
  to stdout for kubectl to consume. New path: apply the CRD directly
  via the Score, using whatever `KUBECONFIG` points at.

- **Deleted** `iot/iot-operator-v0/deploy/crd.yaml` and
  `deploy/operator.yaml`. The CRD yaml was derived from Rust and
  committed alongside the source — a drift hazard (nothing guaranteed
  they stayed in sync). `operator.yaml` was never actually applied by
  any smoke script; it existed only for documentation. Both go.

- **Rewired** `iot/scripts/smoke-a1.sh` phase 2 to call the `install`
  subcommand instead of piping yaml to kubectl. Everything downstream
  (kubectl wait for Established, apiserver CEL rejection check,
  operator + agent + container lifecycle) unchanged.

- **Dropped** `serde_yaml` from the operator's `Cargo.toml` — it was
  only used to print the CRD as yaml. Added `harmony`, `harmony-k8s`,
  and `async-trait` deps.

Verification — `smoke-a1.sh` PASSes end-to-end on x86_64 k3d:
k3d cluster → install CRD via Score → apiserver rejects bad
score.type (CEL still works through the Score-applied CRD) →
operator → agent → nginx container up → curl 200 → delete CR →
KV + container removed.

Out of scope / follow-up: a proper "install operator in-cluster"
Score that also applies Namespace + SA + ClusterRole +
ClusterRoleBinding + Deployment (the manifests that used to live in
the deleted operator.yaml). Smoke-a1 currently runs the operator
as a host-side process, so that Score isn't on the test path today.
docs(topology): flag InstallTopology smell + add roadmap §12.6
All checks were successful
Run Check Script / check (pull_request) Successful in 2m14s
b8db8241d1
The InstallTopology in iot/iot-operator-v0/src/install.rs is
architecturally a workaround: harmony's existing opinionated
topologies (K8sAnywhereTopology, HAClusterTopology) have accumulated
product-level side effects in ensure_ready that make them unfit for
narrow actions like "apply a CRD," so the module vendored its own
tiny Topology impl. If this pattern proliferates, the topology
ecosystem drifts toward "one bespoke topology per Score," which is
exactly the proliferation harmony's design was meant to prevent.

Two documentation changes, no code/behavior change:

- **Inline:** doc comment on `InstallTopology` flagging it as a
  smell, explaining the root cause, and pointing at the roadmap
  entry below. Anyone finding this code later (or tempted to copy
  the pattern) reads the warning before they do.

- **Roadmap §12.6** (new): "Topology proliferation — opinionated
  topologies leaking into narrow use cases." Captures the
  architectural direction (minimal `K8sBareTopology` in harmony,
  unbundle product setup from `ensure_ready`) without prescribing
  an implementation. Includes an explicit done-check: the smoke
  test for "this roadmap item is fixed" is that install.rs can
  delete its inline Topology and one-line against the shared type.
Merge branch 'feat/iot-walking-skeleton' into feat/install-reconcile-operator-score
All checks were successful
Run Check Script / check (pull_request) Successful in 2m13s
6676023aa8
Reviewed-on: #271
johnride added 53 commits 2026-04-25 13:52:24 +00:00
v0 walking skeleton is substantially done (CRD → operator → NATS KV
→ on-device agent → podman reconcile; VM-as-device for x86_64 and
aarch64 via TCG; power-cycle resilience; operator install via Score
instead of yaml/kubectl). Time to switch the `ROADMAP/iot_platform`
folder from "plan to build the skeleton" to "plan to build on top of
the skeleton."

- **NEW** `ROADMAP/iot_platform/v0_1_plan.md` — the authoritative
  forward plan. Five chapters in execution order:
    1. Hands-on end-to-end demo the user can drive by hand
       (imminent, fully detailed: composed smoke, typed-Rust CR
       applier, natsbox command menu, in-cluster NATS).
    2. Status reflect-back + inventory (enrich `AgentStatus`,
       operator aggregates into `.status.aggregate`).
    3. Helm chart packaging (ArgoCD deferred — user's clusters have
       it already, bringing it into the smoke adds no validation
       value).
    4. Zitadel + OpenBao + per-device auth.
    5. Frontend (web / CLI / TUI — deferred).

  Chapters 2-5 are sketched; they expand to their own docs as each
  becomes the active chapter.

- **EDIT** `ROADMAP/iot_platform/v0_walking_skeleton.md` — add a
  SHIPPED banner at the top pointing at v0_1_plan.md. Keep the
  707-line design diary intact as archaeology; don't rewrite
  history.

- Incorporates the post-v0 architectural principles that emerged
  from review (no yaml in framework paths, minimal ad-hoc
  topologies, cross-boundary types in harmony-reconciler-contracts,
  verify before blaming upstream).
Roadmap §12.6 ("topology proliferation") is partially resolved by
extracting the ad-hoc InstallTopology from iot-operator-v0/install.rs
into harmony as a reusable shared type, now that a second consumer
(NatsBasicScore, landing next) makes the extraction genuinely
load-bearing rather than speculative.

What's new:

- harmony/src/modules/k8s/bare_topology.rs — K8sBareTopology carries
  one K8sClient, implements K8sclient + Topology (noop ensure_ready).
  Constructors: from_client(name, client) for callers building their
  own client, from_kubeconfig(name) for callers reading the standard
  KUBECONFIG chain.
- modules::k8s::K8sBareTopology re-export.

What's gone:

- iot-operator-v0/src/install.rs: the ~30-line InstallTopology struct
  + its async_trait-decorated impls. The crate also drops async-trait
  and harmony-k8s as direct deps (neither is used now that the
  topology is shared).
- Long "architectural smell" comment from install.rs — the smell is
  fixed; the explanation belongs at the shared type now (with the
  history captured in its module doc).

Behavior-preserving. cargo check --all-targets --all-features clean.
smoke-a1 wire path unchanged.

Compounding-value move: every future Score that needs "apply a
typed resource against an existing cluster" consumes K8sBareTopology
instead of inventing its own Topology impl. That's the pattern v0
Harmony's design is meant to encourage.
Harmony's existing NATS story starts at `NatsK8sScore`, which is
designed for production multi-site superclusters: TLS-fronted
gateways, cert-manager-minted certs, ingress + Route, helm chart
with gateway merge blocks, NatsAdmin secret prompts. All of that is
overhead for a local smoke or a single-site decentralized deployment
that just needs a live JetStream server.

Add `NatsBasicScore` beside it. Deliberately minimal:
  - Single replica
  - Official `nats:*-alpine` image via typed k8s_openapi Deployment
  - JetStream (-js) on by default, toggle via builder setter
  - Namespace created if missing
  - Service: ClusterIP by default, or NodePort via
    `.node_port(port)` for off-cluster clients (e.g. a libvirt VM
    connecting through the host's loadbalancer port)

Trait bounds are just `Topology + K8sclient` — no `HelmCommand`,
no `TlsRouter`, no `Nats` capability. Composes cleanly with
`K8sBareTopology` (added in the previous commit) so consumers can
`score.create_interpret().execute(&inventory, &topology)` against
any cluster `KUBECONFIG` points at.

Constructed via a small builder:

    NatsBasicScore::new("iot-nats", "iot-system")
        .node_port(4222)
        .jetstream(true)

Under the hood the interpret runs three `K8sResourceScore`s in
sequence (namespace → deployment → service). No new machinery —
just composition of existing primitives.

Deliberately NOT in scope for this Score:
  - TLS / PKI — use NatsK8sScore when you need those
  - Gateways / supercluster — use NatsSuperclusterScore
  - Auth (user/password or JWT) — add a ConfigMap mount when
    the Chapter 4 auth work lands

Tests (4, all passing): default is ClusterIP; node_port() flips
Service to NodePort with the right nodePort field; jetstream() toggle
controls the `-js` arg.

Part of the "compound framework value" mindset: every future Score
that wants a local NATS now points at this one type instead of
inventing its own yaml.
Replaces what would otherwise be a yaml fixture for the hands-on
demo. The CRD is already fully typed (DeploymentSpec + ScorePayload
+ PodmanV0Score + Rollout), so the applier uses those types
directly, constructs the CR via kube::Api, and either applies it
server-side or prints the JSON for `kubectl apply -f -`.

CLI:

  iot_apply_deployment \
      --namespace iot-demo \
      --name hello-world \
      --target-device iot-smoke-vm \
      --image docker.io/library/nginx:latest \
      --port 8080:80                       # apply
  iot_apply_deployment --image nginx:1.26  # upgrade (same name, new img)
  iot_apply_deployment --delete            # tear down
  iot_apply_deployment --print ...         # JSON to stdout → kubectl -f -

Uses server-side apply (PatchParams::apply().force()) so repeated
invocations patch the existing CR cleanly — the upgrade path the
demo exercises.

To expose the CRD types to an external consumer, iot-operator-v0
gains a thin `src/lib.rs` that re-exports the `crd` module. The
binary target now imports from the library (`use iot_operator_v0::crd;`)
instead of declaring its own `mod crd;` — avoids compiling the
types twice.

No change in operator runtime behavior.

Part of the ROADMAP/iot_platform/v0_1_plan.md Chapter 1 work.
Small CLI that installs a single-node NATS server into the cluster
KUBECONFIG points at, using harmony's `NatsBasicScore` composed
against `K8sBareTopology`.

This is the glue between `smoke-a4.sh` and the framework Score:

    cargo run -q -p example_iot_nats_install -- \
        --namespace iot-system \
        --name iot-nats \
        --node-port 4222

Defaults cover the demo exactly: iot-system namespace, NodePort 4222
so the libvirt VM agent can reach NATS through the k3d loadbalancer
port mapping.

No reinvented topology, no hand-rolled yaml, no helm shell-out. The
actual work (Namespace + Deployment + Service with the right
selector/ports/probes) lives inside `NatsBasicScore::Interpret` in
harmony where it can be reused by any future consumer.

Part of ROADMAP/iot_platform/v0_1_plan.md Chapter 1.
Composed demo that brings up operator + in-cluster NATS + ARM (or
x86) VM agent, then either hands the full stack off to the user
with a command menu (default) or drives an apply + upgrade + delete
regression loop (`--auto`).

Phases:
  1. k3d cluster with NATS port exposed via `-p 4222:4222@loadbalancer`.
  2. NATS in-cluster via the new `example_iot_nats_install` binary
     → `NatsBasicScore` → typed k8s_openapi Namespace + Deployment +
     NodePort Service.
  3. CRD install via `iot-operator-v0 install` (Score-based, no yaml).
  4. Operator spawned host-side, connects to nats://localhost:4222.
  5. VM provisioned via `example_iot_vm_setup` (reused from smoke-a3);
     agent inside the VM connects to nats://<libvirt-gateway>:4222.
  6. Sanity: NATS pod Running, agent heartbeat
     `status.<device>` present in `agent-status` bucket.
  7a. DEFAULT: print a command menu (kubectl watch, typed Rust
      applier, ssh/console, natsbox one-liners, curl) and block on
      Ctrl-C with a cleanup trap tearing everything down.
  7b. `--auto`: apply nginx:latest, wait for container on the VM,
      curl, upgrade to nginx:1.26, assert container id CHANGED,
      curl, delete, assert container gone.

Prereqs documented at the top of the script. Handles both x86-64
(native KVM) and aarch64 (TCG emulation) via `ARCH=` env.

Design notes captured in ROADMAP/iot_platform/v0_1_plan.md. Uses
every piece landed in this branch so far: K8sBareTopology,
NatsBasicScore, the typed CR applier, the Score-based CRD install.
Previous commit landed the script without the +x bit (a chmod
between write and commit was swallowed). Fix with git
update-index --chmod=+x so the file is executable on checkout.
Kubernetes NodePort Services must use a port in the apiserver's
configured nodeport range (default 30000-32767). NatsBasicScore's
first cut accepted any port via `.node_port(port)`, which was fine
for strict use of the capital-N NodePort Service type, but made
the demo's "use NATS client port 4222 directly from the host"
story awkward.

Replace the `node_port: Option<i32>` field with a proper
`NatsServiceType` enum (ClusterIP | NodePort(i32) | LoadBalancer).
Three builder methods — one per variant. LoadBalancer is the right
idiom for the demo: k3d's built-in `klipper-lb` fronts
LoadBalancer Services on their `port` (not their nodePort), so
`k3d cluster create -p 4222:4222@loadbalancer` delivers external
traffic straight to the Service's client port. No nodeport range
juggling.

Signatures:

    NatsBasicScore::new(name, namespace)   // ClusterIP default
        .node_port(30422)                   // NodePort(30422)
        .load_balancer()                    // LoadBalancer
        .jetstream(true)
        .image("docker.io/library/nats:2.10-alpine")

Tests: 5 pass. New assertion: `load_balancer()` produces a Service
with type LoadBalancer and no pinned nodePort (apiserver assigns).

Consumers:
- `example_iot_nats_install` gets a `--expose {cluster-ip | node-port
   | load-balancer}` flag (default `load-balancer` since that's what
  the demo wants). The legacy `--node-port N` flag survives as the
  NodePort port value.
- `smoke-a4.sh` asks for `--expose load-balancer`, matching its
  `-p 4222:4222@loadbalancer` k3d port mapping.
Ubuntu 24.04 `useradd --system` does not allocate `/etc/subuid` +
`/etc/subgid` ranges. Rootless podman silently fails on image-layer
unpack:

    potentially insufficient UIDs or GIDs available in user namespace
    (requested 0:42 for /etc/gshadow): ... lchown /etc/gshadow:
    invalid argument

`smoke-a1.sh` didn't hit this because it runs the agent on the
*host* user, which has subuid/subgid populated by default. `smoke-a4.sh`
drives a podman pull inside the VM — the FIRST time we actually
exercise rootless-podman-on-a-fresh-system, and the failure surfaces
immediately.

The fix belongs in harmony, not in ad-hoc cloud-init scripts. Add
`UnixUserManager::ensure_subordinate_ids` alongside the existing
`ensure_user` + `ensure_linger` methods:

- `domain/topology/host_configuration.rs`: new trait method. Doc
  explains why every rootless-container-runtime consumer needs it.
- `modules/linux/ansible_configurator.rs`: impl follows `ensure_linger`'s
  pattern — a grep probe on /etc/subuid+/etc/subgid, then a single
  `usermod --add-subuids 100000-165535 --add-subgids 100000-165535`
  only when missing. Idempotent, no-ops on re-run.
- `modules/linux/topology.rs`: forwarder for `LinuxHostTopology`.
- `modules/iot/setup_score.rs`: call the new method right after
  `ensure_linger` in `IotDeviceSetupScore`. Any future consumer that
  runs rootless podman reaches for the same primitive.

Verified: `cargo check --all-features` clean. End-to-end smoke-a4
regression pending (re-running after this commit).
The agent runs rootless podman as the `iot-agent` user (system
user, created by IotDeviceSetupScore). Each user has their own
podman state tree under ~/.local/share/containers. The smoke
was running \`podman ps\` as \`iot-admin\` (the ssh login user),
so it saw an empty store even when the agent had happily created
the nginx container — leading to a spurious "container never
appeared" failure despite the reconciler reporting SUCCESS.

Fix: go through \`sudo su - iot-agent -c\` with
\`XDG_RUNTIME_DIR=/run/user/\$(id -u)\` so the command runs in
the right user session. Update the hand-off command menu with the
equivalent one-liner so the user can inspect the fleet's actual
container state without tripping over the same gotcha.

Smoke-a4 PASSes end-to-end on x86_64:
  - CRD apply → container materializes
  - Upgrade via new image → container id changes (not patched)
  - Delete → container removed

With the previous commit (ensure_subordinate_ids), this closes
Chapter 1 of ROADMAP/iot_platform/v0_1_plan.md: the full v0 loop
works, hands-on driven by kubectl / a typed Rust binary / natsbox.
Initial 180 s wait assumed native-KVM x86 speed. Under aarch64 TCG
the same nginx:latest pull (~250 MB image + layered userns unpack)
takes 4-8 min observed; 180 s was catching post-heartbeat reconcile
mid-pull and reporting FAIL.

Bump `CONTAINER_WAIT_STEPS` per arch:
  - x86 KVM: 90 iterations × 2 s = 180 s (unchanged)
  - aarch64 TCG: 450 × 2 s = 900 s (15 min)

Apply to both the 'first-boot container' and 'upgrade container id
change' loops.
Docker Hub's unauthenticated rate limit (100 pulls per 6h per IP,
counted per-manifest-query) is the most reliable way for a CI-style
smoke loop to produce false negatives. The NATS pod failing with
'429 Too Many Requests' after a handful of runs today was that —
not a real regression.

Fix inside the smoke: before running the install Score, sideload the
NATS image into the k3d cluster via a podman→docker→k3d bridge:

  - If the image isn't already in docker's store:
      - If it's not in podman's store either, podman pull (this is
        the one-time hit we can't avoid).
      - podman save → docker load.
  - k3d image import into the cluster's containerd.

Steady-state this is a few-hundred-ms operation (no Hub calls, no
registry traffic). Require docker in the preflight list since we
depend on it for the cross-runtime bridge.

Also bump the Available-wait from 60 s to 120 s — the post-import
pod spin-up is fast but the scheduler + loadbalancer update take
longer than I initially budgeted.

VM-side nginx pulls are still at Hub's mercy; addressing that
requires either (a) docker login before the smoke, (b) an
authenticated registry mirror, or (c) arch-specific image
pre-seeding into the VM. All Chapter-2+ follow-ups.
Chapter 2 groundwork. The on-wire AgentStatus the agent publishes
every 30 s was only carrying device_id + status + timestamp — not
enough for the operator to answer "how are my deployments doing."
Enrich it so the operator can aggregate into a useful
DeploymentStatus.aggregate subtree on the CR (second commit).

**harmony-reconciler-contracts/src/status.rs**

- `AgentStatus.deployments: BTreeMap<String, DeploymentPhase>` —
  keyed by deployment name (CR's metadata.name). Each phase carries
  `{ phase: Running|Failed|Pending, last_event_at, last_error }`.
- `AgentStatus.recent_events: Vec<EventEntry>` — ring buffer of the
  most recent reconcile events on this device. Each entry is
  `{ at, severity: Info|Warn|Error, message, deployment: Option }`.
  Bounded agent-side to keep JetStream per-message size sane.
- `AgentStatus.inventory: Option<InventorySnapshot>` — hostname,
  arch, os, kernel, cpu_cores, memory_mb, agent_version. Published
  once on startup.
- All three new fields are `#[serde(default)]` — mixed-fleet upgrades
  don't break: an old agent's payload deserializes into the new
  struct (deployments empty, events empty, inventory None); a new
  agent's payload deserializes into an old operator just losing the
  fields.

New tests (kept forward-compat front and center):
  - `minimal_status_roundtrip` — empty maps / None
  - `enriched_status_roundtrip` — full population
  - `old_wire_format_parses_into_enriched_struct` — pre-Chapter-2
    payload must still parse (the upgrade guarantee)
  - `wire_keys_present` — literal wire-format pins for smoke greps

**iot-agent-v0**

Reconciler gains a `StatusState { deployments, recent_events }` side
map with a bounded ring buffer (`EVENT_RING_CAP = 32`). Every code
path that changes deployment state now also records phase + event:

  - `apply()`: Pending → Running on success, Failed + error event on
    failure.
  - `remove()`: drops phase, emits "deployment deleted" info event.
  - `tick()` (periodic reconcile): keeps phase at Running on noop;
    flips to Failed + event on error (deliberately no event on
    successful no-change ticks — 30 s cadence would drown the ring).

New helper `deployment_from_key(key)` unwraps `<device>.<deployment>`
into just the deployment name. `short(s)` truncates error strings to
512 chars so the payload stays well under NATS JetStream limits.

`report_status()` in main.rs now snapshots the reconciler's status
state on every heartbeat and publishes the full enriched payload
alongside a startup-captured InventorySnapshot. Inventory reads
`/proc/sys/kernel/osrelease` + `/proc/meminfo` + `std::env::consts::ARCH`
with graceful fallbacks — no new sys-info crate dep.

Verified: `cargo test -p harmony-reconciler-contracts --lib` 7/7 green
(5 new). Operator consumption of the new fields lands in the next
commit.
The operator watches the \`agent-status\` bucket, keeps a per-device
snapshot in memory, and folds it into each Deployment CR's
\`.status.aggregate\` subtree every 5 seconds. The answer to the user's
stated requirement — "CRD .status reflect-back: per-device
succeeded/failed counts + recent log lines" — now lives in the CR
itself, observable via \`kubectl get -o jsonpath\` or any UI that
speaks k8s status subresources.

**Shape (in iot/iot-operator-v0/src/crd.rs)**

  DeploymentStatus {
    observed_score_string,   // unchanged; controller change-detect
    aggregate: Option<{
      succeeded: u32,        // devices with Phase::Running
      failed: u32,           // devices with Phase::Failed
      pending: u32,          // devices with Phase::Pending or
                             // reported-but-no-phase-entry-yet
      unreported: u32,       // target devices that never heartbeated
      last_error: Option<{   // most recent failing device + short msg
        device_id, message, at
      }>,
      recent_events: Vec<{   // last-N events across the fleet, newest first
        at, severity, device_id, message, deployment
      }>,
      last_heartbeat_at,     // freshness signal for the whole fleet
    }>
  }

**New module** \`iot/iot-operator-v0/src/aggregate.rs\`

  - \`watch_status_bucket\`: subscribes to \`status.>\` on the
    agent-status bucket, maintains a \`BTreeMap<device_id, AgentStatus>\`
    in memory. Malformed payloads + malformed keys log-and-skip; the
    snapshot map is always the latest good shape.
  - \`aggregate_loop\`: 5 s ticker. Per tick: list Deployment CRs,
    clone the snapshot (no lock held across network calls), compute
    each CR's aggregate, JSON-Merge-Patch \`.status.aggregate\`. Merge
    patch composes cleanly with the controller's
    \`observedScoreString\` patch — neither clobbers the other.
  - \`compute_aggregate\` pure fn: classification logic is in one
    place, four unit tests pin its behaviour (counts + unreported,
    reported-but-no-phase-entry = pending, event filter matches
    deployment name only, status-key parser).

**Operator wiring** (\`main.rs\`)

  \`run()\` now opens *both* KV buckets at startup, spawns the
  controller and the aggregator concurrently via
  \`tokio::select!\`. Either returning an error tears the process
  down — kube-rs's Controller already absorbs transient reconcile
  errors internally, so anything escaping is genuinely fatal.

**Controller tweak**

  The apply path's \`patch_status\` was rebuilding the whole
  \`DeploymentStatus\` struct, which would clobber the aggregator's
  writes. Switched to raw JSON-Merge-Patch for the
  \`observedScoreString\` field only. Behaviour preserved, aggregate
  subtree left intact.

**Smoke assertion** (smoke-a4.sh --auto)

  After apply + curl succeeds, the --auto path now asserts
  \`kubectl get deployment.iot.nationtech.io ... -o
  jsonpath='{.status.aggregate.succeeded}'\` reaches 1 within
  60 s. Proves the full agent → status bucket → operator aggregate →
  CRD status loop, end to end.

Verified locally: \`cargo test -p iot-operator-v0 --lib\` 4/4 green,
\`cargo check --all-targets --all-features\` clean.
Two changes that compose into one win: the smoke no longer needs a
functional Docker Hub to exercise the agent → podman → container
loop.

**harmony/src/modules/podman/topology.rs — IfNotPresent for image pull**

`PodmanTopology::ensure_service_running` was calling `podman pull`
on every reconcile, even when the image was already in the local
store. For a long-lived device agent reconciling against a public
registry, that's a guaranteed rate-limit collision: Docker Hub caps
unauthenticated pulls at 100 manifests per 6 h per IP, and an agent
ticking every 30 s chews through that allowance in a day.

Change the pull path to check the local store first:

    if images.get(image).exists().await? { return Ok(()); }
    // else: pull

Matches Kubernetes' `imagePullPolicy: IfNotPresent` semantics.
Correct default for the IoT platform: upgrades change the image
STRING (tag or digest), so they still hit the pull branch —
"use local if available, pull the new thing if the reference changed."

**iot/scripts/smoke-a4.sh — tarball sideload in place of registry**

An earlier iteration of this smoke stood up a local `registry:2`
container and pushed tagged images into it. That pattern itself
needs to pull `registry:2` from Docker Hub — cute demo, still
Hub-dependent. Gone now.

New phase 4.5 / 5c pair:

  4.5: podman save the cached `nginx:alpine` under two local tags
       (`localdev/nginx:v1`, `localdev/nginx:v2`) into a tarball on
       the host.
  5c:  scp the tarball to the VM, `podman load` it into the
       iot-agent user's rootless store.

Paired with the new IfNotPresent semantics, the agent's reconcile
sees both images already present and never touches a registry. The
upgrade test still works because `v1` and `v2` are distinct tag
strings → spec drift → container id changes.

Dropped the `docker` preflight (no more k3d-side registry transfer)
and the `LOCAL_REGISTRY_*` env vars.

Verified end-to-end: x86 smoke-a4 --auto PASS.
  - apply v1 → container up → curl 200
  - .status.aggregate.succeeded = 1 (Chapter 2 aggregator working)
  - apply v2 → container id changes (upgrade confirmed)
  - delete → container removed

Aarch64 run next.
Running smoke-a4 with `ARCH=aarch64` after an `ARCH=x86-64` run
rebinds the local `nginx:alpine` tag to arm64 (or vice versa),
silently breaking the other arch's next run. Fail fast if the
cached image arch doesn't match the smoke's ARCH, with the exact
command to fix it (`podman pull --platform=linux/<arch> ...`).
`podman save -m` produces an OCI multi-image archive format that
older podman versions in the Ubuntu 24.04 cloud image cannot load:

  Error: payload does not match any of the supported image formats:
   * oci-archive: loading index: ...index.json: no such file or directory

Downgrade to the single-image docker-archive format (default for
`podman save`): save the source image once, load once in the VM,
then `podman tag` twice to expose it under `localdev/nginx:v1` and
`:v2`. Same bits on disk, two distinct tag references, so the
upgrade test still sees a container-id change when the Score
flips from v1 to v2.
kubectl wait --for=Available reports on pod readiness, but k3d's
klipper-lb takes a few more seconds to wire the host loadbalancer
port to Service endpoints. Without this extra wait the operator
races the routing and dies with 'expected INFO, got nothing.'
qemu-img create with no trailing size inherits the backing
image's virtual size. The Ubuntu cloud image ships with ~2 GiB
of root, which fills up as soon as we sideload a container
tarball in the smoke. Pass disk_size_gb through to qemu-img and
rely on cloud-initramfs-growroot (already in the base) to grow
the partition on first boot. example_iot_vm_setup defaults to
16 GiB.
Chapter 1 + Chapter 2 are both green end-to-end on x86_64 and
aarch64. Chapter 3 (helm packaging) is next. Design sketches kept
as the historical record — the running code is the source of
truth for 'how'.
push_str("…") → push('…'), and drop redundant .trim() before
.split_whitespace() in /proc/meminfo parsing.
Design doc for the aggregation rework. Chapter 2's aggregator
(O(deployments × devices) per tick) works for a 10-device smoke but
doesn't scale past a partner fleet of even modest size. Replaces it
with CQRS-style incrementally-maintained counters driven by
JetStream state-change events, device-authoritative per-device
state keys, and a separate log transport that doesn't touch
JetStream.

Review first, implement after. No runtime code changes in this
commit.

Covers data model (KV buckets, streams, subjects), counter
invariants (transition-based, duplicate-safe), cold-start protocol
(walk once, then consume), CR patch cadence (debounced dirty set),
failure modes, scale back-of-envelope for 1M devices + 10k
deployments, schema migration path (clean break, same CRD
v1alpha1), and eight-milestone landing plan.
First milestone of the aggregation rework. Lands the contract layer
without any runtime side effects: the agent + operator still run
their legacy paths unchanged.

New types (module `fleet`):
  - DeviceInfo: routing labels + inventory, rewritten on label
    change. Stored in KV `device-info` at `info.<device_id>`.
  - DeploymentState: current phase per (device, deployment).
    Stored in KV `device-state` at `state.<device>.<deployment>`.
    Authoritative snapshot; operator rebuilds counters from it on
    cold-start.
  - HeartbeatPayload: tiny liveness ping in KV `device-heartbeat`.
    Payload capped by a test (< 96 bytes) so it stays cheap at
    1M-device rates.
  - StateChangeEvent: `from: Option<Phase>, to: Phase, sequence`
    emitted on each transition to JS stream
    `device-state-events` on subject
    `events.state.<device>.<deployment>`. Operator folds these
    events into in-memory counters.
  - LogEvent: shorter-retention user-facing event log to JS stream
    `device-log-events` on subject `events.log.<device>`.

Transport constants + key/subject helpers in `kv` with
cross-component wire-stability tests so a rename here gets caught.

10 new tests (roundtrip serde, forward-compat parse, size bound,
key/subject format). Legacy `AgentStatus` tests + constants stay
green; retirement is scheduled for M8 once the live path has
switched over.
Agent now writes the new per-concern KV shapes + event streams
alongside the legacy AgentStatus. Nothing consumes the new data
yet — the legacy aggregator still drives CR .status from
`agent-status`. M3 will add the operator-side cold-start +
consumer paths in parity mode; M5 flips the CR-patch source once
counters verify against the legacy aggregator.

New module `fleet_publisher.rs` owns:
  - Opening + idempotent-creating the three new KV buckets
    (`device-info`, `device-state`, `device-heartbeat`) and
    two JetStream streams (`device-state-events`,
    `device-log-events`).
  - Publish methods for DeviceInfo, HeartbeatPayload, DeploymentState
    (KV put), StateChangeEvent + LogEvent (stream publish), and
    delete for deployment-state cleanup.
  - Log-and-swallow failure mode. The operator re-walks KV on
    cold-start, so a missed event publish is self-healing on the
    next transition or operator restart.

Reconciler grew:
  - `device_id`: Id + `fleet`: Option<Arc<FleetPublisher>>
  - per-(deployment) monotonic sequence counter in StatusState
  - `set_phase` detects actual transitions (prev_phase vs new) and
    emits a DeploymentState KV write + StateChangeEvent stream
    publish only on change. No-op re-confirmation still bumps the
    sequence (lets operator detect duplicate events via sequence
    comparison) but stays off the wire.
  - `drop_phase` deletes the device-state KV entry.
  - `push_event` also publishes a LogEvent to the stream.

main.rs:
  - Builds FleetPublisher after connect_nats, passes into Reconciler.
  - Publishes DeviceInfo once at startup (empty labels — populated
    by the selector-targeting branch once it merges).
  - Spawns a heartbeat loop on 30 s cadence.
  - Legacy `report_status` AgentStatus task kept running unchanged.

8 unit tests added for the transition-detection + sequence + ring-
buffer invariants (drive set_phase / drop_phase / push_event with
fleet: None). 18 contract tests from M1 still green.
New module `fleet_aggregator` spawns a 5 s tick task that:
  - Walks the Chapter 4 KV buckets (`device-info`,
    `device-state`) every tick.
  - Computes per-CR phase counters via `compute_counters` (pure
    function, unit tested).
  - Computes the legacy aggregator's counts from the same
    `agent-status` snapshot map the legacy task is already
    maintaining.
  - Compares the two per CR and logs per-tick at DEBUG level
    (matches) or WARN (mismatches), with running totals at INFO
    every 60 s.

Explicit `cr_targets_device` predicate is the one-line plug
point for the selector-based rewrite coming from the review-fix
branch: swap `target_devices.contains()` for
`target_selector.matches(&info.labels)`, everything else in the
aggregator is label/selector-agnostic.

Refactored `aggregate::run` to accept the `StatusSnapshots` map
from outside so the parity-check task reads the same agent-status
view the legacy aggregator writes to. Added `aggregate::new_snapshots()`
helper so `main` owns the one shared Arc.

The task is strictly read-only: no CR patches, no side effects. M5
flips `.status.aggregate` over to the new counter-driven path once
M4 replaces the periodic re-walk with the event-stream consumer and
the parity check has stayed green under load.

5 unit tests cover the pure counter logic (target match, multi-CR
fan-in, zero-target CR, phase dispatch).
Replaces M3's per-tick KV re-walk with an incremental
JetStream consumer on `device-state-events`. Cold-start still
walks KV once to seed counters; steady state consumes events and
applies `from -= 1; to += 1` diffs.

New in `fleet_aggregator`:

  FleetState (shared via Arc<Mutex<_>>):
    - counters: per-deployment phase counts.
    - phase_of: per-(device, deployment) current phase, for
      duplicate + resync detection.
    - latest_sequence: per-(device, deployment) highest sequence
      applied, drops stale and duplicate deliveries.
    - deployment_namespace: name → namespace map refreshed each
      parity tick from the CR list (events carry only the
      deployment name, matching the `<device>.<deployment>`
      KV key format).

  apply_state_change_event():
    - Idempotent for duplicate sequence numbers.
    - Idempotent for out-of-order lower-sequence events.
    - On from-phase disagreement with our belief, trusts the
      event and re-syncs (logs warn — parity check will catch
      any resulting drift against the legacy aggregator).
    - Counter decrement saturates at zero so replays can't
      underflow.

  run_event_consumer():
    - Durable JetStream pull consumer on STATE_EVENT_WILDCARD,
      DeliverPolicy::New (cold-start already seeded state from
      KV — replaying from the beginning would double-count).
    - Explicit ack; malformed payloads are logged + acked to
      avoid infinite redelivery.

  parity_tick() no longer walks KV — it reads live counters
  from the shared FleetState and compares with the legacy
  aggregator's per-CR fold. Same match/mismatch/running-totals
  logging as M3.

8 new unit tests cover the event-apply invariants: first
transition (no from), transition (from+to), duplicate sequence,
out-of-order sequence, from-disagreement resync, unknown-
deployment ignore, cold-start seeding, underflow saturation.
Plus the 5 M3 tests from before — 13 aggregator tests total,
all green.
Smoke was silent about the Chapter 4 parity check because the
operator log got discarded on successful runs. Add a pre-cleanup
step that greps for `fleet-aggregator` log lines and prints the
last 20; if any `parity MISMATCH` line is present, upgrade to
`fail` — smoke exit 0 shouldn't hide a silently-wrong new
aggregator.
Chapter 4's parity check in smoke-a4 caught M4 dropping events —
operator's consumer saw 1 of 3 state transitions, parity-mismatch
assertion fired.

Root cause: async-nats's jetstream.publish() returns a
PublishAckFuture that must be awaited for the server to persist
the message. Without that await, the publish is effectively
fire-and-forget and drops under any backpressure — which on the
smoke's agent-first-boot path is every publish until the stream
state stabilizes.

Fix awaits both the publish future (send) and the returned
PublishAckFuture (server ack) for state-change + log events.
State-change events are warn-on-failure (operator needs them);
log events are debug-on-failure (device-side ring buffer is
authoritative).
Two findings from the M4 smoke runs:

1. **Event consumer dropped events for unknown-namespace deployments.**
   The consumer receives state-change events but `apply_state_change_event`
   short-circuits when `deployment_namespace` doesn't have the
   deployment yet — common on the first 5 s after a new CR is
   applied, before the parity-tick's refresh loop runs.

   Fix: on unknown deployment, consumer eagerly does a kube
   `Api::list()` and populates the map. Subsequent events for
   that deployment are fast-path (map already has it).

   Also: added instrumentation on publish + receive paths so
   future debugging against the parity check produces actionable
   traces. Log level is DEBUG to keep INFO clean.

2. **Parity MISMATCH during transitions is correct behavior.**
   The legacy aggregator reads AgentStatus which the agent
   republishes every 30 s. Chapter 4 state-change events land in
   ~100 ms. So during a Pending→Running transition there's a
   window where the new counter shows succeeded=1 while legacy
   still shows pending=1 — precisely because the new path is
   faster, which is the point of this rework.

   The smoke's hard-fail-on-any-mismatch was too strict; relaxed
   to a diagnostic print. Steady state should still converge to
   zero mismatches once the next AgentStatus heartbeat lands; the
   summary lets the user spot sustained divergence by eye. M5
   removes the legacy path entirely, making the parity check
   moot.

Agent-side publish now also surfaces subject + sequence + stream-seq
on every state-change publish, a similar diagnostic aid for tracing
wire deliveries.
Newtypes (review point #3) were the entry. Introducing them forced
the event-payload redesign, and the redesign made the other two
bugs obvious + trivial to fix.

New contract types (harmony-reconciler-contracts::fleet):
  - DeploymentName: validated newtype. Rejects empty, > 253 bytes,
    '.' (alias an extra NATS subject token), NATS wildcards, and
    whitespace. Serde impl validates on deserialize so a malformed
    payload is rejected at the wire, not later.
  - AgentEpoch(u64): random-per-process. Prefixes every sequence.
  - Revision { agent_epoch, sequence } with lexicographic Ord.
  - LifecycleTransition enum: Applied { from, to, last_error } |
    Removed { from }. Replaces (from: Option<Phase>, to: Phase) so
    deletion is modeled explicitly in the wire format.

Bug fixes that fell out of the redesign:

  #1 (drop_phase was silent on the wire): `drop_phase` now
     produces a RecordedTransition with Removed { from }, which
     the publisher serializes into a StateChangeEvent. Operator
     applies the Removed variant by decrementing `from` without
     a paired increment. Counters no longer over-count after
     deletions.

  #2 (sequence reset on agent restart): (agent_epoch, sequence)
     lexicographic ordering means the first post-restart event
     (seq=1 under a fresh epoch) outranks any pre-restart event
     the operator had applied. No more silently-dropped events
     after an agent crash.

Split recommended in review point #4:
  - `record_apply` / `record_remove`: pure in-memory state
    updates returning Option<RecordedTransition>.
  - `publish_transition`: side-effectful wire emission.
  - `apply_phase` / `drop_phase`: thin composite helpers the
    hot path uses.

Typed keys in the operator:
  - DevicePair { device_id, deployment: DeploymentName } replaces
    (String, String) so the two identifiers can't be swapped.
  - FleetState.deployment_namespace is keyed by DeploymentName.
  - Controller's kv_key signature takes &DeploymentName; invalid
    CR names surface as a clear Error rather than corrupting KV.

Tests:
  - 27 contract tests (roundtrip every payload shape, including
    forward-compat parsing; validate DeploymentName rejection
    paths; assert Revision ordering across epochs).
  - 19 operator fleet_aggregator tests, including regression
    guards named for the specific bugs:
      removed_transition_decrements_without_paired_increment  (#1)
      revision_ordering_handles_agent_restart                 (#2)
  - 8 agent reconciler tests (record_apply/record_remove purity,
    sequence monotonicity, agent_epoch stamping, ring buffer
    cap).

Agent main wires a fresh AgentEpoch via rand::random::<u64>() at
startup; FleetPublisher::connect takes it and includes it in every
DeviceInfo + state-change event.
Chapter 4 shipped per-concern wire types (DeviceInfo, DeploymentState,
HeartbeatPayload, StateChangeEvent) as replacements for the monolithic
AgentStatus heartbeat. The parity check proved the new path matches the
legacy one; legacy now goes.

Removed:
- AgentStatus, DeploymentPhase, EventEntry, agent-status bucket, status_key
- iot-operator-v0/src/aggregate.rs (legacy full-recompute aggregator)
- Parity machinery in fleet_aggregator.rs (ParityStats, parity_tick, dual-write)
- Agent recent_events ring + push_event (consumed only by AgentStatus)
- publish_log_event + device-log-events stream (no consumer, YAGNI)

fleet_aggregator now drives CR .status.aggregate directly: event consumer
maintains counters incrementally, 1 Hz patch_tick flushes only deployments
in the `dirty` set.

Net: ~1000 lines removed (4263 → 3216 across the three iot crates).
Wire surface: 5 types → 4. Operator tasks: 4 → 2 (controller + aggregator).

Tests: 21 contracts + 9 operator + 6 agent — all green.
Zero consumers, zero publishers — pure speculative surface area.
Drops LogEvent struct, EventSeverity enum, STREAM_DEVICE_LOG_EVENTS,
log_event_subject, logs_subject, logs_query_subject.

If per-device log streaming lands later, it arrives with a real
consumer attached.

Contracts tests: 21 → 19 (removed two roundtrip tests for the deleted type).
Collapses the Chapter 4 event-stream architecture into pure KV watch.
The operator was maintaining a durable JetStream consumer on
device-state-events in parallel with the KV bucket it was meant to
shadow — the stream was an optimization over KV scanning, but with
async-nats's ordered bucket watch it's redundant.

Gone:
- StateChangeEvent, LifecycleTransition, STREAM_DEVICE_STATE_EVENTS,
  state_event_subject, STATE_EVENT_WILDCARD (contracts)
- Revision, AgentEpoch (contracts) — restart ordering now handled by
  DeploymentState.last_event_at monotonic check
- PhaseCounters.apply_event + incremental diff machinery (operator) —
  counters recomputed per dirty CR from the states snapshot
- RecordedTransition + publish_transition split (agent) — without an
  event to publish, the pure/publish boundary has no reason to exist
- Agent sequence counter + agent_epoch generation (agent main.rs)
- CR aggregate fields recent_events, last_heartbeat_at, unreported —
  never populated, pure speculation

New shape:
- fleet_aggregator.rs watches device-state via bucket.watch_all_from_revision(0)
- apply_state / drop_state mutate an in-memory snapshot
- patch_tick refreshes CR index from kube, recomputes aggregates for
  CRs marked dirty, patches CR status
- DeploymentAggregate = succeeded/failed/pending + last_error only

Line counts (3 iot crates):
  4263 -> 3090 -> 2162 (-49% overall, -30% this pass)

Tests: 24 total (13 contracts + 6 operator + 5 agent), all green.
- agent-status bucket -> device-heartbeat bucket
- status.<device> key -> heartbeat.<device>
- drop parity check summary from smoke-a4 (legacy path is gone)
- tidy stale AgentStatus comment in agent main
`bucket.watch_all_from_revision(0)` sends the JetStream consumer
request with DeliverByStartSequence and an optional-missing start
sequence, which the server rejects with error 10094:

  consumer delivery policy is deliver by start sequence, but
  optional start sequence is not set

`watch_with_history(">")` uses DeliverPolicy::LastPerSubject instead —
replays the current value of every key, then streams live updates.
Same cold-start-plus-steady-state semantics, correct wire.

Caught by smoke-a4 --auto: state watcher exited immediately on
startup, no deployments ever reconciled.
- example_iot_load_test: simulates N devices (default 100 across 10
  groups: 55 + 9×5) pushing DeploymentState every tick to NATS, no
  real podman. Applies one Deployment CR per group, runs for a
  bounded duration, verifies each CR's .status.aggregate counters
  sum to the target device count.

- iot/scripts/load-test.sh: minimum harness — k3d cluster + NATS via
  NatsBasicScore + CRD + operator + load-test binary. No VM, no
  agent build.

- operator: connect_with_retry() on startup. The NATS TCP probe that
  the smoke scripts do isn't enough to guarantee the protocol
  handshake is ready (k3d loadbalancer can accept SYNs before the
  pod is serving); the load harness hit this racing against a
  freshly-rebuilt operator binary.

- drop unused rand dep from iot-agent-v0 Cargo.toml.

100-device run: 6002 state writes in 60s at a clean 100 writes/s,
all 10 CR aggregates converge to target_devices.len() (e.g.
group-00 → 55 = 45 Running + 9 Failed + 1 Pending).
Sequential apply was fine at 10 groups; becomes the startup bottleneck
at 1000. 32-way concurrent CR apply lands 1000 Deployment CRs in ~1.6s;
64-way concurrent DeviceInfo seed seeds 10k devices in ~0.3s.

Also zero-pad CR names and device ids to the largest width so large
runs sort lexicographically in kubectl.
feat(iot-load-test): stable paths + HOLD=1 interactive mode
Some checks failed
Run Check Script / check (pull_request) Failing after 52s
5e8e72df52
- Stable working dir under /tmp/iot-load-test/ — kubeconfig at
  /tmp/iot-load-test/kubeconfig, operator log at
  /tmp/iot-load-test/operator.log. No more chasing mktemp paths.

- Print an explore banner before the load run so the user can
  `export KUBECONFIG=...` and `kubectl get deployments -w` in
  another terminal while the load actually runs.

- HOLD=1 env var keeps the stack alive after the load completes;
  script blocks on sleep until Ctrl-C. Forwards --keep to the
  binary so CRs + KV entries stay in place for inspection.

- DEBUG=1 bumps operator RUST_LOG to surface every status patch.

- Keep operator.log after successful runs (cheap, often useful).

- Load-test binary: --cleanup bool → --keep flag (clap bool with
  default_value_t = true doesn't accept `--cleanup=false`).
Kills the "CRD owns a list of device ids" smell. Deployment CR now
carries a standard K8s LabelSelector; Device is a first-class cluster-
scoped CR (like Node). Matching, desired-state KV writes, and status
aggregation all run off selector evaluation against the Device cache
— no list of device ids anywhere in the CRD spec.

Cross-resource model:
- Agent publishes DeviceInfo (with labels) to NATS `device-info` KV.
- device_reconciler watches that bucket → server-side-applies a
  cluster-scoped Device CR with metadata.labels + spec.inventory.
- Deployment controller is now just validation + finalizer cleanup.
- fleet_aggregator watches Deployment CRs + Device CRs + device-state
  KV, maintains in-memory selector → target device sets, writes/deletes
  `desired-state.<device>.<deployment>` KV on match changes, patches
  `.status.aggregate` at 1 Hz with matchedDeviceCount + phase counters.

Applied CRD shape verified on a live k3d cluster:
  kubectl get crd deployments.iot.nationtech.io -o json
    .spec.versions[0].schema.openAPIV3Schema.properties.spec
      → rollout / score / targetSelector (matchLabels + matchExpressions)
    .spec.versions[0].schema.openAPIV3Schema.properties.status.aggregate
      → matchedDeviceCount / succeeded / failed / pending / lastError
  kubectl get crd devices.iot.nationtech.io -o json
    .spec.scope = "Cluster"
    .spec.versions[0].schema.openAPIV3Schema.properties.spec
      → inventory (nullable, camelCased fields)

Load-test run: DEVICES=20 GROUP_SIZES=10,5,5 DURATION=20
  all 3 CRs hit expected matched=N / succeeded+failed+pending=N.

Other changes:
- k8s-openapi gets the `schemars` feature so LabelSelector derives JsonSchema.
- InventorySnapshot uses `#[serde(rename_all = "camelCase")]` for consistency with the rest of the CRD schema.
- agent publishes `device-id=<id>` as a default label so the
  example_iot_apply_deployment `--target-device <id>` shorthand
  works out-of-the-box (implemented as `--selector device-id=<id>`).
- example_iot_apply_deployment gains `--selector key=value` repeatable flag.
- load-test.sh explore banner exposes Device CR commands + new
  matchedDeviceCount column.
Roadmap:
- v0_1_plan.md Chapter 2: rewrite to describe the shipped selector +
  Device CRD model (matchedDeviceCount, LabelSelector, per-concern KV).
  Drop AgentStatus / observed_score_string / target_devices references.
  Update "State of the world" preamble to match 2026-04-23 reality.
- chapter_4_aggregation_scale.md: SUPERSEDED banner at top with a
  clear what-was-kept vs. what-was-dropped summary. Original body
  preserved as decision-trail archaeology.

Code review pass on the iot crates, behavior-preserving:
- fleet_aggregator: owned_targets is now keyed by DeploymentName
  (matches the KV key space — globally unique, no namespace). The
  old DeploymentKey keying created an orphan-leak on operator
  restart: seed_owned_targets stashed entries under a sentinel
  namespace ("") that on_deployment_upsert never merged. Now
  seeding populates the map correctly so restart + selector change
  diffs properly.
- fleet_aggregator: reuse the Client passed into run() for the
  patch_api instead of calling Client::try_default() a second time.
- fleet_aggregator: delete _use_list_params / _use_deployment_spec
  placeholder scaffolding + unused ListParams / DeploymentSpec /
  ScorePayload imports. Inline one-liner serialize_score.
- fleet_aggregator: clean up `then(|| ...)` → filter/map split.
- device_reconciler: `is_label_value(v).then_some(()).is_some()`
  → plain `is_label_value(v)`.
- crd: delete speculative DeviceStatus + DeviceCondition (no one
  writes to them; the comment in DeviceSpec documents where they'd
  land when a heartbeat-reflection reconciler shows up).
- controller: compute `obj.name_any()` once in cleanup().

All 24 tests green. End-to-end load test (20 devices / 3 groups /
20s) PASS after the changes.
feat(iot): Chapter 3 — operator helm chart (local, no registry)
Some checks failed
Run Check Script / check (pull_request) Failing after 50s
24b8282b7f
Generates a self-contained helm chart directory from typed Rust
(ADR 018 — Template Hydration). The chart packages:

- Deployment CRD (from Deployment::crd())
- Device CRD (from Device::crd())
- ServiceAccount, ClusterRole, ClusterRoleBinding with the exact
  verbs the operator uses — nothing aspirational
- operator Deployment (image, env NATS_URL + RUST_LOG)

No hand-authored yaml, no Helm templating. Re-run the chart
subcommand to regenerate for different inputs. When a publishable
chart is needed (user-facing `values.yaml`), layer a templating
pass on this output; for the load test the plain chart is enough.

New surface:
- `iot-operator-v0 chart --output <dir> [--image ... --nats-url ...]`
  writes the chart tree and prints its path.
- `iot/iot-operator-v0/Dockerfile` — minimal archlinux:base wrapper
  around the host-built release binary (glibc-ABI match without a
  two-stage Docker build).

load-test.sh: drops the host-side operator spawn entirely. Phase 3
now builds the operator image, sideloads it into k3d via `podman
save | docker load | k3d image import`, generates the chart via
the `chart` subcommand, and `helm upgrade --install` it into the
cluster. `dump_operator_log` pulls `kubectl logs` into the stable
work dir so HOLD=1 + failure-tail hooks keep working.

Two gotchas debugged along the way, preserved in code comments:
- workspace `.dockerignore` excludes `target/`, so the image build
  uses a staged build context under $WORK_DIR/image-ctx.
- `podman build -t foo/bar:tag` stores as
  `localhost/foo/bar:tag`, which k3d image import can't find under
  the original tag. Use `localhost/iot-operator-v0:latest` as the
  canonical image ref end-to-end.

Load-test results (selector architecture, operator in helm-
installed pod, same envelope as the host-side baseline):

| Scale | Duration | Writes | Rate | Errors | CR aggregates |
|-------|---------:|-------:|-----:|-------:|:-------------:|
| 20 devices / 3 CRs | 20s | 400 | 20/s | 0 | 3/3 ok |
| 10k / 1000 CRs | 120s | 1,201,967 | 10,009/s | 0 | 1000/1000 ok |

No operator warnings, no errors across the run. Image build +
sideload + helm install adds ~30s to startup; steady-state
throughput unchanged from host-side.
Two changes with a single motivation — make the iot-agent runtime
robust under multi-user hosts + unblock chaos-testing workflows
on the VM admin user.

1. iot-agent user is no longer --system.
   Rootless podman needs subuid/subgid ranges in /etc/subuid +
   /etc/subgid before layer unpacking. Ubuntu's useradd --system
   deliberately skips those allocations (system users aren't
   expected to run user namespaces), so we were patching the gap
   with a hardcoded "usermod --add-subuids 100000-165535". That
   range collides with any other user on the host that also runs
   rootless containers — a real footgun. Dropping --system lets
   useradd's default allocator pick a non-overlapping range, and
   the whole ensure_subordinate_ids trait method + ansible impl
   goes away as dead code.

2. VmFirstBootConfig.admin_password (Option<String>).
   When set, cloud-init unlocks the account and enables
   ssh_pwauth on the guest — intended for reliability / chaos
   testing sessions where the operator wants to log in and break
   things on purpose. Default is still key-only auth.
   example_iot_vm_setup plumbs a --admin-password flag +
   IOT_VM_ADMIN_PASSWORD env var; smoke-a4 passes them through
   so chaos sessions are one env var away from a ready VM.

3 cloud-init unit tests cover the locked + unlocked + YAML-escape
paths.
refactor(nats): extract typed single-node primitive; NatsBasicScore becomes a thin wrapper
Some checks failed
Run Check Script / check (pull_request) Failing after 54s
a616204b1c
Addresses the review point that NatsBasicScore was introduced as a
parallel NATS path instead of sharing primitives with the rest of
the module. The render logic (Deployment + Service + Namespace for
one NATS server pod) is now pulled into a new `nats::node`
module built on ADR 018 — typed k8s_openapi structs, no helm
templating — and NatsBasicScore is a high-level preset that sets
defaults on a NatsNodeSpec and runs the shared render fns.

Module-level doc on `nats::node` explicitly flags that future
high-level scores (clustered, TLS, gateway) should grow the spec
and reuse the same primitive, and that NatsK8sScore +
NatsSuperclusterScore are scheduled to migrate onto this primitive
in a follow-up so the helm-templating path disappears entirely
from the NATS module.

7 unit tests between node (the primitive) + score_nats_basic (the
wrapper) cover service-type routing + JetStream flag propagation.
Before: the agent published only `device-id=<id>` on DeviceInfo,
which collapsed every Deployment.spec.targetSelector to "target one
device by id" — usable, but not the actual scalability story. The
K8s-Node analogue wants kubelet-declared node labels driving
DaemonSet nodeSelector; we were missing the equivalent.

After: a new `[labels]` section in the agent's TOML config, set by
IotDeviceSetupScore and plumbed through to every DeviceInfo
publish. Config labels merge with the default `device-id` on
startup. Re-running the Score with a changed label map regenerates
the TOML, triggers the byte-compare idempotency path, restarts the
agent; new labels propagate into Device.metadata.labels and
Deployment selectors re-resolve on the operator side. Manual toml
edits + `systemctl restart iot-agent` is the break-glass path.

Scope:
- iot/iot-agent-v0/src/config.rs: `labels: BTreeMap<String,String>`
  on AgentConfig, defaults to empty via #[serde(default)]. Two
  parse tests cover the "section present" + "section absent"
  cases.
- iot/iot-agent-v0/src/main.rs: merge cfg.labels with the default
  `device-id` entry before DeviceInfo publish. Config wins on
  key conflicts — unusual but legal.
- harmony/src/modules/iot/setup_score.rs: IotDeviceSetupConfig
  gains `labels: BTreeMap<String,String>` (replacing the
  dedicated `group` field — group is just a conventional label
  now, not a distinct axis). render_toml renders a [labels]
  section; BTreeMap iteration guarantees sorted output so the
  Score's byte-compare change detection stays idempotent. Three
  unit tests: section content, byte-identical rendering across
  runs, value escaping.
- examples/iot_vm_setup/src/main.rs: `--labels key=val,key=val`
  with a parser that errors on malformed chunks, empty keys/values,
  or an empty map (a device with no labels is practically
  untargetable, better to fail at the CLI than onboard a ghost).

Live label changes require an agent restart (same as kubelet's
--node-labels on a running Node). Edit-labels-on-running-fleet
is a later chapter; for v0 the restart cost is negligible.

Tests: 7 iot-agent + 3 iot setup_score + existing operator/
contracts suite — all green.
Extends HelmResourceKind with typed variants for Namespace,
ServiceAccount, ClusterRole, ClusterRoleBinding, and
CustomResourceDefinition. Previously only Service + Deployment
had typed variants; everything else went through the
`from_serializable`/`CustomYaml` escape hatch.

The escape hatch stays (documented as "always prefer a typed
variant") for forward-compat with types we haven't imported yet.
Any consumer currently using `from_serializable` for one of the
new typed variants can switch; serialization output is byte-
equivalent (both paths route through serde_yaml on the same
k8s_openapi struct).

Motivation: every Rust operator built on harmony wants the same
five resources — Namespace, SA, ClusterRole, ClusterRoleBinding,
CRD — to be chart-template-ready. Typing them once here means
every operator's chart.rs stays short and IDE-discoverable
instead of a string-of-from_serializable-calls.

Filenames carry the resource name where applicable
(serviceaccount-<name>.yaml, clusterrole-<name>.yaml, etc.) so
charts with multiple ClusterRoles don't collide on a single
`clusterrole.yaml` file.

2 unit tests: unique-filename invariant across the five typed
variants, and crd-name round-trip.
feat(iot/chart): typed variants + CRD-keep + Pod security context
All checks were successful
Run Check Script / check (pull_request) Successful in 2m17s
61d3a6b757
Three production-path improvements bundled into one chart change,
all verified end-to-end (helm lint + load-test pass):

1. Switch from `HelmResourceKind::from_serializable(...)` to the
   typed `HelmResourceKind::{Namespace, ServiceAccount, ClusterRole,
   ClusterRoleBinding, Crd}` variants added to the shared harmony
   helm module. Serialization output is byte-equivalent; IDE
   discoverability + type-safety go up.

2. Annotate both CRDs with `helm.sh/resource-policy: keep`. Without
   this, `helm uninstall iot-operator-v0` cascade-deletes the CRDs;
   the kube GC then deletes every Deployment CR and every Device CR;
   the operator finalizer fires on each deletion and wipes the
   `desired-state` KV; agents tear down every container. One typo
   on uninstall would be fleet-wide catastrophe. `keep` makes
   uninstall data-preserving and idempotent — wipe requires an
   explicit `kubectl delete crd …`.

3. Lock down the operator Pod's securityContext:
   - `runAsNonRoot: true`
   - `readOnlyRootFilesystem: true`
   - `allowPrivilegeEscalation: false`
   - `capabilities: drop [ALL]`
   - `seccompProfile: RuntimeDefault`
   Deliberately *no* `runAsUser` — OpenShift's `restricted-v2` SCC
   assigns namespace-specific UIDs and rejects fixed ones. The
   image's `USER 65532:65532` (Dockerfile) gives vanilla k8s a
   non-root UID; OpenShift's SCC overrides with its own. Same chart
   works on both without custom SCC bindings.

Dockerfile adds `USER 65532:65532` — required for vanilla k8s to
accept `runAsNonRoot: true` without a Pod-level `runAsUser`. 65532
is the distroless/chainguard `nonroot` convention; arbitrary but
safe (no overlap with common system UIDs).

Tests: 2 chart unit tests locking in the keep annotation + SC
shape. End-to-end load test at 20 devices / 3 CRs: pod comes up
clean under the restricted SC, all aggregates correct, zero
operator warnings.
Addresses the review point that NatsBasicScore was a parallel
typed-k8s_openapi path — reinventing probes, resource shapes, pod
anti-affinity, JetStream storage — instead of reusing what
NatsK8sScore already does via the upstream nats/nats helm chart.
Every shape the project will ever ship (supercluster, single node,
TLS, gateway, leaf nodes) is expressible as values on that chart.
Parallel resource construction was churn waiting to diverge.

The shape now:

  HelmChartScore              [existing helm-install primitive]
      ▲
      │ pins chart + repo
      │
  NatsHelmChartScore (new)    [exposes values_yaml only]
      ▲                ▲
      │                │
  NatsBasicScore   NatsK8sScore
   (single node)   (supercluster + TLS + gateways)

Changes:

- Delete harmony/src/modules/nats/node.rs (279 lines of typed
  k8s_openapi Deployment/Service/Namespace — gone).

- New harmony/src/modules/nats/helm_chart.rs: NatsHelmChartScore
  pins chart_name = "nats/nats" and its official repository;
  values_yaml is the only varying input. Implements Score<T> for
  any topology with HelmCommand; caller hands it to
  K8sBareTopology / HAClusterTopology / K8sAnywhereTopology.

- Rewrite score_nats_basic.rs as a thin preset: build a minimal
  single-node values_yaml (fullnameOverride, replicaCount=1,
  cluster.enabled=false, jetstream on/off, service type via the
  chart's `service.merge.spec.type` knob, optional image
  override). 10 unit tests on render_values covering every
  builder combination + image-ref splitting. Score bound moves
  from `T: K8sclient` to `T: HelmCommand` since installation is
  now helm-based.

- score_nats_k8s.rs: last step in deploy_nats switches from a
  hand-constructed HelmChartScore to NatsHelmChartScore::new(...).
  Supercluster values_yaml construction untouched — a supercluster
  is just a more elaborate values file against the same chart.

- bare_topology.rs: add `impl HelmCommand for K8sBareTopology`
  so the in-load-test flow (K8sBareTopology → NatsBasicScore →
  NatsHelmChartScore → HelmChartScore) compiles. Returns a bare
  `helm` command; KUBECONFIG resolution mirrors how HAClusterTopology
  does it.

- mod.rs: export NatsHelmChartScore + the re-shaped NatsServiceType.

- load-test.sh: the nats/nats chart provisions a StatefulSet, not
  a Deployment. Wait on `pod -l app.kubernetes.io/name=nats`
  instead of `deployment/iot-nats` — works across workload kinds.

Tests:
- 2 helm_chart unit tests (chart+repo pinning, default install-
  upgrade semantics)
- 10 score_nats_basic unit tests covering every values shape
- Full load-test.sh e2e (20 devices / 3 CRs / 20s): PASS.
refactor(examples): rename iot_apply_deployment → harmony_apply_deployment
All checks were successful
Run Check Script / check (pull_request) Successful in 2m17s
61cdb9c326
Addresses the review point that the applier CLI was anchored in IoT
vocabulary, but the CRD it applies is a generic declarative-
reconcile intent that works for Pi podman today and OKD / KVM /
anything-reconcilable tomorrow. The name now reflects what it
actually does.

Mechanical rename: crate, binary, `PatchParams::apply(...)` field
manager, doc comments, every reference in smoke-a4.sh, the
v0_1_plan.md Chapter 1 section, and the example itself. The CRD
types + paths + operator name are *not* touched by this commit —
that's the broader rebrand, planned for a dedicated branch.

- examples/iot_apply_deployment/ → examples/harmony_apply_deployment/
- crate name: example_iot_apply_deployment → example_harmony_apply_deployment
- binary name: iot_apply_deployment → harmony_apply_deployment
- PatchParams field manager: "iot-apply-deployment" → "harmony-apply-deployment"

0 stragglers: `grep example_iot_apply_deployment` across the tree
returns empty.
refactor: rebrand iot → fleet, operator/agent crates → harmony-fleet-*
All checks were successful
Run Check Script / check (pull_request) Successful in 2m25s
7c1fedb303
The IoT vocabulary was anchoring the codebase to one customer's
domain. The reconciler pattern is generic — operator in k8s, NATS
KV as desired-state bus, agents reconciling podman / OKD / KVM /
anything that can register. "Fleet" captures that neutrally; IoT
stays acknowledged in docs as the first customer use case.

Done now, while nothing is deployed. After a partner fleet lands,
changing the CRD group alone is a multi-quarter migration.

Scope (nothing left over):

Paths + crates
- iot/ → fleet/
- iot/iot-operator-v0 → fleet/harmony-fleet-operator
- iot/iot-agent-v0 → fleet/harmony-fleet-agent
- harmony/src/modules/iot → harmony/src/modules/fleet
- ROADMAP/iot_platform → ROADMAP/fleet_platform
- examples/iot_{vm_setup, load_test, nats_install} → examples/fleet_*
- -v0 suffix dropped on the operator + agent crates (semver in
  Cargo.toml already tracks version)

Rust identifiers
- enum IotScore (podman score payload) → ReconcileScore
- struct IotDeviceSetupScore/Config → FleetDeviceSetupScore/Config
- InterpretName::IotDeviceSetup → InterpretName::FleetDeviceSetup
- HarmonyIotPool → HarmonyFleetPool (libvirt pool)
- HARMONY_IOT_POOL_NAME (default "harmony-iot") → HARMONY_FLEET_POOL_NAME ("harmony-fleet")
- IotSshKeypair → FleetSshKeypair
- ensure_iot_ssh_keypair / ensure_harmony_iot_pool /
  check_iot_smoke_preflight_for_arch → fleet-prefixed variants

Wire / config surfaces
- CRD group `iot.nationtech.io` → `fleet.nationtech.io`
- Finalizer `iot.nationtech.io/finalizer` → `fleet.nationtech.io/finalizer`
- Shortnames iotdep/iotdevice → fleetdep/fleetdev
- Env var IOT_AGENT_CONFIG → FLEET_AGENT_CONFIG
- Env var IOT_VM_ADMIN_PASSWORD → FLEET_VM_ADMIN_PASSWORD
- Binary /usr/local/bin/iot-agent → /usr/local/bin/fleet-agent
- Systemd user `iot-agent` → `fleet-agent`
- VM admin user `iot-admin` → `fleet-admin`

Defaults
- Namespaces iot-system/iot-demo/iot-load → fleet-system/fleet-demo/fleet-load
- Helm release iot-nats → fleet-nats
- Helm release iot-operator-v0 → harmony-fleet-operator
- Container image localhost/iot-operator-v0:latest →
  localhost/harmony-fleet-operator:latest
- On-disk cache $HARMONY_DATA_DIR/iot/ → $HARMONY_DATA_DIR/fleet/
  (cloud-images, ssh keypairs, libvirt pool)

What stayed
- harmony-reconciler-contracts — already neutrally named
- Wire types (DeviceInfo, DeploymentState, HeartbeatPayload,
  DeploymentName) — already neutral
- KV buckets (device-info, device-state, device-heartbeat,
  desired-state) — already neutral
- CRD kind names (Deployment, Device) — already neutral
- NatsBasicScore / NatsHelmChartScore / HelmChart / etc. —
  framework-scope, unchanged

Verification
- cargo check --workspace --all-targets: clean
- All harmony lib tests (114), fleet-operator (6), fleet-agent
  (7), harmony-reconciler-contracts (13): green
- End-to-end load-test (20 devices / 3 CRs / 20s under
  fleet/scripts/load-test.sh): PASS. Image built as
  localhost/harmony-fleet-operator:latest, chart installed as
  release harmony-fleet-operator in namespace fleet-system,
  all CR aggregates correct.

Zero stragglers: grep across the tree for \biot\b / IOT_ /
\bIot[A-Z] returns empty (excluding docs explicitly talking about
IoT as the first customer's domain).
Reviewed-on: #276
Merge pull request 'feat/iot-helm' (#275) from feat/iot-helm into feat/iot-walking-skeleton
All checks were successful
Run Check Script / check (pull_request) Successful in 2m10s
01d2cfa0ba
Reviewed-on: #275
stremblay added 15 commits 2026-05-04 17:28:47 +00:00
systemctl --user enable --now is systemd-level idempotent, but the
prior implementation always returned ChangeReport::CHANGED. This made
every re-run of any score that touches a user-scoped unit (notably
FleetDeviceSetupScore's podman.socket step) lie about its change
count, defeating the noop detection the rest of the score honors.

Probe is-enabled --quiet && is-active --quiet first; only call
enable --now (and report CHANGED) when the unit isn't already in the
desired state. Mirrors the existing ensure_linger pattern in the
same file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sibling of fleet_vm_setup with the libvirt provisioning step removed:
the operator has already booted Pi OS Lite themselves (rpi-imager,
preloaded SSH key, passwordless sudo on the admin user), so the
example goes straight to applying FleetDeviceSetupScore over SSH.

Defaults match the typical rpi-imager flow (--pi-user pi,
--ssh-key ~/.ssh/id_ed25519); --ssh-key supports tilde expansion.
The harmony dep is pulled in without the kvm feature since no VM is
created here. RUST_LOG defaults to info so the score's per-step
traces show up without the operator having to set the env var.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When stdout already parses into UNREACHABLE!/FAILED! + msg, the
trailing (ansible-exit=..., stderr=..., stdout=...) envelope just
duplicated the same text. Strip it when stderr is empty and the
verb is recognized; keep it when it adds debug signal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the opaque change-log with tagged per-step info traces and
a human-readable Outcome.details recap (Device ID / NATS / Labels /
User / Agent binary -> remote / Service). User and Service lines
carry their own /🔄 state markers; final line is  for noop and
🎉 for runs that applied changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the bespoke framed renderer, failure hint catalog, and custom
env_logger setup. Score output now flows through harmony_cli's
standard reporter (bullet list under "🚀 All done!"), matching the
other examples. cli_logger::init() at the top of main so early
logs (ensure_ansible_venv) get the same formatting.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cli_reporter only accumulated details for SUCCESS, dropping the
recap on idempotent re-runs that legitimately return NOOP with
populated details. FleetDeviceSetupScore is the first score to
exercise this path; the filter was over-restrictive.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Folds the "-> /usr/local/bin/fleet-agent" continuation into the
"Agent binary:" line. Removes the hardcoded-indent fragility (bullet
prefix shifts in cli_reporter would have broken alignment).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors FileDelivery in the opposite direction: returns Some(content)
or None if the file doesn't exist. AnsibleHostConfigurator implements
it via two SSH calls (sudo test -e + sudo cat), routed through sudo
to handle root- or service-owned config files. Added to the
LinuxHostConfiguration umbrella so any score with that bound gets it.

Enables scores to pre-flight-compare desired state against current
state before committing to a destructive change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New first step (1/7): read /etc/fleet-agent/config.toml off the
device and compare against the rendered desired config. Three
branches:

  - missing  → info, first install
  - matches  → warn, converge anyway
  - differs  → warn + unified diff (similar::TextDiff with 2-line
    context radius, '-/+' marker style) + inquire::Confirm prompt
    defaulting to N. Aborts with InterpretError if declined.

Existing 6 steps renumbered to 2/7-7/7. The diff replaces the
previous "dump both full configs" approach which was unreadable
even for one-line differences.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sudo password for a Linux bootstrap admin user. Stored under key
"SudoPassword" via SecretManager when a host doesn't have
passwordless sudo configured. Same shape as the other single-field
Secret types in this file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lets callers populate creds.sudo_password when the bootstrap admin
doesn't have passwordless sudo. None = current behavior unchanged.

Wire-level injection:
- ansible runs: when Some, write to a tempfile::NamedTempFile and
  pass ANSIBLE_BECOME_PASSWORD_FILE=<path> via Command::env. Path
  in env, never value in argv. File deletes on drop.
- direct ssh_exec sudo paths (ensure_linger, ensure_user_unit_active,
  fetch_file): new sudo_exec helper that uses `sudo -S` with the
  password piped via the new ssh_exec stdin parameter, otherwise
  plain sudo. ensure_user_unit_active's && chain folded into one
  sudo+sh -c call since `sudo -S` only reads stdin once.

ssh_executor.rs: ssh_exec gains an optional stdin: Option<&str>; on
Some, writes via channel.data() then channel.eof() so the remote
reader doesn't hang. Existing 4 call sites pass None.

fleet_vm_setup updated to set sudo_password: None (behavior
identical).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Probe `sudo -n true` over SSH before constructing the topology. If
the probe succeeds (passwordless sudo, the typical rpi-imager
default), proceed silently. If it fails, fetch the password through
SecretManager::get_or_prompt::<SudoPassword>() — first run prompts
the operator, subsequent runs reuse the cached value (same flow
SshKeyPair etc. use).

Adds harmony_secret dep, env.sh with the standard
HARMONY_SECRET_NAMESPACE / HARMONY_SECRET_STORE / HARMONY_DATABASE_URL
/ RUST_LOG variables, and a doc snippet at the top of main.rs
pointing at it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The new sudo_password field is strictly for privilege escalation on
the remote host (sudo -S, ansible become) — not for SSH login. SSH
auth is still key-only. Adds a TODO on SshCredentials pointing at
where SSH password support would land if/when we want it, and a
matching note on the SudoPassword Secret type.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
feat: add little script to call the fleet_rpi_setup example
Some checks failed
Run Check Script / check (pull_request) Failing after -44h57m30s
b86f8f11f9
Merge pull request 'feat/prepare-rpi' (#280) from feat/prepare-rpi into feat/iot-walking-skeleton
Some checks failed
Run Check Script / check (pull_request) Failing after -44h57m29s
ebd199b22e
Reviewed-on: #280
johnride added 38 commits 2026-05-05 13:46:16 +00:00
- nats-jwt crate: JWT builder types for user claims, authorization
  request/response, account claims, algorithm encode/decode
- harmony-nats-callout crate: Zitadel OIDC JWT validator, callout
  service scaffold, account manager (WIP)
- integration-test-callout: end-to-end test validating the full
  auth callout flow — device connects with Zitadel JWT → callout
  validates JWT → returns per-device user JWT with scoped
  permissions → device can pub/sub on its own subjects only
- Mock OIDC server for test (JWKS + openid-configuration)
- Negative test: device A cannot subscribe to device B's subjects
- Added UserClaimsBuilder::audience() for account-scoped user JWTs
- Remove operator-mode files: account_manager, authorizer, service, config,
  main.rs, plan.md from callout crate
- Remove operator/activation claims from nats-jwt (builder and claims)
- Inline PermissionsConfig into permissions.rs (config.rs removed)
- Remove harmony-nats-callout dep from integration test (unused)
- Remove unused imports in algorithm.rs tests
- Clean up callout Cargo.toml (remove bin, unused deps)
nats-jwt:
- Add NkeyPub newtype with prefix validation
- Add ClaimType and Algorithm typed enums
- Add impl_nats_claims! macro eliminating 4x duplicated impl blocks
- Add AuthorizationRequestClaimsBuilder (completing all builder types)
- Fix AuthorizationResponseBuilder: add issuer() builder method, stop
  mutating iss in sign()
- Tighten trait bounds: encode<T: Serialize>, decode_unverified<T:
  DeserializeOwned>
- Remove dead error variants Expired/NotYetValid
- Add builder tests for all 4 claims types
- Deduplicate is_zero helper

harmony-nats-callout (rewritten):
- AuthCalloutService: production service connecting to NATS, subscribing
  to .REQ.USER.AUTH, dispatching auth requests
- AuthCalloutConfig with builder pattern
- handler.rs: pure auth request handler (decode → validate → mint →
  respond) extracted from test
- Fix ZitadelValidator: validate() is now async (was blocking_read
  deadlock in async contexts)
- Remove dead fields kid_map, jwks_uri
- Make danger_accept_invalid_certs configurable
- permissions: InterpolatedPermissions named struct instead of 4-tuple

integration-test-callout:
- Converted to lib+test crate: src/lib.rs exports test utilities
- Tests now exercise the REAL AuthCalloutService (not inline handler)
- Extracted MockOidcServer, NatsServer, CalloutContext into library
- Replace yasna with rsa crate for DER parsing
- Add Drop to NatsServer for container cleanup
- Add module constants for all magic values
- README updated with new architecture diagram
feat: default for ubuntu aws linux topology
Some checks failed
Run Check Script / check (pull_request) Failing after 12m51s
7fa1ca2683
Helm releases without a pinned `chart_version` previously short-circuited
to a NOOP when already installed, which silently dropped any
`values_yaml` / `values_overrides` changes the caller had made. Now we
fall through to `helm upgrade --install` whenever:

- the release isn't installed (unchanged), or
- it's installed and either unpinned or pinned-and-matching.

Helm itself becomes the source of truth for "did anything actually
change" — no-op upgrades are cheap and changed values get applied
automatically without the caller having to opt in via a flag.

`install_only=true` keeps the prior skip-if-installed shortcut so
bootstrap operators (cert-manager, prometheus-operator, CRDs) that
should not be touched on re-runs continue to behave the same.

Pinned-version safety net is unchanged: a different version installed
than what the score requests is an error, never a silent change.
ZitadelScore:
- Auto-provisions an `iam-admin-pat` Kubernetes secret via the chart's
  FirstInstance.Org.Machine.Pat block. ZitadelSetupScore depended on
  this secret existing; without the chart values, the prior code path
  was non-functional.
- New `external_port: Option<u32>` field. Controls Zitadel's emitted
  issuer URL when the host port mapping isn't 80/443 (k3d typically
  maps 8080:80). Without it, JWT-bearer audience validation 500s with
  `Errors.Internal` because the assertion's `aud` doesn't match the
  chart-default issuer at port 80.

ZitadelSetupScore is extended for the JWT-bearer flow needed by the
NATS auth callout:
- API apps (resource servers — required for project-id audience scope)
- Project roles (`POST .../projects/{id}/roles`, idempotent)
- Machine users with KEY_TYPE_JSON keys (provisioned + cached
  device-side; Zitadel does not expose the key material on subsequent
  reads, so the local cache is the source of truth)
- User grants (project + role keys)

Cache (ZitadelClientConfig) gains projects, machine_user_ids,
machine_keys, and user_grants — keyed for idempotency across re-runs.
Backwards compatible with existing harmony_sso example: the new fields
have `#[serde(default)]` and prior callers just need empty vecs.

Refresh upgrade-by-default in helm chart (separate commit) lets
ExternalPort changes propagate to existing releases on re-run.
harmony-nats-callout becomes a deployable service, not just a library:
- New [[bin]] target with env+secret-file driven config and
  SIGINT/SIGTERM-aware shutdown.
- Dockerfile (single-stage archlinux:base, non-root, matches
  harmony-fleet-operator convention).
- Refactored handler into a pure `decide()` function so the entire
  authorization decision tree is unit-testable without async-nats.
- New `roles` module with role resolution + a `validate_device_id`
  security gate that rejects NATS subject metacharacters in device_id
  (.>* whitespace) — closes a real escalation path through the
  `{device_id}` placeholder in the per-device permissions block.
- Configurable role claim path + admin/device role names; admin wins
  when both are present (privilege-escalation invariant).

57 unit tests cover every reachable branch of the security decision
tree; 4 e2e tests in nats/integration-test-callout exercise real NATS
in podman with: device pubsub on own subjects, cross-device subject
isolation, admin-can-read-anything, and JWT-without-role rejection.

harmony/src/modules/nats_auth_callout/:
- New `NatsAuthCalloutScore` deploys the callout as a K8s Deployment +
  Secret. fsGroup + 0o440 secret mode so the non-root container can
  read its mounted seed/password without leaving them in env vars.
- `render_auth_callout_block` helper produces the YAML for NATS Helm
  `config.merge.authorization.auth_callout` so both halves stay in
  sync.

examples/fleet_auth_callout/:
- `bring_up_stack()` orchestrates k3d -> Zitadel + Postgres ->
  CoreDNS rewrite -> project + roles + machine users with JWT keys
  -> NATS Helm with auth_callout block -> callout image build +
  sideload -> NatsAuthCalloutScore deploy. Idempotent across re-runs
  (issuer NKey persisted in a K8s secret so user JWTs survive
  restarts).
- `mint_access_token()` RFC 7523 JWT-bearer client. Uses Host header
  with port so Zitadel emits a matching issuer.
- main.rs prints URLs/creds/keyIds and waits for Ctrl-C.
- Three #[tokio::test] functions sharing one cluster via OnceCell:
  admin_can_read_any_device_subject, device_can_only_access_own_subjects,
  unknown_role_is_rejected. All green on real k3d.
The IoT walking-skeleton's PodmanV0Score and the underlying
ContainerSpec capability were name+image+ports only. Real customer
workloads (the demo target's docker-compose for example) need at
minimum:

- Environment variables for runtime config + secrets injected at
  deploy time.
- Bind-mount volumes so the container can persist data across
  recreates (sqlite db files, config dirs).
- Restart policy so the container survives device reboot or crash.

PodmanService and ContainerSpec gain `env: Vec<(String, String)>`,
`volumes: Vec<VolumeMount>`, and `restart_policy: RestartPolicy`. All
three default to empty / `unless-stopped` via #[serde(default)] so any
Deployment CR written before this change still deserializes — that
includes the existing smoke harnesses and any field-side state.

VolumeMount is bind-only in v0 (host_path -> container_path, optional
read_only). Named/anonymous volumes can be added behind the same field
later by inspecting host_path's shape; the customer's compose file is
expected to use bind mounts only.

RestartPolicy mirrors podman/docker convention — `no`,
`unless-stopped` (default, matching docker-compose), `on-failure`,
`always`. Serialized kebab-case so docker-compose translation is
mechanical.

PodmanTopology::ensure_service_running now passes env / mounts /
restart policy to the podman API. matches_spec conservatively forces
recreate whenever the spec carries non-empty env / volumes or a non-
default restart policy: the podman list endpoint doesn't surface those
fields, so a structural compare isn't possible from ListContainer
alone. Recreating an unchanged container is cheap (~hundreds of ms);
the alternative (silent stale-config window) isn't acceptable for
fleet-managed devices.

example_harmony_apply_deployment grows --env, --volume, and --restart
flags so an operator can drive the new shape from the CLI when
authoring a Deployment CR.

Tests:
- legacy CR JSON without the new fields deserializes (wire-compat).
- env ordering survives roundtrip (drift-detection invariant).
- restart policy serializes kebab-case (compose-translation contract).
- podman_v0_score_roundtrip exercises env + volumes + restart.
The previous commit swept in `.claude/worktrees/*` (ephemeral agent
worktree submodules) and a few scratch files that landed at the repo
root during prior sessions. None of them are project artifacts.
Removing them from the index and adding to .gitignore so future
`git add -A` doesn't re-include them.

Files on disk are unchanged.
The fleet agent's NATS connection is the load-bearing piece of the
"never lose connectivity to a device" guarantee. This commit makes
that hold even when Zitadel access tokens expire across NATS pod
restarts and network partitions.

New `[credentials]` config variants (externally-tagged):

  type = "toml-shared"   { nats_user, nats_pass }   # v0/dev
  type = "zitadel-jwt"   { key_path, oidc_issuer_url, audience, ... }

A `CredentialSource` enum dispatches per variant:

- TomlShared returns the same user/pass each call.
- ZitadelJwt mints an access token from Zitadel via the JWT-bearer
  flow (RFC 7523). The keyfile at `key_path` is the only durable
  secret on the device; the bearer token is short-lived and refreshed
  in-memory when the cached value is within 5 minutes of expiry.
  Two concurrent refreshes are race-safe — the second writer's mint
  is wasted but produces a correct token.

The agent's `connect_nats` is rewritten on top of async-nats's
`with_auth_callback`, which is invoked on every (re)connect attempt:

- async-nats reconnects automatically on disconnect (default
  behaviour of ConnectOptions) — we don't need a watchdog.
- Each reconnect attempt invokes the callback, which calls
  `next_credential()`. If the cached token is expired, a fresh one
  is minted before the reconnect proceeds. So a Pi that loses NATS
  while its token has just expired will pick up a brand-new token
  on the next reconnect attempt with no operator intervention.
- An `event_callback` surfaces Connected / Disconnected / SlowConsumer
  / ServerError events into tracing — operators can see exactly when
  reconnects happen, which is non-negotiable for an out-of-warranty
  device fleet.

A subtle constraint drove the trait shape: async-nats's
`with_auth_callback` requires the returned future to be `Send + Sync`,
which `#[async_trait]`'s erased `Pin<Box<dyn Future + Send>>` does
not satisfy. The credential source is therefore an enum (concrete
dispatch) rather than `dyn CredentialSource`. Two variants is small
enough that enum dispatch beats trait-object plumbing.

Out of scope, tracked for follow-up: a separate daemon for SSH access
to the Pi via Tailscale/Headscale ("secure backdoor"), and the
device-join-request + admin-approve flow that would replace the
current admin-PAT bootstrap pattern.
The merge of feat/prepare-rpi added a `sudo_password: Option<String>`
field to SshCredentials but the `default_ubuntu_aws` constructor on
the destination branch was authored before that field existed. Add
the missing field as `None` (matches the prepare-rpi semantics:
passwordless sudo expected unless explicitly configured).
The Pi onboarding flow can now mint a per-device Zitadel machine user
on the operator's machine and ship the resulting JWT key to the Pi —
the agent then authenticates to NATS via JWT-bearer instead of shared
nats_user/nats_pass.

`FleetDeviceSetupConfig.auth: FleetDeviceAuth` replaces the previous
flat `nats_user` / `nats_pass` fields. Two variants:

- TomlShared { nats_user, nats_pass } — legacy / dev fallback.
- ZitadelJwt { machine_key_json, oidc_issuer_url, audience, ... } —
  per-device JWT-bearer. The Score:
    * Drops `machine_key_json` to /etc/fleet-agent/zitadel-key.json
      (mode 0640, owner fleet-agent — matches the agent's secret-mount
      conventions).
    * Renders [credentials] type = "zitadel-jwt" pointing at that
      keyfile + the issuer + audience the agent's CredentialSource
      needs.
  A change to either the keyfile content or the TOML triggers an
  agent restart, same as binary / unit drift.

`fleet_rpi_setup --bootstrap-token <PAT>` activates the Zitadel path.
The bootstrap PAT is held in the CLI's memory only; it never lands
on the Pi. New flags: --zitadel-issuer-url, --zitadel-project-id,
--zitadel-device-role (default `device`), --danger-accept-invalid-certs.

`zitadel_bootstrap` is a slim ManagementAPI client that, idempotently
per device:
1. Find-or-create machine user `device-${device_id}`.
2. Find-or-skip a project role grant (defaults to `device`).
3. Always mint a fresh JSON key and return its content. (Zitadel
   doesn't expose the private half of an existing key, so reusing
   isn't possible — stale keys remain valid until expiry, which is
   fine because each setup run overwrites the on-device keyfile.)

Three new render_toml tests cover the zitadel-jwt path; eleven
existing agent tests still pass.

Out of scope, tracked: device-join-request + admin-approve flow that
would replace bootstrap-PAT entirely (closer to the OKD
node-approval pattern). Long-lived admin PAT is acceptable for the
demo per product call.
Adds `examples/fleet_staging_deploy/` — the operator-side, run-once-
per-customer harness that brings up the fleet platform's central
services on a real OKD/K8s cluster. Complements the existing
`fleet_auth_callout` (k3d local-dev harness, kept unchanged) and
`fleet_rpi_setup` (per-device onboarding).

`FleetDomainConfig` is the single source of truth for hostnames:

  base_domain = "customer1.nationtech.io"
  → zitadel.<base>     (Zitadel HTTPS via OKD HAProxy edge-TLS)
  → nats.<base>        (NATS WSS through the same ingress)

Nothing is hardcoded; the operator supplies one --base-domain flag
and the deploy is fully parameterized. Re-running is idempotent
(rides the helm-upgrade-by-default + ZitadelSetupScore search-then-
create + persisted issuer-NKey-secret idempotency layers).

NATS values render under config.merge.{auth_callout, accounts,
system_account}, with WSS via `websocket: { enabled, port: 8443,
ingress: { className: openshift-default, ... } }` and the OKD-flavored
HAProxy edge-TLS annotations:

  route.openshift.io/termination: edge
  haproxy.router.openshift.io/timeout: "1h"

(Switch to `reencrypt` when the customer wants pod-to-edge TLS;
gateway-api migration is on their roadmap, separate from the demo.)

bring_up_staging():
- Deploys ZitadelScore (external_secure: true, no external_port → 443).
- Waits for HTTPS .well-known.
- Provisions the project + API app + roles via ZitadelSetupScore
  hitting Zitadel through the public ingress (port 443, TLS verified).
  No machine users provisioned — fleet_rpi_setup mints them on demand
  per device, so the staging deploy stays device-count-agnostic.
- Persists / reads the issuer NKey seed in the
  `callout-issuer-seed` K8s secret (so re-runs don't invalidate
  user JWTs already in flight on customer Pis).
- Deploys NATS via NatsHelmChartScore with the WSS values.
- Deploys NatsAuthCalloutScore (oidc_audience = project_id;
  external_secure path means no danger_accept_invalid_certs).

main.rs ends by printing the exact `cargo run -p
example-fleet-rpi-setup ...` invocation the operator runs against a
Pi, with the project_id and zitadel/nats URLs filled in.

Three unit tests cover the domain config + NATS values rendering
(WSS + edge-TLS annotations + auth_callout under merge).
Adds `examples/fleet_sso_login/` — the developer-side CLI that proves
the SSO works end-to-end against a deployed staging instance. RFC 8628
device-code flow:

- POSTs `/oauth/v2/device_authorization` with the harmony-cli client_id.
- Prints `verification_uri_complete` so the developer opens one URL in
  the browser; Zitadel handles the auth (username/password, MFA,
  whatever the customer has wired into Zitadel's auth chain).
- Polls `/oauth/v2/token` honouring the standard `authorization_pending`
  / `slow_down` polling protocol.
- On success: decodes the access token's claims, prints
  `Welcome <name> <email>`, persists the session (issuer + client_id +
  access_token + claims) at $DATA_DIR/harmony/sso-session.json with
  mode 0600.

For the demo this proves the SSO chain end-to-end. The actual
`harmony fleet apply` operation (which would consume the persisted
token through a fleet-platform API gateway) is post-demo — clusters
typically don't accept Zitadel JWTs as kube-apiserver bearer tokens
without an OIDC integration the customer would have to opt into.

`fleet_staging_deploy` now also provisions a `harmony-cli` Device
Code OIDC application alongside the existing API app, captures its
client_id from the ZitadelClientConfig cache, and prints both the
client_id and the exact `cargo run -p example-fleet-sso-login ...`
invocation in the operator's "next steps" panel.
Hand-on walkthrough for the 48-hour customer demo:

- Operator: build/push the callout image → fleet-staging-deploy →
  capture project_id + cli_client_id from the printed panel.
- Developer: fleet-sso-login proves Zitadel SSO works end-to-end.
- Pi onboarding: extract iam-admin-pat from the staging cluster,
  cross-compile the agent for aarch64, run fleet-rpi-setup once
  per device with --bootstrap-token. Each Pi's agent connects to
  NATS over WSS using the JWT-bearer token minted from its
  per-device keyfile.
- Deploy a container to a labeled subset via
  example_harmony_apply_deployment with --env / --volume / --restart
  flags (env + bind mounts + restart policy that work_item #1 added).
- Observe the cross-device security model holding via the auth
  callout's logs.

Also captures what's deliberately NOT in the demo (compose
auto-translation, UI, Tailscale backdoor, device-join-request
flow, OpenBao, K8s OIDC) so the customer call has clean expectation-
setting.

The runbook is the closing piece of the 48h-demo work plan;
sequenced after the eight feat / refactor commits that built the
underlying functionality.
The VM smoke harness still uses shared NATS creds for v0 (no Zitadel
JWT path through libvirt — the customer-facing Pi flow has it via
fleet_rpi_setup --bootstrap-token). Rewriting the FleetDeviceSetupConfig
literal against the new `auth: FleetDeviceAuth` field.
Adds ROADMAP/fleet_platform/v0_demo_e2e.md and threads it from
v0_1_plan.md. The VM rehearsal extends smoke-a4 (already-green k3d
+ libvirt VM + agent + apply CR + reconcile loop) with Zitadel +
auth callout + agent JWT auth. Two devices + one admin, real
cargo tests sharing a OnceCell-bringup.

Plan calls out:
- The 7 tests, including the load-bearing
  `agent_recovers_from_nats_pod_restart` (asserts the auto-reconnect
  + auth-callback re-mint path under realistic disturbance).
- Five known risks / debugging traps to expect on first cold-start
  (iam-admin-pat secret timing, /etc/hosts injection, k3d port
  collisions, etc.).
- Success criteria for the rehearsal day: cold cargo run greens in
  <20 min, all 7 tests green on a clean machine, the NATS-restart
  test reliably greens 5 runs in a row.
- Anything below the success criteria → reframe the customer call
  to "architecture walkthrough + local k3d demo + pilot in 1-2
  weeks." Avoids burning the relationship to keep a deadline.

Once VM rehearsal is green the residual OKD deltas are configuration
(Route annotations, image registry, real DNS, cert) — no new code.
Adds `examples/fleet_e2e_demo/` — composes fleet_auth_callout's
existing pieces (Zitadel + auth callout deploy) with per-device
machine-user provisioning (one ZitadelSetupScore call per VM) and
FleetDeviceSetupScore using FleetDeviceAuth::ZitadelJwt. The harness
expects pre-provisioned libvirt VMs (one per device) reachable via
`FLEET_E2E_VM_<i>_IP` env vars; full VM provisioning via
ProvisionVmScore is a follow-up — keeping the harness observable in
pieces during the cold-start debugging tomorrow.

Constituent helpers in `fleet_auth_callout::lib.rs` flipped from
private to `pub` (deploy_zitadel, wait_for_zitadel_ready,
ensure_issuer_seed, build_and_load_callout_image, etc.) so the new
harness composes them rather than re-implementing.

`bring_up_full_stack`:
1. Ensure k3d cluster (re-uses fleet_auth_callout's create_k3d).
2. Deploy Zitadel + Postgres.
3. CoreDNS rewrite + wait for Zitadel HTTP + wait for the
   chart-provisioned `iam-admin-pat` secret. (Last step is new and
   load-bearing — without it ZitadelSetupScore races the chart's
   setup job and fails on first cold-run.)
4. ZitadelSetupScore for project + API app + roles + admin
   machine-user (admin gets fleet-admin role grant).
5. Issuer NKey from a persisted secret + NATS deploy with
   auth_callout block + callout pod.
6. For each device i: per-device ZitadelSetupScore (machine-user
   with `device` role grant), pull the JSON keyfile from cache,
   render the agent's TOML with the keyfile path. (FleetDeviceSetupScore
   invocation is wired structurally; the SSH-and-apply step is
   gated behind the VM provisioning follow-up.)

`HostsEntry` + `merge_hosts_file` added to FleetDeviceSetupScore so
VMs on a libvirt NAT can resolve `sso.fleet.local` to the host
gateway. Managed-block markers in /etc/hosts make the merge
idempotent across re-runs and removable when entries are dropped
from the score. Four new unit tests cover the merge invariants
(insert, replace, strip, byte-stable).

Tests skeleton in `tests/e2e_walking_skeleton.rs`:
- `both_devices_heartbeat_within_60s` — implemented; reads from
  device-info KV via admin token.
- `admin_jwt_reads_any_device_subject` — implemented; subscribes
  to `device-state.>` as admin.
- `cross_device_isolation_enforced_in_vm` — `#[ignore]` pending
  per-device-key plumbing through E2eHandles.
- `agent_recovers_from_nats_pod_restart` — `#[ignore]` pending
  the NATS-pod-restart driver.

The two `#[ignore]`d tests cover the load-bearing reconnect and
isolation invariants. Wiring them is the morning-of-rehearsal
priority since those are the customer-facing claims.

Out of scope of this commit (called out in the roadmap doc):
- ProvisionVmScore integration (today operator runs fleet_vm_setup
  out-of-band).
- Operator install via Helm (smoke-a4 runs operator host-side; this
  harness inherits that pattern).
- Full SSH-based agent install via FleetDeviceSetupScore — Score
  built, invocation gated.
Wires the previously-built FleetDeviceSetupScore through to a
LinuxHostTopology against each pre-provisioned VM. Mirrors the
fleet_rpi_setup pattern but synthesizes inline so the harness drives
N VMs in sequence without re-deriving the CLI plumbing.

Each VM gets:
- An /etc/hosts entry mapping `sso.fleet.local` → libvirt host IP
  via the new HostsEntry support, so the in-VM agent's HTTP client
  to Zitadel can resolve the issuer.
- The per-device Zitadel machine key dropped at
  /etc/fleet-agent/zitadel-key.json.
- Agent TOML with `type = "zitadel-jwt"` pointing at the keyfile.
- Agent service started under systemd.

SSH user assumed `fleet-admin` (matches what fleet_vm_setup +
smoke-a4 cloud-init create). Private key from the harmony fleet
keypair (ensure_fleet_ssh_keypair).

After this commit, `cargo run -p example-fleet-e2e-demo` is the
single command that turns a fresh k3d + 2 booted VMs into a
fully-converged stack: Zitadel + NATS callout + 2 agents speaking
JWT-bearer to NATS. Tomorrow's morning: prove it actually does
that on a clean machine.
Zitadel only includes the project-roles block in an access token when
the JWT-bearer request asks for it via the
`urn:zitadel:iam:org:projects:roles` scope (PLURAL "projects"). Without
it the agent's token has a valid signature/audience but no roles, so
the NATS auth callout rejects with "no authorized role in token" even
though the machine user has a "device" grant.

Discovered while running the VM-based e2e rehearsal: agents could mint
a token, connect to NATS, then immediately fail authorization. The
plural-projects vs. singular-project distinction is a Zitadel
convention; both scopes are required, and the comment now spells out
what each one does.
The cargo bin target is `harmony-fleet-agent`, not `fleet-agent` —
the latter never existed under target/release. Smoke-a4 happened to
work because callers passed --agent-binary explicitly; the harness
defaults didn't.
No behavior changes; only re-flowing existing expressions.
fix(callout): align device permissions with KV key formats and machine-user prefix
Some checks failed
Run Check Script / check (pull_request) Failing after -44h57m23s
d4fd4859ec
Two bugs surfaced when the agent went live against NATS JetStream KV
in the VM-based e2e rehearsal:

1. The default `device` role only allowed flat `device-state.<id>` /
   `device-commands.<id>` subjects. The agent's actual data plane is
   JetStream KV, which puts every operation on `$KV.<bucket>.<key>`
   subjects with control-plane traffic on `$JS.API.>` and `$JS.ACK.>`.
   With the old role config, the very first KV publish died with
   `Permissions Violation for Publish to "$JS.API.INFO"`.

   The role now allows `$JS.API.>` + `$JS.ACK.>` plus the four
   per-device data subjects derived from
   harmony_reconciler_contracts::kv (info.<id>, state.<id>.<dep>,
   heartbeat.<id>, desired-state.<id>.<dep>). The legacy direct
   `device-state.<id>` / `device-commands.<id>` subjects are kept so
   non-JetStream callers of NatsAuthCalloutScore still work.

   A new unit test (`device_role_covers_reconciler_contract_kv_subjects`)
   imports the contract crate as a dev-dep and asserts each contract-
   produced subject is matched, plus that cross-device subjects are
   *not* matched. This locks the role config to the contract surface so
   future renames break the test before they break prod.

2. Zitadel's `client_id` claim for a machine user equals the userName
   verbatim. Both `fleet_rpi_setup` and `fleet_e2e_demo` create the
   user as `device-{device_id}`, so the JWT carries
   `device-vm-device-00` while the agent's KV keys use the bare
   `vm-device-00`. The callout was interpolating the prefixed string
   into permissions, producing rules that never matched what the
   agent actually publishes.

   Adds `device_id_prefix_strip` (env: `DEVICE_ID_PREFIX_STRIP`,
   defaults empty so existing deployments are unaffected). When set,
   the validator strips the prefix from the extracted claim before
   permission interpolation. The fleet_auth_callout example wires it
   to `device-` so the e2e harness stays end-to-end correct without
   reaching into either naming convention.

Verified end-to-end: both VM agents now publish DeviceInfo /
heartbeat through JetStream KV with no permission errors and zero
service restarts since the rollout.
chore: formatting
Some checks failed
Run Check Script / check (pull_request) Failing after -44h56m9s
54308fd7a4
feat(fleet-agent): emit state pulse on direct device-state.<id> subject
Some checks failed
Run Check Script / check (pull_request) Failing after -44h56m12s
c6284c09bc
The agent's data plane was JetStream-KV-only, so live observers
that don't want to consume the JS stream had no signal to subscribe
to. The walking-skeleton e2e admin test was failing as a result —
admin subscribes to `device-state.>` (the per-device direct
subject) and saw nothing in 30s.

This commit adds a small core-NATS publish on `device-state.<id>`
alongside the existing KV writes:

- `FleetPublisher::publish_state_pulse()` emits a tiny
  `{device_id, kind: "heartbeat", at}` payload on
  `device-state.<device_id>`, called from the heartbeat loop so
  observers see traffic on the same 30s cadence as the KV
  heartbeat write — but on a non-JetStream subject anyone can sub
  to.
- `write_deployment_state()` now fans out the same payload it puts
  in the KV bucket on the direct subject, so live admin tooling
  picks up reconcile transitions immediately without watching the
  KV stream.

Also threads `device_id_prefix_strip = "device-"` through the
fleet_e2e_demo bring-up. The bring-up has its own NatsAuthCalloutScore
construction (parallel to fleet_auth_callout's `bring_up_stack`),
and was missing the prefix-strip line, so the deployed callout was
interpolating permissions against `device-vm-device-00` instead of
the bare device id the agent uses.

Locks the regression with a unit test
(`device_id_prefix_strip_lands_as_env_value`) on the deployment
manifest builder.

Verified end-to-end in the VM rehearsal:
  test both_devices_heartbeat_within_60s ... ok
  test admin_jwt_reads_any_device_subject ... ok
Merge remote-tracking branch 'origin' into feat/nats-auth-callout-e2e
Some checks failed
Run Check Script / check (pull_request) Failing after -44h57m27s
3069f5b9ae
ZitadelClientConfig was used as both a key store (machine keys —
which Zitadel cannot return after creation, so caching is required)
AND a lookup cache (project_id, machine_user_ids, user_grants).
The latter introduced a silent drift class:

- ZitadelSetupScore writes the cache incrementally as it creates
  each resource.
- If Zitadel is reset between runs (Postgres recreated, IDs
  reissued), the cache still holds the old IDs.
- ensure_project / ensure_app / ensure_machine_user / user_grant
  short-circuited on cache hit and never consulted Zitadel — so
  downstream Scores got the stale ID.
- The legacy `project_id` field was further `is_none`-guarded so it
  preserved the very first id ever seen, surviving any number of
  Zitadel resets.

Net effect in the wild: the deployed callout's `OIDC_AUDIENCE`
silently pointed at a project that no longer existed, while
agents kept working only because their TOML config carried the
matching stale id. A manual mint script reading `project_id` from
the cache would produce tokens that pass signature validation but
fail the audience check — exactly the symptom that surfaced this
bug.

Fix: drop the cache-hit short-circuit in every ensure_* path and
always live-query. The cache now only holds machine key material
(its only legitimate role) and a record of last-known IDs that
get refreshed on every apply. Cost: ~1 extra HTTP per project /
app / user / grant per Score apply — these are not hot paths.

Also: stop is_none-guarding `config.project_id` so the legacy
field tracks live state for older single-project consumers.
Working PyJWT script + nats CLI commands for talking to a
callout-protected NATS by hand. Distills what we learned debugging
the auth chain: which scope claims matter, why the audience is the
project id (not the API app's clientId), how to read OIDC_AUDIENCE
off the live callout instead of trusting the cache, and the failure
modes — including the PyJWT vs jwt package collision that costs
30 minutes the first time you hit it.

Cross-linked from fleet-zitadel-faq.md.
The agent's `credentials.rs` + `CredentialsSection` enum graduate
into a workspace crate (`fleet/harmony-fleet-auth/`) so the
operator can consume the same code path. Single struct, single
factory, single auth-callback wiring. The only thing that varies
between consumers is where the `[credentials]` TOML bytes come
from — the agent reads them from a config file on disk, the
operator (next commit) will read them from an env var.

Public surface of the new crate:
  CredentialsSection                    — the deserializable
  CredentialSource / NatsCredential     — the runtime objects
  MachineKeyFile / CachedToken          — helper types
  credential_source_from_config         — factory
  connect_options_with_credentials      — async-nats wiring

Agent consumes via `pub use harmony_fleet_auth::CredentialsSection`
in its own `config.rs` so existing call sites keep working.
Existing 5 tests in the new crate + 7 in the agent all green.

This commit is structurally a move; behavior unchanged. Operator
wiring, additional unit tests, and the JWT-mint refactor (split
build_assertion / build_scope / build_token_url for testability)
follow in the next commits.
Bumps coverage on harmony-fleet-auth from 5 to 18 unit tests. The
new tests lock the corners we burned cycles on while debugging
the live system:

  * cache freshness boundary (within-leeway, outside-leeway,
    no-cache, non-zitadel variant)
  * assertion claim shape (iss/sub/aud/exp/iat) and the 60-second
    lifetime constant Zitadel enforces server-side
  * scope string content (plural-projects-roles + singular-project-id
    URN + openid base)
  * token URL strips trailing slashes (the //oauth/v2/token 404
    waiting to bite the next operator)
  * MachineKeyFile JSON parsing under Zitadel's wire shape

Refactor: build_assertion now delegates to build_assertion_claims
+ build_assertion_header (pure, no signing). Lets the claim/header
shape be unit-tested without an RSA private-key fixture; the
sign-and-decode end-to-end is still covered by the e2e harness.

No new deps. wiremock not needed — every meaningful assertion is
on pure logic.
The operator was opening a bare async_nats::connect with no auth,
which would fail closed against a callout-protected NATS. Wires it
through the same JWT-bearer flow the agent uses, sharing the
recently-extracted harmony-fleet-auth crate.

Operator side
-------------
* main.rs: read FLEET_OPERATOR_CREDENTIALS_TOML (TOML snippet, same
  shape as the agent's [credentials] block — single
  CredentialsSection struct, just a different byte source). Empty
  string bypasses (callout-less dev only, with a loud warning).
* chart.rs: ChartOptions gains an optional OperatorCredentials field.
  When set, build_chart's Deployment mounts a Secret as both
  envFrom (TOML payload → FLEET_OPERATOR_CREDENTIALS_TOML) and a
  volume mount for the JSON keyfile at the configured key_path
  (defaults to /etc/fleet-operator/zitadel-key.json). On-disk helm
  chart still emits credentials: None — those are environment-
  specific and out of scope for a redistributable chart.
* Public manifest builders (build_service_account, build_cluster_role,
  build_cluster_role_binding, build_operator_deployment,
  operator_secret) so the e2e bring-up can apply each resource via
  K8sResourceScore without re-implementing the manifests.
* mod chart now lives in lib.rs so external consumers (the e2e
  bring-up) can reach into it.

E2e bring-up
------------
* Bring-up gains a separate `fleet-operator` machine user with the
  fleet-admin role grant — distinct from the manual-admin
  `fleet-ops` user so audit logs can tell automated operator
  actions apart from human ones.
* New steps 8/10 (build + sideload operator image) and 9/10 (apply
  CRDs + RBAC + Secret + Deployment + wait for Ready). Devices step
  becomes 10/10.
* Reuses harmony_fleet_operator's manifest builders + operator_secret
  via K8sResourceScore — no duplicated YAML, no shell-out.

Tests
-----
* All existing tests pass (harmony-fleet-auth: 18, harmony-fleet-agent:
  7, harmony-fleet-operator: 2). E2e walking-skeleton is exercised
  by the next phase's clean rerun.
The agent's periodic reconcile destroys-and-recreates any service
whose ContainerSpec has env or volumes, every 30s tick. Root cause:
matches_spec returns false unconditionally for those fields because
podman's list endpoint doesn't surface them; the original author
chose to declare "any spec with state is drifted" as a fail-safe.
That fail-safe weaponizes the polling reconciler into a loop.

Tags the offending line with a multi-paragraph FIXME explaining
the symptom, the root cause, the proposed fix (containers.inspect
+ structural compare + an integration test), and the demo-time
workaround (keep demo specs trivial — the hello-web nginx demo
already is).

Adds the same gap to ROADMAP/fleet_platform/v0_demo_e2e.md's
known-risks section so it's visible at planning time.

Out of scope for tonight; in scope for delivery alongside the
upcoming health-check support on ContainerSpec.
fix(zitadel,operator): user-grant search endpoint + operator keyfile mode
Some checks failed
Run Check Script / check (pull_request) Failing after 2m15s
29896bfeab
Two bugs uncovered while running the full e2e walk end to end:

1. find_user_grant POSTed to /management/v1/users/<id>/grants/_search
   which Zitadel rejects with 405 Method Not Allowed (the original
   author's note in the comment hinted at this). The cache previously
   masked it: first apply created the grant + cached the id; second
   apply hit the cache and skipped the broken search. The live-query
   refactor (f4d6fb94) removed the cache short-circuit, surfacing
   the bug as "Create user grant failed: User grant already exists"
   on every re-apply.

   Fix: switch to the collection endpoint
   /management/v1/users/grants/_search with a userIdQuery filter,
   matching the Zitadel API that's actually wired up. Now returns
   the existing grant on re-apply and the create_user_grant fallback
   is correctly skipped.

2. Operator keyfile mounted as 0o400 owned by root. The operator pod
   runs as non-root (image USER directive — no fixed runAsUser
   because we want SCC compatibility). Result: operator boots,
   tries to load the JSON keyfile from the Secret volume, hits
   EACCES, fails the credential factory, retries forever.

   Fix: mode 0o444. World-read inside the pod is fine — single
   container, no other consumers, the Secret namespace is locked
   down, and the file never escapes pod-fs. The proper fsGroup-based
   alternative requires pinning a UID/GID, which conflicts with our
   SCC-friendly choice of leaving runAsUser unset.

Also fixes a stale `git rm` from commit 4194baac
(harmony-fleet-auth extraction) — the agent's local credentials.rs
was deleted from disk but never staged.

Verified end to end:
  * STACK READY in 2 min on warm cluster
  * Operator pod: "minted fresh Zitadel access token", "NATS connected",
    "starting Deployment controller", "watching device-info KV"
  * 2 Device CRs auto-created with full label set
  * `kubectl apply -f` of a Deployment CR with
    targetSelector.matchLabels: { group: group-a } produced:
      - status.aggregate { matched=1, succeeded=1, failed=0 }
      - HTTP 200 from nginx on vm-device-00:8080
      - connection refused from vm-device-01:8080 (correctly excluded)
Reviewed-on: #279
johnride added 2 commits 2026-05-05 14:04:45 +00:00
refactor(fleet-operator): replace ScorePayload with ReconcileScore in Deployment CRD
Some checks failed
Run Check Script / check (pull_request) Failing after 2m45s
95ccc974f9
Removes the hand-typed ScorePayload struct and its custom schemars
schema function. DeploymentSpec.score is now typed as the strongly
typed ReconcileScore enum already used by the agent, eliminating
duplication and ensuring the CRD schema is derived automatically.

- Add JsonSchema derive to PodmanService, PodmanV0Score, ReconcileScore
- Enable podman feature on harmony dependency in operator
- Re-export ReconcileScore/PodmanV0Score/PodmanService from crd module
- Update harmony_apply_deployment and fleet_load_test examples
- Remove TODO comment from harmony_apply_deployment

Wire format is unchanged (externally tagged {type, data}), so the
operator -> NATS KV -> agent path remains fully backward compatible.
Reviewed-on: #278
johnride added 13 commits 2026-05-05 14:42:07 +00:00
The operator Dockerfile previously copied a host-built binary into
archlinux:base — archlinux was a glibc-ABI workaround for that
host-build path. Convert to a two-stage build (rust:1.94-slim →
debian:bookworm-slim) so cargo runs inside the image. load-test.sh
loses its host cargo build + staging-context trick and now points
podman at the workspace root with -f. Add build_docker.sh as the
local Harbor entry point (DOCKER_TAG, PUSH overrides).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors .gitea/workflows/harmony_composer.yaml: on push to master (or
manual dispatch), build the multi-stage Dockerfile and push
hub.nationtech.io/harmony/harmony-fleet-operator:latest. No buildx
caching yet — TODO comment in the workflow tracks it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure rustfmt wrapping on long lines that pre-dated this branch — surfaced
when running `cargo fmt --check` as part of unrelated work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collapses the load-test harness's chart-gen + helm-install dance into
first-class Harmony Scores. Customer-facing path:

  let score = FleetServerScore::new(nats, operator);
  score.create_interpret().execute(&Inventory::empty(), &topology).await?;

FleetOperatorScore renders the operator chart (CRDs + RBAC + ServiceAccount
+ Deployment) into a tempdir and delegates to HelmChartScore. FleetServerScore
composes it with NatsBasicScore via fail-fast `?` chaining; Zitadel + Argo
hang off the same chain when their Scores land.

Structural change: CRD type definitions and chart-builder moved from
fleet/harmony-fleet-operator/src/{crd,chart}.rs into
harmony/src/modules/fleet/operator/. Harmony can't depend on the operator
crate (cycle), so the score-side code lives in harmony and the operator
binary imports the types right back via
`harmony::modules::fleet::operator::*`. Considered keeping CRDs in the
operator crate with the score either there or in a sibling crate, but
putting customer-facing scores in harmony/src/modules/fleet/ matches the
existing convention (FleetDeviceSetupScore, ProvisionVmScore) and keeps
the CRDs reachable from future harmony scores (e.g. an inventory aggregator
reading Device CRs) without dragging in the operator binary.

The operator's `chart` subcommand stays as a developer convenience
(routes through harmony::modules::fleet::operator::build_chart) so
`cargo run -p harmony-fleet-operator -- chart` still produces an
identical chart on disk for inspection. Existing examples
(fleet_load_test, harmony_apply_deployment) updated to import CRD types
from harmony directly.

load-test.sh phase 3c collapses to a single
`cargo run -p example_fleet_server_install` invocation; phase 2b's NATS
install still runs separately so the host-side NATS reachability probe
sits where it always did. Idempotency: re-running short-circuits via
HelmChartScore::find_installed_release on both inner installs.

Verified: cargo fmt --check, cargo clippy, cargo test all pass; the
4 fleet operator unit tests (2 migrated from operator crate, 2 new on
FleetOperatorScore defaults/builders) pass under `cargo test -p harmony`;
operator chart subcommand produces an identical chart structure
post-refactor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two scripts for running the new install Score against a local cluster:

- examples/fleet_server_install/run.sh — generic, cwd-independent
  passthrough around `cargo run -p example_fleet_server_install`.
- fleet/scripts/run_server_install.sh — opinionated k3d test harness:
  creates `fleet-server-test` cluster if absent (with NATS port 4222
  mapped through klipper-lb), builds the operator image via
  build_docker.sh, sideloads it, runs the Score, and leaves the
  cluster up. Prints teardown + redeploy commands at the end. Header
  documents the helm-idempotency limitation: a rebuilt image won't
  redeploy on a second run unless `helm uninstall` is invoked first
  (HelmChartScore short-circuits on chart_version match). Proper fix
  is deferred — content-hash chart_version or a force_upgrade flag.

Dockerfile glibc pin: builder pinned to `rust:1.94-slim-bookworm`.
Unsuffixed `rust:slim` follows Debian's latest stable (trixie =
glibc 2.40), so binaries built there fail to start on the
`debian:bookworm-slim` runtime (glibc 2.36) with "GLIBC_2.39 not
found". Surfaced when running the new scripts end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switch example_fleet_server_install from a manual `create_interpret().
execute()` + `println!` to `harmony_cli::run`, which wires up the
framework's standard logger + reporter — emoji-tagged per-Score
progress lines and an end-of-run summary listing each Score's
`Outcome.details`. Mirrors the okd_add_node example's pattern.

For events to fire on the inner Scores, FleetServerScore now calls
`Score::interpret` (not `create_interpret().execute`) on
NatsBasicScore + FleetOperatorScore. Same change inside
FleetOperatorScore for its inner HelmChartScore.

Outcome.details populated:
- FleetOperatorScore: image, namespace, release_name, NATS URL.
- FleetServerScore: in-cluster NATS URL, kubectl pointer to the
  operator deployment, kubectl tip for verifying CRDs.

Progress logs added inside FleetOperatorScore between the chart-
render and helm-install phases (`info!`).

FleetOperatorScore fields are now `pub` so callers can read them
post-construction (FleetServerScore needs `operator.namespace` for
its summary). Builder methods unchanged; both styles coexist.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Skips cluster create + operator image build + k3d sideload when set —
just refreshes the kubeconfig and runs the Score against the already-
bootstrapped cluster. Shaves the slow rebuild + sideload off the dev
loop when iterating on Score-side code with the operator binary
unchanged.

Errors out cleanly if --score-only is passed but the cluster is
missing (instead of letting cargo trip on a missing kube context).
Unknown flags also fail-fast.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
FleetServerScore gains `pub identity: Option<ZitadelScore>` and a
conditional `.interpret()` call after the operator install. Trait
bounds widen from `Topology + HelmCommand` to
`Topology + HelmCommand + K8sclient + PostgreSQL` to satisfy the
ZitadelScore impl — both inner Scores need the wider topology even
when identity is None (Rust trait bounds are static).

Example crate consequences:
- Switched topology from K8sBareTopology to K8sAnywhereTopology
  (provides PostgreSQL via CNPG). `ensure_ready` now installs
  cert-manager as a side effect — Zitadel's prod ingress needs it
  anyway, and it's harmless on k3d.
- New CLI flags: --zitadel-host (Option<String>; omitted = no Zitadel),
  --zitadel-version, --zitadel-insecure. Dev-friendly defaults: hosts
  ending in .localhost / .test default to external_secure=false.
- Outcome details now include the Zitadel URL when identity is set.

Auxiliary:
- Added env.sh next to the example, mirroring okd_add_node's pattern
  (KUBECONFIG / RUST_LOG / sqlite secret store paths, with optional
  ZITADEL_HOST documented).
- run_server_install.sh now reads ZITADEL_HOST / ZITADEL_VERSION env
  and passes them through. Trailing banner conditionally prints the
  Zitadel `helm uninstall` command alongside the operator one.

Out of scope: load-test.sh drives the same example crate and may
need a topology audit after this change. Flagged for follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Flip the polarity of the Zitadel knobs in run_server_install.sh: the
Score is now installed on every run, and `NO_ZITADEL=1` is the
explicit skip. Defaults: ZITADEL_HOST=zitadel.localhost (HTTP ingress
auto-selected by the example crate's `.localhost` rule). ZITADEL_VERSION
stays optional (empty = inherit the example's clap default).

Updates env.sh to document the new polarity (NO_ZITADEL as the opt-out,
ZITADEL_HOST/VERSION as overrides on top of the defaults).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
run_server_install.sh now unconditionally sources
examples/fleet_server_install/env.sh after computing REPO_ROOT, so
the example's env knobs (KUBECONFIG, RUST_LOG, NO_ZITADEL,
ZITADEL_HOST, …) are picked up without the user having to source
manually before invoking the script. The script's `${VAR:-default}`
block only fills in values env.sh leaves unset.

env.sh keeps a (commented-out) KUBECONFIG hint and the new optional
Zitadel knobs documented post-source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
feat: add deploy-apache.sh example script
All checks were successful
Run Check Script / check (pull_request) Successful in 2m52s
e5caaba1e4
Merge branch 'feat/iot-walking-skeleton' into feat/deploy_fleet_server_side
Some checks failed
Run Check Script / check (pull_request) Failing after 59s
22eed9b533
Merge branch 'feat/deploy_fleet_server_side' into feat/iot-walking-skeleton
Some checks failed
Run Check Script / check (pull_request) Failing after 58s
69b74d572e
johnride added 19 commits 2026-05-06 03:08:36 +00:00
Extends NatsK8sScore additively (every new field optional, defaults
preserve supercluster shape):

  pub gateway: Option<GatewayConfig>          // None = single-instance
  pub auth_callout: Option<AuthCalloutCfg>    // delegate auth to callout
  pub websocket: Option<WebSocketRouteCfg>    // public WS Route + edge TLS

Render-side:
  * `gateway = None` → cluster.enabled=false, replicas=1, gateway
    block disabled, no `tlsCA`, no service.ports.gateway
  * `auth_callout = Some` → emits authorization.auth_callout block
    (using harmony's existing render_auth_callout_block convention)
    + accounts.<account>.users for the bypass user the callout
    connects as + accounts.SYS + system_account: SYS. Drops the
    legacy testUser + default_permissions — the callout is the
    sole authority.
  * `websocket = Some` → enables config.websocket.enabled with
    no_tls (the Route owns TLS termination).

Routes:
  * `gateway` Route stays gated to gateway.is_some(). passthrough on
    7222, host = cluster.dns_name. Preserves supercluster behavior.
  * `websocket` Route is new. Edge-TLS termination on port 8080
    (chart's WS listener), Redirect insecure-edge policy, host from
    WebSocketRouteCfg. cert-manager.io/cluster-issuer annotation
    drives the Route certificate.

OKDRouteScore gains an `annotations: BTreeMap<String, String>` field
(default empty) + `with_annotation()` builder so callers can attach
the cert-manager annotation without reaching for K8sResourceScore
manually.

Side-effect: `harmony` lib's default features now include `podman`.
The CRD types in `modules::fleet::operator::crd` embed
`ReconcileScore` from `modules::podman` unconditionally — without
the feature on by default, harmony's lib-only builds fail. Existing
explicit `features = ["podman"]` callers are unaffected.

K8sAnywhereTopology's `Nats::deploy` impl populates the new fields
with `gateway = Some(default)` so the capability path keeps the
supercluster behavior it had before this commit.
Replaces the volume-mounted Secret (`/etc/callout/{issuer-nkey-seed,
nats-auth-pass}`) with `valueFrom.secretKeyRef` env vars
(`ISSUER_NKEY_SEED`, `NATS_AUTH_PASS`). The callout binary's
`read_secret` helper already supports both `<NAME>_FILE` and
`<NAME>` — it just falls through to env when the `_FILE` variant
is absent.

Also drops the pod-level `securityContext` block that pinned
`runAsUser: 65532, runAsGroup: 65532, fsGroup: 65532`. OKD's
restricted-v2 SCC rejects pods that pin UID/GID outside the
namespace's allocated range; the SCC will assign appropriate
values from that range when the fields are unset. Container-level
hardening (runAsNonRoot, no-privilege-escalation, RO root fs,
capabilities drop ALL) stays intact.

Tests rewritten to assert the new shape: env vars come from Secret
key refs, no volumes, no pinned UID/GID/fsGroup. 7 callout tests
green.
Operator-side: drops the Secret-as-volume mount entirely. The
operator pod consumes the entire `[credentials]` TOML block —
including the Zitadel JSON keyfile — through one
`valueFrom.secretKeyRef` env var
(`FLEET_OPERATOR_CREDENTIALS_TOML`). No volume, no mount, no
fsGroup, no `0o444` workaround. OKD restricted-v2 SCC compatible.

`OperatorCredentials` collapses to a single field:
  pub credentials_toml: String   // JSON keyfile inlined under key_json

`SECRET_KEY_ZITADEL_KEYFILE` and `KEYFILE_VOLUME_NAME` constants
removed — no longer used.

harmony-fleet-auth: `CredentialsSection::ZitadelJwt` gains
`key_json: Option<String>`. The factory prefers `key_json` when
non-empty, falls back to `key_path` otherwise. Agent (file-based,
`key_path` populated) keeps working unchanged. Operator (env-only,
`key_json` populated) skips the file read entirely. Tests cover
both shapes plus the default-key_path path.

Internal refactor: `load_machine_key` now delegates to
`parse_machine_key(&str)`, shared with the inline path.

fleet_e2e_demo bring-up rewires the credentials TOML it renders
to embed the JSON keyfile via `key_json = """..."""` instead of
`key_path = "..."`. The `OPERATOR_KEY_MOUNT_PATH` constant is gone
along with the now-unused mount logic. 7 callout tests + 19
fleet-auth tests still green.
`FleetServerScore` now composes:

  * `nats: NatsK8sScore` — replaces NatsBasicScore. Same Score that
    knows about OKD Routes, the auth_callout block in NATS Helm
    values, and the WS edge-TLS wiring. The NatsBasicScore-using
    `fleet_server_install` example registers the simple inner
    Scores directly (no FleetServerScore wrapper) — keeps the basic
    k3d-style install working without forcing it through the
    K8s-flavor Score.

  * `identity_setup: Option<ZitadelSetupScore>` — runs after the
    Zitadel helm install. Provisions project + roles + machine
    users via Zitadel's management API. The keys it produces are
    what the operator authenticates with.

  * `auth_callout: Option<NatsAuthCalloutScore>` — deploys the
    callout pod. Pair with `nats.auth_callout = Some(...)` so the
    rendered NATS values delegate to the same issuer pubkey.

Execute order:

  identity (helm) → identity_setup (API) → nats (with auth_callout
  block in values) → auth_callout (pod) → operator

The operator goes last so it doesn't burn reconnect attempts while
the rest comes up; its `connect_with_retry` covers any small
remaining race.

Trait bounds widen to include `Nats + TlsRouter` (for NatsK8sScore's
Route + capability path).

Post-install summary lines added: NATS WS public URL when set,
and a kubectl pointer to the callout deployment.
ZitadelScore gains two fields, both with defaults that preserve
the previous hardcoded behavior:

  pub namespace: String        // default "zitadel"
  pub cluster_issuer: String   // default "letsencrypt-prod"

The hardcoded `NAMESPACE` const becomes `pub const DEFAULT_NAMESPACE`
and the YAML's `cert-manager.io/cluster-issuer` annotation now
substitutes `{cluster_issuer}` from the field. Existing struct-literal
ZitadelScore call sites (5 examples) updated to fall through to
`..Default::default()` so older callers compile unchanged.

New example: `examples/fleet_staging_install`. One-shot install of
the fleet stack on OKD-shaped clusters, composing in order:

  1. ZitadelScore (helm) into `--zitadel-namespace`
  2. ZitadelSetupScore (project + roles + fleet-ops + fleet-operator
     machine users)
  3. NatsK8sScore: single-instance + auth_callout + WS Route
  4. NatsAuthCalloutScore: env-var-only Secret config
  5. FleetOperatorScore: credentials TOML inlining the operator's
     JSON keyfile via key_json (no volume mounts)

Public hostnames derive from one CLI flag: `--base-domain`. The
demo uses `cb1.nationtech.io` → sso-staging.cb1.nationtech.io and
nats-fleet-staging.cb1.nationtech.io. cert-manager `--cluster-issuer`
defaults to `letsencrypt-prod`. Image refs (`--operator-image`,
`--callout-image`) are required (private registry, no sensible
default).

Generates the issuer NKey + auth pass at install time; the callout's
Secret consumes them via env-from-secret-key. One TOML file end-to-
end: the operator pod's only mounted Secret is the credentials
TOML, single-key, no volumes.

Idempotency note: re-running ZitadelSetupScore with the same project
name short-circuits via the cached client-config. Re-runs of NATS /
operator / callout are idempotent at the Helm/K8sResourceScore level.
One-shot script to build + push the operator and auth-callout
container images. Pre-builds the callout binary on the host (its
Dockerfile expects target/release/harmony-nats-callout to exist —
matches the local-k3d iteration convention). Operator image is
self-contained multi-stage.

Defaults: REGISTRY=hub.nationtech.io/harmony, IMAGE_TAG=dev, PUSH=1.
Override via env. Built refs are echoed at the end as the exact
flags to paste into fleet_staging_install.
Walks through: build+push images, namespace creation, KUBECONFIG
sanity, fleet_staging_install run, layer-by-layer verification
(Zitadel cert + URL, NATS pod + callout subscribe, operator auth +
controller, public WSS reachable, CRDs registered), per-device
machine user creation in Zitadel UI, agent config TOML render +
launch, end-to-end Deployment CRD walk, common failure modes with
diagnostic commands, teardown.

Cross-linked from the existing FAQ + manual-mint-recipe guides.
The Zitadel helm chart's JSON schema validates each securityContext
block against integer types for runAsUser/fsGroup. Setting either
to `null` in values.yaml triggers:

  Error: values don't meet the specifications of the schema(s):
  zitadel:
  - at '/login/podSecurityContext/runAsUser': got null, want integer

The intent of the original `null`s was "let OpenShift's
restricted-v2 SCC assign UID/GID" — but the chart's schema doesn't
recognize that as valid YAML. The right way to leave the fields
unset is to omit them from the values block entirely; with no key,
the chart's default (also null/unset) applies and the SCC takes
over at admit time.

Strips 14 occurrences of `runAsUser: null` / `fsGroup: null` across
the main pod, init job, setup job, and login pod security contexts.
runAsNonRoot/seccompProfile/capabilities-drop stay — those are
fields the chart accepts.
The build context for `podman build` was the workspace root —
fine for cargo's path-deps, but `COPY . .` shipped 147 GB to the
build daemon (target/, .claude/worktrees, .git, demos, network
test data, manual_mint scratch). Tightens the .dockerignore to
exclude the heavy items, dropping the context to ~180 MB.

The callout Dockerfile was also single-stage with a host pre-built
binary (`COPY target/release/harmony-nats-callout`), which conflicts
with the new strict .dockerignore (target/ is now excluded). Rewrote
to mirror the operator's multi-stage cargo-in-Docker shape — same
builder + runtime images, same USER 65532 convention.

Build script consequences:
* No more host-side `cargo build --release -p harmony-nats-callout`
  step. Both images now build self-contained from the workspace
  context.
* Two podman build invocations (operator + callout), then push.

The k3d e2e harness (`fleet_auth_callout::build_and_load_callout_image`)
was relying on the old single-stage Dockerfile via tempdir staging;
it now writes its own minimal single-stage Dockerfile inline so the
fast local-iteration path is unaffected by the production-shape
change in `nats/callout/Dockerfile`.

Also includes `topology.ensure_ready()` in fleet_staging_install
(needed for cert-manager bootstrap on first apply).

Verified: `podman build` for the callout completes successfully;
operator build is the same shape and was mid-compile in testing.
The chart's defaults pin runAsUser=1000 / fsGroup=1000 in the
chart-wide podSecurityContext + securityContext blocks. On
OpenShift, restricted-v2 SCC rejects pods that pin a UID outside
the namespace's allocated `openshift.io/sa.scc.uid-range` range
(typically `1000700000/10000`).

Previous attempts:
- `runAsUser: null` in our overrides → schema rejects (`type: integer`)
- omit our overrides → chart defaults apply → SCC rejects 1000

Right answer: read the namespace's `openshift.io/sa.scc.uid-range`
annotation at install time, parse the start UID, inject it as
`runAsUser` + `fsGroup` into every securityContext block we emit.
Schema is happy (integer), SCC is happy (UID is in range).

Wired into the OpenshiftFamily branch of the values renderer:
chart-wide pod + container securityContext, initJob, setupJob,
and login (per-component override that the chart's helpers prefer
over chart-wide). K3s / vanilla K8s gets `1000` literal — chart
default, no SCC to worry about.

Bonus: namespace must pre-exist before this Score runs (caller's
job; the staging install doc already covers this).
The chart's OpenShift-flavored values previously omitted
`ExternalPort` from the configmapConfig. Zitadel falls back to its
internal listen port (8080), which then leaks into every
externally-emitted URL — most visibly the management console URL
and the OIDC issuer claim:

  Management Console URL: https://sso-staging.cb1.nationtech.io:8080/ui/console
  iss in tokens:          https://sso-staging.cb1.nationtech.io:8080

But clients reach Zitadel through the OKD edge-TLS Route on 443.
The mismatch surfaces as JWT-bearer 500s (`Errors.Internal`) and
broken OIDC discovery for any client that compares the issuer to
the URL it actually used.

Fix: resolve `ExternalPort` defensively. When the caller passes
`external_port: Some(p)`, honor it. When `None`, default to 443
for `external_secure: true` and 80 otherwise — matching the
public port the OKD Route serves on.

The K3s/local branch already supported `external_port` overrides
via a separate code path (k3d port mappings); behavior unchanged
there.
ZitadelSetupScore was hardcoded to look for the `iam-admin-pat`
secret in `zitadel`. After ZitadelScore gained a configurable
namespace (so it can deploy into `zitadel-staging`), the setup
score continued reading from the wrong place and failed:

  Secret 'iam-admin-pat' not found in namespace 'zitadel' —
  ensure ZitadelScore Helm values configure FirstInstance.Org.Machine.Pat

Adds `pub namespace: String` to ZitadelSetupScore (default
"zitadel" via serde for backward compatibility). The 5 example
call sites get explicit `namespace:` fields — fleet_staging_install
threads `cli.zitadel_namespace` through, the rest hardcode the
legacy value to keep their behavior unchanged.

The `read_admin_pat` helper now uses `self.score.namespace`
instead of the const, and the error message points at the
mismatch between ZitadelScore.namespace and ZitadelSetupScore.namespace
as the most likely cause.
`PodmanService.env: Vec<(String, String)>` made schemars emit
`items: [{type: string}, {type: string}]` (OpenAPI tuple validation),
which k8s apiextensions rejects with "Forbidden: items must be a
schema object and not an array" — install of the operator's
`deployments.fleet.nationtech.io` CRD blew up at the Helm step.

Introduces `EnvVar { name, value }` in `domain::topology` (with
`From<(String,String)>` for ergonomics) and switches both
`PodmanService.env` and `ContainerSpec.env` to `Vec<EnvVar>`. schemars
now produces `items: { type: object, properties: { name, value } }`
which validates cleanly.

Adds `env_schema_is_object_not_tuple_for_crd_validation` to lock the
schema shape — if anyone reverts to a tuple the test fails before the
operator install does.
A grab-bag of fixes the OKD staging install surfaced. Each landed as a
diagnosable failure during real deploys:

* URL parametrization. ZitadelSetupScore was hardcoded to
  `http://127.0.0.1:{port}` with a `Host:` header — fine for k3d
  port-forward, broken everywhere else. Adds `scheme: ZitadelScheme`
  (Http/Https), `port: Option<u16>` (None → scheme default), and
  `endpoint: Option<String>` for the rare port-forward case. The
  `Host:` header is now only injected when `endpoint` is set.

* HTTP readiness gate. Helm reports SUCCESS when pods are Ready but
  on OKD the Route + cert-manager Certificate reconcile asynchronously
  — the first management call after install was dying with
  `CaUsedAsEndEntity` (rustls rejecting OKD's bootstrap CA cert
  served while cert-manager was still issuing). Score now polls
  `/debug/ready` with retry; treats connect / TLS errors as transient.

* Admin password persistence. ZitadelScore was generating a fresh
  random password on every run, then printing it in the success
  banner — but Zitadel's chart only honors FirstInstance.* on the
  first install, so the printed password didn't match what was live
  in the DB. Now persisted via harmony_secret (LocalFile by default).

* Login banner shows full SSO loginName. Default Zitadel org name is
  ZITADEL → org primary domain is `zitadel.<ExternalDomain>` → admin
  preferredLoginName is `admin@zitadel.<host>`. Print the full
  string so the operator pastes the right value.

* Shared TLS Secret across Zitadel + login Ingresses. Two
  cert-manager-annotated Ingresses on the same host create two
  Certificates → two ACME Orders → competing HTTP01 challenges; the
  loser's Secret never lands and on OKD the second Ingress's Route
  is silently never admitted because the controller inlines TLS
  material into the Route at creation time. Login Ingress now
  references `zitadel-tls` (same as main) and drops its
  cert-manager.io annotation. Documented in
  docs/guides/kubernetes-ingress.md as the canonical pattern with the
  diagnostic signature so this doesn't get rediscovered.

* fleet_staging_deploy namespaces. The OLDER staging deploy example
  hardcoded `fleet-system` / `zitadel`; renamed to `fleet-staging` /
  `zitadel-staging` to match `fleet_staging_install`'s convention.

Five example call sites updated for the new ZitadelSetupScore shape;
fleet_e2e_demo / fleet_auth_callout / harmony_sso pass the k3d
port-forward as `endpoint: Some("http://127.0.0.1:8080")`, the
staging examples take the defaults (direct https on 443).

Tests: 8 new unit tests in setup.rs lock the URL builder, Host-header
conditional, scheme serde, and minimal-fields deserialization. One
new test in setup_score covers render_toml.
`FleetDeviceSetupScore` gains `FleetDeviceAuth::ZitadelEnroll` —
resolves the device's Zitadel machine user + JSON key inline, then
falls through to the existing keyfile-drop flow exactly as if a
pre-resolved `ZitadelJwt` had been passed.

Two operator workflows fall out of this:

* Dev-on-device — developer runs the score on a Pi with display
  attached, browser opens locally to Zitadel SSO, dev signs in with
  their personal account (must hold IAM_OWNER or equivalent), score
  mints credentials for that one device and brings up the agent.
* Production-via-SSH — operator runs from a workstation, targets
  each device over SSH. Browser opens once on the workstation; the
  resulting access token is in-memory only for v0 (per-batch token
  caching tracked in
  ROADMAP/fleet_platform/device_enrollment_token_caching.md).

Implementation:

* `harmony/src/modules/zitadel/admin_auth.rs` — RFC 8628 device-code
  flow against Zitadel. Tries `webbrowser::open`, falls back to
  printing the URL (SSH sessions just see the URL). Minimum scope
  set is `openid urn:zitadel:iam:org:project🆔zitadel:aud` —
  enough to call `/management/v1/*`, nothing more.
* `harmony/src/modules/zitadel/setup.rs` — `mint_device_credentials`
  helper that reuses the existing find-or-create methods (project,
  machine user, user grant) plus `create_machine_key`. Idempotent on
  user + grant; always mints a new key because Zitadel does not
  return existing key material.
* `harmony/src/modules/fleet/setup_score.rs` — new `ZitadelEnroll`
  variant + `AdminAuth::{Sso, Token}`. Resolution runs at the top
  of execute(); the rest of the score sees a single shape.
  render_toml's match collapses both Zitadel variants into one arm
  (they share the issuer/audience/danger fields).
* `harmony/src/modules/fleet/assets.rs` — Debian bookworm arm64
  generic-cloud image fetcher. This is the same Debian base
  Raspberry Pi OS is built on; Pi OS itself is locked to Pi
  hardware (Broadcom firmware) and won't boot in generic KVM.
  No sha pin (Debian's `latest/` URL rotates per point release);
  swap to a dated subdir if you need cryptographic provenance.
* `examples/fleet_device_enroll/` — single CLI covering both
  workflows + a `--launch-pi-vm` switch that boots a Pi-equivalent
  VM with one command and prints the SSH details + suggested
  follow-up enrollment command. README walks the three flows.

Tests: `render_toml_zitadel_enroll_renders_same_as_zitadel_jwt`
locks the byte-equivalence between the unresolved (Enroll) and
resolved (Jwt) variants — the invariant `execute()` relies on so
TOML rendering is independent of when admin auth resolves.

Adds `webbrowser` as a regular dependency on `harmony` (small,
no feature gate).
Symptom: `--launch-pi-vm` boots a Debian bookworm arm64 VM, SSH comes
up, but the configured `fleet-admin` user doesn't exist and key auth
fails. The seed ISO is well-formed (CIDATA volume label, valid
user-data, valid meta-data), but cloud-init never finds it.

Root cause: Debian's `linux-image-cloud-arm64` kernel — and other
slimmed cloud-image kernels — ship WITHOUT `ahci.ko`, because real
clouds don't expose SATA. The SATA cdrom we attach is invisible to
the guest:

* `dmesg` has zero ata/ahci/scsi/sr0 lines (confirmed by inspecting
  the post-boot overlay's journald).
* `blkid -tLABEL=CIDATA` returns nothing.
* cloud-init's NoCloud datasource gives up, falls through to
  `DataSourceNone`, applies no user-data, the user the score wanted
  to create never gets created.

  Final cloud-init log line:
  `Cloud-init v. 22.4.2 finished at … Datasource DataSourceNone`
  `cc_final_message.py[WARNING]: Used fallback datasource`

Fix: attach the seed as `device='disk'` `bus='virtio'` with
`<readonly/>`. virtio-blk is the universal cloud-image baseline —
every cloud kernel includes the driver — and cloud-init's NoCloud
datasource finds the seed via the volume label regardless of device
type. The `cdrom`/`CdromConfig` naming on the public API is kept
(callers mentally model the seed as removable media), but the wire
shape is now virtio-blk on every arch. Device name moves from `hdb`
to `vdb` accordingly.

Tests: `domain_xml_cdrom_device_uses_virtio_blk_readonly` pins the
new shape and explicitly asserts that the SATA / IDE-cdrom shape
does NOT come back — that's the regression this test exists to
prevent.
`harmony`'s `kvm` feature pulls in `libvirt`, which doesn't link on
aarch64-unknown-linux-gnu (no aarch64 `libvirt-dev` package on most
distros). The device-side workflow needs a binary that runs ON the
Pi and only does enrollment — no VM-rehearsal — but the example was
unconditionally enabling `kvm`, so the cross-compile failed at link
time with `undefined reference to virStoragePoolFree` etc.

Fixes by gating the rehearsal bits behind a new `vm-rehearsal`
Cargo feature (default-on for workstation builds, opt-out via
`--no-default-features` for device builds):

* `Cargo.toml`: harmony dep is now `default-features = false,
  features = ["podman"]` (podman is needed unconditionally — the
  operator CRD types depend on it). New `vm-rehearsal` feature
  enables `harmony/kvm` on demand.
* `main.rs`: every libvirt-touching import, CLI flag
  (`--launch-pi-vm`, `--vm-rehearsal`, `--vm-*`), CLI branch, and
  helper function (`boot_*_vm`, `RehearsalImage`) is now
  `#[cfg(feature = "vm-rehearsal")]`. With the feature off, none
  of it is referenced and nothing tries to link libvirt.
* README: documents both build flavors with copy-paste commands.

Workstation build (unchanged):
  cargo build --release -p example_fleet_device_enroll

Device-side build (the new path):
  cargo build --release --target aarch64-unknown-linux-gnu \
      -p example_fleet_device_enroll --no-default-features
Two related issues from a real run.

(1) Image was Debian 12 bookworm — released June 2023, glibc 2.36, two
releases old by mid-2026. Bumping to Debian 13 trixie (current stable
since Aug 2025, glibc 2.41) keeps the rehearsal kernel + userland
roughly aligned with what's likely sitting on a fresh Pi imaged today.

URL pattern is unchanged (`cloud.debian.org/.../latest/`), still
no sha pin (latest/ rotates per point release; swap to a dated
subdir if cryptographic provenance matters). The `cdrom` is still
attached as virtio-blk read-only — that fix is independent and
still required (Debian's cloud-arm64 kernel ships without ahci.ko).

Renames in `harmony::modules::fleet`:
  ensure_debian_bookworm_arm64_cloud_image →
  ensure_debian_trixie_arm64_cloud_image
  DEBIAN_BOOKWORM_CLOUDIMG_ARM64_{URL,FILENAME} →
  DEBIAN_TRIXIE_CLOUDIMG_ARM64_{URL,FILENAME}

(2) The device-side `--target aarch64-unknown-linux-gnu` cross-compile
produced a binary that linked against the workstation's glibc
(2.41 on a current Arch host). Running it on the rehearsal VM
(Debian 12 / 13) blew up immediately:

  /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.39' not found

This is fundamental to the gnu target — the binary depends
dynamically on whatever glibc the host happens to have. The fix
isn't a workaround on the harmony side; it's switching the device
build to `aarch64-unknown-linux-musl`, which produces a fully-static
binary that runs on any aarch64 Linux regardless of the device's
libc generation.

README updated with the musl recipe (rustup target, cargo config
linker, optional `cross` shortcut) and the rationale for why musl
beats gnu for device-side cross-compiles. Workstation build is
unchanged.
fix(fleet): provision device-code OIDC app + require numeric client_id
Some checks failed
Run Check Script / check (pull_request) Failing after 1m6s
a1c9e33955
The SSO login from `fleet_device_enroll` was hitting Zitadel with the
app name (`harmony-cli`) as the OAuth client_id, getting back:

  400 Bad Request: invalid_client: no active client not found

Two real problems behind that error:

* `fleet_staging_install` never created the device-code OIDC app in
  the first place. Its `applications: vec![]` was empty — the only
  Zitadel resources provisioned were the API app, the project roles,
  and the machine users. The `harmony-cli` device-code app that the
  enrollment example assumed was provisioned simply did not exist.
  Adds it via `ZitadelApplication { app_type: DeviceCode }` so a
  fresh staging install yields a real OIDC app.

* `--admin-oidc-client-id` defaulted to the literal string
  `"harmony-cli"`, which is the app's *display name*, not the
  client_id. Zitadel issues numeric client_ids of the form
  `<number>@<project>` when the app is created — that's what OAuth
  endpoints want. Defaulting to the name was misleading: it produces
  no warning, just a confusing 400 from Zitadel about a "client not
  found" that the operator can't easily map back to "wrong field
  passed to the flag".

  Removes the default; the flag is now required when SSO is in use
  (skipped only with `--admin-token`). Help text and README spell
  out the distinction explicitly. The staging install now reads the
  resolved client_id from `ZitadelClientConfig::client_id(...)` and
  prints it in the success banner, alongside a copy-paste-ready
  `fleet_device_enroll` invocation.

  README also documents the post-install lookup path
  (`jq -r '.apps."harmony-cli"' ~/.local/share/harmony/zitadel/client-config.json`)
  and adds the `invalid_client` error to the troubleshooting list.
johnride added 4 commits 2026-05-06 15:22:52 +00:00
Real symptom from a staging run:

  Error: FleetDeviceSetupScore: Project 'fleet' not found in Zitadel —
  run ZitadelSetupScore first to create it

…even though the project clearly existed and was visible in the
Zitadel UI. Cause: `/management/v1/*` scopes by the caller's org. The
SSO operator's primary org is whatever org their personal account
lives in; the project was created by the system iam-admin user, in
the system org. With no `x-zitadel-orgid` override, the search runs
in the operator's org and returns empty. Project effectively
"invisible" to that token.

Three changes:

* `ZitadelSetupScore` gains `admin_org_id: Option<String>`. When set,
  every management API call sends `x-zitadel-orgid: <id>`. Plumbed
  through `request()` next to the existing conditional `Host:`
  header. Default `None`, serde-default for backward compat.

* `FleetDeviceAuth::ZitadelEnroll` gains a matching `admin_org_id`
  field, threaded through `resolve_zitadel_enroll` into the
  synthetic `ZitadelSetupScore` connection it builds for
  `mint_device_credentials`. CLI surface: `--admin-org-id` on
  `fleet_device_enroll`, with help text explaining the symptom and
  where to find the value (Zitadel UI → Organization → Resource ID).

* `find_project` now uses a `nameQuery` filter rather than scanning
  the full default-paginated list, so it doesn't depend on the
  project being on page 1. When the filter returns empty it falls
  back to an unfiltered enumeration and logs the project names that
  ARE visible to the token — that list is usually enough for the
  operator to spot an org-context mismatch in seconds. The not-found
  error in `mint_device_credentials` was rewritten to spell out the
  three real causes (org context, role, no project) instead of the
  misleading "run ZitadelSetupScore first".

All 7 existing `ZitadelSetupScore` initializer sites updated with
`admin_org_id: None`. README's troubleshooting section gets the new
failure-mode entry.
Two ergonomic fixes for the dev-on-device workflow.

(1) Ansible local connection. `LinuxHostTopology` always went through
SSH, so running `fleet_device_enroll` with `--target ssh://you@127.0.0.1`
required the operator to set up sshd loopback access on their own Pi —
clunky for a dev who's sitting in front of the device. Adds
`LinuxLocalhostTopology` that drives the same `LinuxHostConfiguration`
trait surface using ansible's `-c local` connection (no SSH at all)
plus direct `sh -c` subprocess calls for the loginctl / systemctl
--user paths.

The configurator now takes a unified `AnsibleConnection<'a>` enum
(`Ssh { host, creds }` | `Local { sudo_password }`) instead of a
`(host, creds)` pair. Internal `host_exec`/`host_sudo_exec` helpers
branch by transport and return the same `SshCommandOutput` shape
either way, so the public methods (ping, ensure_package, ensure_file,
etc.) are transport-agnostic.

`fleet_device_enroll` switches `--target` to optional: omitted →
local, present → SSH. No magic `localhost` string, no special-case
for 127.0.0.1. README + the flag's help text describe both modes.

(2) Auto-install `python3-venv` on Debian. First-run venv creation
fails on stock Debian/Ubuntu with `ensurepip is not available`
because Debian splits venv into the `python3-venv` apt package.
`ensure_ansible_venv` now detects that failure, checks for
`/etc/debian_version`, runs `sudo apt-get update && sudo apt-get
install -y python3-venv`, and retries. Idempotent on re-runs (apt
is a noop when already installed). On non-Debian or genuinely
broken environments, the operator gets a clear error pointing at
the right install command per distro family. Sudo prompts for a
password if not configured passwordless — that's fine, the operator
expects it.
`loginctl enable-linger` returns to the caller before logind has
actually finished bringing up `user@<uid>.service`. The next step in
`FleetDeviceSetupScore` (Step 4/7 — activating user-scoped
podman.socket) calls `systemctl --user` against the just-lingered
user, which fails with:

  Failed to connect to user scope bus via local transport:
  No such file or directory

…because `/run/user/<uid>/bus` doesn't exist yet. The user manager
is on its way up but the score has already moved on. Reproducible
on a fresh dev-on-device run.

Adds a `wait_for_user_bus` helper that polls `/run/user/<uid>/bus`
for up to 5s after `enable-linger`. We've never seen the wait take
more than a fraction of a second in practice; 5s is a generous
ceiling that gives a clear error pointing at the right diagnostic
commands (`journalctl -u user@<uid>.service`, `loginctl user-status`)
if logind is genuinely stuck.
chore(fleet-agent): default tracing filter to info
Some checks failed
Run Check Script / check (pull_request) Failing after 1m39s
9baae65171
`EnvFilter::from_default_env()` returns the empty filter when
`RUST_LOG` isn't set, which silences every log line. The systemd
unit installed by `FleetDeviceSetupScore` does pass
`RUST_LOG=info`, but a hand-launched binary, an overridden unit, or
any other invocation path produced a silent agent — including the
dev-on-device run the user just hit.

Switches to `try_from_default_env().unwrap_or_else(|_|
EnvFilter::new("info"))` so:

* RUST_LOG unset → info-level by default (what the operator wants
  the moment they look for logs).
* RUST_LOG set → respected as before (`RUST_LOG=debug` for
  troubleshooting, `RUST_LOG=warn` if it's too chatty, etc.).

The systemd unit's existing `Environment=RUST_LOG=info` line is
left in place — explicit + harmless, and lets a customer toggle
the unit's verbosity without rebuilding the binary.
johnride added 1 commit 2026-05-06 15:58:04 +00:00
fix(fleet-operator): apply credentials Secret before helm install
Some checks failed
Run Check Script / check (pull_request) Failing after 57s
71312d27ba
The operator chart's Deployment references
`harmony-fleet-operator-secrets` via `envFrom`/`secretKeyRef` for the
`FLEET_OPERATOR_CREDENTIALS_TOML` env var, but the Secret is
intentionally NOT bundled in the on-disk helm chart (credentials are
operator-environment-specific — see comment in `chart::build_chart`).
The chart docs say "applies the Secret directly via
`operator_secret()` (used as a `K8sResourceScore`)", but
`FleetOperatorInterpret::execute` never actually did that. Result: the
operator pod stalls forever in `CreateContainerConfigError` with
`secret "harmony-fleet-operator-secrets" not found`.

Fix: when `score.credentials` is set, build the Secret via
`operator_secret(&chart_options)` and apply it via `K8sResourceScore`
**before** the helm install fires. This way kube has the Secret in
place by the time the chart's Deployment lands and the pod starts
cleanly. Mirrors the pattern `NatsAuthCalloutScore` already uses for
its own callout Secret.

Trait bound widens from `T: Topology + HelmCommand` to
`T: Topology + HelmCommand + K8sclient` to support the
`K8sResourceScore::interpret` call. The only existing caller
(`fleet_staging_install`) drives this through `K8sAnywhereTopology`
which already implements all three.

When `credentials` is `None` (no-auth dev mode) we skip the Secret
apply entirely — the chart's Deployment doesn't reference it in
that case either.
johnride added 3 commits 2026-05-06 17:27:56 +00:00
The operator's `credentials.toml` embeds Zitadel's JSON machine-key
content under `key_json`. Both `fleet_staging_install` and the
docstring example used basic triple-quoted strings (`"""..."""`),
which interpret backslash escapes — every `\n` in the embedded RSA
private key gets expanded to a literal 0x0A before the value lands
in the operator's env var. The operator's `harmony-fleet-auth`
deserializer then runs `serde_json::from_str` on a "JSON" string
that contains raw control chars inside string literals and rejects
it with "control character found while parsing a string at line 2
column 0".

The fix is a one-character delta: switch to TOML *literal*
multi-line strings (triple single-quote). Literal strings preserve
backslash sequences as-is, so `\n` reaches the JSON parser as the
two chars `\` + `n`, gets interpreted as a string escape, and the
multi-line PEM decodes correctly.

Updates `fleet_staging_install`'s `format!()` template to render
`key_json = '''<json>'''` and rewrites the docstring example on
`OperatorCredentials::credentials_toml` to spell out which string
form is required, with the failure mode that comes from picking
the wrong one.
NATS server-level `jetstream: { ... }` config doesn't extend to
explicit accounts — each one has to opt in individually with
`jetstream: enabled` (or a per-account quota object). The rendered
values block declared `FLEET` and `SYS` accounts but never enabled
JetStream on `FLEET`, so the operator's first call to create its
desired-state KV bucket died immediately with:

  JetStream error: JetStream not enabled for account
  (code 503, error code 10039)

Adds `jetstream: enabled` to the callout account block in
`render_values_yaml`. SYS deliberately stays without it — system
account doesn't host streams. Reference:
https://docs.nats.io/nats-concepts/jetstream/account_jetstream

Adds `auth_callout_account_has_jetstream_enabled` regression test
that:
* asserts `jetstream: enabled` appears under the callout account
  block in the rendered YAML;
* defense-in-depth: asserts `jetstream:` does NOT appear under SYS,
  so a future regex slip can't silently flip system-account
  JetStream on.
fix(deps): enable async-nats websockets feature for wss:// support
Some checks failed
Run Check Script / check (pull_request) Failing after 1m0s
af06177502
The fleet agent connects to NATS via the OKD edge-TLS Route at
`wss://nats-fleet-stg.cb1.nationtech.io`. Without the `websockets`
feature on async-nats, the connector parses the URL but doesn't know
how to do the HTTP Upgrade — it opens a raw TCP socket to port 443
and sits waiting for NATS's plaintext `INFO` frame, which never
comes (the OKD router speaks TLS+HTTPS, not raw NATS). 30s later:

  ERROR async_nats::connector: expected INFO, got nothing
  Error: Nats connection FAILED : IO error: expected INFO, got nothing

…and systemd restart-loops forever.

`websockets` isn't in async-nats 0.45's default feature set; the
crate's own Cargo.toml lists it as
`websockets = ["dep:tokio-websockets"]`. Enabling it on the
workspace dep makes the connector route `wss://` URLs through
tokio-websockets which does the TLS+upgrade dance correctly. Curl
already proved the server-side path works (`101 Switching
Protocols` + NATS `INFO`); the missing piece was always client
support.

The operator wasn't affected because it talks to NATS in-cluster
on `nats://fleet-nats.fleet-staging.svc.cluster.local:4222` (plain
TCP). Only external clients going through the public wss:// Route
hit this.
johnride added 5 commits 2026-05-11 20:45:24 +00:00
The auto-generated `Id::default()` shape (`fb5310_Qm2kPoQ`) contains
underscores and uppercase, so once the agent published its
DeviceInfo and the operator tried to upsert a Device CR using
`device_id` as `metadata.name`, kube rejected it:

  ApiError: Device.fleet.nationtech.io "fb5310_Qm2kPoQ" is invalid:
  metadata.name: Invalid value ... must consist of lower case
  alphanumeric characters, '-' ...

Failing at operator-reconcile time is bad UX: the Zitadel machine
user is already provisioned, the agent is already running, and the
auth callout's per-device permissions are already templated to a
device_id the kube layer will never accept. Re-enrolling requires
manually deleting state in three places.

Makes `--device-id` **required** and validates it against RFC1123
DNS subdomain rules upfront, before any Zitadel call:

* non-empty, ≤253 chars total
* dot-separated labels, each 1-63 chars, lowercase a-z + 0-9 + `-`
* labels must start AND end with an alphanumeric

Stricter than just "kube name valid" because the same id flows into
NATS subjects (auth callout's permission templates) — `_`/uppercase
silently passes NATS auth but breaks the kube path. Rejecting at
the CLI is the only failure point that catches both layers in one
place.

8 unit tests cover the accept set + every reject path
(underscore — the regression that triggered this — uppercase,
leading/trailing dash, empty, consecutive dots, label too long,
total too long).

CLI banner + README updated. The `Id::default()` fallback path is
removed entirely; no backward compat with the old auto-generated
shape (the user explicitly opted out — anything that ran before now
needs re-enrollment with an explicit id).
Two design documents framing the next push.

`ROADMAP/fleet_platform/v0_2_plan.md` — three-day production push.
Replaces the open-ended chapter structure of v0_1_plan.md for the
period between the walking-skeleton merge and v0.1.0 in production.
Focus is locking the fleet module's public API surface so the
inevitable physical refactor (out of `harmony/modules/fleet/`,
into `fleet/harmony-fleet/`) is mechanical when we get to it.
Anchored in the principle from JG's *Pour l'amour des compilateurs*
talk: design the brick before moving the brick.

`docs/adr/022-fleet-agent-upgrade.md` — agent upgrade procedure.
K8s rolling-update shape applied to one host: drain in-flight
work, stage versioned binary alongside old, smoke-test, atomic
symlink swap, both agents alive briefly, operator verifies new
agent's heartbeat then sends explicit stop signal to old, old
exits cleanly. No version is ever erased — N-history on disk is
the rollback target. Operator-driven cutover (not self-stopping)
so the most-trusted side decides the handoff. Implementation
deferred to post-v0.1 backlog; spec exists so anyone can build
it without reinventing the design.

ADR README index updated.
Workspace warning count: 408 → 105.

Three buckets cleared:

* Auto-fixable (`cargo fix` + `cargo clippy --fix`): unused imports
  removed, unused variables prefixed with `_`, deprecated method
  calls updated. Applied across harmony, harmony-k8s, harmony-agent,
  harmony_inventory_agent, the fleet/ workspace, and ~15 examples.
* Generated code (opnsense-api/src/generated/): 269 snake_case
  warnings + ~10 unreachable-pattern warnings come from
  CamelCase-preserving bindings to OPNsense's HAProxy/Caddy XML
  schemas. Scoped a single `#[allow(non_snake_case,
  unreachable_patterns)]` at `pub mod generated;` rather than
  fighting the codegen — renaming would break serde round-trips
  and the codegen would regenerate them anyway.
* opnsense-codegen parser's defensive `let...else` guards on
  `XmlNode` (currently single-variant): file-level
  `#![allow(irrefutable_let_patterns)]` with a comment explaining
  why we keep the `else` arms (they re-arm if the IR grows a
  second variant).

`harmony_inventory_agent::local_presence::{DiscoveryEvent,
discover_agents}` re-exports were stripped twice by the auto-fix
passes (consumers live in another crate, so the local crate looks
"unused" to lint). Anchored with explicit `pub use` + an
`#[allow(unused_imports)]` annotation noting why.

All 151 harmony lib tests still pass. Remaining ~105 warnings are
mostly real dead code in non-fleet modules + a handful of
unused-imports/variables clippy couldn't auto-resolve; cleared in
the next pass.
Picks up where the auto-fix pass left off. Workspace warning count
goes from 105 to 0 across `cargo build --workspace --all-targets`.

Three categories of fixes:

1. Mechanical fixes the auto-pass couldn't handle (unused imports
   inside braced multi-name `use` statements, unused variables that
   needed an underscore prefix without breaking other references):
   batched via a small Python script, then 6 manual edits where the
   warning location and the actual identifier were on different
   lines.

2. Dead-code that's intentionally kept around for future wiring or
   debug visibility — `#[allow(dead_code)]` at the right scope:
   - 19 individual items (struct fields, methods, free functions,
     type aliases, enum variants), e.g. `default_namespace` / `default_cluster_issuer`
     in zitadel/mod.rs (used via serde defaults, opaque to rustc),
     `score` fields on the OKD bootstrap interpret types,
     `crd_exists` methods on the prometheus alerting scores, the
     `harmony_inventory_agent::local_presence::{DiscoveryEvent,
     discover_agents}` re-exports.
   - 5 module-level allows for files where most items are
     aspirational scaffolding (harmony_agent's replica workflow,
     opnsense-config dnsmasq, three opnsense-api examples).

3. Special cases that needed real fixes, not allows:
   - `opnsense-config-xml/src/data/haproxy.rs`: deprecated
     `rand::thread_rng` / `Rng::gen` updated to `rng()` / `random`.
   - `harmony_secret/src/lib.rs`: the `secrete2etest` integration
     test gate is now declared in Cargo.toml's `[lints.rust]
     unexpected_cfgs.check-cfg`; the gated test module is structured
     so its dead `TestSecret`/`TestUserMeta` types come along for
     the cfg ride and don't show up as unconditional dead code.
   - `harmony/src/modules/nats/score_nats_k8s.rs:241`: `K8sIngressScore
     { name: todo!(), ... }`'s unreachable expression annotated.
   - `harmony/src/domain/topology/k8s_anywhere/k8s_anywhere.rs:982`:
     wrap the dead-after-`return Ok(Noop)` branch in
     `#[allow(unreachable_code)] { ... }`. Behavior unchanged.
   - `examples/try_rust_webapp/Cargo.toml`: `autobins = false` so
     `src/main.rs` isn't auto-registered as both bin AND example.

All 16 lib-test suites pass: 437 tests, 0 failed, 13 ignored.

Ready for `-Dwarnings` in CI as a follow-up — the gate makes
sense once we're sure no contributor's local builds slip warnings
back in.
docs: fleet architecture review — inventory, principles, alternatives
Some checks failed
Run Check Script / check (pull_request) Failing after 52s
616c05d5a4
Working document for the architectural redesign of the fleet
platform before v0.1 ships to production. Captures four sections
of research:

§1 — Current state inventory. Markdown-bullet map of every public
type, score, trait, and module across `harmony/modules/fleet/`,
`harmony-reconciler-contracts`, and `fleet/harmony-fleet-*/`.
Sorted by domain meaning (identity, desired state, observed
state, setup, plumbing) rather than location, so the
cross-cutting concerns become visible. Includes a text "diagram"
of the dependency graph showing the two problematic edges:
runtime crates importing CRD types from the framework crate
(`harmony-fleet-operator` ← `harmony::modules::fleet::operator::crd`
verified at `controller.rs:37`, `device_reconciler.rs:21`,
`main.rs:9`) and the agent importing podman wire types from the
framework crate (`harmony-fleet-agent` ← `harmony::modules::podman`
verified at `main.rs:21-22`, `reconciler.rs:11`).

§2 — Theory review. Pulls principles from JG's *Pour l'amour des
compilateurs* talk (2026-04-30), its references (Crichton,
Feldman, Maguire, Goedecke, Fowler), and harmony's own load-bearing
ADRs (002 hexagonal, 003 infrastructure abstractions, 015 higher-
order topologies, 016 agent + global mesh, 018 template hydration).
Synthesizes eight design principles for the redesign — including
Goedecke's guardrail that "type-driven" ≠ "type-everything" so we
don't over-fit the cardinality argument.

§3 — Ten concrete shape problems (P1–P10), framed as cardinality
mismatches, leaky boundaries, and "is this resolved yet" branches
rather than bugs. P1 is the placement issue JG flagged in code
review; P2 is `FleetDeviceAuth`'s mixed resolved/unresolved
states; P10 is the credential-shape staircase across operator
workstation / operator pod / agent.

§4 — Five design alternatives, each scored against P1–P10:
  A. Move + thin façade (conservative cleanup).
  B. Resolved-only at boundaries + capability traits (principled
     incremental).
  C. Dataflow reframe (events in, state out).
  D. Fleet as kube control plane, period (deliberately weird).
  E. Algebra of fleets (deliberately mathematical).

A is too little, C/D/E are right-shape but wrong-timing for the
3-day window. B is the working recommendation, with explicit
awareness that D is the v2.0 destination and the capability
traits in B are the seam that lets us migrate without breaking
callers.

§5 sketches a concrete shape for B: new `harmony-fleet/` domain
crate with no framework dependency, `harmony-fleet-adapters-*`
crates for NATS/Zitadel/kube, the existing operator/agent/auth
crates wire adapters together, the framework's
`harmony::modules::fleet` collapses to a re-export module that
goes away by v0.2.

§6 — Five open questions for JG's review before locking the
choice. §7 — explicit "spike one slice, then commit or back out"
process so we don't lock the wrong shape.

Not an ADR yet. The ADR happens after JG agrees on which
alternative is the working hypothesis and the spike confirms the
shape feels better in code than on paper.
johnride added 4 commits 2026-05-20 10:35:09 +00:00
feat: maud + htmx + tailwindcss frontend for fleet operator, initial commit, still much work to do
Some checks failed
Run Check Script / check (pull_request) Failing after 59s
ee95a5d1a3
add auth to frontend through lib (#284)
Some checks failed
Run Check Script / check (pull_request) Failing after 38s
96e7d43b2f
Adds OIDC login support to the harmony-fleet-operator web dashboard using Zitadel SSO.

pkce was the recommended option for this since we don't need to hold on to any secret. We compute a value on server before sending the data to Zitadel who validates authenticity by recomputing the hash and comparing the two values.

pkce Auth flow

 1. User visits a protected dashboard route, like /devices.
 2. If no valid harmony_fleet_session cookie exists, the app redirects to /login.
 3. /login creates:
     - random state
     - random pkce_code_verifier
     - derived code_challenge = base64url(sha256(pkce_code_verifier))
 4. The app stores state and pkce_code_verifier in a temporary HTTP-only login-attempt cookie.
 5. The browser is redirected to Zitadel’s authorize endpoint with:
     - client_id
     - redirect_uri
     - scope
     - state
     - code_challenge
     - code_challenge_method=S256
 6. After SSO login, Zitadel redirects back to /auth/callback?code=...&state=....
 7. The callback handler:
     - parses the raw query into a strict success/failure enum
     - reads the temporary login-attempt cookie
     - validates returned state
     - exchanges code + pkce_code_verifier for tokens
     - validates the returned ID token using OIDC discovery/JWKS
     - creates a local harmony_fleet_session cookie
     - redirects to /
 8. Protected routes validate the local dashboard session cookie on each request.
 9. /logout clears the dashboard session cookie and redirects to /login.

---

 Auth middleware responses depending on request type:

 - normal browser request: redirect to /login
 - SSE request: 401 authentication required
 - HTMX request: 401 with HX-Redirect: /login (HTMX redirect is more idiomatic than through Axum for this)

Reviewed-on: #284
Reviewed-by: johnride <jg@nationtech.io>
Co-authored-by: Reda Tarzalt <tarzaltreda@gmail.com>
Co-committed-by: Reda Tarzalt <tarzaltreda@gmail.com>
Reviewed-on: #283
johnride added 1 commit 2026-05-20 14:19:46 +00:00
chore: Move claude.md to agents.md and symlink back
Some checks failed
Run Check Script / check (pull_request) Failing after 38s
b72ac7c99d
johnride added 4 commits 2026-05-20 15:00:35 +00:00
First slice of the device-commands.* protocol from
fleet/requests_over_nats.md. Lands `Verb::Ping` plus the harness that
proves it works against a real in-cluster agent.

Wire types (`harmony-reconciler-contracts::commands`):
- `Verb::Ping`, `CommandRequest`, `PingReply`, `ErrorReply`/`ErrorKind`
- `device_command_subject` / `device_command_subscription` helpers
- `X-Harmony-*` header constants

Agent:
- `command_server.rs` subscribes on `device-commands.<id>.>` and
  dispatches verbs; ping handler replies with `PingReply`
- New `[agent].runtime_enabled` config flag (default true). When
  false, podman init + reconciler loop are skipped so the agent can
  run as a Pod on containerd-only k3d nodes; command server +
  heartbeat still run
- `Dockerfile`: canonical multi-stage build for production registries

Operator:
- `commands::FleetCommandsClient` with typed `CommandError`
  (`DeviceOffline` via `no_responders`, `Timeout`, `BadReply`, `Nats`)

E2E harness (`harmony-fleet-e2e`):
- Library crate + integration test. `Stack::bring_up` provisions a
  fresh `e2e-<uuid8>` namespace in a shared `fleet-e2e` k3d cluster,
  deploys NATS (UserPass auth, JetStream on) + the agent Pod, returns
  a connected admin NATS client, and tears the namespace down on Drop
- v1 ships `AuthMode::UserPass` only; the `Callout` variant is
  reserved on the public API for the follow-up PR that adds the mock
  OIDC fixture + NatsAuthCalloutScore deployment
- Operator pod deployment is also follow-up — for ping the test
  process drives `FleetCommandsClient` directly against the cluster's
  NATS NodePort
- `HARMONY_FLEET_E2E=1` gates the integration test so default
  `cargo test --workspace` runs don't depend on k3d/podman
- Image build + sideload mirrors the `fleet_auth_callout` pattern:
  host `cargo build --release` → single-stage Dockerfile → `podman
  build` → `k3d image import`. ~12s warm bring-up, ~80s cold
The previous e2e harness handrolled k8s manifests in `stack.rs`,
bypassing the Score-Topology-Interpret machinery harmony exists to
provide. This commit:

1. **ADR-023** codifies the rules: deploy with Scores (not
   manifests), e2e uses the same Scores as production, one Score
   per component, deploy blocks on smoke-test success, deploy logic
   lives in `*-deploy` crates, topologies are compile-time,
   thiserror over anyhow. CLAUDE.md mirrors the principles.

2. **New `fleet/harmony-fleet-deploy` crate** is the canonical home
   for fleet-component Scores:
   - `FleetOperatorScore` + helm-chart generator + `install_crds`
     moved out of `harmony::modules::fleet::operator` (they should
     never have lived in `harmony` core). `FleetServerScore`
     (composite of NATS + operator + Zitadel + callout) moved too.
   - New `FleetNatsScore` (preset over `NatsHelmChartScore` with
     fleet's required values; v1 supports `UserPass` auth, callout
     mode reserved on the public API for PR 1.5).
   - New `FleetAgentScore` with `FleetAgentTarget::Pod`; `Vm`
     target is a future variant that absorbs `FleetDeviceSetupScore`.
   - `harmony-fleet-deploy` binary built on the existing
     `harmony_cli` crate — no new CLI scaffolding.

3. **Operator runtime binary trimmed**: `Install` and `Chart`
   subcommands removed; both jobs now belong to
   `harmony-fleet-deploy`. The runtime binary becomes leaner.

4. **E2E harness rewritten** as a thin Score composer:
   `harmony-fleet-e2e/src/stack.rs` deploys the stack via
   `FleetNatsScore` + `FleetAgentScore`. The inline NATS manifest
   factory and the bespoke agent Pod renderer are gone.
   - Bring-up runs once per test binary via `shared_stack` +
     `tokio::sync::OnceCell` (matches the `fleet_e2e_demo` pattern).
   - Stale `e2e-*` namespaces from prior runs get pruned at
     startup so the leaks the OnceCell creates don't compound.

5. **`thiserror` for the agent's `CommandServer`** — replaces the
   anyhow-based surface with typed `CommandError` /
   `CommandServerError`.

6. **Memory** captures eight load-bearing principles (saved to
   `~/.claude/projects/.../memory/`) so future sessions don't drift
   back into manifest-handrolling.

Verified: `cargo test -p harmony-fleet-e2e --test ping` green
end-to-end against k3d in 25s warm.
docs(fleet): top-level README; harden e2e namespace prune to wait for NodePort release
Some checks failed
Run Check Script / check (pull_request) Failing after 41s
1b21176215
- Add `fleet/README.md`: overview of the crates, ADR-023 pointer,
  quickstart for the e2e ping test, env knobs (`HARMONY_FLEET_E2E`,
  `FLEET_E2E_KEEP`, `RUST_LOG`), how to connect to NATS from the host
  and in-cluster, how to inspect the agent, the `harmony-fleet-deploy`
  production CLI, the operator dashboard, and the roadmap (Zitadel +
  callout next).
- `prune_stale_namespaces` now polls until each pruned namespace is
  fully gone (up to 90 s). NATS NodePort 30423 is cluster-scoped, so
  a still-`Terminating` namespace from the prior run was blocking the
  new bring-up with "provided port is already allocated".

Verified: e2e ping test green back-to-back after the fix, with a
prior namespace left behind.
Merge commit '1b211762' into feat/iot-walking-skeleton
Some checks failed
Run Check Script / check (pull_request) Failing after 37s
fdd8383caa
johnride added 1 commit 2026-05-20 16:03:29 +00:00
chore: Fix clippy and ci lints, cleanup docs a bit, rewrite adr 023 with better language, etc.
Some checks failed
Run Check Script / check (pull_request) Failing after 1m52s
7ab15415c7
johnride added 1 commit 2026-05-20 16:21:39 +00:00
fix: interactive test now has injected mock data
Some checks failed
Run Check Script / check (pull_request) Failing after 1m51s
4e80101f26
johnride added 1 commit 2026-05-20 17:41:44 +00:00
feat: refactor fleet agent config into a strongly typed struct, remove brittle string processing
Some checks failed
Run Check Script / check (pull_request) Failing after 1m51s
34807511b4
johnride added 1 commit 2026-05-20 19:45:55 +00:00
doc: fleet/ARCHITECTURE.html overview of the whole fleet plubings, components, flows, code cheatsheet
Some checks failed
Run Check Script / check (pull_request) Failing after 1m52s
20b14c4648
reda added 6 commits 2026-05-22 18:07:59 +00:00
feat: Fleet E2E tests harness improving a lot, firing up a VM and testing agent behavior
Some checks failed
Run Check Script / check (pull_request) Failing after 2m3s
cdebeb8a9f
feat: fleet e2e x86 vm support
Some checks failed
Run Check Script / check (pull_request) Failing after 37s
8e6e1fa1bc
doc: fleet e2e x86 arch support
Some checks failed
Run Check Script / check (pull_request) Failing after 2m23s
ba685baddb
feat: fleet e2e x86 support
Some checks failed
Run Check Script / check (pull_request) Failing after 39s
433a66dac2
resolve conflicts
Some checks failed
Run Check Script / check (pull_request) Failing after 36s
1bedbd0f62
reda added 1 commit 2026-05-22 18:57:50 +00:00
add test for operator and update read me
Some checks failed
Run Check Script / check (pull_request) Failing after 38s
f273d07657
johnride added 1 commit 2026-05-22 21:04:54 +00:00
chore: fix fmt
Some checks failed
Run Check Script / check (pull_request) Failing after 2m5s
b37c76d0a5
johnride added 2 commits 2026-05-22 22:13:09 +00:00
Caller must pass `UserPassCredentials` to `FleetNatsScore::user_pass`
— no more `e2e-admin`/`e2e-device` defaults shipped in the library.
The deploy binary reads `HARMONY_FLEET_*` env vars (default namespace
`harmony-fleet-system`) and fails fast when NATS creds aren't set.

Also: `style/dist/` gitignored, `manual_mint/mint.py` moved next to
`nats/callout/` with README + secrets gitignore (the real RSA key
that was sitting untracked has been removed), `architecture_review.md`
moved to `docs/adr/drafts/024-`, three low-value ROADMAP docs deleted.

Updates pre-merge checklist (§1.6, §1.8, §3.1, §5).
chore(fleet): drop pre_merge_checklist.md — served its purpose
Some checks failed
Run Check Script / check (pull_request) Failing after 1m50s
46ceb6b493
Branch is ready to merge; the checklist was working scaffolding for
that. Remaining deferred items (CI image libvirt-dev, smoke-test
contract, bash → Rust smoke migration, ignored-test CI runner,
ADR-024) live in the merge commit body and should be tracked as
real issues from there.
johnride merged commit 27accb399e into master 2026-05-22 22:16:18 +00:00
johnride deleted branch feat/iot-walking-skeleton 2026-05-22 22:16:18 +00:00
Sign in to join this conversation.
No Reviewers
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: NationTech/harmony#264
No description provided.