feat/iot-helm #275

Merged
johnride merged 52 commits from feat/iot-helm into feat/iot-walking-skeleton 2026-04-25 13:52:24 +00:00

52 Commits

Author SHA1 Message Date
fbe58228f2 Merge pull request 'refactor: rebrand iot → fleet, operator/agent crates → harmony-fleet-*' (#276) from feat/iot-rebrand into feat/iot-helm
All checks were successful
Run Check Script / check (pull_request) Successful in 2m12s
Reviewed-on: #276
2026-04-25 13:48:23 +00:00
7c1fedb303 refactor: rebrand iot → fleet, operator/agent crates → harmony-fleet-*
All checks were successful
Run Check Script / check (pull_request) Successful in 2m25s
The IoT vocabulary was anchoring the codebase to one customer's
domain. The reconciler pattern is generic — operator in k8s, NATS
KV as desired-state bus, agents reconciling podman / OKD / KVM /
anything that can register. "Fleet" captures that neutrally; IoT
stays acknowledged in docs as the first customer use case.

Done now, while nothing is deployed. After a partner fleet lands,
changing the CRD group alone is a multi-quarter migration.

Scope (nothing left over):

Paths + crates
- iot/ → fleet/
- iot/iot-operator-v0 → fleet/harmony-fleet-operator
- iot/iot-agent-v0 → fleet/harmony-fleet-agent
- harmony/src/modules/iot → harmony/src/modules/fleet
- ROADMAP/iot_platform → ROADMAP/fleet_platform
- examples/iot_{vm_setup, load_test, nats_install} → examples/fleet_*
- -v0 suffix dropped on the operator + agent crates (semver in
  Cargo.toml already tracks version)

Rust identifiers
- enum IotScore (podman score payload) → ReconcileScore
- struct IotDeviceSetupScore/Config → FleetDeviceSetupScore/Config
- InterpretName::IotDeviceSetup → InterpretName::FleetDeviceSetup
- HarmonyIotPool → HarmonyFleetPool (libvirt pool)
- HARMONY_IOT_POOL_NAME (default "harmony-iot") → HARMONY_FLEET_POOL_NAME ("harmony-fleet")
- IotSshKeypair → FleetSshKeypair
- ensure_iot_ssh_keypair / ensure_harmony_iot_pool /
  check_iot_smoke_preflight_for_arch → fleet-prefixed variants

Wire / config surfaces
- CRD group `iot.nationtech.io` → `fleet.nationtech.io`
- Finalizer `iot.nationtech.io/finalizer` → `fleet.nationtech.io/finalizer`
- Shortnames iotdep/iotdevice → fleetdep/fleetdev
- Env var IOT_AGENT_CONFIG → FLEET_AGENT_CONFIG
- Env var IOT_VM_ADMIN_PASSWORD → FLEET_VM_ADMIN_PASSWORD
- Binary /usr/local/bin/iot-agent → /usr/local/bin/fleet-agent
- Systemd user `iot-agent` → `fleet-agent`
- VM admin user `iot-admin` → `fleet-admin`

Defaults
- Namespaces iot-system/iot-demo/iot-load → fleet-system/fleet-demo/fleet-load
- Helm release iot-nats → fleet-nats
- Helm release iot-operator-v0 → harmony-fleet-operator
- Container image localhost/iot-operator-v0:latest →
  localhost/harmony-fleet-operator:latest
- On-disk cache $HARMONY_DATA_DIR/iot/ → $HARMONY_DATA_DIR/fleet/
  (cloud-images, ssh keypairs, libvirt pool)

What stayed
- harmony-reconciler-contracts — already neutrally named
- Wire types (DeviceInfo, DeploymentState, HeartbeatPayload,
  DeploymentName) — already neutral
- KV buckets (device-info, device-state, device-heartbeat,
  desired-state) — already neutral
- CRD kind names (Deployment, Device) — already neutral
- NatsBasicScore / NatsHelmChartScore / HelmChart / etc. —
  framework-scope, unchanged

Verification
- cargo check --workspace --all-targets: clean
- All harmony lib tests (114), fleet-operator (6), fleet-agent
  (7), harmony-reconciler-contracts (13): green
- End-to-end load-test (20 devices / 3 CRs / 20s under
  fleet/scripts/load-test.sh): PASS. Image built as
  localhost/harmony-fleet-operator:latest, chart installed as
  release harmony-fleet-operator in namespace fleet-system,
  all CR aggregates correct.

Zero stragglers: grep across the tree for \biot\b / IOT_ /
\bIot[A-Z] returns empty (excluding docs explicitly talking about
IoT as the first customer's domain).
2026-04-23 11:10:10 -04:00
61cdb9c326 refactor(examples): rename iot_apply_deployment → harmony_apply_deployment
All checks were successful
Run Check Script / check (pull_request) Successful in 2m17s
Addresses the review point that the applier CLI was anchored in IoT
vocabulary, but the CRD it applies is a generic declarative-
reconcile intent that works for Pi podman today and OKD / KVM /
anything-reconcilable tomorrow. The name now reflects what it
actually does.

Mechanical rename: crate, binary, `PatchParams::apply(...)` field
manager, doc comments, every reference in smoke-a4.sh, the
v0_1_plan.md Chapter 1 section, and the example itself. The CRD
types + paths + operator name are *not* touched by this commit —
that's the broader rebrand, planned for a dedicated branch.

- examples/iot_apply_deployment/ → examples/harmony_apply_deployment/
- crate name: example_iot_apply_deployment → example_harmony_apply_deployment
- binary name: iot_apply_deployment → harmony_apply_deployment
- PatchParams field manager: "iot-apply-deployment" → "harmony-apply-deployment"

0 stragglers: `grep example_iot_apply_deployment` across the tree
returns empty.
2026-04-23 11:00:19 -04:00
4254a2092c refactor(nats): share the helm-chart primitive across all NATS scores
Addresses the review point that NatsBasicScore was a parallel
typed-k8s_openapi path — reinventing probes, resource shapes, pod
anti-affinity, JetStream storage — instead of reusing what
NatsK8sScore already does via the upstream nats/nats helm chart.
Every shape the project will ever ship (supercluster, single node,
TLS, gateway, leaf nodes) is expressible as values on that chart.
Parallel resource construction was churn waiting to diverge.

The shape now:

  HelmChartScore              [existing helm-install primitive]
      ▲
      │ pins chart + repo
      │
  NatsHelmChartScore (new)    [exposes values_yaml only]
      ▲                ▲
      │                │
  NatsBasicScore   NatsK8sScore
   (single node)   (supercluster + TLS + gateways)

Changes:

- Delete harmony/src/modules/nats/node.rs (279 lines of typed
  k8s_openapi Deployment/Service/Namespace — gone).

- New harmony/src/modules/nats/helm_chart.rs: NatsHelmChartScore
  pins chart_name = "nats/nats" and its official repository;
  values_yaml is the only varying input. Implements Score<T> for
  any topology with HelmCommand; caller hands it to
  K8sBareTopology / HAClusterTopology / K8sAnywhereTopology.

- Rewrite score_nats_basic.rs as a thin preset: build a minimal
  single-node values_yaml (fullnameOverride, replicaCount=1,
  cluster.enabled=false, jetstream on/off, service type via the
  chart's `service.merge.spec.type` knob, optional image
  override). 10 unit tests on render_values covering every
  builder combination + image-ref splitting. Score bound moves
  from `T: K8sclient` to `T: HelmCommand` since installation is
  now helm-based.

- score_nats_k8s.rs: last step in deploy_nats switches from a
  hand-constructed HelmChartScore to NatsHelmChartScore::new(...).
  Supercluster values_yaml construction untouched — a supercluster
  is just a more elaborate values file against the same chart.

- bare_topology.rs: add `impl HelmCommand for K8sBareTopology`
  so the in-load-test flow (K8sBareTopology → NatsBasicScore →
  NatsHelmChartScore → HelmChartScore) compiles. Returns a bare
  `helm` command; KUBECONFIG resolution mirrors how HAClusterTopology
  does it.

- mod.rs: export NatsHelmChartScore + the re-shaped NatsServiceType.

- load-test.sh: the nats/nats chart provisions a StatefulSet, not
  a Deployment. Wait on `pod -l app.kubernetes.io/name=nats`
  instead of `deployment/iot-nats` — works across workload kinds.

Tests:
- 2 helm_chart unit tests (chart+repo pinning, default install-
  upgrade semantics)
- 10 score_nats_basic unit tests covering every values shape
- Full load-test.sh e2e (20 devices / 3 CRs / 20s): PASS.
2026-04-23 10:58:17 -04:00
61d3a6b757 feat(iot/chart): typed variants + CRD-keep + Pod security context
All checks were successful
Run Check Script / check (pull_request) Successful in 2m17s
Three production-path improvements bundled into one chart change,
all verified end-to-end (helm lint + load-test pass):

1. Switch from `HelmResourceKind::from_serializable(...)` to the
   typed `HelmResourceKind::{Namespace, ServiceAccount, ClusterRole,
   ClusterRoleBinding, Crd}` variants added to the shared harmony
   helm module. Serialization output is byte-equivalent; IDE
   discoverability + type-safety go up.

2. Annotate both CRDs with `helm.sh/resource-policy: keep`. Without
   this, `helm uninstall iot-operator-v0` cascade-deletes the CRDs;
   the kube GC then deletes every Deployment CR and every Device CR;
   the operator finalizer fires on each deletion and wipes the
   `desired-state` KV; agents tear down every container. One typo
   on uninstall would be fleet-wide catastrophe. `keep` makes
   uninstall data-preserving and idempotent — wipe requires an
   explicit `kubectl delete crd …`.

3. Lock down the operator Pod's securityContext:
   - `runAsNonRoot: true`
   - `readOnlyRootFilesystem: true`
   - `allowPrivilegeEscalation: false`
   - `capabilities: drop [ALL]`
   - `seccompProfile: RuntimeDefault`
   Deliberately *no* `runAsUser` — OpenShift's `restricted-v2` SCC
   assigns namespace-specific UIDs and rejects fixed ones. The
   image's `USER 65532:65532` (Dockerfile) gives vanilla k8s a
   non-root UID; OpenShift's SCC overrides with its own. Same chart
   works on both without custom SCC bindings.

Dockerfile adds `USER 65532:65532` — required for vanilla k8s to
accept `runAsNonRoot: true` without a Pod-level `runAsUser`. 65532
is the distroless/chainguard `nonroot` convention; arbitrary but
safe (no overlap with common system UIDs).

Tests: 2 chart unit tests locking in the keep annotation + SC
shape. End-to-end load test at 20 devices / 3 CRs: pod comes up
clean under the restricted SC, all aggregates correct, zero
operator warnings.
2026-04-23 10:32:03 -04:00
20b94dfacf feat(harmony/helm): typed HelmResourceKind variants for RBAC + Namespace + CRD
Extends HelmResourceKind with typed variants for Namespace,
ServiceAccount, ClusterRole, ClusterRoleBinding, and
CustomResourceDefinition. Previously only Service + Deployment
had typed variants; everything else went through the
`from_serializable`/`CustomYaml` escape hatch.

The escape hatch stays (documented as "always prefer a typed
variant") for forward-compat with types we haven't imported yet.
Any consumer currently using `from_serializable` for one of the
new typed variants can switch; serialization output is byte-
equivalent (both paths route through serde_yaml on the same
k8s_openapi struct).

Motivation: every Rust operator built on harmony wants the same
five resources — Namespace, SA, ClusterRole, ClusterRoleBinding,
CRD — to be chart-template-ready. Typing them once here means
every operator's chart.rs stays short and IDE-discoverable
instead of a string-of-from_serializable-calls.

Filenames carry the resource name where applicable
(serviceaccount-<name>.yaml, clusterrole-<name>.yaml, etc.) so
charts with multiple ClusterRoles don't collide on a single
`clusterrole.yaml` file.

2 unit tests: unique-filename invariant across the five typed
variants, and crd-name round-trip.
2026-04-23 10:27:11 -04:00
3d39b670dd feat(iot-agent): config-driven routing labels
Before: the agent published only `device-id=<id>` on DeviceInfo,
which collapsed every Deployment.spec.targetSelector to "target one
device by id" — usable, but not the actual scalability story. The
K8s-Node analogue wants kubelet-declared node labels driving
DaemonSet nodeSelector; we were missing the equivalent.

After: a new `[labels]` section in the agent's TOML config, set by
IotDeviceSetupScore and plumbed through to every DeviceInfo
publish. Config labels merge with the default `device-id` on
startup. Re-running the Score with a changed label map regenerates
the TOML, triggers the byte-compare idempotency path, restarts the
agent; new labels propagate into Device.metadata.labels and
Deployment selectors re-resolve on the operator side. Manual toml
edits + `systemctl restart iot-agent` is the break-glass path.

Scope:
- iot/iot-agent-v0/src/config.rs: `labels: BTreeMap<String,String>`
  on AgentConfig, defaults to empty via #[serde(default)]. Two
  parse tests cover the "section present" + "section absent"
  cases.
- iot/iot-agent-v0/src/main.rs: merge cfg.labels with the default
  `device-id` entry before DeviceInfo publish. Config wins on
  key conflicts — unusual but legal.
- harmony/src/modules/iot/setup_score.rs: IotDeviceSetupConfig
  gains `labels: BTreeMap<String,String>` (replacing the
  dedicated `group` field — group is just a conventional label
  now, not a distinct axis). render_toml renders a [labels]
  section; BTreeMap iteration guarantees sorted output so the
  Score's byte-compare change detection stays idempotent. Three
  unit tests: section content, byte-identical rendering across
  runs, value escaping.
- examples/iot_vm_setup/src/main.rs: `--labels key=val,key=val`
  with a parser that errors on malformed chunks, empty keys/values,
  or an empty map (a device with no labels is practically
  untargetable, better to fail at the CLI than onboard a ghost).

Live label changes require an agent restart (same as kubelet's
--node-labels on a running Node). Edit-labels-on-running-fleet
is a later chapter; for v0 the restart cost is negligible.

Tests: 7 iot-agent + 3 iot setup_score + existing operator/
contracts suite — all green.
2026-04-23 10:25:25 -04:00
a616204b1c refactor(nats): extract typed single-node primitive; NatsBasicScore becomes a thin wrapper
Some checks failed
Run Check Script / check (pull_request) Failing after 54s
Addresses the review point that NatsBasicScore was introduced as a
parallel NATS path instead of sharing primitives with the rest of
the module. The render logic (Deployment + Service + Namespace for
one NATS server pod) is now pulled into a new `nats::node`
module built on ADR 018 — typed k8s_openapi structs, no helm
templating — and NatsBasicScore is a high-level preset that sets
defaults on a NatsNodeSpec and runs the shared render fns.

Module-level doc on `nats::node` explicitly flags that future
high-level scores (clustered, TLS, gateway) should grow the spec
and reuse the same primitive, and that NatsK8sScore +
NatsSuperclusterScore are scheduled to migrate onto this primitive
in a follow-up so the helm-templating path disappears entirely
from the NATS module.

7 unit tests between node (the primitive) + score_nats_basic (the
wrapper) cover service-type routing + JetStream flag propagation.
2026-04-23 09:48:42 -04:00
1df0ba7cdc refactor(iot): drop --system from iot-agent; add optional admin password
Two changes with a single motivation — make the iot-agent runtime
robust under multi-user hosts + unblock chaos-testing workflows
on the VM admin user.

1. iot-agent user is no longer --system.
   Rootless podman needs subuid/subgid ranges in /etc/subuid +
   /etc/subgid before layer unpacking. Ubuntu's useradd --system
   deliberately skips those allocations (system users aren't
   expected to run user namespaces), so we were patching the gap
   with a hardcoded "usermod --add-subuids 100000-165535". That
   range collides with any other user on the host that also runs
   rootless containers — a real footgun. Dropping --system lets
   useradd's default allocator pick a non-overlapping range, and
   the whole ensure_subordinate_ids trait method + ansible impl
   goes away as dead code.

2. VmFirstBootConfig.admin_password (Option<String>).
   When set, cloud-init unlocks the account and enables
   ssh_pwauth on the guest — intended for reliability / chaos
   testing sessions where the operator wants to log in and break
   things on purpose. Default is still key-only auth.
   example_iot_vm_setup plumbs a --admin-password flag +
   IOT_VM_ADMIN_PASSWORD env var; smoke-a4 passes them through
   so chaos sessions are one env var away from a ready VM.

3 cloud-init unit tests cover the locked + unlocked + YAML-escape
paths.
2026-04-23 09:48:36 -04:00
24b8282b7f feat(iot): Chapter 3 — operator helm chart (local, no registry)
Some checks failed
Run Check Script / check (pull_request) Failing after 50s
Generates a self-contained helm chart directory from typed Rust
(ADR 018 — Template Hydration). The chart packages:

- Deployment CRD (from Deployment::crd())
- Device CRD (from Device::crd())
- ServiceAccount, ClusterRole, ClusterRoleBinding with the exact
  verbs the operator uses — nothing aspirational
- operator Deployment (image, env NATS_URL + RUST_LOG)

No hand-authored yaml, no Helm templating. Re-run the chart
subcommand to regenerate for different inputs. When a publishable
chart is needed (user-facing `values.yaml`), layer a templating
pass on this output; for the load test the plain chart is enough.

New surface:
- `iot-operator-v0 chart --output <dir> [--image ... --nats-url ...]`
  writes the chart tree and prints its path.
- `iot/iot-operator-v0/Dockerfile` — minimal archlinux:base wrapper
  around the host-built release binary (glibc-ABI match without a
  two-stage Docker build).

load-test.sh: drops the host-side operator spawn entirely. Phase 3
now builds the operator image, sideloads it into k3d via `podman
save | docker load | k3d image import`, generates the chart via
the `chart` subcommand, and `helm upgrade --install` it into the
cluster. `dump_operator_log` pulls `kubectl logs` into the stable
work dir so HOLD=1 + failure-tail hooks keep working.

Two gotchas debugged along the way, preserved in code comments:
- workspace `.dockerignore` excludes `target/`, so the image build
  uses a staged build context under $WORK_DIR/image-ctx.
- `podman build -t foo/bar:tag` stores as
  `localhost/foo/bar:tag`, which k3d image import can't find under
  the original tag. Use `localhost/iot-operator-v0:latest` as the
  canonical image ref end-to-end.

Load-test results (selector architecture, operator in helm-
installed pod, same envelope as the host-side baseline):

| Scale | Duration | Writes | Rate | Errors | CR aggregates |
|-------|---------:|-------:|-----:|-------:|:-------------:|
| 20 devices / 3 CRs | 20s | 400 | 20/s | 0 | 3/3 ok |
| 10k / 1000 CRs | 120s | 1,201,967 | 10,009/s | 0 | 1000/1000 ok |

No operator warnings, no errors across the run. Image build +
sideload + helm install adds ~30s to startup; steady-state
throughput unchanged from host-side.
2026-04-23 06:57:56 -04:00
173f549918 chore(iot): roadmap doc sync + code review pass
Roadmap:
- v0_1_plan.md Chapter 2: rewrite to describe the shipped selector +
  Device CRD model (matchedDeviceCount, LabelSelector, per-concern KV).
  Drop AgentStatus / observed_score_string / target_devices references.
  Update "State of the world" preamble to match 2026-04-23 reality.
- chapter_4_aggregation_scale.md: SUPERSEDED banner at top with a
  clear what-was-kept vs. what-was-dropped summary. Original body
  preserved as decision-trail archaeology.

Code review pass on the iot crates, behavior-preserving:
- fleet_aggregator: owned_targets is now keyed by DeploymentName
  (matches the KV key space — globally unique, no namespace). The
  old DeploymentKey keying created an orphan-leak on operator
  restart: seed_owned_targets stashed entries under a sentinel
  namespace ("") that on_deployment_upsert never merged. Now
  seeding populates the map correctly so restart + selector change
  diffs properly.
- fleet_aggregator: reuse the Client passed into run() for the
  patch_api instead of calling Client::try_default() a second time.
- fleet_aggregator: delete _use_list_params / _use_deployment_spec
  placeholder scaffolding + unused ListParams / DeploymentSpec /
  ScorePayload imports. Inline one-liner serialize_score.
- fleet_aggregator: clean up `then(|| ...)` → filter/map split.
- device_reconciler: `is_label_value(v).then_some(()).is_some()`
  → plain `is_label_value(v)`.
- crd: delete speculative DeviceStatus + DeviceCondition (no one
  writes to them; the comment in DeviceSpec documents where they'd
  land when a heartbeat-reflection reconciler shows up).
- controller: compute `obj.name_any()` once in cleanup().

All 24 tests green. End-to-end load test (20 devices / 3 groups /
20s) PASS after the changes.
2026-04-23 06:35:36 -04:00
8a6a9f1a03 refactor(iot): Deployment.targetSelector + Device CRD (DaemonSet-like)
Kills the "CRD owns a list of device ids" smell. Deployment CR now
carries a standard K8s LabelSelector; Device is a first-class cluster-
scoped CR (like Node). Matching, desired-state KV writes, and status
aggregation all run off selector evaluation against the Device cache
— no list of device ids anywhere in the CRD spec.

Cross-resource model:
- Agent publishes DeviceInfo (with labels) to NATS `device-info` KV.
- device_reconciler watches that bucket → server-side-applies a
  cluster-scoped Device CR with metadata.labels + spec.inventory.
- Deployment controller is now just validation + finalizer cleanup.
- fleet_aggregator watches Deployment CRs + Device CRs + device-state
  KV, maintains in-memory selector → target device sets, writes/deletes
  `desired-state.<device>.<deployment>` KV on match changes, patches
  `.status.aggregate` at 1 Hz with matchedDeviceCount + phase counters.

Applied CRD shape verified on a live k3d cluster:
  kubectl get crd deployments.iot.nationtech.io -o json
    .spec.versions[0].schema.openAPIV3Schema.properties.spec
      → rollout / score / targetSelector (matchLabels + matchExpressions)
    .spec.versions[0].schema.openAPIV3Schema.properties.status.aggregate
      → matchedDeviceCount / succeeded / failed / pending / lastError
  kubectl get crd devices.iot.nationtech.io -o json
    .spec.scope = "Cluster"
    .spec.versions[0].schema.openAPIV3Schema.properties.spec
      → inventory (nullable, camelCased fields)

Load-test run: DEVICES=20 GROUP_SIZES=10,5,5 DURATION=20
  all 3 CRs hit expected matched=N / succeeded+failed+pending=N.

Other changes:
- k8s-openapi gets the `schemars` feature so LabelSelector derives JsonSchema.
- InventorySnapshot uses `#[serde(rename_all = "camelCase")]` for consistency with the rest of the CRD schema.
- agent publishes `device-id=<id>` as a default label so the
  example_iot_apply_deployment `--target-device <id>` shorthand
  works out-of-the-box (implemented as `--selector device-id=<id>`).
- example_iot_apply_deployment gains `--selector key=value` repeatable flag.
- load-test.sh explore banner exposes Device CR commands + new
  matchedDeviceCount column.
2026-04-22 22:55:38 -04:00
5e8e72df52 feat(iot-load-test): stable paths + HOLD=1 interactive mode
Some checks failed
Run Check Script / check (pull_request) Failing after 52s
- Stable working dir under /tmp/iot-load-test/ — kubeconfig at
  /tmp/iot-load-test/kubeconfig, operator log at
  /tmp/iot-load-test/operator.log. No more chasing mktemp paths.

- Print an explore banner before the load run so the user can
  `export KUBECONFIG=...` and `kubectl get deployments -w` in
  another terminal while the load actually runs.

- HOLD=1 env var keeps the stack alive after the load completes;
  script blocks on sleep until Ctrl-C. Forwards --keep to the
  binary so CRs + KV entries stay in place for inspection.

- DEBUG=1 bumps operator RUST_LOG to surface every status patch.

- Keep operator.log after successful runs (cheap, often useful).

- Load-test binary: --cleanup bool → --keep flag (clap bool with
  default_value_t = true doesn't accept `--cleanup=false`).
2026-04-22 21:59:26 -04:00
4d0aa069e5 perf(iot-load-test): parallel CR apply + DeviceInfo seed via tokio::JoinSet
Sequential apply was fine at 10 groups; becomes the startup bottleneck
at 1000. 32-way concurrent CR apply lands 1000 Deployment CRs in ~1.6s;
64-way concurrent DeviceInfo seed seeds 10k devices in ~0.3s.

Also zero-pad CR names and device ids to the largest width so large
runs sort lexicographically in kubectl.
2026-04-22 21:55:30 -04:00
ce7ad75dbf feat(iot): synthetic load test for fleet_aggregator + operator NATS connect retry
- example_iot_load_test: simulates N devices (default 100 across 10
  groups: 55 + 9×5) pushing DeploymentState every tick to NATS, no
  real podman. Applies one Deployment CR per group, runs for a
  bounded duration, verifies each CR's .status.aggregate counters
  sum to the target device count.

- iot/scripts/load-test.sh: minimum harness — k3d cluster + NATS via
  NatsBasicScore + CRD + operator + load-test binary. No VM, no
  agent build.

- operator: connect_with_retry() on startup. The NATS TCP probe that
  the smoke scripts do isn't enough to guarantee the protocol
  handshake is ready (k3d loadbalancer can accept SYNs before the
  pod is serving); the load harness hit this racing against a
  freshly-rebuilt operator binary.

- drop unused rand dep from iot-agent-v0 Cargo.toml.

100-device run: 6002 state writes in 60s at a clean 100 writes/s,
all 10 CR aggregates converge to target_devices.len() (e.g.
group-00 → 55 = 45 Running + 9 Failed + 1 Pending).
2026-04-22 21:43:02 -04:00
5c65ba71cc fix(iot-operator): watch device-state with LastPerSubject, not StartSequence(0)
`bucket.watch_all_from_revision(0)` sends the JetStream consumer
request with DeliverByStartSequence and an optional-missing start
sequence, which the server rejects with error 10094:

  consumer delivery policy is deliver by start sequence, but
  optional start sequence is not set

`watch_with_history(">")` uses DeliverPolicy::LastPerSubject instead —
replays the current value of every key, then streams live updates.
Same cold-start-plus-steady-state semantics, correct wire.

Caught by smoke-a4 --auto: state watcher exited immediately on
startup, no deployments ever reconciled.
2026-04-22 21:17:52 -04:00
9e42c15901 refactor(iot/smoke): update smoke scripts for new KV wire layout
- agent-status bucket -> device-heartbeat bucket
- status.<device> key -> heartbeat.<device>
- drop parity check summary from smoke-a4 (legacy path is gone)
- tidy stale AgentStatus comment in agent main
2026-04-22 21:10:55 -04:00
2d99880770 refactor(iot): operator watches device-state KV directly; drop event stream
Collapses the Chapter 4 event-stream architecture into pure KV watch.
The operator was maintaining a durable JetStream consumer on
device-state-events in parallel with the KV bucket it was meant to
shadow — the stream was an optimization over KV scanning, but with
async-nats's ordered bucket watch it's redundant.

Gone:
- StateChangeEvent, LifecycleTransition, STREAM_DEVICE_STATE_EVENTS,
  state_event_subject, STATE_EVENT_WILDCARD (contracts)
- Revision, AgentEpoch (contracts) — restart ordering now handled by
  DeploymentState.last_event_at monotonic check
- PhaseCounters.apply_event + incremental diff machinery (operator) —
  counters recomputed per dirty CR from the states snapshot
- RecordedTransition + publish_transition split (agent) — without an
  event to publish, the pure/publish boundary has no reason to exist
- Agent sequence counter + agent_epoch generation (agent main.rs)
- CR aggregate fields recent_events, last_heartbeat_at, unreported —
  never populated, pure speculation

New shape:
- fleet_aggregator.rs watches device-state via bucket.watch_all_from_revision(0)
- apply_state / drop_state mutate an in-memory snapshot
- patch_tick refreshes CR index from kube, recomputes aggregates for
  CRs marked dirty, patches CR status
- DeploymentAggregate = succeeded/failed/pending + last_error only

Line counts (3 iot crates):
  4263 -> 3090 -> 2162 (-49% overall, -30% this pass)

Tests: 24 total (13 contracts + 6 operator + 5 agent), all green.
2026-04-22 21:09:09 -04:00
d28cc6a184 refactor(iot): drop LogEvent type + log subject helpers
Zero consumers, zero publishers — pure speculative surface area.
Drops LogEvent struct, EventSeverity enum, STREAM_DEVICE_LOG_EVENTS,
log_event_subject, logs_subject, logs_query_subject.

If per-device log streaming lands later, it arrives with a real
consumer attached.

Contracts tests: 21 → 19 (removed two roundtrip tests for the deleted type).
2026-04-22 20:57:35 -04:00
9b35bc5314 refactor(iot): delete legacy AgentStatus path; event-driven aggregation is now authoritative
Chapter 4 shipped per-concern wire types (DeviceInfo, DeploymentState,
HeartbeatPayload, StateChangeEvent) as replacements for the monolithic
AgentStatus heartbeat. The parity check proved the new path matches the
legacy one; legacy now goes.

Removed:
- AgentStatus, DeploymentPhase, EventEntry, agent-status bucket, status_key
- iot-operator-v0/src/aggregate.rs (legacy full-recompute aggregator)
- Parity machinery in fleet_aggregator.rs (ParityStats, parity_tick, dual-write)
- Agent recent_events ring + push_event (consumed only by AgentStatus)
- publish_log_event + device-log-events stream (no consumer, YAGNI)

fleet_aggregator now drives CR .status.aggregate directly: event consumer
maintains counters incrementally, 1 Hz patch_tick flushes only deployments
in the `dirty` set.

Net: ~1000 lines removed (4263 → 3216 across the three iot crates).
Wire surface: 5 types → 4. Operator tasks: 4 → 2 (controller + aggregator).

Tests: 21 contracts + 9 operator + 6 agent — all green.
2026-04-22 20:54:39 -04:00
2f08643aa0 refactor(iot): DeploymentName + Revision newtypes; LifecycleTransition models deletion; fixes bugs #1 and #2 from the review
Newtypes (review point #3) were the entry. Introducing them forced
the event-payload redesign, and the redesign made the other two
bugs obvious + trivial to fix.

New contract types (harmony-reconciler-contracts::fleet):
  - DeploymentName: validated newtype. Rejects empty, > 253 bytes,
    '.' (alias an extra NATS subject token), NATS wildcards, and
    whitespace. Serde impl validates on deserialize so a malformed
    payload is rejected at the wire, not later.
  - AgentEpoch(u64): random-per-process. Prefixes every sequence.
  - Revision { agent_epoch, sequence } with lexicographic Ord.
  - LifecycleTransition enum: Applied { from, to, last_error } |
    Removed { from }. Replaces (from: Option<Phase>, to: Phase) so
    deletion is modeled explicitly in the wire format.

Bug fixes that fell out of the redesign:

  #1 (drop_phase was silent on the wire): `drop_phase` now
     produces a RecordedTransition with Removed { from }, which
     the publisher serializes into a StateChangeEvent. Operator
     applies the Removed variant by decrementing `from` without
     a paired increment. Counters no longer over-count after
     deletions.

  #2 (sequence reset on agent restart): (agent_epoch, sequence)
     lexicographic ordering means the first post-restart event
     (seq=1 under a fresh epoch) outranks any pre-restart event
     the operator had applied. No more silently-dropped events
     after an agent crash.

Split recommended in review point #4:
  - `record_apply` / `record_remove`: pure in-memory state
    updates returning Option<RecordedTransition>.
  - `publish_transition`: side-effectful wire emission.
  - `apply_phase` / `drop_phase`: thin composite helpers the
    hot path uses.

Typed keys in the operator:
  - DevicePair { device_id, deployment: DeploymentName } replaces
    (String, String) so the two identifiers can't be swapped.
  - FleetState.deployment_namespace is keyed by DeploymentName.
  - Controller's kv_key signature takes &DeploymentName; invalid
    CR names surface as a clear Error rather than corrupting KV.

Tests:
  - 27 contract tests (roundtrip every payload shape, including
    forward-compat parsing; validate DeploymentName rejection
    paths; assert Revision ordering across epochs).
  - 19 operator fleet_aggregator tests, including regression
    guards named for the specific bugs:
      removed_transition_decrements_without_paired_increment  (#1)
      revision_ordering_handles_agent_restart                 (#2)
  - 8 agent reconciler tests (record_apply/record_remove purity,
    sequence monotonicity, agent_epoch stamping, ring buffer
    cap).

Agent main wires a fresh AgentEpoch via rand::random::<u64>() at
startup; FleetPublisher::connect takes it and includes it in every
DeviceInfo + state-change event.
2026-04-22 17:42:42 -04:00
367d63cfba test(iot/smoke-a4): clarify parity summary — matches are DEBUG-level so don't report them 2026-04-22 14:42:27 -04:00
3b111df578 fix(iot-operator): lazy namespace refresh in event consumer + relax smoke parity check
Two findings from the M4 smoke runs:

1. **Event consumer dropped events for unknown-namespace deployments.**
   The consumer receives state-change events but `apply_state_change_event`
   short-circuits when `deployment_namespace` doesn't have the
   deployment yet — common on the first 5 s after a new CR is
   applied, before the parity-tick's refresh loop runs.

   Fix: on unknown deployment, consumer eagerly does a kube
   `Api::list()` and populates the map. Subsequent events for
   that deployment are fast-path (map already has it).

   Also: added instrumentation on publish + receive paths so
   future debugging against the parity check produces actionable
   traces. Log level is DEBUG to keep INFO clean.

2. **Parity MISMATCH during transitions is correct behavior.**
   The legacy aggregator reads AgentStatus which the agent
   republishes every 30 s. Chapter 4 state-change events land in
   ~100 ms. So during a Pending→Running transition there's a
   window where the new counter shows succeeded=1 while legacy
   still shows pending=1 — precisely because the new path is
   faster, which is the point of this rework.

   The smoke's hard-fail-on-any-mismatch was too strict; relaxed
   to a diagnostic print. Steady state should still converge to
   zero mismatches once the next AgentStatus heartbeat lands; the
   summary lets the user spot sustained divergence by eye. M5
   removes the legacy path entirely, making the parity check
   moot.

Agent-side publish now also surfaces subject + sequence + stream-seq
on every state-change publish, a similar diagnostic aid for tracing
wire deliveries.
2026-04-22 14:38:48 -04:00
cc8d908fcb fix(iot-agent/fleet-publisher): await PublishAckFuture so events are durably persisted
Chapter 4's parity check in smoke-a4 caught M4 dropping events —
operator's consumer saw 1 of 3 state transitions, parity-mismatch
assertion fired.

Root cause: async-nats's jetstream.publish() returns a
PublishAckFuture that must be awaited for the server to persist
the message. Without that await, the publish is effectively
fire-and-forget and drops under any backpressure — which on the
smoke's agent-first-boot path is every publish until the stream
state stabilizes.

Fix awaits both the publish future (send) and the returned
PublishAckFuture (server ack) for state-change + log events.
State-change events are warn-on-failure (operator needs them);
log events are debug-on-failure (device-side ring buffer is
authoritative).
2026-04-22 14:24:58 -04:00
6d4335771e test(iot/smoke-a4): surface fleet-aggregator parity summary on PASS
Smoke was silent about the Chapter 4 parity check because the
operator log got discarded on successful runs. Add a pre-cleanup
step that greps for `fleet-aggregator` log lines and prints the
last 20; if any `parity MISMATCH` line is present, upgrade to
`fail` — smoke exit 0 shouldn't hide a silently-wrong new
aggregator.
2026-04-22 14:18:50 -04:00
64d8295a65 feat(iot-operator): M4 — event-driven counters + duplicate-safe apply
Replaces M3's per-tick KV re-walk with an incremental
JetStream consumer on `device-state-events`. Cold-start still
walks KV once to seed counters; steady state consumes events and
applies `from -= 1; to += 1` diffs.

New in `fleet_aggregator`:

  FleetState (shared via Arc<Mutex<_>>):
    - counters: per-deployment phase counts.
    - phase_of: per-(device, deployment) current phase, for
      duplicate + resync detection.
    - latest_sequence: per-(device, deployment) highest sequence
      applied, drops stale and duplicate deliveries.
    - deployment_namespace: name → namespace map refreshed each
      parity tick from the CR list (events carry only the
      deployment name, matching the `<device>.<deployment>`
      KV key format).

  apply_state_change_event():
    - Idempotent for duplicate sequence numbers.
    - Idempotent for out-of-order lower-sequence events.
    - On from-phase disagreement with our belief, trusts the
      event and re-syncs (logs warn — parity check will catch
      any resulting drift against the legacy aggregator).
    - Counter decrement saturates at zero so replays can't
      underflow.

  run_event_consumer():
    - Durable JetStream pull consumer on STATE_EVENT_WILDCARD,
      DeliverPolicy::New (cold-start already seeded state from
      KV — replaying from the beginning would double-count).
    - Explicit ack; malformed payloads are logged + acked to
      avoid infinite redelivery.

  parity_tick() no longer walks KV — it reads live counters
  from the shared FleetState and compares with the legacy
  aggregator's per-CR fold. Same match/mismatch/running-totals
  logging as M3.

8 new unit tests cover the event-apply invariants: first
transition (no from), transition (from+to), duplicate sequence,
out-of-order sequence, from-disagreement resync, unknown-
deployment ignore, cold-start seeding, underflow saturation.
Plus the 5 M3 tests from before — 13 aggregator tests total,
all green.
2026-04-22 14:15:48 -04:00
adb015bdea feat(iot-operator): M3 — parity-check task reading Chapter 4 KV alongside legacy aggregator
New module `fleet_aggregator` spawns a 5 s tick task that:
  - Walks the Chapter 4 KV buckets (`device-info`,
    `device-state`) every tick.
  - Computes per-CR phase counters via `compute_counters` (pure
    function, unit tested).
  - Computes the legacy aggregator's counts from the same
    `agent-status` snapshot map the legacy task is already
    maintaining.
  - Compares the two per CR and logs per-tick at DEBUG level
    (matches) or WARN (mismatches), with running totals at INFO
    every 60 s.

Explicit `cr_targets_device` predicate is the one-line plug
point for the selector-based rewrite coming from the review-fix
branch: swap `target_devices.contains()` for
`target_selector.matches(&info.labels)`, everything else in the
aggregator is label/selector-agnostic.

Refactored `aggregate::run` to accept the `StatusSnapshots` map
from outside so the parity-check task reads the same agent-status
view the legacy aggregator writes to. Added `aggregate::new_snapshots()`
helper so `main` owns the one shared Arc.

The task is strictly read-only: no CR patches, no side effects. M5
flips `.status.aggregate` over to the new counter-driven path once
M4 replaces the periodic re-walk with the event-stream consumer and
the parity check has stayed green under load.

5 unit tests cover the pure counter logic (target match, multi-CR
fan-in, zero-target CR, phase dispatch).
2026-04-22 14:09:46 -04:00
c123c058b7 feat(iot-agent): M2 — publish Chapter 4 wire format in parallel with AgentStatus
Agent now writes the new per-concern KV shapes + event streams
alongside the legacy AgentStatus. Nothing consumes the new data
yet — the legacy aggregator still drives CR .status from
`agent-status`. M3 will add the operator-side cold-start +
consumer paths in parity mode; M5 flips the CR-patch source once
counters verify against the legacy aggregator.

New module `fleet_publisher.rs` owns:
  - Opening + idempotent-creating the three new KV buckets
    (`device-info`, `device-state`, `device-heartbeat`) and
    two JetStream streams (`device-state-events`,
    `device-log-events`).
  - Publish methods for DeviceInfo, HeartbeatPayload, DeploymentState
    (KV put), StateChangeEvent + LogEvent (stream publish), and
    delete for deployment-state cleanup.
  - Log-and-swallow failure mode. The operator re-walks KV on
    cold-start, so a missed event publish is self-healing on the
    next transition or operator restart.

Reconciler grew:
  - `device_id`: Id + `fleet`: Option<Arc<FleetPublisher>>
  - per-(deployment) monotonic sequence counter in StatusState
  - `set_phase` detects actual transitions (prev_phase vs new) and
    emits a DeploymentState KV write + StateChangeEvent stream
    publish only on change. No-op re-confirmation still bumps the
    sequence (lets operator detect duplicate events via sequence
    comparison) but stays off the wire.
  - `drop_phase` deletes the device-state KV entry.
  - `push_event` also publishes a LogEvent to the stream.

main.rs:
  - Builds FleetPublisher after connect_nats, passes into Reconciler.
  - Publishes DeviceInfo once at startup (empty labels — populated
    by the selector-targeting branch once it merges).
  - Spawns a heartbeat loop on 30 s cadence.
  - Legacy `report_status` AgentStatus task kept running unchanged.

8 unit tests added for the transition-detection + sequence + ring-
buffer invariants (drive set_phase / drop_phase / push_event with
fleet: None). 18 contract tests from M1 still green.
2026-04-22 14:04:58 -04:00
bfef5fad54 feat(contracts): M1 — Chapter 4 wire-format types + bucket/subject constants
First milestone of the aggregation rework. Lands the contract layer
without any runtime side effects: the agent + operator still run
their legacy paths unchanged.

New types (module `fleet`):
  - DeviceInfo: routing labels + inventory, rewritten on label
    change. Stored in KV `device-info` at `info.<device_id>`.
  - DeploymentState: current phase per (device, deployment).
    Stored in KV `device-state` at `state.<device>.<deployment>`.
    Authoritative snapshot; operator rebuilds counters from it on
    cold-start.
  - HeartbeatPayload: tiny liveness ping in KV `device-heartbeat`.
    Payload capped by a test (< 96 bytes) so it stays cheap at
    1M-device rates.
  - StateChangeEvent: `from: Option<Phase>, to: Phase, sequence`
    emitted on each transition to JS stream
    `device-state-events` on subject
    `events.state.<device>.<deployment>`. Operator folds these
    events into in-memory counters.
  - LogEvent: shorter-retention user-facing event log to JS stream
    `device-log-events` on subject `events.log.<device>`.

Transport constants + key/subject helpers in `kv` with
cross-component wire-stability tests so a rename here gets caught.

10 new tests (roundtrip serde, forward-compat parse, size bound,
key/subject format). Legacy `AgentStatus` tests + constants stay
green; retirement is scheduled for M8 once the live path has
switched over.
2026-04-22 13:57:57 -04:00
0decb1ab61 docs(iot): chapter 4 — aggregation architecture at IoT scale (design draft)
Design doc for the aggregation rework. Chapter 2's aggregator
(O(deployments × devices) per tick) works for a 10-device smoke but
doesn't scale past a partner fleet of even modest size. Replaces it
with CQRS-style incrementally-maintained counters driven by
JetStream state-change events, device-authoritative per-device
state keys, and a separate log transport that doesn't touch
JetStream.

Review first, implement after. No runtime code changes in this
commit.

Covers data model (KV buckets, streams, subjects), counter
invariants (transition-based, duplicate-safe), cold-start protocol
(walk once, then consume), CR patch cadence (debounced dirty set),
failure modes, scale back-of-envelope for 1M devices + 10k
deployments, schema migration path (clean break, same CRD
v1alpha1), and eight-milestone landing plan.
2026-04-22 12:40:06 -04:00
c081f2cf5e style(iot-agent): silence two clippy nits in Chapter 2 code
push_str("…") → push('…'), and drop redundant .trim() before
.split_whitespace() in /proc/meminfo parsing.
2026-04-21 23:23:11 -04:00
c1dc7d56ea docs(iot): mark Chapter 2 shipped in v0_1_plan
Chapter 1 + Chapter 2 are both green end-to-end on x86_64 and
aarch64. Chapter 3 (helm packaging) is next. Design sketches kept
as the historical record — the running code is the source of
truth for 'how'.
2026-04-21 23:01:47 -04:00
9a08978e34 style(kvm): rustfmt the overlay args vec literal 2026-04-21 23:00:20 -04:00
9fb3691c3d feat(kvm): honor spec.disk_size_gb in overlay creation
qemu-img create with no trailing size inherits the backing
image's virtual size. The Ubuntu cloud image ships with ~2 GiB
of root, which fills up as soon as we sideload a container
tarball in the smoke. Pass disk_size_gb through to qemu-img and
rely on cloud-initramfs-growroot (already in the base) to grow
the partition on first boot. example_iot_vm_setup defaults to
16 GiB.
2026-04-21 22:41:59 -04:00
633f015444 fix(iot/smoke-a4): probe NATS TCP port after Available condition
kubectl wait --for=Available reports on pod readiness, but k3d's
klipper-lb takes a few more seconds to wire the host loadbalancer
port to Service endpoints. Without this extra wait the operator
races the routing and dies with 'expected INFO, got nothing.'
2026-04-21 22:32:25 -04:00
087af2f6f4 fix(iot/smoke-a4): single-archive save + post-load tagging on VM
`podman save -m` produces an OCI multi-image archive format that
older podman versions in the Ubuntu 24.04 cloud image cannot load:

  Error: payload does not match any of the supported image formats:
   * oci-archive: loading index: ...index.json: no such file or directory

Downgrade to the single-image docker-archive format (default for
`podman save`): save the source image once, load once in the VM,
then `podman tag` twice to expose it under `localdev/nginx:v1` and
`:v2`. Same bits on disk, two distinct tag references, so the
upgrade test still sees a container-id change when the Score
flips from v1 to v2.
2026-04-21 22:28:59 -04:00
97e10927d2 fix(iot/smoke-a4): arch-match guard on cached SRC_IMAGE
Running smoke-a4 with `ARCH=aarch64` after an `ARCH=x86-64` run
rebinds the local `nginx:alpine` tag to arm64 (or vice versa),
silently breaking the other arch's next run. Fail fast if the
cached image arch doesn't match the smoke's ARCH, with the exact
command to fix it (`podman pull --platform=linux/<arch> ...`).
2026-04-21 22:19:46 -04:00
92f1519f8e feat(podman): IfNotPresent pull + smoke-a4 tarball sideload for images
Two changes that compose into one win: the smoke no longer needs a
functional Docker Hub to exercise the agent → podman → container
loop.

**harmony/src/modules/podman/topology.rs — IfNotPresent for image pull**

`PodmanTopology::ensure_service_running` was calling `podman pull`
on every reconcile, even when the image was already in the local
store. For a long-lived device agent reconciling against a public
registry, that's a guaranteed rate-limit collision: Docker Hub caps
unauthenticated pulls at 100 manifests per 6 h per IP, and an agent
ticking every 30 s chews through that allowance in a day.

Change the pull path to check the local store first:

    if images.get(image).exists().await? { return Ok(()); }
    // else: pull

Matches Kubernetes' `imagePullPolicy: IfNotPresent` semantics.
Correct default for the IoT platform: upgrades change the image
STRING (tag or digest), so they still hit the pull branch —
"use local if available, pull the new thing if the reference changed."

**iot/scripts/smoke-a4.sh — tarball sideload in place of registry**

An earlier iteration of this smoke stood up a local `registry:2`
container and pushed tagged images into it. That pattern itself
needs to pull `registry:2` from Docker Hub — cute demo, still
Hub-dependent. Gone now.

New phase 4.5 / 5c pair:

  4.5: podman save the cached `nginx:alpine` under two local tags
       (`localdev/nginx:v1`, `localdev/nginx:v2`) into a tarball on
       the host.
  5c:  scp the tarball to the VM, `podman load` it into the
       iot-agent user's rootless store.

Paired with the new IfNotPresent semantics, the agent's reconcile
sees both images already present and never touches a registry. The
upgrade test still works because `v1` and `v2` are distinct tag
strings → spec drift → container id changes.

Dropped the `docker` preflight (no more k3d-side registry transfer)
and the `LOCAL_REGISTRY_*` env vars.

Verified end-to-end: x86 smoke-a4 --auto PASS.
  - apply v1 → container up → curl 200
  - .status.aggregate.succeeded = 1 (Chapter 2 aggregator working)
  - apply v2 → container id changes (upgrade confirmed)
  - delete → container removed

Aarch64 run next.
2026-04-21 22:15:37 -04:00
37e69b36cf feat(iot-operator): aggregate agent-status into DeploymentStatus.aggregate
The operator watches the \`agent-status\` bucket, keeps a per-device
snapshot in memory, and folds it into each Deployment CR's
\`.status.aggregate\` subtree every 5 seconds. The answer to the user's
stated requirement — "CRD .status reflect-back: per-device
succeeded/failed counts + recent log lines" — now lives in the CR
itself, observable via \`kubectl get -o jsonpath\` or any UI that
speaks k8s status subresources.

**Shape (in iot/iot-operator-v0/src/crd.rs)**

  DeploymentStatus {
    observed_score_string,   // unchanged; controller change-detect
    aggregate: Option<{
      succeeded: u32,        // devices with Phase::Running
      failed: u32,           // devices with Phase::Failed
      pending: u32,          // devices with Phase::Pending or
                             // reported-but-no-phase-entry-yet
      unreported: u32,       // target devices that never heartbeated
      last_error: Option<{   // most recent failing device + short msg
        device_id, message, at
      }>,
      recent_events: Vec<{   // last-N events across the fleet, newest first
        at, severity, device_id, message, deployment
      }>,
      last_heartbeat_at,     // freshness signal for the whole fleet
    }>
  }

**New module** \`iot/iot-operator-v0/src/aggregate.rs\`

  - \`watch_status_bucket\`: subscribes to \`status.>\` on the
    agent-status bucket, maintains a \`BTreeMap<device_id, AgentStatus>\`
    in memory. Malformed payloads + malformed keys log-and-skip; the
    snapshot map is always the latest good shape.
  - \`aggregate_loop\`: 5 s ticker. Per tick: list Deployment CRs,
    clone the snapshot (no lock held across network calls), compute
    each CR's aggregate, JSON-Merge-Patch \`.status.aggregate\`. Merge
    patch composes cleanly with the controller's
    \`observedScoreString\` patch — neither clobbers the other.
  - \`compute_aggregate\` pure fn: classification logic is in one
    place, four unit tests pin its behaviour (counts + unreported,
    reported-but-no-phase-entry = pending, event filter matches
    deployment name only, status-key parser).

**Operator wiring** (\`main.rs\`)

  \`run()\` now opens *both* KV buckets at startup, spawns the
  controller and the aggregator concurrently via
  \`tokio::select!\`. Either returning an error tears the process
  down — kube-rs's Controller already absorbs transient reconcile
  errors internally, so anything escaping is genuinely fatal.

**Controller tweak**

  The apply path's \`patch_status\` was rebuilding the whole
  \`DeploymentStatus\` struct, which would clobber the aggregator's
  writes. Switched to raw JSON-Merge-Patch for the
  \`observedScoreString\` field only. Behaviour preserved, aggregate
  subtree left intact.

**Smoke assertion** (smoke-a4.sh --auto)

  After apply + curl succeeds, the --auto path now asserts
  \`kubectl get deployment.iot.nationtech.io ... -o
  jsonpath='{.status.aggregate.succeeded}'\` reaches 1 within
  60 s. Proves the full agent → status bucket → operator aggregate →
  CRD status loop, end to end.

Verified locally: \`cargo test -p iot-operator-v0 --lib\` 4/4 green,
\`cargo check --all-targets --all-features\` clean.
2026-04-21 21:50:00 -04:00
7dd89a7617 feat(reconciler-contracts): enrich AgentStatus with per-deployment phase + events + inventory
Chapter 2 groundwork. The on-wire AgentStatus the agent publishes
every 30 s was only carrying device_id + status + timestamp — not
enough for the operator to answer "how are my deployments doing."
Enrich it so the operator can aggregate into a useful
DeploymentStatus.aggregate subtree on the CR (second commit).

**harmony-reconciler-contracts/src/status.rs**

- `AgentStatus.deployments: BTreeMap<String, DeploymentPhase>` —
  keyed by deployment name (CR's metadata.name). Each phase carries
  `{ phase: Running|Failed|Pending, last_event_at, last_error }`.
- `AgentStatus.recent_events: Vec<EventEntry>` — ring buffer of the
  most recent reconcile events on this device. Each entry is
  `{ at, severity: Info|Warn|Error, message, deployment: Option }`.
  Bounded agent-side to keep JetStream per-message size sane.
- `AgentStatus.inventory: Option<InventorySnapshot>` — hostname,
  arch, os, kernel, cpu_cores, memory_mb, agent_version. Published
  once on startup.
- All three new fields are `#[serde(default)]` — mixed-fleet upgrades
  don't break: an old agent's payload deserializes into the new
  struct (deployments empty, events empty, inventory None); a new
  agent's payload deserializes into an old operator just losing the
  fields.

New tests (kept forward-compat front and center):
  - `minimal_status_roundtrip` — empty maps / None
  - `enriched_status_roundtrip` — full population
  - `old_wire_format_parses_into_enriched_struct` — pre-Chapter-2
    payload must still parse (the upgrade guarantee)
  - `wire_keys_present` — literal wire-format pins for smoke greps

**iot-agent-v0**

Reconciler gains a `StatusState { deployments, recent_events }` side
map with a bounded ring buffer (`EVENT_RING_CAP = 32`). Every code
path that changes deployment state now also records phase + event:

  - `apply()`: Pending → Running on success, Failed + error event on
    failure.
  - `remove()`: drops phase, emits "deployment deleted" info event.
  - `tick()` (periodic reconcile): keeps phase at Running on noop;
    flips to Failed + event on error (deliberately no event on
    successful no-change ticks — 30 s cadence would drown the ring).

New helper `deployment_from_key(key)` unwraps `<device>.<deployment>`
into just the deployment name. `short(s)` truncates error strings to
512 chars so the payload stays well under NATS JetStream limits.

`report_status()` in main.rs now snapshots the reconciler's status
state on every heartbeat and publishes the full enriched payload
alongside a startup-captured InventorySnapshot. Inventory reads
`/proc/sys/kernel/osrelease` + `/proc/meminfo` + `std::env::consts::ARCH`
with graceful fallbacks — no new sys-info crate dep.

Verified: `cargo test -p harmony-reconciler-contracts --lib` 7/7 green
(5 new). Operator consumption of the new fields lands in the next
commit.
2026-04-21 21:45:48 -04:00
ec3d3a9d63 fix(iot/smoke-a4): sideload NATS image into k3d to dodge Docker Hub rate limits
Docker Hub's unauthenticated rate limit (100 pulls per 6h per IP,
counted per-manifest-query) is the most reliable way for a CI-style
smoke loop to produce false negatives. The NATS pod failing with
'429 Too Many Requests' after a handful of runs today was that —
not a real regression.

Fix inside the smoke: before running the install Score, sideload the
NATS image into the k3d cluster via a podman→docker→k3d bridge:

  - If the image isn't already in docker's store:
      - If it's not in podman's store either, podman pull (this is
        the one-time hit we can't avoid).
      - podman save → docker load.
  - k3d image import into the cluster's containerd.

Steady-state this is a few-hundred-ms operation (no Hub calls, no
registry traffic). Require docker in the preflight list since we
depend on it for the cross-runtime bridge.

Also bump the Available-wait from 60 s to 120 s — the post-import
pod spin-up is fast but the scheduler + loadbalancer update take
longer than I initially budgeted.

VM-side nginx pulls are still at Hub's mercy; addressing that
requires either (a) docker login before the smoke, (b) an
authenticated registry mirror, or (c) arch-specific image
pre-seeding into the VM. All Chapter-2+ follow-ups.
2026-04-21 21:37:55 -04:00
9fd283183d fix(iot/smoke-a4): per-arch container-wait timeouts for TCG
Initial 180 s wait assumed native-KVM x86 speed. Under aarch64 TCG
the same nginx:latest pull (~250 MB image + layered userns unpack)
takes 4-8 min observed; 180 s was catching post-heartbeat reconcile
mid-pull and reporting FAIL.

Bump `CONTAINER_WAIT_STEPS` per arch:
  - x86 KVM: 90 iterations × 2 s = 180 s (unchanged)
  - aarch64 TCG: 450 × 2 s = 900 s (15 min)

Apply to both the 'first-boot container' and 'upgrade container id
change' loops.
2026-04-21 20:53:59 -04:00
a098e48e29 fix(iot/smoke-a4): query podman as iot-agent, not iot-admin
The agent runs rootless podman as the `iot-agent` user (system
user, created by IotDeviceSetupScore). Each user has their own
podman state tree under ~/.local/share/containers. The smoke
was running \`podman ps\` as \`iot-admin\` (the ssh login user),
so it saw an empty store even when the agent had happily created
the nginx container — leading to a spurious "container never
appeared" failure despite the reconciler reporting SUCCESS.

Fix: go through \`sudo su - iot-agent -c\` with
\`XDG_RUNTIME_DIR=/run/user/\$(id -u)\` so the command runs in
the right user session. Update the hand-off command menu with the
equivalent one-liner so the user can inspect the fleet's actual
container state without tripping over the same gotcha.

Smoke-a4 PASSes end-to-end on x86_64:
  - CRD apply → container materializes
  - Upgrade via new image → container id changes (not patched)
  - Delete → container removed

With the previous commit (ensure_subordinate_ids), this closes
Chapter 1 of ROADMAP/iot_platform/v0_1_plan.md: the full v0 loop
works, hands-on driven by kubectl / a typed Rust binary / natsbox.
2026-04-21 20:25:00 -04:00
1737374a93 fix(iot/linux): ensure_subordinate_ids so rootless podman can pull images
Ubuntu 24.04 `useradd --system` does not allocate `/etc/subuid` +
`/etc/subgid` ranges. Rootless podman silently fails on image-layer
unpack:

    potentially insufficient UIDs or GIDs available in user namespace
    (requested 0:42 for /etc/gshadow): ... lchown /etc/gshadow:
    invalid argument

`smoke-a1.sh` didn't hit this because it runs the agent on the
*host* user, which has subuid/subgid populated by default. `smoke-a4.sh`
drives a podman pull inside the VM — the FIRST time we actually
exercise rootless-podman-on-a-fresh-system, and the failure surfaces
immediately.

The fix belongs in harmony, not in ad-hoc cloud-init scripts. Add
`UnixUserManager::ensure_subordinate_ids` alongside the existing
`ensure_user` + `ensure_linger` methods:

- `domain/topology/host_configuration.rs`: new trait method. Doc
  explains why every rootless-container-runtime consumer needs it.
- `modules/linux/ansible_configurator.rs`: impl follows `ensure_linger`'s
  pattern — a grep probe on /etc/subuid+/etc/subgid, then a single
  `usermod --add-subuids 100000-165535 --add-subgids 100000-165535`
  only when missing. Idempotent, no-ops on re-run.
- `modules/linux/topology.rs`: forwarder for `LinuxHostTopology`.
- `modules/iot/setup_score.rs`: call the new method right after
  `ensure_linger` in `IotDeviceSetupScore`. Any future consumer that
  runs rootless podman reaches for the same primitive.

Verified: `cargo check --all-features` clean. End-to-end smoke-a4
regression pending (re-running after this commit).
2026-04-21 20:09:03 -04:00
b226bc9d29 feat(nats): NatsBasicScore gets LoadBalancer expose mode
Kubernetes NodePort Services must use a port in the apiserver's
configured nodeport range (default 30000-32767). NatsBasicScore's
first cut accepted any port via `.node_port(port)`, which was fine
for strict use of the capital-N NodePort Service type, but made
the demo's "use NATS client port 4222 directly from the host"
story awkward.

Replace the `node_port: Option<i32>` field with a proper
`NatsServiceType` enum (ClusterIP | NodePort(i32) | LoadBalancer).
Three builder methods — one per variant. LoadBalancer is the right
idiom for the demo: k3d's built-in `klipper-lb` fronts
LoadBalancer Services on their `port` (not their nodePort), so
`k3d cluster create -p 4222:4222@loadbalancer` delivers external
traffic straight to the Service's client port. No nodeport range
juggling.

Signatures:

    NatsBasicScore::new(name, namespace)   // ClusterIP default
        .node_port(30422)                   // NodePort(30422)
        .load_balancer()                    // LoadBalancer
        .jetstream(true)
        .image("docker.io/library/nats:2.10-alpine")

Tests: 5 pass. New assertion: `load_balancer()` produces a Service
with type LoadBalancer and no pinned nodePort (apiserver assigns).

Consumers:
- `example_iot_nats_install` gets a `--expose {cluster-ip | node-port
   | load-balancer}` flag (default `load-balancer` since that's what
  the demo wants). The legacy `--node-port N` flag survives as the
  NodePort port value.
- `smoke-a4.sh` asks for `--expose load-balancer`, matching its
  `-p 4222:4222@loadbalancer` k3d port mapping.
2026-04-21 19:10:19 -04:00
818525824c chore(iot): make smoke-a4.sh executable
Previous commit landed the script without the +x bit (a chmod
between write and commit was swallowed). Fix with git
update-index --chmod=+x so the file is executable on checkout.
2026-04-21 19:06:58 -04:00
5e8fb429ca feat(iot): smoke-a4.sh — hands-on end-to-end demo harness
Composed demo that brings up operator + in-cluster NATS + ARM (or
x86) VM agent, then either hands the full stack off to the user
with a command menu (default) or drives an apply + upgrade + delete
regression loop (`--auto`).

Phases:
  1. k3d cluster with NATS port exposed via `-p 4222:4222@loadbalancer`.
  2. NATS in-cluster via the new `example_iot_nats_install` binary
     → `NatsBasicScore` → typed k8s_openapi Namespace + Deployment +
     NodePort Service.
  3. CRD install via `iot-operator-v0 install` (Score-based, no yaml).
  4. Operator spawned host-side, connects to nats://localhost:4222.
  5. VM provisioned via `example_iot_vm_setup` (reused from smoke-a3);
     agent inside the VM connects to nats://<libvirt-gateway>:4222.
  6. Sanity: NATS pod Running, agent heartbeat
     `status.<device>` present in `agent-status` bucket.
  7a. DEFAULT: print a command menu (kubectl watch, typed Rust
      applier, ssh/console, natsbox one-liners, curl) and block on
      Ctrl-C with a cleanup trap tearing everything down.
  7b. `--auto`: apply nginx:latest, wait for container on the VM,
      curl, upgrade to nginx:1.26, assert container id CHANGED,
      curl, delete, assert container gone.

Prereqs documented at the top of the script. Handles both x86-64
(native KVM) and aarch64 (TCG emulation) via `ARCH=` env.

Design notes captured in ROADMAP/iot_platform/v0_1_plan.md. Uses
every piece landed in this branch so far: K8sBareTopology,
NatsBasicScore, the typed CR applier, the Score-based CRD install.
2026-04-21 19:03:07 -04:00
18dd712f8e feat(iot): example_iot_nats_install — single-node NATS via NatsBasicScore
Small CLI that installs a single-node NATS server into the cluster
KUBECONFIG points at, using harmony's `NatsBasicScore` composed
against `K8sBareTopology`.

This is the glue between `smoke-a4.sh` and the framework Score:

    cargo run -q -p example_iot_nats_install -- \
        --namespace iot-system \
        --name iot-nats \
        --node-port 4222

Defaults cover the demo exactly: iot-system namespace, NodePort 4222
so the libvirt VM agent can reach NATS through the k3d loadbalancer
port mapping.

No reinvented topology, no hand-rolled yaml, no helm shell-out. The
actual work (Namespace + Deployment + Service with the right
selector/ports/probes) lives inside `NatsBasicScore::Interpret` in
harmony where it can be reused by any future consumer.

Part of ROADMAP/iot_platform/v0_1_plan.md Chapter 1.
2026-04-21 18:33:35 -04:00
287ecdfb30 feat(iot): typed-Rust Deployment CR applier (example_iot_apply_deployment)
Replaces what would otherwise be a yaml fixture for the hands-on
demo. The CRD is already fully typed (DeploymentSpec + ScorePayload
+ PodmanV0Score + Rollout), so the applier uses those types
directly, constructs the CR via kube::Api, and either applies it
server-side or prints the JSON for `kubectl apply -f -`.

CLI:

  iot_apply_deployment \
      --namespace iot-demo \
      --name hello-world \
      --target-device iot-smoke-vm \
      --image docker.io/library/nginx:latest \
      --port 8080:80                       # apply
  iot_apply_deployment --image nginx:1.26  # upgrade (same name, new img)
  iot_apply_deployment --delete            # tear down
  iot_apply_deployment --print ...         # JSON to stdout → kubectl -f -

Uses server-side apply (PatchParams::apply().force()) so repeated
invocations patch the existing CR cleanly — the upgrade path the
demo exercises.

To expose the CRD types to an external consumer, iot-operator-v0
gains a thin `src/lib.rs` that re-exports the `crd` module. The
binary target now imports from the library (`use iot_operator_v0::crd;`)
instead of declaring its own `mod crd;` — avoids compiling the
types twice.

No change in operator runtime behavior.

Part of the ROADMAP/iot_platform/v0_1_plan.md Chapter 1 work.
2026-04-21 18:32:17 -04:00
7e2882425f feat(nats): NatsBasicScore — single-node NATS, no helm/PKI/ingress
Harmony's existing NATS story starts at `NatsK8sScore`, which is
designed for production multi-site superclusters: TLS-fronted
gateways, cert-manager-minted certs, ingress + Route, helm chart
with gateway merge blocks, NatsAdmin secret prompts. All of that is
overhead for a local smoke or a single-site decentralized deployment
that just needs a live JetStream server.

Add `NatsBasicScore` beside it. Deliberately minimal:
  - Single replica
  - Official `nats:*-alpine` image via typed k8s_openapi Deployment
  - JetStream (-js) on by default, toggle via builder setter
  - Namespace created if missing
  - Service: ClusterIP by default, or NodePort via
    `.node_port(port)` for off-cluster clients (e.g. a libvirt VM
    connecting through the host's loadbalancer port)

Trait bounds are just `Topology + K8sclient` — no `HelmCommand`,
no `TlsRouter`, no `Nats` capability. Composes cleanly with
`K8sBareTopology` (added in the previous commit) so consumers can
`score.create_interpret().execute(&inventory, &topology)` against
any cluster `KUBECONFIG` points at.

Constructed via a small builder:

    NatsBasicScore::new("iot-nats", "iot-system")
        .node_port(4222)
        .jetstream(true)

Under the hood the interpret runs three `K8sResourceScore`s in
sequence (namespace → deployment → service). No new machinery —
just composition of existing primitives.

Deliberately NOT in scope for this Score:
  - TLS / PKI — use NatsK8sScore when you need those
  - Gateways / supercluster — use NatsSuperclusterScore
  - Auth (user/password or JWT) — add a ConfigMap mount when
    the Chapter 4 auth work lands

Tests (4, all passing): default is ClusterIP; node_port() flips
Service to NodePort with the right nodePort field; jetstream() toggle
controls the `-js` arg.

Part of the "compound framework value" mindset: every future Score
that wants a local NATS now points at this one type instead of
inventing its own yaml.
2026-04-21 18:29:16 -04:00
6863162655 feat(k8s): K8sBareTopology — minimal topology for ad-hoc Score execution
Roadmap §12.6 ("topology proliferation") is partially resolved by
extracting the ad-hoc InstallTopology from iot-operator-v0/install.rs
into harmony as a reusable shared type, now that a second consumer
(NatsBasicScore, landing next) makes the extraction genuinely
load-bearing rather than speculative.

What's new:

- harmony/src/modules/k8s/bare_topology.rs — K8sBareTopology carries
  one K8sClient, implements K8sclient + Topology (noop ensure_ready).
  Constructors: from_client(name, client) for callers building their
  own client, from_kubeconfig(name) for callers reading the standard
  KUBECONFIG chain.
- modules::k8s::K8sBareTopology re-export.

What's gone:

- iot-operator-v0/src/install.rs: the ~30-line InstallTopology struct
  + its async_trait-decorated impls. The crate also drops async-trait
  and harmony-k8s as direct deps (neither is used now that the
  topology is shared).
- Long "architectural smell" comment from install.rs — the smell is
  fixed; the explanation belongs at the shared type now (with the
  history captured in its module doc).

Behavior-preserving. cargo check --all-targets --all-features clean.
smoke-a1 wire path unchanged.

Compounding-value move: every future Score that needs "apply a
typed resource against an existing cluster" consumes K8sBareTopology
instead of inventing its own Topology impl. That's the pattern v0
Harmony's design is meant to encourage.
2026-04-21 18:26:30 -04:00
d4c8731941 docs(iot): forward plan (v0.1 and beyond) + mark v0 walking skeleton as SHIPPED
v0 walking skeleton is substantially done (CRD → operator → NATS KV
→ on-device agent → podman reconcile; VM-as-device for x86_64 and
aarch64 via TCG; power-cycle resilience; operator install via Score
instead of yaml/kubectl). Time to switch the `ROADMAP/iot_platform`
folder from "plan to build the skeleton" to "plan to build on top of
the skeleton."

- **NEW** `ROADMAP/iot_platform/v0_1_plan.md` — the authoritative
  forward plan. Five chapters in execution order:
    1. Hands-on end-to-end demo the user can drive by hand
       (imminent, fully detailed: composed smoke, typed-Rust CR
       applier, natsbox command menu, in-cluster NATS).
    2. Status reflect-back + inventory (enrich `AgentStatus`,
       operator aggregates into `.status.aggregate`).
    3. Helm chart packaging (ArgoCD deferred — user's clusters have
       it already, bringing it into the smoke adds no validation
       value).
    4. Zitadel + OpenBao + per-device auth.
    5. Frontend (web / CLI / TUI — deferred).

  Chapters 2-5 are sketched; they expand to their own docs as each
  becomes the active chapter.

- **EDIT** `ROADMAP/iot_platform/v0_walking_skeleton.md` — add a
  SHIPPED banner at the top pointing at v0_1_plan.md. Keep the
  707-line design diary intact as archaeology; don't rewrite
  history.

- Incorporates the post-v0 architectural principles that emerged
  from review (no yaml in framework paths, minimal ad-hoc
  topologies, cross-boundary types in harmony-reconciler-contracts,
  verify before blaming upstream).
2026-04-21 18:18:20 -04:00