Chapter 1 + Chapter 2 are both green end-to-end on x86_64 and aarch64. Chapter 3 (helm packaging) is next. Design sketches kept as the historical record — the running code is the source of truth for 'how'.
14 KiB
IoT Platform v0.1 and beyond — forward plan
Authoritative forward plan for the NationTech decentralized-infra /
IoT platform, written after the v0 walking skeleton shipped
(see v0_walking_skeleton.md for the historical diary). Organized as
five chapters in execution order.
State of the world (as of 2026-04-21)
Green, end-to-end:
- CRD → operator → NATS JetStream KV write path (
smoke-a1.sh). - Agent watches KV, reconciles podman containers (
smoke-a1.sh). - VM-as-device provisioning: cloud-init + iot-agent install + NATS
smoke (
smoke-a3.sh), x86_64 (native KVM) and aarch64 (TCG). - Power-cycle / reboot resilience (
smoke-a3.shphase 5). - aarch64 cross-compile of the agent (no Harmony modules need to feature-gate aarch64).
- Operator installed via a harmony Score (typed Rust, no yaml).
harmony-reconciler-contractscrate — cross-boundary types (NATS bucket names + key helpers,AgentStatus,Idre-export).
Chapter 1 shipped (as of 2026-04-21): composed end-to-end
demo (smoke-a4.sh) — operator in k3d + in-cluster NATS + ARM VM
- typed-Rust CR applier + hand-off menu +
--autoregression. Green on x86_64 (native KVM) and aarch64 (TCG).
Chapter 2 shipped (as of 2026-04-22): AgentStatus enriched
with per-deployment phase, recent-events ring, and optional
inventory snapshot. Operator aggregator watches the agent-status
bucket and patches .status.aggregate (succeeded / failed /
pending / unreported + last_error + recent_events +
last_heartbeat_at). smoke-a4 --auto now asserts
.status.aggregate.succeeded == 1 after apply. Green on
x86_64 and aarch64.
Not yet wired (real v0.1 work still to go):
- Helm packaging of the operator (Chapter 3).
- Zitadel + OpenBao auth (per-device credentials, SSO for
operator users). Placeholder
CredentialSourcetrait on the agent side (Chapter 4). - Any frontend (Chapter 5).
Verified during planning (so future implementation doesn't have to re-litigate):
- Upgrade already works.
reconciler.rs::applybyte-compares serialized score payloads; drift triggers re-reconcile.PodmanTopology::ensure_service_runningremoves then re-creates containers on spec drift. No "stale + new" window. - The polymorphism stays.
IotScoreis an externally-tagged enum; addingOkdApplyV0later is additive.
Surprises since v0 started (for context, none architectural):
- Arch
edk2-aarch64-202602-2shipped empty firmware blobs;202508-1ships unpadded edk2 that needs 64 MiB pflash padding. Fixed via runtime discovery + padding inmodules/kvm/firmware.rs. - MTTCG isn't default for cross-arch TCG on QEMU 10.2; force via
qemu:commandlineoverride.pauth-impdef=onlikewise a qemu:commandline opt-in. ensure_vmis idempotent on "domain exists" — re-apply of a changed XML requires manualundefine --nvram --remove-all-storage. Noted as a follow-up in the code comments.
Chapter 1 — Hands-on end-to-end demo (imminent)
Goal: the user runs one command, watches operator + NATS + ARM
VM come up, then drives a CRD through the full loop by hand:
kubectl apply it (manually or via a typed Rust applier), watch the
operator log "acquired," check the NATS KV store with natsbox,
SSH/console into the VM, curl the running nginx container from
the workstation.
User-facing requirements (explicit)
- No yaml fixtures. Sample
DeploymentCRs constructed in typed Rust usingDeploymentSpec+PodmanV0Score. Same discipline as theinstallScore that replacedgen-crd | kubectl apply. - ArgoCD deferred. User's production clusters have it; bringing
it into the smoke harness adds setup overhead without validating
anything
helm installdoesn't. Chapter 3 produces the chart; ArgoCD integration is a later operational concern. - Operator logs every CR it acquires —
controller.rsalready doestracing::info!(%ns, %name, "reconcile"); verify the output reads well in the command-menu hand-off. - natsbox debugging is first-class. Script prints exact natsbox one-liners at hand-off so the user can inspect KV state.
- In-cluster NATS. Not a side-by-side podman container (as smoke-a1 does today). Expose to the libvirt VM via k3d loadbalancer port mapping.
Design decisions
- Rust CR applier. New binary
examples/iot_apply_deployment/. CLI flags--name --namespace --target-device --image --port --delete. Constructs theDeploymentCR viakube::Api<Deployment>+ typedDeploymentSpec; callsapi.apply(...). Can also--printthe CR JSON to stdout sokubectl apply -f -still works from the terminal. - smoke-a4.sh orchestration stays bash for now. User agreed this is test-harness scope, not framework path; converting it to Rust is "not as important right now."
- Hand-off is the default mode, not
--keep. The whole point of Chapter 1 is that the user drives the last stage interactively.smoke-a4.shbrings everything up, applies nothing, prints the command menu, waits onINT/TERMto tear down.--autoruns the full apply/curl/upgrade/delete regression for CI. - In-cluster NATS path. Preferred: use
harmony::modules::natsif it has a lightweight single-node / no-supercluster mode. Fallback: typedK8sResourceScoreapplying a minimal Deployment- NodePort Service. 15-min research task before committing.
Composed smoke phases (smoke-a4.sh)
- k3d cluster up with
-p "4222:4222@loadbalancer"so the host port 4222 forwards into the cluster. Reachable from the libvirt VM via the gateway IP (typically192.168.122.1:4222). - NATS in-cluster via the chosen path (harmony module or direct K8sResourceScore). Wait for readiness.
- Install CRD via the operator's
installsubcommand (typed Rust). - Spawn operator as a host-side process (same pattern as
smoke-a1). Operator connects to
nats://localhost:4222. - Provision ARM VM via
example_iot_vm_setup(same entry point smoke-a3 uses). Agent configured to connect tonats://<libvirt_gateway>:4222— discover the gateway IP viavirsh net-dumpxml default, as smoke-a3 already does. - Sanity:
kubectl wait ... crd Established, operator logged "KV bucket ready", agent logged "watching KV keys",status.<device>present inagent-statusbucket. - Hand off. Print the command menu below. Exit 0 with a cleanup
trap on
INT/TERM.
Command menu at hand-off
kubectl get deployments.iot.nationtech.io -A -w— watch CR reconcile reactively.cargo run -q -p example_iot_apply_deployment -- --image nginx:latest --target-device $TARGET_DEVICE— apply an nginx deployment via typed Rust.cargo run -q -p example_iot_apply_deployment -- --print --image nginx:latest --target-device $TARGET_DEVICE | kubectl apply -f -— same thing, through kubectl.ssh -i $SSH_KEY iot-admin@$VM_IP— connect to the VM.virsh console $VM_NAME --force— serial console alternative.podman --url unix://$VM_IP:... psor ssh +podman ps— list containers on the VM from the workstation.podman run --rm docker.io/natsio/nats-box nats --server nats://localhost:4222 kv ls desired-state— list desired state keys (from the host).podman run --rm ... nats kv get desired-state '<device>.<deployment>' --raw— dump a specific desired state.podman run --rm ... nats kv get agent-status 'status.<device>' --raw— dump the heartbeat.curl http://$VM_IP:8080/— hit the deployed nginx.
--auto path (for regression)
- Apply
nginx:latest, wait for container on VM,curl200. - Apply
nginx:1.26(upgrade), wait for container id to change,curl200 against the new container. - Apply
--delete, wait for container gone from VM.
Files
- NEW
examples/iot_apply_deployment/Cargo.toml+src/main.rs— typed applier. - NEW
iot/scripts/smoke-a4.sh. - NO yaml fixtures. Rust CLI flags cover the shape.
- Optional: factor shared smoke phases (NATS up, k3d up, operator
spawn, VM provision) into
iot/scripts/lib/if the duplication across a1/a3/a4 becomes obvious. Don't force it.
NATS exposure — implementation-time notes
- k3d
@loadbalancerport mapping binds the host's0.0.0.0:4222by default; libvirt VMs onvirbr0can reach it via the gateway IP. No special NAT config required. - Fallback if environmental snag: keep the side-by-side podman
container on an opt-in
NATS_MODE=podmanflag. Don't default to that — user explicitly asked for in-cluster.
Verification
- Fresh host:
ARCH=aarch64 ./iot/scripts/smoke-a4.shcompletes in 8-15 min, prints the command menu. ARCH=aarch64 ./iot/scripts/smoke-a4.sh --autoPASSes end-to-end including upgrade id-change assertion.- x86_64 (
ARCH=x86-64) completes in 2-5 min.
Explicitly out of scope
AgentStatus/DeploymentStatusenrichment — Chapter 2.- Helm chart, ArgoCD, auth, frontend — later chapters.
- Lifting the applier into a reusable
ApplyDeploymentScore— only if a second consumer appears.
Chapter 2 — Status reflect-back + inventory [SHIPPED 2026-04-22]
Landed on feat/iot-status-reflect. Design notes preserved below
as the authoritative record of what was built + why; the
running code is the source of truth for how.
Goal: CRD .status reflects fleet reality. Per-device
success/failure counts, recent event lines, inventory snapshot.
NATS always holds current status for every device.
Sketch
- Enrich
AgentStatus(harmony-reconciler-contracts/src/status.rs):deployments: BTreeMap<String, DeploymentPhase>keyed by deployment name. Phase:Running | Failed | Pendingwithlast_error: Option<String>andlast_event_at: DateTime<Utc>.recent_events: Vec<EventEntry>— bounded ring buffer of the last N reconcile outcomes (success + failure) with timestamp, severity, short message. Serves the "few log lines from the most recent failure/success" requirement.inventory: Option<InventorySnapshot>— CPU cores, RAM, disk, kernel, arch, agent version. Populated once + on change.- All new fields
#[serde(default)]for forward compat.
- Agent populates from its reconciler state + event ring.
Inventory snapshot reuses
harmony::inventory::Inventory::from_localhost(). - Operator watches
agent-statusbucket, aggregates into the CRD's.status.aggregate:- Per-deployment phase counts:
{succeeded, failed, pending}. - De-duplicated last-N events across all devices for that deployment.
- Ref to the most-recent failing device + its
last_error.
- Per-deployment phase counts:
- CRD schema evolution: add
.status.aggregatesubtree.observed_score_stringstays for change detection or becomes a condition. - Smoke updates: a1 and a4 assert
.status.aggregate.succeededtransitions after reconcile. New test: kill a container out-of-band, assert.failedincrements within 30s.
Out of scope in this chapter
- Full journald log streaming — bounded event ring covers the user's reflect-back requirement; full streaming is a later concern.
- Multi-device regression test — wait until a second VM or real Pi is around.
Chapter 3 — Helm chart (ArgoCD deferred)
Goal: operator ships as a versioned helm chart with CRD version-locked inside.
User clarified this session: ArgoCD exists in production; all it does is apply resources from the chart. Standing up ArgoCD in the smoke adds setup overhead with no incremental validation value.
Chapter 3 produces the chart + validates helm install / helm upgrade lifecycles. ArgoCD consumption is a user operational
concern downstream.
Sketch
- Chart location:
iot/iot-operator-v0/chart/(or sibling repo — defer decision to implementation time). - Templates: Namespace, SA, ClusterRole, ClusterRoleBinding, Deployment (operator pod), CRD.
- CRD yaml in the chart is generated at chart-publish time from
the Rust
Deployment::crd(). One-off release artifact, not framework path — consistent with "no yaml in framework code." - Values: operator image tag, NATS URL, log level.
- Smoke:
helm installinto k3d → CR apply → same assertions as Chapter 1.
Open questions
- Chart repo: subdir vs. separate git repo.
- CRD install mechanism: chart hook vs. templates directory. Drives CRD upgrade story.
Chapter 4 — Auth: Zitadel + OpenBao + per-device identity
Goal: per-device granular NATS credentials; SSO for operator users; OpenBao policy per device; JWT bootstrap from Zitadel.
Zitadel + OpenBao are already ~99% integrated in harmony; this chapter is wiring the IoT-specific flows.
Sketch
- Agent's
CredentialSourcetrait (already abstract in agentconfig.rs) gets a Zitadel-JWT-backed implementation. Mints short-lived NATS creds via OpenBao auth callout. - Remove the shared-credentials
toml-sharedvariant (v0 demo leftover). - Availability: auth-callout caches policies, tolerates OpenBao outages.
- SSO for operator users (separate flow): Zitadel groups →
Kubernetes RBAC subjects on the
DeploymentCRD.
Chapter 5 — Frontend (last)
Goal: operator-friendly UI for the decentralized platform.
Form factor undecided: Leptos web dashboard, CLI extension to
harmony_cli, or a TUI. Minimum viable product: read-only view of
fleet state (devices + deployments + aggregated status) powered by
the CRD .status from Chapter 2. Aspiration: write operations with
auth from Chapter 4.
Principles — what we've learned and want to keep doing
- No yaml in framework code paths. Every kube-rs type is typed; every Score apply goes through typed Rust. Yaml generation happens only at chart-publish time, never at runtime.
- Scores describe desired state; topologies expose capabilities. Prefer adding capability traits over thickening a single topology.
- Minimal topologies for ad-hoc Score execution.
K8sAnywhereTopologyhas too many opinions (cert-manager install, tenant-manager bootstrap, helm probes) for narrow apply-a-CRD use cases. See ROADMAP §12.6 — a lean sharedK8sBareTopologyis the durable fix. - Cross-boundary wire types in
harmony-reconciler-contracts, everything else in its natural crate. - Never ship untested code. Every commit that changes runtime behavior is verified against a smoke script before landing. Cargo check + unit tests aren't enough.
- Prove claims about upstream before blaming upstream. The
Arch edk2 investigation showed this matters; see
memory/feedback_prove_before_blaming_upstream.md.