Files

Jean-Gabriel Gill-Couture 65ef540b97

Run Check Script / check (pull_request) Waiting to run

Details

feat: scaffold IoT walking skeleton — podman module, operator, and agent

- Add PodmanV0Score/IotScore (adjacent-tagged serde) and PodmanV0Interpret stub
- Gate virt behind kvm feature and podman-api behind podman feature
- Scaffold iot-operator-v0 (kube-rs operator stub) and iot-agent-v0 (NATS KV watch)
- Add PodmanV0 to InterpretName enum
- Fix aarch64 cross-compilation by making kvm/podman optional features
- Align async-nats across workspace, add workspace deps for tracing/toml/tracing-subscriber
- Remove unused deps (serde_yaml from agent, schemars from operator)
- Add Send+Sync to CredentialSource, fix &PathBuf → &Path, remove dead_code allow
- Update 5 KVM example Cargo.tomls with explicit features = ["kvm"]

2026-04-17 20:15:10 -04:00

18 KiB

Raw Permalink Blame History

Conversation Summary: IoT Platform Architecture

For agents implementing this system: This document captures the full decision trail that led to the final iot-platform-v0-walking-skeleton.md plan. Understanding why decisions were made is as important as understanding what was decided — especially for judgment calls during implementation where the plan doesn't spell something out explicitly.

The original ask

Sylvain (CTO of NationTech) wanted to build an IoT platform with these specific requirements:

SSO via Zitadel
Secrets via OpenBao
Per-device identity, devices belong to groups
Full CI/CD integration
A "mini-kubelet" with NATS as the storage backend — each device is a node, reads its own resources, reconciles in a loop, reports status back to NATS KV
Central operator with CRDs for deployments, device groups, devices — operator writes to NATS on CRD change and reports deployment status back
CI/CD pipeline publishes hydrated Helm charts to Harbor registry; ArgoCD applies them; operator picks them up and pushes to NATS
Devices run containers declared via Harmony Scores
Strong consistency assumed free (NATS provides it)
Zitadel/OpenBao integration already ~99% done in Harmony

Original constraints: simplicity key throughout, production-ready, don't go down rabbit holes, deadline and cost discipline.

Phase 1: Initial architecture research and design

Claude researched NATS, Zitadel, and OpenBao integration patterns in depth using primary sources. Key findings that shaped the design:

NATS auth callout with bearer-token JWTs is the right identity primitive. Devices don't hold NATS signing material. An auth callout service mints a scoped per-connection user JWT with bearer_token: true (skips nonce signing, per the nats-io/jwt source) after verifying a Zitadel token the device presents at CONNECT time. This is cleaner than distributing long-lived NATS NKeys to devices. ADR-26 is the authoritative spec.

Zitadel JWT Profile grant is the device auth path. Service accounts with public keys registered in Zitadel, devices sign self-JWTs with their private key, exchange for access tokens. Zitadel Discussion #8406 documents exactly this pattern with working Go code for an IoT/TPM case. Key gotcha: external-OpenSSL keys need ParsePKCS8PrivateKey, not PKCS1.

OpenBao's JWT auth method + templated policies with token_policies_template_claims (PR #618) lets one policy resolve per-device based on JWT claims. One policy for N devices instead of N policies.

The three systems compose cleanly:

Zitadel = identity (who or what)
OpenBao = policy + secrets (what they can access)
NATS = transport + subject-level authorization (where messages go)

A ~900-line architecture document captured this with primary-source citations. It remains the reference for implementation detail on auth flows.

Phase 2: Planning iterations and scope calibration

Claude produced three planning documents in sequence, each refining the approach:

Issue breakdown (22 issues, 9 tracks) — human-executable parallel tracks
Autonomous agent harness plan — contract-first with phase gates, for agent-driven execution
Walking skeleton plan — thin end-to-end thread shipping Tuesday

Sylvain's critical intervention: when Claude produced the parallel-tracks plan targeting day 14, Sylvain pushed back. The real approach should be walking skeleton (Cockburn) / tracer bullet — ship a naive end-to-end loop first, let architecture emerge from running code, harden from there. This reduces the risk of reaching day 14 with nothing integrated.

Claude acknowledged overreach: the three documents shouldn't all exist. Walking skeleton supersedes the other two. The first two became reference material only.

Phase 3: The "start from scratch objectively" challenge

Sylvain asked Claude to reconsider the architecture from scratch, with access to NationTech's full resource context (k8s clusters, ArgoCD, Harbor, Zitadel, OpenBao, Harmony ownership) but without emotional attachment to the previous design.

Claude's initial recommendation: k3s on each Pi + ArgoCD + external-secrets-operator. Boring, CNCF-standard, maintained by the ecosystem rather than by NationTech. Argued the custom NATS-mini-kubelet approach was "building a platform when you could buy one."

Sylvain's decisive pushback reframed this correctly. Claude had under-weighted several things:

End-customer engineers are mechanical/electrical/chemical, not Kubernetes-literate. They debug with systemctl, journalctl, ps. A k3s device forces them to learn kubectl/CRDs/CNI — a real productivity tax on a team that shouldn't have to pay it. A single Rust binary + podman is inspectable with tools they already know.
The platform bet is strategic, not technical. NationTech's positioning as "no vendor lock-in, decentralized, open-source enterprise cloud" gains credibility from having a product (Harmony), not from being "extraordinary plumbers for off-the-shelf CNCF." Building a custom platform on this bet is how you become a platform company instead of an integration shop.
NationTech is its own largest customer. Multiple OKD clusters already need coordination; manually connecting to each to make deployments is a major operational pain that hinders growth. The same architecture (agent reconciling against NATS KV) eventually manages podman on Pis, kubectl apply on OKD clusters, and VM-level operations. One abstraction, three instantiations.
NATS is architecturally superior for federation. ArgoCD doesn't naturally federate — it manages clusters from one place. A NATS supercluster with strict ordering across regions supports "operator in multiple clusters, ArgoCD instances all over, deployments coming from everywhere." For the long-term decentralized control plane, NATS is the correct substrate.
Rancher code quality (k3s provenance) is real data, not nostalgia. Sylvain has direct experience; Claude had over-indexed on CNCF graduation as a quality proxy.
Harmony daemon-mode Interpret is already solved. Claude had repeatedly flagged "does Score::interpret() work in a loop?" as a major unknown. Reality: s.clone().interpret().await is exactly the TUI's daemon pattern, and harmony_agent runs this in production for distributed CNPG PostgreSQL management. The concern was unfounded.

Result: Claude updated. The custom NATS-based platform is correct for this context. The k3s alternative genuinely doesn't fit. The walking skeleton plan stands.

Remaining real risks (acknowledged, not architecture-invalidating):

Platform scope creep → walking skeleton discipline
Bus factor → normal Harmony collaboration patterns with Jean-Gabriel
Customers #2-N for the federation story → business question, not technical

Phase 4: Strategic alignment and scope clarifications

Sylvain provided specific clarifications that shaped the final plan:

Balena was considered and rejected. It's the closest viable alternative, open-source, but requires custom OS (balenaOS — lock-in of a different kind), lacks native SSO + secrets integration, and positions NationTech as a Balena integrator rather than a platform company. AGPL Harmony vs. Balena has similar license profiles; NationTech can deliver honest no-lock-in positioning.

Three-way relationship structure: NationTech → Partner (custom software shop, engineering-quality-focused, does coaching) → End-customer (whose field-deployed Pi 5 devices run the partner's application). Tuesday's demo is for the Partner. Production deployment may involve direct end-customer contact later.

Partner relationship is healthy and collaborative. They want NationTech to succeed. Demo failure modes tolerable. Platform partnership is an active topic between the teams — they explicitly value having a platform partner they trust for landing their own customers.

Other potential customers exist but aren't paying. NationTech is managing their OKD clusters via other means for now. They can wait. NationTech's own OKD coordination pain is the largest driver.

Phase 5: Technical nitty-gritty corrections

Sylvain corrected several technical details Claude had gotten wrong or overdesigned:

No harmony-podman-score new crate. It's a new module in harmony/src/modules/podman/ following existing Harmony module conventions. Corrected in the plan.

Use podman-api Rust crate, not shell-out. Strongly typed API preferred. Requires systemctl --user enable --now podman.socket on the device. podlet crate worth evaluating later when Quadlet comes back in scope (v0.1+).

Graceful shutdown is just podman stop with 5-min timeout then SIGKILL. Not kubelet-style pod termination. Claude had overcomplicated this.

Score envelope was overdesigned — drop it. The "ScoreEnvelope with format/encoding/content_hash/data" pattern reminded Sylvain of SOAP. Use adjacently-tagged serde enum instead: #[serde(tag = "type", content = "data")]. Rust type name is the discriminator. Agent deserializes directly into the typed Score. No double-deserialization, no opaque bytes, no format version strings.

Change detection via string comparison, not content hash. Comparing serialized Score strings is cheap enough at this scale (a couple times per minute). Removes hashing-algorithm risk. More deterministic.

Agent config is flat TOML for v0. Long-term target is zero-config — device boots (PXE if budget allows), has a Zitadel URL + initial token, fetches real config from OpenBao, connects to NATS. OpenBao as source of truth for NATS credentials. v0 uses simple shared NATS credentials directly in TOML.

OpenBao outage must not break NATS reconnect if token is still valid. The auth callout in v0.2 should validate Zitadel tokens against JWKS directly; OpenBao lookup for group permissions should be cached in the callout. Availability-favoring design — reboot isn't more of a security event than a passing minute, and NATS rejects on actual token expiry anyway. No degradation of real security posture.

Phase 6: aarch64 discovery

Late in the conversation, the single most important technical issue surfaced. Claude had been flagging "does Harmony Interpret work in daemon mode?" as the biggest Friday risk. Sylvain corrected that this was a non-issue.

The real issue: Harmony doesn't currently compile on aarch64. When harmony_agent was cross-compiled for ARM64, an upstream dependency had to be pulled out. Sylvain's 80% confidence: single sub-dependency used by only a few modules, feature-gatable, those modules become unavailable on ARM (acceptable — device doesn't need every Harmony feature).

This replaces the Friday-evening "§6 decision" in the plan. It becomes the first-hour investigation. Fallback paths exist: build agent against minimum aarch64-clean Harmony subset, or (worst case) pure Rust without Harmony Score traits for v0, adopt them in v0.1.

Phase 7: Final plan adjustments

The walking skeleton plan was updated with all agreed decisions in a single coherent revision. Key decisions baked into the final doc:

Section-by-section:

§1 Strategic framing now explicitly names NationTech as largest customer and describes the decentralized cloud vision (heating buildings, sovereign, etc.) so collaborators reading the plan understand this is long-term investment, not a side project.
§5.4 agent scope: kubelet compatibility explicitly NOT a goal, kubelet architecture as north star only, v0 absolutely minimal. One paragraph, no enforced limits — discipline through inherent minimalism.
§5.5 Score message format: adjacently-tagged serde enum, no envelope, no content hash, string comparison for change detection.
§6.7 agent config: flat TOML for v0. v0.2 narrows to Zitadel-token-bootstrap model.
§7 aarch64 investigation is the Friday-evening critical path. Fallbacks documented.
§8 Hour 1-2 field readiness: heavy power-cycle testing, network-out-during-boot, agent crash loop. SD card wear / thermal / PoE explicitly ruled out per partner conversation.
Agent task cards: A1 uses new Score format. A2 targets harmony/src/modules/rpi/. A3 commits to podman-api crate. Graceful shutdown simplified.
§12 v0.2 roadmap includes availability-favoring auth callout design (cached OpenBao permissions, NATS handles token expiry).
§13 partner conversation: technical strategy only; "others in your network" question dropped per Sylvain ("overstepping into sales, not your concern").

Explicitly removed:

OKD-as-device future spike (kept in strategic framing only, not execution)
Three-level Jean-Gabriel review process (normal Harmony collaboration applies)
ScoreEnvelope wrapping
Content hash in Score messages
iot-contracts crate in v0 (extract v0.1)
Thesis document Sunday dispatch (moved to v0.1 Week 2)

Key principles for implementing agents

Drawing these out as they're load-bearing for judgment calls:

The walking skeleton is the plan. Ship Tuesday with something crude but working. Not production-ready, not complete. Working end-to-end thread from git push to container running on Pi.
Inherent discipline over enforced limits. The plan doesn't have line-count budgets or anti-scope lists because Sylvain argued (correctly) that walking-skeleton discipline makes them redundant. If you find yourself wanting to add PLEG event streams, per-workload worker pools, or housekeeping sweeps to v0 — don't. Periodic relist is enough.
Architectural boundaries (§6) must survive v0 even under deadline pressure. Score enum polymorphic from day one. Credentials behind a trait. Topology generalizable. CRD spec forward-compatible. NATS subject grammar matches long-term. These cost little now and save big later. Don't take shortcuts here to save 20 minutes.
Scope cuts (§4) are real, not aspirational. Zitadel/OpenBao deferred to v0.2. One device for Tuesday. No groups. No rollout state machine. No API. No TUI. No observability beyond journalctl. Fighting these cuts is the plan's biggest risk.
Availability favored over strict security posture. The auth callout caches OpenBao lookups. Token expiry is the authoritative revocation mechanism, not real-time policy lookup. A disconnected OpenBao doesn't brick the fleet.
The podman-api crate is the happy path. Shell-out to podman is fallback-only. Strong typing wins when available.
Sylvain owns the critical code himself. Agent A1 (operator), A2 (Pi provisioning), A3 (installer), A4 (demo script) are agent-dispatched. The agent binary itself and the PodmanV0Score implementation are Sylvain's work. The auth callout (v0.2) will also be human-written. Don't propose that agents take over these pieces.
The partner relationship is strategic. Tuesday demo conversation is half the Tuesday deliverable. Framing the v0.1/v0.2/v0.3 roadmap to them matters as much as the running code.
End-customer debuggability is a UX constraint. Mechanical/electrical/chemical engineers will touch these devices. systemctl status iot-agent must tell them what's happening. journalctl -u iot-agent must be parseable by humans. Error messages must be understandable without Kubernetes knowledge.
NATS is the long-term architectural commitment. Everything on NATS — not as a queue, as a coordination fabric. The "decentralized cluster management" future depends on this choice. Implementation decisions that weaken this (e.g., "let's just put a database in the middle") should be pushed back on.

What failed or went wrong in the planning process

Noted for meta-awareness — avoid repeating:

Claude overproduced. Three planning documents when two would do. Under deadline pressure, planning documents are distractions from execution. Sylvain eventually said this directly.
Claude under-weighted end-customer UX. Initial k3s recommendation treated "Kubernetes is easy" as universal when it's only easy for people who already know Kubernetes.
Claude under-weighted strategic positioning. Platform-building vs. integration-consulting is a business choice; Claude treated it as purely technical.
Claude repeatedly flagged the Harmony daemon-mode concern despite it being already solved. A better first question would have been "does this work today?" rather than "what if this doesn't work?"
Claude's initial Zitadel/OpenBao integration estimate was too large because Claude didn't fully internalize "integration is 99% done in Harmony." The remaining work is wiring, not implementing.
Claude started with the ScoreEnvelope pattern before understanding Harmony's native serde patterns. The "SOAP" reaction was deserved.

What's in the final plan

The final document iot-platform-v0-walking-skeleton.md (~700 lines) contains:

Strategic framing (§1)
Walking skeleton vs. alternatives comparison (§2)
Tuesday demo definition (§3)
Scope cuts with milestones (§4)
End-to-end architecture (§5)
Architecture boundaries to preserve (§6)
Friday aarch64 investigation and fallbacks (§7)
Hour-by-hour Friday-Tuesday plan (§8)
Four agent task cards (§9)
Anti-patterns prevented (§10)
Failure-mode decision tree (§11)
Post-Tuesday roadmap v0.1→v0.4+ (§12)
Partner conversation structure for Tuesday (§13)

Companion documents for deep-dive reference:

iot-platform-architecture.md — full architecture with primary-source citations, useful for v0.2+ when auth is implemented

What agents should do when uncertain

The plan cannot anticipate everything. When an agent hits an ambiguity, the decision hierarchy is:

Does this preserve the end-to-end thread for Tuesday? If yes, proceed. If it breaks the thread, stop and escalate.
Does this preserve architectural boundaries §6? If unsure, favor the boundary.
Does this add scope beyond §4's in-scope list? If yes, don't do it, regardless of how easy it seems.
Is this security-critical? If yes, don't add new code — flag for human review. Especially relevant for v0.2 auth callout work.
Would this be more elegant but take an extra hour? Don't do it. Ship Tuesday.
Is the end-customer engineer's debuggability harmed by this choice? If yes, don't do it.
Is this on the path to the OKD-cluster-as-device future? Don't optimize for this in v0. The abstractions are correct; don't over-invest.

The walking skeleton's entire value is shipping Tuesday. Every decision that serves that goal is correct. Every decision that defers it (no matter how well-intentioned) is wrong.

18 KiB Raw Permalink Blame History