harmony/ROADMAP/iot_platform/context_conversation.md

# Conversation Summary: IoT Platform Architecture

**For agents implementing this system:** This document captures the full decision trail that led to the final `iot-platform-v0-walking-skeleton.md` plan. Understanding *why* decisions were made is as important as understanding *what* was decided — especially for judgment calls during implementation where the plan doesn't spell something out explicitly.

---

## The original ask

Sylvain (CTO of NationTech) wanted to build an IoT platform with these specific requirements:

- SSO via Zitadel
- Secrets via OpenBao
- Per-device identity, devices belong to groups
- Full CI/CD integration
- A "mini-kubelet" with NATS as the storage backend — each device is a node, reads its own resources, reconciles in a loop, reports status back to NATS KV
- Central operator with CRDs for deployments, device groups, devices — operator writes to NATS on CRD change and reports deployment status back
- CI/CD pipeline publishes hydrated Helm charts to Harbor registry; ArgoCD applies them; operator picks them up and pushes to NATS
- Devices run containers declared via Harmony Scores
- Strong consistency assumed free (NATS provides it)
- Zitadel/OpenBao integration already ~99% done in Harmony

Original constraints: simplicity key throughout, production-ready, don't go down rabbit holes, deadline and cost discipline.

---

## Phase 1: Initial architecture research and design

Claude researched NATS, Zitadel, and OpenBao integration patterns in depth using primary sources. Key findings that shaped the design:

**NATS auth callout with bearer-token JWTs is the right identity primitive.** Devices don't hold NATS signing material. An auth callout service mints a scoped per-connection user JWT with `bearer_token: true` (skips nonce signing, per the `nats-io/jwt` source) after verifying a Zitadel token the device presents at CONNECT time. This is cleaner than distributing long-lived NATS NKeys to devices. ADR-26 is the authoritative spec.

**Zitadel JWT Profile grant is the device auth path.** Service accounts with public keys registered in Zitadel, devices sign self-JWTs with their private key, exchange for access tokens. Zitadel Discussion #8406 documents exactly this pattern with working Go code for an IoT/TPM case. Key gotcha: external-OpenSSL keys need `ParsePKCS8PrivateKey`, not PKCS1.

**OpenBao's JWT auth method + templated policies** with `token_policies_template_claims` (PR #618) lets one policy resolve per-device based on JWT claims. One policy for N devices instead of N policies.

The three systems compose cleanly:
- Zitadel = identity (who or what)
- OpenBao = policy + secrets (what they can access)
- NATS = transport + subject-level authorization (where messages go)

A ~900-line architecture document captured this with primary-source citations. It remains the reference for implementation detail on auth flows.

---

## Phase 2: Planning iterations and scope calibration

Claude produced three planning documents in sequence, each refining the approach:

1. **Issue breakdown (22 issues, 9 tracks)** — human-executable parallel tracks
2. **Autonomous agent harness plan** — contract-first with phase gates, for agent-driven execution
3. **Walking skeleton plan** — thin end-to-end thread shipping Tuesday

**Sylvain's critical intervention:** when Claude produced the parallel-tracks plan targeting day 14, Sylvain pushed back. The real approach should be **walking skeleton** (Cockburn) / **tracer bullet** — ship a naive end-to-end loop first, let architecture emerge from running code, harden from there. This reduces the risk of reaching day 14 with nothing integrated.

Claude acknowledged overreach: the three documents shouldn't all exist. Walking skeleton supersedes the other two. The first two became reference material only.

---

## Phase 3: The "start from scratch objectively" challenge

Sylvain asked Claude to reconsider the architecture from scratch, with access to NationTech's full resource context (k8s clusters, ArgoCD, Harbor, Zitadel, OpenBao, Harmony ownership) but without emotional attachment to the previous design.

**Claude's initial recommendation:** k3s on each Pi + ArgoCD + external-secrets-operator. Boring, CNCF-standard, maintained by the ecosystem rather than by NationTech. Argued the custom NATS-mini-kubelet approach was "building a platform when you could buy one."

**Sylvain's decisive pushback** reframed this correctly. Claude had under-weighted several things:

1. **End-customer engineers are mechanical/electrical/chemical, not Kubernetes-literate.** They debug with `systemctl`, `journalctl`, `ps`. A k3s device forces them to learn kubectl/CRDs/CNI — a real productivity tax on a team that shouldn't have to pay it. A single Rust binary + podman is inspectable with tools they already know.

2. **The platform bet is strategic, not technical.** NationTech's positioning as "no vendor lock-in, decentralized, open-source enterprise cloud" gains credibility from having a product (Harmony), not from being "extraordinary plumbers for off-the-shelf CNCF." Building a custom platform on this bet is how you become a platform company instead of an integration shop.

3. **NationTech is its own largest customer.** Multiple OKD clusters already need coordination; manually connecting to each to make deployments is a major operational pain that hinders growth. The same architecture (agent reconciling against NATS KV) eventually manages podman on Pis, `kubectl apply` on OKD clusters, and VM-level operations. One abstraction, three instantiations.

4. **NATS is architecturally superior for federation.** ArgoCD doesn't naturally federate — it manages clusters *from one place*. A NATS supercluster with strict ordering across regions supports "operator in multiple clusters, ArgoCD instances all over, deployments coming from everywhere." For the long-term decentralized control plane, NATS is the correct substrate.

5. **Rancher code quality (k3s provenance) is real data, not nostalgia.** Sylvain has direct experience; Claude had over-indexed on CNCF graduation as a quality proxy.

6. **Harmony daemon-mode `Interpret` is already solved.** Claude had repeatedly flagged "does `Score::interpret()` work in a loop?" as a major unknown. Reality: `s.clone().interpret().await` is exactly the TUI's daemon pattern, and `harmony_agent` runs this in production for distributed CNPG PostgreSQL management. The concern was unfounded.

**Result:** Claude updated. The custom NATS-based platform is correct for this context. The k3s alternative genuinely doesn't fit. The walking skeleton plan stands.

Remaining real risks (acknowledged, not architecture-invalidating):
- Platform scope creep → walking skeleton discipline
- Bus factor → normal Harmony collaboration patterns with Jean-Gabriel
- Customers #2-N for the federation story → business question, not technical

---

## Phase 4: Strategic alignment and scope clarifications

Sylvain provided specific clarifications that shaped the final plan:

**Balena was considered and rejected.** It's the closest viable alternative, open-source, but requires custom OS (balenaOS — lock-in of a different kind), lacks native SSO + secrets integration, and positions NationTech as a Balena integrator rather than a platform company. AGPL Harmony vs. Balena has similar license profiles; NationTech can deliver honest no-lock-in positioning.

**Three-way relationship structure:** NationTech → Partner (custom software shop, engineering-quality-focused, does coaching) → End-customer (whose field-deployed Pi 5 devices run the partner's application). Tuesday's demo is for the Partner. Production deployment may involve direct end-customer contact later.

**Partner relationship is healthy and collaborative.** They want NationTech to succeed. Demo failure modes tolerable. Platform partnership is an active topic between the teams — they explicitly value having a platform partner they trust for landing their own customers.

**Other potential customers exist but aren't paying.** NationTech is managing their OKD clusters via other means for now. They can wait. NationTech's own OKD coordination pain is the largest driver.

---

## Phase 5: Technical nitty-gritty corrections

Sylvain corrected several technical details Claude had gotten wrong or overdesigned:

**No `harmony-podman-score` new crate.** It's a new module in `harmony/src/modules/podman/` following existing Harmony module conventions. Corrected in the plan.

**Use `podman-api` Rust crate, not shell-out.** Strongly typed API preferred. Requires `systemctl --user enable --now podman.socket` on the device. `podlet` crate worth evaluating later when Quadlet comes back in scope (v0.1+).

**Graceful shutdown is just `podman stop` with 5-min timeout then SIGKILL.** Not kubelet-style pod termination. Claude had overcomplicated this.

**Score envelope was overdesigned — drop it.** The "ScoreEnvelope with format/encoding/content_hash/data" pattern reminded Sylvain of SOAP. Use adjacently-tagged serde enum instead: `#[serde(tag = "type", content = "data")]`. Rust type name is the discriminator. Agent deserializes directly into the typed Score. No double-deserialization, no opaque bytes, no format version strings.

**Change detection via string comparison, not content hash.** Comparing serialized Score strings is cheap enough at this scale (a couple times per minute). Removes hashing-algorithm risk. More deterministic.

**Agent config is flat TOML for v0.** Long-term target is zero-config — device boots (PXE if budget allows), has a Zitadel URL + initial token, fetches real config from OpenBao, connects to NATS. OpenBao as source of truth for NATS credentials. v0 uses simple shared NATS credentials directly in TOML.

**OpenBao outage must not break NATS reconnect if token is still valid.** The auth callout in v0.2 should validate Zitadel tokens against JWKS directly; OpenBao lookup for group permissions should be cached in the callout. Availability-favoring design — reboot isn't more of a security event than a passing minute, and NATS rejects on actual token expiry anyway. No degradation of real security posture.

---

## Phase 6: aarch64 discovery

Late in the conversation, the single most important technical issue surfaced. Claude had been flagging "does Harmony `Interpret` work in daemon mode?" as the biggest Friday risk. Sylvain corrected that this was a non-issue.

**The real issue:** Harmony doesn't currently compile on aarch64. When `harmony_agent` was cross-compiled for ARM64, an upstream dependency had to be pulled out. Sylvain's 80% confidence: single sub-dependency used by only a few modules, feature-gatable, those modules become unavailable on ARM (acceptable — device doesn't need every Harmony feature).

**This replaces the Friday-evening "§6 decision" in the plan.** It becomes the first-hour investigation. Fallback paths exist: build agent against minimum aarch64-clean Harmony subset, or (worst case) pure Rust without Harmony Score traits for v0, adopt them in v0.1.

---

## Phase 7: Final plan adjustments

The walking skeleton plan was updated with all agreed decisions in a single coherent revision. Key decisions baked into the final doc:

**Section-by-section:**

- **§1 Strategic framing** now explicitly names NationTech as largest customer and describes the decentralized cloud vision (heating buildings, sovereign, etc.) so collaborators reading the plan understand this is long-term investment, not a side project.

- **§5.4 agent scope:** kubelet compatibility explicitly NOT a goal, kubelet architecture as north star only, v0 absolutely minimal. One paragraph, no enforced limits — discipline through inherent minimalism.

- **§5.5 Score message format:** adjacently-tagged serde enum, no envelope, no content hash, string comparison for change detection.

- **§6.7 agent config:** flat TOML for v0. v0.2 narrows to Zitadel-token-bootstrap model.

- **§7 aarch64 investigation** is the Friday-evening critical path. Fallbacks documented.

- **§8 Hour 1-2 field readiness:** heavy power-cycle testing, network-out-during-boot, agent crash loop. SD card wear / thermal / PoE explicitly ruled out per partner conversation.

- **Agent task cards:** A1 uses new Score format. A2 targets `harmony/src/modules/rpi/`. A3 commits to `podman-api` crate. Graceful shutdown simplified.

- **§12 v0.2 roadmap** includes availability-favoring auth callout design (cached OpenBao permissions, NATS handles token expiry).

- **§13 partner conversation:** technical strategy only; "others in your network" question dropped per Sylvain ("overstepping into sales, not your concern").

**Explicitly removed:**
- OKD-as-device future spike (kept in strategic framing only, not execution)
- Three-level Jean-Gabriel review process (normal Harmony collaboration applies)
- ScoreEnvelope wrapping
- Content hash in Score messages
- `iot-contracts` crate in v0 (extract v0.1)
- Thesis document Sunday dispatch (moved to v0.1 Week 2)

---

## Key principles for implementing agents

Drawing these out as they're load-bearing for judgment calls:

1. **The walking skeleton is the plan. Ship Tuesday with something crude but working.** Not production-ready, not complete. Working end-to-end thread from git push to container running on Pi.

2. **Inherent discipline over enforced limits.** The plan doesn't have line-count budgets or anti-scope lists because Sylvain argued (correctly) that walking-skeleton discipline makes them redundant. If you find yourself wanting to add PLEG event streams, per-workload worker pools, or housekeeping sweeps to v0 — don't. Periodic relist is enough.

3. **Architectural boundaries (§6) must survive v0 even under deadline pressure.** Score enum polymorphic from day one. Credentials behind a trait. Topology generalizable. CRD spec forward-compatible. NATS subject grammar matches long-term. These cost little now and save big later. Don't take shortcuts here to save 20 minutes.

4. **Scope cuts (§4) are real, not aspirational.** Zitadel/OpenBao deferred to v0.2. One device for Tuesday. No groups. No rollout state machine. No API. No TUI. No observability beyond journalctl. Fighting these cuts is the plan's biggest risk.

5. **Availability favored over strict security posture.** The auth callout caches OpenBao lookups. Token expiry is the authoritative revocation mechanism, not real-time policy lookup. A disconnected OpenBao doesn't brick the fleet.

6. **The `podman-api` crate is the happy path.** Shell-out to `podman` is fallback-only. Strong typing wins when available.

7. **Sylvain owns the critical code himself.** Agent A1 (operator), A2 (Pi provisioning), A3 (installer), A4 (demo script) are agent-dispatched. The agent binary itself and the `PodmanV0Score` implementation are Sylvain's work. The auth callout (v0.2) will also be human-written. Don't propose that agents take over these pieces.

8. **The partner relationship is strategic.** Tuesday demo conversation is half the Tuesday deliverable. Framing the v0.1/v0.2/v0.3 roadmap to them matters as much as the running code.

9. **End-customer debuggability is a UX constraint.** Mechanical/electrical/chemical engineers will touch these devices. `systemctl status iot-agent` must tell them what's happening. `journalctl -u iot-agent` must be parseable by humans. Error messages must be understandable without Kubernetes knowledge.

10. **NATS is the long-term architectural commitment.** Everything on NATS — not as a queue, as a coordination fabric. The "decentralized cluster management" future depends on this choice. Implementation decisions that weaken this (e.g., "let's just put a database in the middle") should be pushed back on.

---

## What failed or went wrong in the planning process

Noted for meta-awareness — avoid repeating:

- **Claude overproduced.** Three planning documents when two would do. Under deadline pressure, planning documents are distractions from execution. Sylvain eventually said this directly.

- **Claude under-weighted end-customer UX.** Initial k3s recommendation treated "Kubernetes is easy" as universal when it's only easy for people who already know Kubernetes.

- **Claude under-weighted strategic positioning.** Platform-building vs. integration-consulting is a business choice; Claude treated it as purely technical.

- **Claude repeatedly flagged the Harmony daemon-mode concern** despite it being already solved. A better first question would have been "does this work today?" rather than "what if this doesn't work?"

- **Claude's initial Zitadel/OpenBao integration estimate was too large** because Claude didn't fully internalize "integration is 99% done in Harmony." The remaining work is wiring, not implementing.

- **Claude started with the ScoreEnvelope pattern** before understanding Harmony's native serde patterns. The "SOAP" reaction was deserved.

---

## What's in the final plan

The final document `iot-platform-v0-walking-skeleton.md` (~700 lines) contains:

- Strategic framing (§1)
- Walking skeleton vs. alternatives comparison (§2)
- Tuesday demo definition (§3)
- Scope cuts with milestones (§4)
- End-to-end architecture (§5)
- Architecture boundaries to preserve (§6)
- Friday aarch64 investigation and fallbacks (§7)
- Hour-by-hour Friday-Tuesday plan (§8)
- Four agent task cards (§9)
- Anti-patterns prevented (§10)
- Failure-mode decision tree (§11)
- Post-Tuesday roadmap v0.1→v0.4+ (§12)
- Partner conversation structure for Tuesday (§13)

Companion documents for deep-dive reference:
- `iot-platform-architecture.md` — full architecture with primary-source citations, useful for v0.2+ when auth is implemented

---

## What agents should do when uncertain

The plan cannot anticipate everything. When an agent hits an ambiguity, the decision hierarchy is:

1. **Does this preserve the end-to-end thread for Tuesday?** If yes, proceed. If it breaks the thread, stop and escalate.
2. **Does this preserve architectural boundaries §6?** If unsure, favor the boundary.
3. **Does this add scope beyond §4's in-scope list?** If yes, don't do it, regardless of how easy it seems.
4. **Is this security-critical?** If yes, don't add new code — flag for human review. Especially relevant for v0.2 auth callout work.
5. **Would this be more elegant but take an extra hour?** Don't do it. Ship Tuesday.
6. **Is the end-customer engineer's debuggability harmed by this choice?** If yes, don't do it.
7. **Is this on the path to the OKD-cluster-as-device future?** Don't optimize for this in v0. The abstractions are correct; don't over-invest.

The walking skeleton's entire value is shipping Tuesday. Every decision that serves that goal is correct. Every decision that defers it (no matter how well-intentioned) is wrong.