- Add PodmanV0Score/IotScore (adjacent-tagged serde) and PodmanV0Interpret stub - Gate virt behind kvm feature and podman-api behind podman feature - Scaffold iot-operator-v0 (kube-rs operator stub) and iot-agent-v0 (NATS KV watch) - Add PodmanV0 to InterpretName enum - Fix aarch64 cross-compilation by making kvm/podman optional features - Align async-nats across workspace, add workspace deps for tracing/toml/tracing-subscriber - Remove unused deps (serde_yaml from agent, schemars from operator) - Add Send+Sync to CredentialSource, fix &PathBuf → &Path, remove dead_code allow - Update 5 KVM example Cargo.tomls with explicit features = ["kvm"]
33 KiB
IoT Platform v0 — Walking Skeleton
Approach: Walking skeleton (Cockburn). Thin end-to-end thread through every architectural component. Naive first, architecture emerges from running code, hardening follows real-world feedback.
1. Strategic framing
Near-term product: IoT platform for an internal partner (a custom software shop with strong engineering practices — tests, CI/CD, coaching). They are developing an application for their end-customer whose field devices are Raspberry Pi 5s with 8/16 GB RAM, ARM64. The end-customer's engineers are mechanical/electrical/chemical, not Kubernetes-literate; on-device debuggability using standard Linux tools is a genuine UX concern.
Long-term product: This is the foundation for NationTech's decentralized enterprise cloud orchestration. NationTech itself is effectively our largest customer for this platform — we already run multiple OKD clusters in different locations and need to coordinate deployments, updates, and observability across them without connecting into each one manually. An "agent" reconciling against NATS KV looks the same whether it runs podman on a Pi, kubectl apply on an OKD cluster, or a VM-level operation. The abstraction has been chosen to support all three eventually; v0 demonstrates it on the simplest target (podman on Pi).
Why this matters for collaborators reading this plan: this is not a side project or a one-off customer integration. NationTech's positioning as a no-vendor-lock-in, decentralized, open-source cloud solution is gaining traction specifically because we have a product (Harmony) and not just bespoke integration work. This IoT platform extends that thesis. Resource investment is long-term.
Deadlines:
- Tuesday (day 4): internal partner sees
git push → container running on Pi. Confidence-building, low-stakes. - Day 14 (~2 weeks): solid product foundation, before other NationTech projects claim attention.
- 2 months (partner's deadline): hardened production delivery for the partner's end-customer.
Hour budget:
- Friday evening (now): 3-4 hours focused
- Saturday: light supervision of agents, 2-3 hours
- Sunday: light supervision of agents, 2-3 hours
- Monday: 8 focused hours
- Tuesday morning: ship + polish, 4 hours
- Week 2 (Wed-Fri): v0.1 hardening, ~4 hours/day
- Week 3: v0.2 auth layer, ~4 hours/day
- Remaining weeks: partner-driven hardening as their application development reveals needs
Sustainable hours non-negotiable.
Terminology used consistently below: NationTech = us. Partner = the software shop we're directly working with. End-customer = the partner's customer whose field devices we're managing.
2. Walking skeleton vs. parallel-tracks: the honest choice
I considered both. For this context, walking skeleton wins on every axis:
| Axis | Walking skeleton | Parallel-tracks autonomous |
|---|---|---|
| Partner sees progress | Tuesday (day 4) | Day 11+ |
| Integration risk | Discovered day 3 | Discovered day 11 |
| Weekend pressure | Natural stopping points | Merge-gate pressure |
| Adapts to "OKD cluster as device" future | Trivial — new Score variant later | Expensive mid-architecture pivot |
| Risk of day-14 slip | Low (partner has seen it work) | High (integration bugs in final days) |
| Hours sustainability | Good | Poor |
3. The demo (= the product for Tuesday)
Partner edits ArgoCD Operator NATS KV Agent on podman
YAML in git ──push→ syncs ──apply→ writes ──store→ watch ──pull→ Pi reads ──run→ container
to k8s to NATS Score running
Success criterion for Tuesday:
git pushon a workload repo.- Within 2 minutes, a container is running on a Raspberry Pi 5 in our lab.
- Partner can
curlthe container (on the Pi's IP) and get hello-world. - Partner can edit the YAML (change image or port), push, watch the container transition within 2 minutes.
Invisible to partner but critical:
- Pi is pre-provisioned with agent installed.
- ArgoCD is pre-configured with the partner's workload repo.
- Agent uses a shared NATS credential from a TOML file.
Partner-explicit framing for Tuesday conversation:
- "This proves the mechanism end-to-end."
- "v0.1 next week: Harmony Score polished, second Pi added, status aggregation."
- "v0.2 week 2: real authentication via Zitadel + OpenBao, no more shared creds."
- "Here's what you can start building against today."
4. Scope cuts — explicit deferrals
Each cut has a target milestone. This is the foundation for the "here's what's coming" partner conversation.
| Deferred | v0 replacement | Milestone |
|---|---|---|
| Zitadel device auth | Shared NATS credential in agent TOML | v0.2 |
| OpenBao | Shared credentials in agent TOML | v0.2 |
| Auth callout service | Direct NATS user/pass | v0.2 |
| Scoping tests | None (single-tenant demo) | v0.2 |
| Multiple Pi devices | One Pi for Tuesday; second added v0.1 | v0.1 |
| Quadlet interpretation | podman-api crate direct control |
v0.1 considers Quadlet |
| Status aggregation in CRD | Agent writes status, operator doesn't aggregate | v0.1 |
| Inventory reporting | Not in v0 | v0.1 |
| Log streaming via NATS | journalctl over SSH |
v0.1 |
| API service | None | v0.2+ |
| TUI for IoT | kubectl + nats CLI |
v0.2+ |
| Rollout state machine | All-at-once (one Pi for Tuesday, moot) | v0.1+ |
| Failure injection harness | None formal | v0.1 |
| Observability (Prom+Grafana) | journalctl + kubectl logs |
v0.1+ |
| OKD-cluster-as-device | Not in v0; not in v0.x at all | Strategic roadmap, separate |
What's kept in v0 despite cost:
- Harmony Score on device. Friday builds a minimal podman Score as a module in
harmony/src/modules/podman/. Adds 1-2 hours Friday but proves the abstraction works in daemon mode. - Real
kube-rsoperator (not a cron script). The operator's shape matters for long-term stability. - NATS KV transport. Proven now so we don't switch later.
- CRD-based partner API.
kubectl apply -f deployment.yamlis the partner's long-term interface. - Pi provisioning via Harmony Score when achievable in <1hr (§7 Hour 1); manual runbook as fallback.
5. The thread end-to-end
5.1 Partner's git repo
iot-workload-hello/
├── deployment.yaml # Deployment CR
├── README.md # "Edit, git push, done."
deployment.yaml:
apiVersion: iot.nationtech.io/v1alpha1
kind: Deployment
metadata:
name: hello-world
namespace: iot-demo
spec:
targetDevices:
- pi-demo-01
score:
type: PodmanV0 # Rust enum discriminator (serde adjacently-tagged)
data:
services:
- name: hello
image: docker.io/library/nginx:alpine
ports: ["8080:80"]
rollout:
strategy: Immediate
5.2 Central cluster setup
Existing k8s cluster. Namespaces:
iot-system— operator, NATS (single-node for v0)iot-demo—DeploymentCRs
ArgoCD application pre-configured to sync iot-workload-hello repo into iot-demo namespace.
5.3 Raspberry Pi 5 setup
One Pi 5 in the lab, provisioned via Harmony Pi-provisioning Score (if achievable in <1hr Friday) or manually via SD card flash (fallback).
Base OS: Ubuntu Server 24.04 LTS ARM64 (ships Podman 4.9 in repos). Raspberry Pi OS 64-bit bookworm acceptable fallback.
Installed:
podman(4.4+, ARM64) withsystemctl --user enable --now podman.socket(required forpodman-apicrate)iot-agentbinary (cross-compiled to aarch64 via existing Harmony aarch64 toolchain)/etc/iot-agent/config.tomlwith NATS URL + shared credential- systemd unit
iot-agent.service
5.4 What the code does
Operator:
- Watches
DeploymentCRs cluster-wide. - For each, for each
device_idinspec.targetDevices, writesdesired-state.<device_id>.<deployment-name>indesired-stateJetStream KV bucket with the Score message (see §5.5). - Updates
.status.observedScoreString(the last-written Score as stored string, used for change detection via string comparison). - On deletion, removes corresponding KV entries.
Agent on Pi:
- Connect to NATS (TOML-configured user/pass).
- Watch
desired-state.<my-device-id>.>KV keys. - For each entry: deserialize Score message, dispatch to Harmony
Score::interpret(&topology)vias.clone().interpret().awaitpattern (already the TUI's daemon-mode pattern, battle-tested inharmony_agentfor CNPG management). - For v0, only
PodmanV0Score variant exists. Interprets against aPiDeviceTopology(arch=aarch64, runtime=podman) and uses thepodman-apicrate to manage containers via the Podman REST API (over the user socket activated at §5.3 setup). - Change detection via serialized string comparison (not content hash). Cheap at this scale (a couple times per minute expected), removes hashing-algorithm risk, deterministic.
- Status writer: every 30s, write current state to
status.<my-device-id>.
Kubelet compatibility is explicitly NOT a goal. Kubelet architecture serves as a north star for proven reconcile-loop patterns; the v0 implementation stays absolutely minimal. No PLEG event stream in v0, no per-workload worker pool, no housekeeping sweep — just a single reconcile loop with periodic relist. Scope discipline through inherent minimalism, not enforced limits.
5.5 Score message on NATS
Adjacently tagged serde enum. One Rust type per Score variant, #[serde(tag = "type", content = "data")] for clean discriminator/payload separation:
#[derive(Serialize, Deserialize, Clone)]
#[serde(tag = "type", content = "data")]
pub enum Score {
PodmanV0(PodmanV0Score),
// Future: OkdApplyV0(OkdApplyScore), KubectlApplyV0(...), etc.
}
JSON wire format:
{
"type": "PodmanV0",
"data": { /* PodmanV0Score fields */ }
}
No envelope. No encoding field. No format version string. The Rust type name is the discriminator; serde handles polymorphism cleanly. Adding a new Score variant (for OKD management later) is enum Score { ..., OkdApplyV0(OkdApplyScore) } — additive, not breaking.
5.6 What's deliberately dumb in v0
- Polling instead of event-driven PLEG. Agent polls podman-api every 30s as ground truth; KV watch events are accelerators.
- No idempotency beyond string-equality. Current score matches stored → no-op, mismatch → stop old container, run new. Brief downtime on updates. Fine for v0.
- Graceful shutdown =
podman stopwith 5min timeout, then SIGKILL. Sufficient. - No auth between operator and NATS. Same k8s cluster, same namespace. Network trust.
- No state persistence beyond podman itself. Agent restart = re-read NATS, re-query podman, reconcile differences.
- No multi-service coordination. A Score with three services starts them all immediately, no dependency ordering.
6. Architecture boundaries we keep even in v0
Decisions that cost little now and save real time later.
6.1 Score enum polymorphic from day 1
Even with one variant (PodmanV0), the enum shape is already polymorphic. Adding OkdApplyV0 later is trivial.
6.2 Score + Interpret traits used consistently
Use Harmony's existing traits. Cost: ~1 hour Friday. Benefit: agent is structurally ready for a second Score type in v0.3+.
6.3 Credentials behind a trait
trait CredentialSource: Send + Sync {
async fn nats_connect_options(&self) -> Result<ConnectOptions>;
}
v0: TomlFileCredentialSource reading /etc/iot-agent/config.toml.
v0.2: ZitadelBootstrappedCredentialSource — same trait, swapped via config.
30 minutes Friday. Saves 3 hours of refactor in v0.2.
6.4 Device topology generalizes
PiDeviceTopology for v0. Trait interface supports other topologies — OKD cluster as OkdClusterTopology later. The v0 Score validates at compile time that its topology requirements match (arch=aarch64, runtime=podman). The OKD Score will validate different requirements (has_kube_api, has_argo). Same pattern.
6.5 CRD spec forward-compatible
spec:
targetDevices: [id1, id2] # v0. v1 adds targetGroups.
score: {type: ..., data: ...} # polymorphic enum
rollout:
strategy: Immediate # v0. v1 adds Progressive.
6.6 NATS subject grammar matches long-term
Even with one Pi, use desired-state.<device_id>.<n> and status.<device_id>. Don't take shortcuts.
6.7 Agent config is TOML, flat for v0
[agent]
device_id = "pi-demo-01"
[credentials]
type = "toml-shared"
nats_user = "iot-agent"
nats_pass = "dev-shared-password"
[nats]
urls = ["nats://central:4222"]
v0.2 adds a [zitadel] section enabling the bootstrap-via-token flow (see §11 roadmap). Additive, not breaking. Target long-term state: device boots → PXE or minimal TOML → Zitadel URL + token → fetches real config from OpenBao → connects to NATS. OpenBao outage doesn't break reconnect because the NATS auth callout validates tokens against Zitadel JWKS directly (with cached group permissions); NATS rejects only when the token actually expires.
7. Friday evening critical path — the aarch64 investigation
The previous walking skeleton draft had a "§6 decision point" on whether Harmony's Interpret works in daemon mode. That's resolved — the TUI does s.clone().interpret().await as a daemon pattern, and harmony_agent manages distributed CNPG in production using exactly this. Not a concern.
The real concern that replaces it: Harmony does not currently compile on aarch64. When harmony_agent was cross-compiled for ARM64, an upstream dependency had to be pulled out. This was likely a single sub-dependency used by only a few modules, feature-gatable so those modules become unavailable on ARM (acceptable — the device doesn't need every Harmony feature). Estimated as a quick fix (~80% confidence, per Sylvain's recollection).
Friday evening investigation (30-60 min, first):
cargo build --target aarch64-unknown-linux-musl -p harmonyon the workspace. Capture the error.- Identify the offending crate and the module(s) in Harmony that depend on it.
- Apply feature-gate: add a
cfg(not(target_arch = "aarch64"))attribute to the offending module, or introduce a Cargo feature flag (--features x86-only) that the ARM build skips. - Verify:
cargo build --target aarch64-unknown-linux-musl -p harmony --features <minimal>succeeds. - Run the unit tests that exist for the feature-gated modules on x86_64 to confirm we haven't broken anything on the primary platform.
Budget: 2 hours max Friday night. If not resolved in 2 hours:
- Fallback A: Build the agent against only the crates that do compile on aarch64 (
harmony_agent,harmony_types, whatever subset). Implement thePodmanV0Score directly in the agent crate using its own trait impls for now. Reunify with the main Harmony codebase in v0.1 after the compile fix is properly done. - Fallback B: (only if Fallback A also blocks) Write the v0 agent as pure Rust without Harmony Score traits. Adopt them in v0.1 after the aarch64 fix lands. This is the walking-skeleton-surfaces-real-issue scenario from §10.
Document findings in the Friday night log regardless of outcome. v0.1 work includes proper fix if we took a shortcut.
This is the single most important investigation of the weekend. Do it before anything else Friday. Every downstream decision (can the agent use Harmony Score traits? what's agent A3 cross-compiling?) depends on it.
8. Hour-by-hour plan
Friday evening (3-4 hours)
Goal by end of Friday night: aarch64 path clear; operator running in central cluster writes to NATS on CR apply; agent crate compiling on laptop, talking to NATS; Pi provisioning plan chosen.
Hour 1 — aarch64 investigation + decisions + dispatches
Your work:
- aarch64 investigation per §7 (30-60 min, first thing).
- Write 1-page
v0-demo.md: demo script, success criteria, fallback plan. - Decide Pi OS: Ubuntu 24.04 ARM64 (default) vs Raspberry Pi OS 64-bit. Don't agonize beyond 10 min.
Dispatch agent A1 (operator): "Create Rust crate iot/iot-operator-v0/ using kube-rs implementing a Deployment CRD controller that writes to NATS KV. Exact spec in task card §9.A1. Self-verify: kubectl apply → nats kv get shows entry. Under 300 lines main.rs. No auth."
Dispatch agent A2 (Pi provisioning, fallback-aware): "Attempt Harmony-based Raspberry Pi 5 provisioning Score. Target: fresh Pi flashed via SD card, boots, static IP, Ubuntu 24.04 ARM64 with Podman 4.9, podman user socket enabled, user iot-agent with linger enabled, /etc/iot-agent/ ready. If Harmony doesn't have Pi primitives, document the gap and produce a manual provisioning runbook instead (rpi-imager + cloud-init). Hard time limit: 90 min. Self-verify: ssh iot-agent@<pi-ip> 'podman --version' returns 4.4+."
Hour 2 — your work: agent crate
Start writing the agent yourself. Core customer-experience code; you own its shape.
Crate in harmony/src/modules/iot_agent/ or a new binary in the Harmony workspace (follow existing conventions — Harmony modules live in harmony/src/modules/):
- Under 500 lines for v0.
- Dependencies:
async-nats,serde,serde_yaml,tokio,tracing,anyhow,podman-api, plus Harmony workspace deps. - Main loop per §5.4.
CredentialSourcetrait (§6.3) withTomlFileCredentialSourceimpl.- Score enum (§5.5) with
PodmanV0variant. PodmanV0Scoreimplements Harmony'sScore+Interprettraits. Score lives inharmony/src/modules/podman/(new module) following existing Harmony module conventions.podman-apicrate for container operations — no shell-out.
Hour 3 — local integration
- Review agent A1's operator. Deploy to central cluster
iot-systemnamespace. - Deploy NATS to
iot-systemif not already (single-node JetStream). - Review agent A2's Pi provisioning. If Harmony Score succeeded, note for demo; if manual runbook, accept and move on.
- Agent compiles on laptop. Connects to central NATS.
Hour 4 — first partial handshake
kubectl applyaDeploymentCR targetingpi-demo-01.- Verify:
nats kv get desired-state pi-demo-01.test-deployshows entry. - Run agent locally on laptop with
DEVICE_ID=pi-demo-01, confirm it reads the KV entry and prints what it would do. - First success: local end-to-end without actual podman execution. Good for tonight.
Stop by 10 PM.
Saturday (2-3 hours, light supervision)
Goal: local end-to-end working — laptop agent starts a podman container when CR is applied. Pi provisioning rehearsed if Harmony path succeeded.
Morning check-in (30 min).
Dispatch agent A3 (installer) Saturday morning. Task card §9.A3.
Your work (2 hours):
- Finish agent's happy path: start container via podman-api, remove on CR deletion, transition on Score string change.
- End-to-end test on laptop: agent + central NATS + central operator. Expect a design bug. Budget an extra hour.
Dispatch agent A4 (demo script) Saturday afternoon. Task card §9.A4.
Sunday (2-3 hours, light supervision)
Goal: Demo script works against the real Pi in the lab. Not polished; works.
Your work:
- Run agent A4's demo against real Pi. Fix what breaks.
- First clean Pi success = shipping-confidence milestone.
- Run 3 more clean-room. Document failure modes.
No A5 agent task this time — thesis doc deferred to Week 2 per §2 decision.
Monday (8 focused hours)
Goal: Demo runs reliably. Tuesday ship-ready.
Hour 1-2 — field deployment readiness:
Named subsection: the most important class of failures for Pi-in-field deployment. The partner's devices will be power-cycled, their networks will flap. These matter more than polish.
- Power cycle test: unplug Pi, wait 30s, plug back in. Target: boot-to-reconciled within 90s. Run this 5+ times. This is the most important test of the weekend.
- Network-out-during-boot: boot Pi with NATS blocked (iptables or network shutdown). Agent starts, waits, reconciles when NATS is reachable. No crash loop.
- Agent crash loop: corrupt the config, let systemd restart loop kick in. Back-off works; device not bricked;
systemctl statusshows the failure clearly.
Not tested (ruled out per explicit partner conversation): SD card wear (partner's app has few IOPS), thermal throttling (environment doesn't cause sustained high temps), PoE-specific failure modes (not relevant here).
Hour 3-4 — demo polish:
./demo.shis one command, no manual steps.- Output is clean: clear PASS/FAIL with per-phase timings.
kubectl get deployments.iot.nationtech.iooutput is readable.
Hour 5-6 — partner-facing polish:
- README in workload repo: 4 lines. "Edit this, git push, done."
- ArgoCD auto-sync enabled on partner's repo.
- CR
.statusupdates within ~10s of agent report.
Hour 7 — failure demo prep:
Partner will ask "what if X fails." Prep answers, demonstrate 1-2 live:
- "Pi loses network" → container keeps running, reconnects when network returns.
- "Central cluster down" → already-running containers keep running.
- "Agent crashes" → systemd restarts, re-reads NATS, no data loss.
- "NATS down" → agent shows clear "not connected" status; recovers on NATS return.
Hour 8 — rest case: One full clean-room demo run. Time it. Write Tuesday runbook. Stop by 9 PM.
Tuesday morning (4 hours) — ship
Hour 1: Final clean-room run on our infra. Passes → proceed. Hour 2: Setup for partner demo (on-site or screenshare). Hour 3: Demo walkthrough. Git push → container. Edit → transition. One failure demo. Q&A. Hour 4: Handoff + v0.1/v0.2 plan conversation. Align on next-week milestone.
9. Agent task cards
Each card is self-contained. Hand the entire card to an agent.
Mandatory verification for every agent task — must pass before completion:
# Native CI check (check + fmt + clippy + test)
./build/check.sh
# Cross-compilation — aarch64 builds must succeed for all IoT-critical crates.
# Note: harmony is built with --no-default-features to exclude KVM (libvirt cannot cross-compile to aarch64).
# The 5 KVM examples (kvm_vm_examples, kvm_okd_ha_cluster, opnsense_vm_integration,
# opnsense_pair_integration, example_linux_vm) are x86_64-only by design.
cargo build --target x86_64-unknown-linux-gnu -p harmony -p harmony_agent -p iot-agent-v0 -p iot-operator-v0
cargo build --target aarch64-unknown-linux-gnu -p harmony --no-default-features -p harmony_agent -p iot-agent-v0 -p iot-operator-v0
All three must exit 0. Note: cargo test --target aarch64-unknown-linux-gnu cannot run on x86_64 (exec format error) — that's expected. Test execution is only for the host architecture via ./build/check.sh. If any check fails, fix the issue before marking the task complete. Include the output in the PR description.
A1: Operator skeleton (Friday)
Goal: kube-rs operator that watches Deployment CRs and writes the Score to NATS KV.
Deliverable: Crate iot/iot-operator-v0/:
Cargo.toml:kube,k8s-openapi,async-nats,serde,serde_yaml,serde_json,tokio,tracing,tracing-subscriber,anyhow.src/main.rsunder 300 lines.deploy/operator.yaml— Deployment, ServiceAccount, ClusterRole, ClusterRoleBinding.deploy/crd.yaml—DeploymentCRD foriot.nationtech.io/v1alpha1.
Behavior:
- Connect to NATS on startup (
NATS_URLenv, no auth). - Ensure JetStream KV bucket
desired-stateexists (create if not). - Watch
DeploymentCRs cluster-wide. - On reconcile: for each
device_idinspec.targetDevices, write key<device_id>.<n>indesired-statebucket with the serialized Score (adjacently-tagged JSON per §5.5). - On CR delete: remove corresponding KV keys.
- Update
.status.observedScoreStringwith the string that was written (for human inspection and change detection). - Log every reconcile.
CRD schema:
spec:
targetDevices: [string] # required, at least 1
score:
type: string # required, e.g. "PodmanV0"
data: object # required, Score-type-specific
rollout:
strategy: string # v0: only "Immediate"
status:
observedScoreString: string
conditions: [stdCondition]
Self-verification:
cd iot/iot-operator-v0
cargo build && cargo clippy -- -D warnings
# Test against k3d:
k3d cluster create iot-test --wait
kubectl apply -f deploy/crd.yaml
docker run -d --rm -p 4222:4222 -p 8222:8222 --name nats nats:latest -js
NATS_URL=nats://localhost:4222 RUST_LOG=info cargo run &
OP_PID=$!
sleep 3
kubectl apply -f - <<EOF
apiVersion: iot.nationtech.io/v1alpha1
kind: Deployment
metadata:
name: test-deploy
namespace: default
spec:
targetDevices: [test-device-01]
score:
type: PodmanV0
data:
services:
- name: hello
image: nginx:alpine
ports: ["8080:80"]
rollout:
strategy: Immediate
EOF
sleep 5
nats --server nats://localhost:4222 kv get desired-state test-device-01.test-deploy
# Must print the Score JSON with type="PodmanV0"
kubectl get deployment.iot.nationtech.io test-deploy -o jsonpath='{.status.observedScoreString}'
# Must print the stored string
kill $OP_PID
k3d cluster delete iot-test
docker stop nats
Forbidden:
- Code outside
iot/iot-operator-v0/. - Zitadel, OpenBao, auth callout dependencies.
- Parsing
score.data. - Rollout logic beyond KV writes.
Completion: Push branch, open PR with verification output in description.
A2: Pi provisioning (Friday)
Goal: Harmony Score to provision a Raspberry Pi 5 ready for IoT agent. Fallback to documented manual procedure if Harmony doesn't have Pi primitives.
Deliverable (primary path): New module harmony/src/modules/rpi/ implementing Score + Interpret for Pi provisioning. Follow existing Harmony KVM Score patterns.
Deliverable (fallback path): docs/iot/pi-provisioning-manual.md with: rpi-imager steps, cloud-init YAML, package install list, user setup commands, verification steps.
Target state regardless of path:
- Ubuntu Server 24.04 LTS ARM64 (or Raspberry Pi OS 64-bit if Ubuntu fails).
- Static IP on lab network.
- Packages:
podman,systemd-container,openssh-server,curl,jq. systemctl --user enable --now podman.socketfor useriot-agent.- User
iot-agentwith linger enabled (loginctl enable-linger iot-agent). /etc/iot-agent/(owned by iot-agent, 0750)./var/lib/iot-agent/.
Self-verification:
ssh iot-agent@<pi-ip> 'podman --version'
# Must be 4.4+ (target 4.9+)
ssh iot-agent@<pi-ip> 'systemctl --user is-active podman.socket'
# Must print "active"
ssh iot-agent@<pi-ip> 'loginctl show-user iot-agent | grep Linger=yes'
ssh iot-agent@<pi-ip> 'uname -m'
# Must print aarch64
Time limit: 90 min agent time.
Forbidden: Docker. x86_64 base images. Hard-coded credentials.
A3: Agent installer (Saturday)
Goal: Deploy agent to Pi via SSH using aarch64-cross-compiled binary.
Prerequisites: Agent binary exists (Sylvain writes Friday).
Deliverable: iot/iot-agent-v0/scripts/install.sh:
- Args:
--host <ip>,--device-id <id>,--nats-url <url>,--nats-user <u>,--nats-pass <p>. - Cross-builds for aarch64 using existing Harmony aarch64 toolchain.
scpbinary to Pi,sudo mvto/usr/local/bin/iot-agent.- Templates
/etc/iot-agent/config.tomlfrom args. - Installs
/etc/systemd/system/iot-agent.service. systemctl daemon-reload && systemctl enable --now iot-agent.- Waits up to 15s for "connected to NATS" in journal.
systemd unit:
[Unit]
Description=IoT Agent
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=iot-agent
ExecStart=/usr/local/bin/iot-agent
Restart=on-failure
RestartSec=5
StandardOutput=journal
StandardError=journal
Environment=RUST_LOG=info
[Install]
WantedBy=multi-user.target
Self-verification:
./install.sh --host <pi-ip> --device-id pi-demo-01 \
--nats-url nats://central:4222 \
--nats-user iot-agent --nats-pass dev-shared-password
ssh iot-agent@<pi-ip> 'sudo systemctl status iot-agent' # active (running)
ssh iot-agent@<pi-ip> 'sudo journalctl -u iot-agent --since "2 minutes ago"' | grep "connected to NATS"
Time limit: 2 hours agent time.
A4: End-to-end demo script (Saturday)
Goal: One command runs full demo flow.
Deliverable: iot/scripts/demo.sh:
- Verifies Pi reachable + agent running.
- Applies
scripts/demo-deployment.yaml. - Waits up to 120s for container on Pi (ssh +
podman ps). curl http://<pi-ip>:8080— expects nginx page.- Deletes CR, waits up to 60s for removal.
- Prints PASS or FAIL with per-phase timings.
- Cleans up on failure.
Self-verification:
./iot/scripts/demo.sh
# Ends with "PASS", total < 5 min
Time limit: 2 hours agent time.
10. Anti-patterns the plan prevents
- Premature contract extraction. No
iot-contractscrate in v0. Inline types. Extract v0.1 when they've proven their shape through use. - Quadlet under deadline. Direct
podman-apifor v0. Quadlet evaluation in v0.1+ (possibly viapodletcrate for code generation). User systemd quirks are a real cost under deadline pressure. - Agent-driven refactors. If an agent suggests "I could clean this up," say no. v0 ships first.
- Harmony rewrite. Use what fits. If something doesn't fit cleanly, document and work around.
- Second device in v0. One Pi Tuesday. Second in v0.1.
- Dashboards/TUI/API for v0.
kubectlandnatsCLI are v0 operator UX. Partner UX isgit push. - OKD-cluster-as-device in v0 or v0.x. Strategic roadmap, not execution plan. Keep focus.
- Weekend overwork. 2-3 hrs/day max Sat/Sun. Monday is where the hours are.
11. If the demo doesn't work Monday night
Flaky polish (reconnect timing, status lag): Ship Tuesday with minor caveats. Partner tolerates.
End-to-end happy path unreliable: Push ship by half a day or full day. Broken demo hurts partner trust more than 1-day slip. Communicate early Tuesday morning.
Genuine architectural flaw (e.g., NATS KV watches lose events under load): The walking skeleton has done its job — problem discovered cheap, not in week 3. Regroup Tuesday morning, push by 2-3 days, present to partner as "we found a design issue, here's the fix." They respect honesty.
12. Post-Tuesday milestones
v0.1 (Wed-Fri week 2): Hardening informed by v0 deployment.
- Harmony aarch64 compile properly fixed if we took a Friday shortcut.
iot-contractscrate extracted (consolidate inline types).- Second Pi added, regression-tested.
- Status aggregation in operator (CRD
.status.aggregate). - Inventory reporting from agent.
- Basic journald log streaming prototype.
- Field-readiness test suite running automated against a VM (power cycle, network-out, agent crash loop).
- Thesis document: 2 pages covering 3-year platform vision, written after seeing v0 run.
v0.2 (Mon-Fri week 3): Auth layer.
- Zitadel service accounts per device.
- Device-side JWT Profile client.
- OpenBao JWT auth method configured.
- Auth callout service implementing the bearer-token NATS JWT minting pattern from the architecture doc.
- Availability-favoring design: auth callout caches OpenBao policy lookups; on OpenBao failure, cached permissions are used; NATS rejects only on actual token expiry. A reboot doesn't force re-verification more than a passing minute does.
- Scoping test suite.
- Shared credentials removed.
- Bootstrap flow: device has Zitadel URL + initial token on disk → fetches NATS config from OpenBao → connects to NATS. Device TOML narrows to minimal bootstrap-only config.
v0.3 (week 4+): Scale + partner-driven features.
- Multiple workloads per device.
- Progressive rollout.
- Real log streaming.
- API service.
- Observability (Prometheus + Grafana).
- Automated field-readiness tests running on real Pi in CI.
v0.4+ (weeks 5-8, partner's 2-month target): Production hardening.
- TPM-backed device keys.
- Scale testing with partner's real fleet size.
- Runbook maturation.
- First non-demo production deployment for end-customer.
13. Tuesday partner conversation
Don't just demo. Frame. Prepared talking points:
"Here's what we shipped today."
- Git push → container on Pi.
- CRD surface they can start building against.
- No auth yet — shared credentials for internal use.
"Here's what's coming next week (v0.1)."
- Harmony Score integration polished.
- Second Pi, multi-device demo.
- Status visibility.
"Here's what's coming week 2 (v0.2)."
- Real authentication: Zitadel + OpenBao.
- Per-device scoped credentials.
- Production-grade security.
"Here's how we're doing it."
- Walking skeleton: ship early, harden based on real use.
- We want them to start building against v0 today.
- Feedback from their real use shapes v0.1/v0.2 priorities.
"Here's what we need from them."
- Early feedback on the CRD surface — does it fit how they want to deploy?
- Access to a test Pi from their fleet (if available) for v0.1/v0.2 testing.
- Rough timeline for their application development so we can sequence hardening with them.
This conversation is as much of Tuesday's deliverable as the running demo. Don't skip it.