Files

Jean-Gabriel Gill-Couture 65ef540b97

Run Check Script / check (pull_request) Waiting to run

Details

feat: scaffold IoT walking skeleton — podman module, operator, and agent

- Add PodmanV0Score/IotScore (adjacent-tagged serde) and PodmanV0Interpret stub
- Gate virt behind kvm feature and podman-api behind podman feature
- Scaffold iot-operator-v0 (kube-rs operator stub) and iot-agent-v0 (NATS KV watch)
- Add PodmanV0 to InterpretName enum
- Fix aarch64 cross-compilation by making kvm/podman optional features
- Align async-nats across workspace, add workspace deps for tracing/toml/tracing-subscriber
- Remove unused deps (serde_yaml from agent, schemars from operator)
- Add Send+Sync to CredentialSource, fix &PathBuf → &Path, remove dead_code allow
- Update 5 KVM example Cargo.tomls with explicit features = ["kvm"]

2026-04-17 20:15:10 -04:00

33 KiB

Raw Permalink Blame History

IoT Platform v0 — Walking Skeleton

Approach: Walking skeleton (Cockburn). Thin end-to-end thread through every architectural component. Naive first, architecture emerges from running code, hardening follows real-world feedback.

1. Strategic framing

Near-term product: IoT platform for an internal partner (a custom software shop with strong engineering practices — tests, CI/CD, coaching). They are developing an application for their end-customer whose field devices are Raspberry Pi 5s with 8/16 GB RAM, ARM64. The end-customer's engineers are mechanical/electrical/chemical, not Kubernetes-literate; on-device debuggability using standard Linux tools is a genuine UX concern.

Long-term product: This is the foundation for NationTech's decentralized enterprise cloud orchestration. NationTech itself is effectively our largest customer for this platform — we already run multiple OKD clusters in different locations and need to coordinate deployments, updates, and observability across them without connecting into each one manually. An "agent" reconciling against NATS KV looks the same whether it runs podman on a Pi, kubectl apply on an OKD cluster, or a VM-level operation. The abstraction has been chosen to support all three eventually; v0 demonstrates it on the simplest target (podman on Pi).

Why this matters for collaborators reading this plan: this is not a side project or a one-off customer integration. NationTech's positioning as a no-vendor-lock-in, decentralized, open-source cloud solution is gaining traction specifically because we have a product (Harmony) and not just bespoke integration work. This IoT platform extends that thesis. Resource investment is long-term.

Deadlines:

Tuesday (day 4): internal partner sees git push → container running on Pi. Confidence-building, low-stakes.
Day 14 (~2 weeks): solid product foundation, before other NationTech projects claim attention.
2 months (partner's deadline): hardened production delivery for the partner's end-customer.

Hour budget:

Friday evening (now): 3-4 hours focused
Saturday: light supervision of agents, 2-3 hours
Sunday: light supervision of agents, 2-3 hours
Monday: 8 focused hours
Tuesday morning: ship + polish, 4 hours
Week 2 (Wed-Fri): v0.1 hardening, ~4 hours/day
Week 3: v0.2 auth layer, ~4 hours/day
Remaining weeks: partner-driven hardening as their application development reveals needs

Sustainable hours non-negotiable.

Terminology used consistently below: NationTech = us. Partner = the software shop we're directly working with. End-customer = the partner's customer whose field devices we're managing.

2. Walking skeleton vs. parallel-tracks: the honest choice

I considered both. For this context, walking skeleton wins on every axis:

Axis	Walking skeleton	Parallel-tracks autonomous
Partner sees progress	Tuesday (day 4)	Day 11+
Integration risk	Discovered day 3	Discovered day 11
Weekend pressure	Natural stopping points	Merge-gate pressure
Adapts to "OKD cluster as device" future	Trivial — new Score variant later	Expensive mid-architecture pivot
Risk of day-14 slip	Low (partner has seen it work)	High (integration bugs in final days)
Hours sustainability	Good	Poor

3. The demo (= the product for Tuesday)

Partner edits       ArgoCD         Operator        NATS KV      Agent on        podman
YAML in git  ──push→ syncs ──apply→ writes ──store→ watch ──pull→ Pi reads ──run→ container
                     to k8s          to NATS                       Score              running

Success criterion for Tuesday:

git push on a workload repo.
Within 2 minutes, a container is running on a Raspberry Pi 5 in our lab.
Partner can curl the container (on the Pi's IP) and get hello-world.
Partner can edit the YAML (change image or port), push, watch the container transition within 2 minutes.

Invisible to partner but critical:

Pi is pre-provisioned with agent installed.
ArgoCD is pre-configured with the partner's workload repo.
Agent uses a shared NATS credential from a TOML file.

Partner-explicit framing for Tuesday conversation:

"This proves the mechanism end-to-end."
"v0.1 next week: Harmony Score polished, second Pi added, status aggregation."
"v0.2 week 2: real authentication via Zitadel + OpenBao, no more shared creds."
"Here's what you can start building against today."

4. Scope cuts — explicit deferrals

Each cut has a target milestone. This is the foundation for the "here's what's coming" partner conversation.

Deferred	v0 replacement	Milestone
Zitadel device auth	Shared NATS credential in agent TOML	v0.2
OpenBao	Shared credentials in agent TOML	v0.2
Auth callout service	Direct NATS user/pass	v0.2
Scoping tests	None (single-tenant demo)	v0.2
Multiple Pi devices	One Pi for Tuesday; second added v0.1	v0.1
Quadlet interpretation	`podman-api` crate direct control	v0.1 considers Quadlet
Status aggregation in CRD	Agent writes status, operator doesn't aggregate	v0.1
Inventory reporting	Not in v0	v0.1
Log streaming via NATS	`journalctl` over SSH	v0.1
API service	None	v0.2+
TUI for IoT	`kubectl` + `nats` CLI	v0.2+
Rollout state machine	All-at-once (one Pi for Tuesday, moot)	v0.1+
Failure injection harness	None formal	v0.1
Observability (Prom+Grafana)	`journalctl` + `kubectl logs`	v0.1+
OKD-cluster-as-device	Not in v0; not in v0.x at all	Strategic roadmap, separate

What's kept in v0 despite cost:

Harmony Score on device. Friday builds a minimal podman Score as a module in harmony/src/modules/podman/. Adds 1-2 hours Friday but proves the abstraction works in daemon mode.
Real kube-rs operator (not a cron script). The operator's shape matters for long-term stability.
NATS KV transport. Proven now so we don't switch later.
CRD-based partner API. kubectl apply -f deployment.yaml is the partner's long-term interface.
Pi provisioning via Harmony Score when achievable in <1hr (§7 Hour 1); manual runbook as fallback.

5. The thread end-to-end

5.1 Partner's git repo

iot-workload-hello/
├── deployment.yaml      # Deployment CR
├── README.md            # "Edit, git push, done."

deployment.yaml:

apiVersion: iot.nationtech.io/v1alpha1
kind: Deployment
metadata:
  name: hello-world
  namespace: iot-demo
spec:
  targetDevices:
    - pi-demo-01
  score:
    type: PodmanV0                    # Rust enum discriminator (serde adjacently-tagged)
    data:
      services:
        - name: hello
          image: docker.io/library/nginx:alpine
          ports: ["8080:80"]
  rollout:
    strategy: Immediate

5.2 Central cluster setup

Existing k8s cluster. Namespaces:

iot-system — operator, NATS (single-node for v0)
iot-demo — Deployment CRs

ArgoCD application pre-configured to sync iot-workload-hello repo into iot-demo namespace.

5.3 Raspberry Pi 5 setup

One Pi 5 in the lab, provisioned via Harmony Pi-provisioning Score (if achievable in <1hr Friday) or manually via SD card flash (fallback).

Base OS: Ubuntu Server 24.04 LTS ARM64 (ships Podman 4.9 in repos). Raspberry Pi OS 64-bit bookworm acceptable fallback.

Installed:

podman (4.4+, ARM64) with systemctl --user enable --now podman.socket (required for podman-api crate)
iot-agent binary (cross-compiled to aarch64 via existing Harmony aarch64 toolchain)
/etc/iot-agent/config.toml with NATS URL + shared credential
systemd unit iot-agent.service

5.4 What the code does

Operator:

Watches Deployment CRs cluster-wide.
For each, for each device_id in spec.targetDevices, writes desired-state.<device_id>.<deployment-name> in desired-state JetStream KV bucket with the Score message (see §5.5).
Updates .status.observedScoreString (the last-written Score as stored string, used for change detection via string comparison).
On deletion, removes corresponding KV entries.

Agent on Pi:

Connect to NATS (TOML-configured user/pass).
Watch desired-state.<my-device-id>.> KV keys.
For each entry: deserialize Score message, dispatch to Harmony Score::interpret(&topology) via s.clone().interpret().await pattern (already the TUI's daemon-mode pattern, battle-tested in harmony_agent for CNPG management).
For v0, only PodmanV0 Score variant exists. Interprets against a PiDeviceTopology (arch=aarch64, runtime=podman) and uses the podman-api crate to manage containers via the Podman REST API (over the user socket activated at §5.3 setup).
Change detection via serialized string comparison (not content hash). Cheap at this scale (a couple times per minute expected), removes hashing-algorithm risk, deterministic.
Status writer: every 30s, write current state to status.<my-device-id>.

Kubelet compatibility is explicitly NOT a goal. Kubelet architecture serves as a north star for proven reconcile-loop patterns; the v0 implementation stays absolutely minimal. No PLEG event stream in v0, no per-workload worker pool, no housekeeping sweep — just a single reconcile loop with periodic relist. Scope discipline through inherent minimalism, not enforced limits.

5.5 Score message on NATS

Adjacently tagged serde enum. One Rust type per Score variant, #[serde(tag = "type", content = "data")] for clean discriminator/payload separation:

#[derive(Serialize, Deserialize, Clone)]
#[serde(tag = "type", content = "data")]
pub enum Score {
    PodmanV0(PodmanV0Score),
    // Future: OkdApplyV0(OkdApplyScore), KubectlApplyV0(...), etc.
}

JSON wire format:

{
  "type": "PodmanV0",
  "data": { /* PodmanV0Score fields */ }
}

No envelope. No encoding field. No format version string. The Rust type name is the discriminator; serde handles polymorphism cleanly. Adding a new Score variant (for OKD management later) is enum Score { ..., OkdApplyV0(OkdApplyScore) } — additive, not breaking.

5.6 What's deliberately dumb in v0

Polling instead of event-driven PLEG. Agent polls podman-api every 30s as ground truth; KV watch events are accelerators.
No idempotency beyond string-equality. Current score matches stored → no-op, mismatch → stop old container, run new. Brief downtime on updates. Fine for v0.
Graceful shutdown = podman stop with 5min timeout, then SIGKILL. Sufficient.
No auth between operator and NATS. Same k8s cluster, same namespace. Network trust.
No state persistence beyond podman itself. Agent restart = re-read NATS, re-query podman, reconcile differences.
No multi-service coordination. A Score with three services starts them all immediately, no dependency ordering.

6. Architecture boundaries we keep even in v0

Decisions that cost little now and save real time later.

6.1 Score enum polymorphic from day 1

Even with one variant (PodmanV0), the enum shape is already polymorphic. Adding OkdApplyV0 later is trivial.

6.2 `Score` + `Interpret` traits used consistently

Use Harmony's existing traits. Cost: ~1 hour Friday. Benefit: agent is structurally ready for a second Score type in v0.3+.

6.3 Credentials behind a trait

trait CredentialSource: Send + Sync {
    async fn nats_connect_options(&self) -> Result<ConnectOptions>;
}

v0: TomlFileCredentialSource reading /etc/iot-agent/config.toml. v0.2: ZitadelBootstrappedCredentialSource — same trait, swapped via config.

30 minutes Friday. Saves 3 hours of refactor in v0.2.

6.4 Device topology generalizes

PiDeviceTopology for v0. Trait interface supports other topologies — OKD cluster as OkdClusterTopology later. The v0 Score validates at compile time that its topology requirements match (arch=aarch64, runtime=podman). The OKD Score will validate different requirements (has_kube_api, has_argo). Same pattern.

6.5 CRD spec forward-compatible

spec:
  targetDevices: [id1, id2]     # v0. v1 adds targetGroups.
  score: {type: ..., data: ...}  # polymorphic enum
  rollout:
    strategy: Immediate          # v0. v1 adds Progressive.

6.6 NATS subject grammar matches long-term

Even with one Pi, use desired-state.<device_id>.<n> and status.<device_id>. Don't take shortcuts.

6.7 Agent config is TOML, flat for v0

[agent]
device_id = "pi-demo-01"

[credentials]
type = "toml-shared"
nats_user = "iot-agent"
nats_pass = "dev-shared-password"

[nats]
urls = ["nats://central:4222"]

v0.2 adds a [zitadel] section enabling the bootstrap-via-token flow (see §11 roadmap). Additive, not breaking. Target long-term state: device boots → PXE or minimal TOML → Zitadel URL + token → fetches real config from OpenBao → connects to NATS. OpenBao outage doesn't break reconnect because the NATS auth callout validates tokens against Zitadel JWKS directly (with cached group permissions); NATS rejects only when the token actually expires.

7. Friday evening critical path — the aarch64 investigation

The previous walking skeleton draft had a "§6 decision point" on whether Harmony's Interpret works in daemon mode. That's resolved — the TUI does s.clone().interpret().await as a daemon pattern, and harmony_agent manages distributed CNPG in production using exactly this. Not a concern.

The real concern that replaces it: Harmony does not currently compile on aarch64. When harmony_agent was cross-compiled for ARM64, an upstream dependency had to be pulled out. This was likely a single sub-dependency used by only a few modules, feature-gatable so those modules become unavailable on ARM (acceptable — the device doesn't need every Harmony feature). Estimated as a quick fix (~80% confidence, per Sylvain's recollection).

Friday evening investigation (30-60 min, first):

cargo build --target aarch64-unknown-linux-musl -p harmony on the workspace. Capture the error.
Identify the offending crate and the module(s) in Harmony that depend on it.
Apply feature-gate: add a cfg(not(target_arch = "aarch64")) attribute to the offending module, or introduce a Cargo feature flag (--features x86-only) that the ARM build skips.
Verify: cargo build --target aarch64-unknown-linux-musl -p harmony --features <minimal> succeeds.
Run the unit tests that exist for the feature-gated modules on x86_64 to confirm we haven't broken anything on the primary platform.

Budget: 2 hours max Friday night. If not resolved in 2 hours:

Fallback A: Build the agent against only the crates that do compile on aarch64 (harmony_agent, harmony_types, whatever subset). Implement the PodmanV0 Score directly in the agent crate using its own trait impls for now. Reunify with the main Harmony codebase in v0.1 after the compile fix is properly done.
Fallback B: (only if Fallback A also blocks) Write the v0 agent as pure Rust without Harmony Score traits. Adopt them in v0.1 after the aarch64 fix lands. This is the walking-skeleton-surfaces-real-issue scenario from §10.

Document findings in the Friday night log regardless of outcome. v0.1 work includes proper fix if we took a shortcut.

This is the single most important investigation of the weekend. Do it before anything else Friday. Every downstream decision (can the agent use Harmony Score traits? what's agent A3 cross-compiling?) depends on it.

8. Hour-by-hour plan

Friday evening (3-4 hours)

Goal by end of Friday night: aarch64 path clear; operator running in central cluster writes to NATS on CR apply; agent crate compiling on laptop, talking to NATS; Pi provisioning plan chosen.

Hour 1 — aarch64 investigation + decisions + dispatches

Your work:

aarch64 investigation per §7 (30-60 min, first thing).
Write 1-page v0-demo.md: demo script, success criteria, fallback plan.
Decide Pi OS: Ubuntu 24.04 ARM64 (default) vs Raspberry Pi OS 64-bit. Don't agonize beyond 10 min.

Dispatch agent A1 (operator): "Create Rust crate iot/iot-operator-v0/ using kube-rs implementing a Deployment CRD controller that writes to NATS KV. Exact spec in task card §9.A1. Self-verify: kubectl apply → nats kv get shows entry. Under 300 lines main.rs. No auth."

Dispatch agent A2 (Pi provisioning, fallback-aware): "Attempt Harmony-based Raspberry Pi 5 provisioning Score. Target: fresh Pi flashed via SD card, boots, static IP, Ubuntu 24.04 ARM64 with Podman 4.9, podman user socket enabled, user iot-agent with linger enabled, /etc/iot-agent/ ready. If Harmony doesn't have Pi primitives, document the gap and produce a manual provisioning runbook instead (rpi-imager + cloud-init). Hard time limit: 90 min. Self-verify: ssh iot-agent@<pi-ip> 'podman --version' returns 4.4+."

Hour 2 — your work: agent crate

Start writing the agent yourself. Core customer-experience code; you own its shape.

Crate in harmony/src/modules/iot_agent/ or a new binary in the Harmony workspace (follow existing conventions — Harmony modules live in harmony/src/modules/):

Under 500 lines for v0.
Dependencies: async-nats, serde, serde_yaml, tokio, tracing, anyhow, podman-api, plus Harmony workspace deps.
Main loop per §5.4.
CredentialSource trait (§6.3) with TomlFileCredentialSource impl.
Score enum (§5.5) with PodmanV0 variant.
PodmanV0Score implements Harmony's Score + Interpret traits. Score lives in harmony/src/modules/podman/ (new module) following existing Harmony module conventions. podman-api crate for container operations — no shell-out.

Hour 3 — local integration

Review agent A1's operator. Deploy to central cluster iot-system namespace.
Deploy NATS to iot-system if not already (single-node JetStream).
Review agent A2's Pi provisioning. If Harmony Score succeeded, note for demo; if manual runbook, accept and move on.
Agent compiles on laptop. Connects to central NATS.

Hour 4 — first partial handshake

kubectl apply a Deployment CR targeting pi-demo-01.
Verify: nats kv get desired-state pi-demo-01.test-deploy shows entry.
Run agent locally on laptop with DEVICE_ID=pi-demo-01, confirm it reads the KV entry and prints what it would do.
First success: local end-to-end without actual podman execution. Good for tonight.

Stop by 10 PM.

Saturday (2-3 hours, light supervision)

Goal: local end-to-end working — laptop agent starts a podman container when CR is applied. Pi provisioning rehearsed if Harmony path succeeded.

Morning check-in (30 min).

Dispatch agent A3 (installer) Saturday morning. Task card §9.A3.

Your work (2 hours):

Finish agent's happy path: start container via podman-api, remove on CR deletion, transition on Score string change.
End-to-end test on laptop: agent + central NATS + central operator. Expect a design bug. Budget an extra hour.

Dispatch agent A4 (demo script) Saturday afternoon. Task card §9.A4.

Sunday (2-3 hours, light supervision)

Goal: Demo script works against the real Pi in the lab. Not polished; works.

Your work:

Run agent A4's demo against real Pi. Fix what breaks.
First clean Pi success = shipping-confidence milestone.
Run 3 more clean-room. Document failure modes.

No A5 agent task this time — thesis doc deferred to Week 2 per §2 decision.

Monday (8 focused hours)

Goal: Demo runs reliably. Tuesday ship-ready.

Hour 1-2 — field deployment readiness:

Named subsection: the most important class of failures for Pi-in-field deployment. The partner's devices will be power-cycled, their networks will flap. These matter more than polish.

Power cycle test: unplug Pi, wait 30s, plug back in. Target: boot-to-reconciled within 90s. Run this 5+ times. This is the most important test of the weekend.
Network-out-during-boot: boot Pi with NATS blocked (iptables or network shutdown). Agent starts, waits, reconciles when NATS is reachable. No crash loop.
Agent crash loop: corrupt the config, let systemd restart loop kick in. Back-off works; device not bricked; systemctl status shows the failure clearly.

Not tested (ruled out per explicit partner conversation): SD card wear (partner's app has few IOPS), thermal throttling (environment doesn't cause sustained high temps), PoE-specific failure modes (not relevant here).

Hour 3-4 — demo polish:

./demo.sh is one command, no manual steps.
Output is clean: clear PASS/FAIL with per-phase timings.
kubectl get deployments.iot.nationtech.io output is readable.

Hour 5-6 — partner-facing polish:

README in workload repo: 4 lines. "Edit this, git push, done."
ArgoCD auto-sync enabled on partner's repo.
CR .status updates within ~10s of agent report.

Hour 7 — failure demo prep:

Partner will ask "what if X fails." Prep answers, demonstrate 1-2 live:

"Pi loses network" → container keeps running, reconnects when network returns.
"Central cluster down" → already-running containers keep running.
"Agent crashes" → systemd restarts, re-reads NATS, no data loss.
"NATS down" → agent shows clear "not connected" status; recovers on NATS return.

Hour 8 — rest case: One full clean-room demo run. Time it. Write Tuesday runbook. Stop by 9 PM.

Tuesday morning (4 hours) — ship

Hour 1: Final clean-room run on our infra. Passes → proceed. Hour 2: Setup for partner demo (on-site or screenshare). Hour 3: Demo walkthrough. Git push → container. Edit → transition. One failure demo. Q&A. Hour 4: Handoff + v0.1/v0.2 plan conversation. Align on next-week milestone.

9. Agent task cards

Each card is self-contained. Hand the entire card to an agent.

Mandatory verification for every agent task — must pass before completion:

# Native CI check (check + fmt + clippy + test)
./build/check.sh

# Cross-compilation — aarch64 builds must succeed for all IoT-critical crates.
# Note: harmony is built with --no-default-features to exclude KVM (libvirt cannot cross-compile to aarch64).
# The 5 KVM examples (kvm_vm_examples, kvm_okd_ha_cluster, opnsense_vm_integration,
# opnsense_pair_integration, example_linux_vm) are x86_64-only by design.
cargo build --target x86_64-unknown-linux-gnu -p harmony -p harmony_agent -p iot-agent-v0 -p iot-operator-v0
cargo build --target aarch64-unknown-linux-gnu -p harmony --no-default-features -p harmony_agent -p iot-agent-v0 -p iot-operator-v0

All three must exit 0. Note: cargo test --target aarch64-unknown-linux-gnu cannot run on x86_64 (exec format error) — that's expected. Test execution is only for the host architecture via ./build/check.sh. If any check fails, fix the issue before marking the task complete. Include the output in the PR description.

A1: Operator skeleton (Friday)

Goal: kube-rs operator that watches Deployment CRs and writes the Score to NATS KV.

Deliverable: Crate iot/iot-operator-v0/:

Cargo.toml: kube, k8s-openapi, async-nats, serde, serde_yaml, serde_json, tokio, tracing, tracing-subscriber, anyhow.
src/main.rs under 300 lines.
deploy/operator.yaml — Deployment, ServiceAccount, ClusterRole, ClusterRoleBinding.
deploy/crd.yaml — Deployment CRD for iot.nationtech.io/v1alpha1.

Behavior:

Connect to NATS on startup (NATS_URL env, no auth).
Ensure JetStream KV bucket desired-state exists (create if not).
Watch Deployment CRs cluster-wide.
On reconcile: for each device_id in spec.targetDevices, write key <device_id>.<n> in desired-state bucket with the serialized Score (adjacently-tagged JSON per §5.5).
On CR delete: remove corresponding KV keys.
Update .status.observedScoreString with the string that was written (for human inspection and change detection).
Log every reconcile.

CRD schema:

spec:
  targetDevices: [string]       # required, at least 1
  score:
    type: string                # required, e.g. "PodmanV0"
    data: object                # required, Score-type-specific
  rollout:
    strategy: string            # v0: only "Immediate"
status:
  observedScoreString: string
  conditions: [stdCondition]

Self-verification:

cd iot/iot-operator-v0
cargo build && cargo clippy -- -D warnings

# Test against k3d:
k3d cluster create iot-test --wait
kubectl apply -f deploy/crd.yaml
docker run -d --rm -p 4222:4222 -p 8222:8222 --name nats nats:latest -js
NATS_URL=nats://localhost:4222 RUST_LOG=info cargo run &
OP_PID=$!

sleep 3
kubectl apply -f - <<EOF
apiVersion: iot.nationtech.io/v1alpha1
kind: Deployment
metadata:
  name: test-deploy
  namespace: default
spec:
  targetDevices: [test-device-01]
  score:
    type: PodmanV0
    data:
      services:
        - name: hello
          image: nginx:alpine
          ports: ["8080:80"]
  rollout:
    strategy: Immediate
EOF

sleep 5
nats --server nats://localhost:4222 kv get desired-state test-device-01.test-deploy
# Must print the Score JSON with type="PodmanV0"

kubectl get deployment.iot.nationtech.io test-deploy -o jsonpath='{.status.observedScoreString}'
# Must print the stored string

kill $OP_PID
k3d cluster delete iot-test
docker stop nats

Forbidden:

Code outside iot/iot-operator-v0/.
Zitadel, OpenBao, auth callout dependencies.
Parsing score.data.
Rollout logic beyond KV writes.

Completion: Push branch, open PR with verification output in description.

A2: Pi provisioning (Friday)

Goal: Harmony Score to provision a Raspberry Pi 5 ready for IoT agent. Fallback to documented manual procedure if Harmony doesn't have Pi primitives.

Deliverable (primary path): New module harmony/src/modules/rpi/ implementing Score + Interpret for Pi provisioning. Follow existing Harmony KVM Score patterns.

Deliverable (fallback path): docs/iot/pi-provisioning-manual.md with: rpi-imager steps, cloud-init YAML, package install list, user setup commands, verification steps.

Target state regardless of path:

Ubuntu Server 24.04 LTS ARM64 (or Raspberry Pi OS 64-bit if Ubuntu fails).
Static IP on lab network.
Packages: podman, systemd-container, openssh-server, curl, jq.
systemctl --user enable --now podman.socket for user iot-agent.
User iot-agent with linger enabled (loginctl enable-linger iot-agent).
/etc/iot-agent/ (owned by iot-agent, 0750).
/var/lib/iot-agent/.

Self-verification:

ssh iot-agent@<pi-ip> 'podman --version'
# Must be 4.4+ (target 4.9+)
ssh iot-agent@<pi-ip> 'systemctl --user is-active podman.socket'
# Must print "active"
ssh iot-agent@<pi-ip> 'loginctl show-user iot-agent | grep Linger=yes'
ssh iot-agent@<pi-ip> 'uname -m'
# Must print aarch64

Time limit: 90 min agent time.

Forbidden: Docker. x86_64 base images. Hard-coded credentials.

A3: Agent installer (Saturday)

Goal: Deploy agent to Pi via SSH using aarch64-cross-compiled binary.

Prerequisites: Agent binary exists (Sylvain writes Friday).

Deliverable: iot/iot-agent-v0/scripts/install.sh:

Args: --host <ip>, --device-id <id>, --nats-url <url>, --nats-user <u>, --nats-pass <p>.
Cross-builds for aarch64 using existing Harmony aarch64 toolchain.
scp binary to Pi, sudo mv to /usr/local/bin/iot-agent.
Templates /etc/iot-agent/config.toml from args.
Installs /etc/systemd/system/iot-agent.service.
systemctl daemon-reload && systemctl enable --now iot-agent.
Waits up to 15s for "connected to NATS" in journal.

systemd unit:

[Unit]
Description=IoT Agent
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=iot-agent
ExecStart=/usr/local/bin/iot-agent
Restart=on-failure
RestartSec=5
StandardOutput=journal
StandardError=journal
Environment=RUST_LOG=info

[Install]
WantedBy=multi-user.target

Self-verification:

./install.sh --host <pi-ip> --device-id pi-demo-01 \
    --nats-url nats://central:4222 \
    --nats-user iot-agent --nats-pass dev-shared-password
ssh iot-agent@<pi-ip> 'sudo systemctl status iot-agent'  # active (running)
ssh iot-agent@<pi-ip> 'sudo journalctl -u iot-agent --since "2 minutes ago"' | grep "connected to NATS"

Time limit: 2 hours agent time.

A4: End-to-end demo script (Saturday)

Goal: One command runs full demo flow.

Deliverable: iot/scripts/demo.sh:

Verifies Pi reachable + agent running.
Applies scripts/demo-deployment.yaml.
Waits up to 120s for container on Pi (ssh + podman ps).
curl http://<pi-ip>:8080 — expects nginx page.
Deletes CR, waits up to 60s for removal.
Prints PASS or FAIL with per-phase timings.
Cleans up on failure.

Self-verification:

./iot/scripts/demo.sh
# Ends with "PASS", total < 5 min

Time limit: 2 hours agent time.

10. Anti-patterns the plan prevents

Premature contract extraction. No iot-contracts crate in v0. Inline types. Extract v0.1 when they've proven their shape through use.
Quadlet under deadline. Direct podman-api for v0. Quadlet evaluation in v0.1+ (possibly via podlet crate for code generation). User systemd quirks are a real cost under deadline pressure.
Agent-driven refactors. If an agent suggests "I could clean this up," say no. v0 ships first.
Harmony rewrite. Use what fits. If something doesn't fit cleanly, document and work around.
Second device in v0. One Pi Tuesday. Second in v0.1.
Dashboards/TUI/API for v0. kubectl and nats CLI are v0 operator UX. Partner UX is git push.
OKD-cluster-as-device in v0 or v0.x. Strategic roadmap, not execution plan. Keep focus.
Weekend overwork. 2-3 hrs/day max Sat/Sun. Monday is where the hours are.

11. If the demo doesn't work Monday night

Flaky polish (reconnect timing, status lag): Ship Tuesday with minor caveats. Partner tolerates.

End-to-end happy path unreliable: Push ship by half a day or full day. Broken demo hurts partner trust more than 1-day slip. Communicate early Tuesday morning.

Genuine architectural flaw (e.g., NATS KV watches lose events under load): The walking skeleton has done its job — problem discovered cheap, not in week 3. Regroup Tuesday morning, push by 2-3 days, present to partner as "we found a design issue, here's the fix." They respect honesty.

12. Post-Tuesday milestones

v0.1 (Wed-Fri week 2): Hardening informed by v0 deployment.

Harmony aarch64 compile properly fixed if we took a Friday shortcut.
iot-contracts crate extracted (consolidate inline types).
Second Pi added, regression-tested.
Status aggregation in operator (CRD .status.aggregate).
Inventory reporting from agent.
Basic journald log streaming prototype.
Field-readiness test suite running automated against a VM (power cycle, network-out, agent crash loop).
Thesis document: 2 pages covering 3-year platform vision, written after seeing v0 run.

v0.2 (Mon-Fri week 3): Auth layer.

Zitadel service accounts per device.
Device-side JWT Profile client.
OpenBao JWT auth method configured.
Auth callout service implementing the bearer-token NATS JWT minting pattern from the architecture doc.
Availability-favoring design: auth callout caches OpenBao policy lookups; on OpenBao failure, cached permissions are used; NATS rejects only on actual token expiry. A reboot doesn't force re-verification more than a passing minute does.
Scoping test suite.
Shared credentials removed.
Bootstrap flow: device has Zitadel URL + initial token on disk → fetches NATS config from OpenBao → connects to NATS. Device TOML narrows to minimal bootstrap-only config.

v0.3 (week 4+): Scale + partner-driven features.

Multiple workloads per device.
Progressive rollout.
Real log streaming.
API service.
Observability (Prometheus + Grafana).
Automated field-readiness tests running on real Pi in CI.

v0.4+ (weeks 5-8, partner's 2-month target): Production hardening.

TPM-backed device keys.
Scale testing with partner's real fleet size.
Runbook maturation.
First non-demo production deployment for end-customer.

13. Tuesday partner conversation

Don't just demo. Frame. Prepared talking points:

"Here's what we shipped today."

Git push → container on Pi.
CRD surface they can start building against.
No auth yet — shared credentials for internal use.

"Here's what's coming next week (v0.1)."

Harmony Score integration polished.
Second Pi, multi-device demo.
Status visibility.

"Here's what's coming week 2 (v0.2)."

Real authentication: Zitadel + OpenBao.
Per-device scoped credentials.
Production-grade security.

"Here's how we're doing it."

Walking skeleton: ship early, harden based on real use.
We want them to start building against v0 today.
Feedback from their real use shapes v0.1/v0.2 priorities.

"Here's what we need from them."

Early feedback on the CRD surface — does it fit how they want to deploy?
Access to a test Pi from their fleet (if available) for v0.1/v0.2 testing.
Rough timeline for their application development so we can sequence hardening with them.

This conversation is as much of Tuesday's deliverable as the running demo. Don't skip it.

33 KiB Raw Permalink Blame History

IoT Platform v0 — Walking Skeleton

1. Strategic framing

2. Walking skeleton vs. parallel-tracks: the honest choice

3. The demo (= the product for Tuesday)

4. Scope cuts — explicit deferrals

5. The thread end-to-end

5.1 Partner's git repo

5.2 Central cluster setup

5.3 Raspberry Pi 5 setup

5.4 What the code does

5.5 Score message on NATS

5.6 What's deliberately dumb in v0

6. Architecture boundaries we keep even in v0

6.1 Score enum polymorphic from day 1

6.2 Score + Interpret traits used consistently

6.3 Credentials behind a trait

6.4 Device topology generalizes

6.5 CRD spec forward-compatible

6.6 NATS subject grammar matches long-term

6.7 Agent config is TOML, flat for v0

7. Friday evening critical path — the aarch64 investigation

8. Hour-by-hour plan

Friday evening (3-4 hours)

Saturday (2-3 hours, light supervision)

Sunday (2-3 hours, light supervision)

Monday (8 focused hours)

Tuesday morning (4 hours) — ship

9. Agent task cards

A1: Operator skeleton (Friday)

A2: Pi provisioning (Friday)

A3: Agent installer (Saturday)

A4: End-to-end demo script (Saturday)

10. Anti-patterns the plan prevents

11. If the demo doesn't work Monday night

12. Post-Tuesday milestones

13. Tuesday partner conversation

33 KiB

Raw Permalink Blame History

6.2 `Score` + `Interpret` traits used consistently