Files
harmony/ROADMAP/fleet_platform/demo_runbook.md
Jean-Gabriel Gill-Couture 4053ac52de docs(fleet): demo runbook (operator + developer flow, single page)
Hand-on walkthrough for the 48-hour customer demo:

- Operator: build/push the callout image → fleet-staging-deploy →
  capture project_id + cli_client_id from the printed panel.
- Developer: fleet-sso-login proves Zitadel SSO works end-to-end.
- Pi onboarding: extract iam-admin-pat from the staging cluster,
  cross-compile the agent for aarch64, run fleet-rpi-setup once
  per device with --bootstrap-token. Each Pi's agent connects to
  NATS over WSS using the JWT-bearer token minted from its
  per-device keyfile.
- Deploy a container to a labeled subset via
  example_harmony_apply_deployment with --env / --volume / --restart
  flags (env + bind mounts + restart policy that work_item #1 added).
- Observe the cross-device security model holding via the auth
  callout's logs.

Also captures what's deliberately NOT in the demo (compose
auto-translation, UI, Tailscale backdoor, device-join-request
flow, OpenBao, K8s OIDC) so the customer call has clean expectation-
setting.

The runbook is the closing piece of the 48h-demo work plan;
sequenced after the eight feat / refactor commits that built the
underlying functionality.
2026-05-03 15:43:10 -04:00

8.4 KiB

Fleet Platform Demo Runbook

48-hour-demo edition. Covers the operator-side (NationTech) and the customer-developer-side (two devs onboarding two Pis, applying a container deployment to them). Hand-on, no UI yet.

Roles

  • NationTech operator — runs fleet-staging-deploy once against the customer's OKD cluster.
  • Customer developer — runs fleet-sso-login to prove auth works, then runs fleet-rpi-setup for each Pi, then applies their workload via the existing harmony-apply-deployment example.

Prerequisites

Cluster (operator-side)

  • OKD ≥ 4.10 (HAProxy ingress, edge-TLS).
  • Wildcard DNS *.<base-domain> pointing at the cluster ingress IP (e.g. *.customer1.nationtech.io).
  • Wildcard cert that the HAProxy router serves for that domain (the default OKD pattern).
  • cert-manager, cloudnative-pg operators installed (Zitadel chart depends on them via K8sAnywhereTopology's ensure_ready).
  • Access to a container registry the cluster can pull from. Customer may have their own; the default in fleet-staging-deploy is quay.io/nationtech/harmony-nats-callout:demo.

Driver machine (operator + developers)

  • kubectl with kubeconfig wired up.
  • cargo (Rust toolchain).
  • podman (used to build the agent image / fleet-callout image).
  • ssh into the Pis from the developers' machines.

Pis

  • Pi OS Lite booted, SSH server enabled, developer's SSH pubkey in ~/.ssh/authorized_keys. fleet-rpi-setup handles the rest.

Operator: deploy the staging stack

# 1. Build the callout image and push it to the customer's registry.
cargo build --release -p harmony-nats-callout
podman build -t quay.io/nationtech/harmony-nats-callout:demo \
  -f nats/callout/Dockerfile .
podman push quay.io/nationtech/harmony-nats-callout:demo

# 2. Deploy the central stack.
cargo run -p example-fleet-staging-deploy -- \
  --base-domain customer1.nationtech.io \
  --kube-context customer1-prod \
  --callout-image quay.io/nationtech/harmony-nats-callout:demo \
  --nats-auth-pass "$(openssl rand -hex 16)" \
  --nats-system-pass "$(openssl rand -hex 16)"

Expected output ends with a "next steps" panel containing the project ID, the harmony-cli client_id, the NATS WSS URL, and the exact follow-up commands. Save those — both developers will need them.

Developer: prove SSO works

cargo run -p example-fleet-sso-login -- \
  --base-domain customer1.nationtech.io \
  --client-id <CLI_CLIENT_ID printed by staging deploy>

Browser opens, developer logs into Zitadel, CLI prints Welcome <name> <email> and persists ~/.local/share/harmony/sso-session.json.

Two developers each do this once with their own Zitadel accounts.

Operator (or developer with an admin PAT): onboard a Pi

# Extract the Zitadel admin PAT once (it's in a K8s secret on the
# staging cluster).
PAT=$(kubectl --context customer1-prod \
  -n zitadel get secret iam-admin-pat \
  -o jsonpath='{.data.pat}' | base64 -d)

# Cross-compile the agent for aarch64 (one-time per agent rev).
cargo build --release --target aarch64-unknown-linux-gnu -p harmony-fleet-agent

# Onboard Pi #1 — sensor on the floor with arch=aarch64, group=group-a.
cargo run -p example-fleet-rpi-setup -- \
  --pi-host 192.168.1.42 \
  --pi-user pi \
  --device-id sensor-floor-01 \
  --labels "group=group-a,arch=aarch64,role=sensor" \
  --bootstrap-token "$PAT" \
  --zitadel-issuer-url https://zitadel.customer1.nationtech.io \
  --zitadel-project-id <PROJECT_ID printed by staging deploy> \
  --nats-url wss://nats.customer1.nationtech.io/ \
  --agent-binary ./target/aarch64-unknown-linux-gnu/release/fleet-agent

# Onboard Pi #2 — different group label so we can target by selector.
cargo run -p example-fleet-rpi-setup -- \
  --pi-host 192.168.1.43 \
  --pi-user pi \
  --device-id sensor-shelf-02 \
  --labels "group=group-b,arch=aarch64,role=sensor" \
  --bootstrap-token "$PAT" \
  --zitadel-issuer-url https://zitadel.customer1.nationtech.io \
  --zitadel-project-id <PROJECT_ID> \
  --nats-url wss://nats.customer1.nationtech.io/ \
  --agent-binary ./target/aarch64-unknown-linux-gnu/release/fleet-agent

Each Pi onboarding does the following on the device:

  • Installs podman + systemd-container.
  • Creates the fleet-agent user (with subuid/subgid for rootless podman + linger).
  • Drops the per-device Zitadel JSON key at /etc/fleet-agent/zitadel-key.json (mode 0640, owner fleet-agent).
  • Renders /etc/fleet-agent/config.toml with type = "zitadel-jwt" pointing at the keyfile.
  • Starts fleet-agent.service under systemd.

The agent connects to NATS over WSS using the JWT-bearer token it mints from its keyfile. async-nats's auto-reconnect + the auth callback re-mints the token on every reconnect attempt — the "never lose connectivity" property holds across:

  • Token expiry (12h Zitadel default → re-minted ~5 minutes before).
  • NATS pod restart (chart upgrade, drain, etc.).
  • Pi network blip (DHCP renewal, Wi-Fi roam).

Verify the fleet from the operator side

kubectl --context customer1-prod -n fleet-system get device.fleet.nationtech.io
# NAME                LABELS
# sensor-floor-01     arch=aarch64,group=group-a,role=sensor
# sensor-shelf-02     arch=aarch64,group=group-b,role=sensor

kubectl --context customer1-prod -n fleet-system logs deployment/fleet-callout
# ... received auth callout request
# ... Zitadel JWT validated, generating user JWT  device_id=sensor-floor-01  role=device

Developer: deploy a container to a labeled subset

# Apply the customer's backend (single service + sqlite volume + envs)
# to every device with group=group-a.
cargo run -p example_harmony_apply_deployment -- \
  --namespace fleet-demo \
  --name customer-backend \
  --selector group=group-a \
  --image registry.example.com/customer/backend:1.4 \
  --port 8080:8080 \
  --env DATABASE_URL=sqlite:///data/app.db \
  --env LOG_LEVEL=info \
  --volume /var/lib/customer-backend:/data \
  --restart unless-stopped

The operator sees one Deployment CR materialized, NATS KV gets a desired-state.<device-id>.customer-backend entry per matched device, and each Pi's agent reconciles podman to match. The container's data persists across agent restarts and Pi reboots because the bind mount survives both.

kubectl get device shows the agents heartbeating; their per-deployment state shows up on Device.status.aggregate (Chapter 2 reflect-back already in place).

Translating a docker-compose to a Deployment CR

For the call: walk through the customer's compose file once, paste the equivalent --env/--volume/--port flags. Bind mounts only; named volumes need a separate decision per service. Most compose shapes translate mechanically; depends_on / startup ordering does not (PodmanV0 has no ordering primitive — design out of scope for the demo).

Cross-device security model (worth showing)

  • Pi A's NATS connection has a user JWT permissioned to device-state.sensor-floor-01.> and device-commands.sensor-floor-01.>.
  • Pi A cannot publish to or subscribe from sensor-shelf-02's subjects — the auth callout never grants them.
  • An admin user (Zitadel role fleet-admin) gets > on both publish + subscribe — they observe every device.
  • A user with no fleet role is rejected at NATS connect time.

This is the same security model the local examples/fleet_auth_callout suite (3 cargo tests sharing a OnceCell k3d cluster) verifies in CI.

What's NOT in the demo

  • Compose-to-Deployment auto-translation (low priority — manual translation during the call works).
  • A web UI for harmony fleet apply (post-demo).
  • Tailscale/Headscale-based SSH backdoor to the Pis (separate daemon, out of scope).
  • Device-join-request + admin-approve flow (would replace bootstrap-PAT pattern; out of scope).
  • OpenBao for non-NATS secrets (env-var-only is fine for demo).
  • K8s OIDC integration so kubectl accepts Zitadel JWTs (post-demo).

Re-run idempotency

Every harness in this runbook is idempotent.

  • fleet-staging-deploy rides helm-upgrade-by-default, the ZitadelSetupScore search-then-create loop, and a persisted issuer NKey in a K8s secret.
  • fleet-rpi-setup byte-compares the rendered TOML against the device's existing config and only reapplies on drift; the keyfile drop + agent restart only happen when something actually changed.
  • harmony-apply-deployment is a kube::Api::patch(...) apply, so re-running with the same fields is a server-side no-op. EOF )