Files
harmony/ROADMAP/fleet_platform/demo_runbook.md
Jean-Gabriel Gill-Couture 4053ac52de docs(fleet): demo runbook (operator + developer flow, single page)
Hand-on walkthrough for the 48-hour customer demo:

- Operator: build/push the callout image → fleet-staging-deploy →
  capture project_id + cli_client_id from the printed panel.
- Developer: fleet-sso-login proves Zitadel SSO works end-to-end.
- Pi onboarding: extract iam-admin-pat from the staging cluster,
  cross-compile the agent for aarch64, run fleet-rpi-setup once
  per device with --bootstrap-token. Each Pi's agent connects to
  NATS over WSS using the JWT-bearer token minted from its
  per-device keyfile.
- Deploy a container to a labeled subset via
  example_harmony_apply_deployment with --env / --volume / --restart
  flags (env + bind mounts + restart policy that work_item #1 added).
- Observe the cross-device security model holding via the auth
  callout's logs.

Also captures what's deliberately NOT in the demo (compose
auto-translation, UI, Tailscale backdoor, device-join-request
flow, OpenBao, K8s OIDC) so the customer call has clean expectation-
setting.

The runbook is the closing piece of the 48h-demo work plan;
sequenced after the eight feat / refactor commits that built the
underlying functionality.
2026-05-03 15:43:10 -04:00

222 lines
8.4 KiB
Markdown

# Fleet Platform Demo Runbook
48-hour-demo edition. Covers the operator-side (NationTech) and the
customer-developer-side (two devs onboarding two Pis, applying a
container deployment to them). Hand-on, no UI yet.
## Roles
- **NationTech operator** — runs `fleet-staging-deploy` once against the
customer's OKD cluster.
- **Customer developer** — runs `fleet-sso-login` to prove auth works,
then runs `fleet-rpi-setup` for each Pi, then applies their workload
via the existing `harmony-apply-deployment` example.
## Prerequisites
### Cluster (operator-side)
- OKD ≥ 4.10 (HAProxy ingress, edge-TLS).
- Wildcard DNS `*.<base-domain>` pointing at the cluster ingress IP
(e.g. `*.customer1.nationtech.io`).
- Wildcard cert that the HAProxy router serves for that domain (the
default OKD pattern).
- `cert-manager`, `cloudnative-pg` operators installed (Zitadel chart
depends on them via `K8sAnywhereTopology`'s ensure_ready).
- Access to a container registry the cluster can pull from. Customer
may have their own; the default in `fleet-staging-deploy` is
`quay.io/nationtech/harmony-nats-callout:demo`.
### Driver machine (operator + developers)
- `kubectl` with kubeconfig wired up.
- `cargo` (Rust toolchain).
- `podman` (used to build the agent image / fleet-callout image).
- `ssh` into the Pis from the developers' machines.
### Pis
- Pi OS Lite booted, SSH server enabled, developer's SSH pubkey in
`~/.ssh/authorized_keys`. `fleet-rpi-setup` handles the rest.
## Operator: deploy the staging stack
```bash
# 1. Build the callout image and push it to the customer's registry.
cargo build --release -p harmony-nats-callout
podman build -t quay.io/nationtech/harmony-nats-callout:demo \
-f nats/callout/Dockerfile .
podman push quay.io/nationtech/harmony-nats-callout:demo
# 2. Deploy the central stack.
cargo run -p example-fleet-staging-deploy -- \
--base-domain customer1.nationtech.io \
--kube-context customer1-prod \
--callout-image quay.io/nationtech/harmony-nats-callout:demo \
--nats-auth-pass "$(openssl rand -hex 16)" \
--nats-system-pass "$(openssl rand -hex 16)"
```
Expected output ends with a "next steps" panel containing the project
ID, the `harmony-cli` client_id, the NATS WSS URL, and the exact
follow-up commands. Save those — both developers will need them.
## Developer: prove SSO works
```bash
cargo run -p example-fleet-sso-login -- \
--base-domain customer1.nationtech.io \
--client-id <CLI_CLIENT_ID printed by staging deploy>
```
Browser opens, developer logs into Zitadel, CLI prints
`Welcome <name> <email>` and persists `~/.local/share/harmony/sso-session.json`.
Two developers each do this once with their own Zitadel accounts.
## Operator (or developer with an admin PAT): onboard a Pi
```bash
# Extract the Zitadel admin PAT once (it's in a K8s secret on the
# staging cluster).
PAT=$(kubectl --context customer1-prod \
-n zitadel get secret iam-admin-pat \
-o jsonpath='{.data.pat}' | base64 -d)
# Cross-compile the agent for aarch64 (one-time per agent rev).
cargo build --release --target aarch64-unknown-linux-gnu -p harmony-fleet-agent
# Onboard Pi #1 — sensor on the floor with arch=aarch64, group=group-a.
cargo run -p example-fleet-rpi-setup -- \
--pi-host 192.168.1.42 \
--pi-user pi \
--device-id sensor-floor-01 \
--labels "group=group-a,arch=aarch64,role=sensor" \
--bootstrap-token "$PAT" \
--zitadel-issuer-url https://zitadel.customer1.nationtech.io \
--zitadel-project-id <PROJECT_ID printed by staging deploy> \
--nats-url wss://nats.customer1.nationtech.io/ \
--agent-binary ./target/aarch64-unknown-linux-gnu/release/fleet-agent
# Onboard Pi #2 — different group label so we can target by selector.
cargo run -p example-fleet-rpi-setup -- \
--pi-host 192.168.1.43 \
--pi-user pi \
--device-id sensor-shelf-02 \
--labels "group=group-b,arch=aarch64,role=sensor" \
--bootstrap-token "$PAT" \
--zitadel-issuer-url https://zitadel.customer1.nationtech.io \
--zitadel-project-id <PROJECT_ID> \
--nats-url wss://nats.customer1.nationtech.io/ \
--agent-binary ./target/aarch64-unknown-linux-gnu/release/fleet-agent
```
Each Pi onboarding does the following on the device:
- Installs podman + systemd-container.
- Creates the `fleet-agent` user (with subuid/subgid for rootless
podman + linger).
- Drops the per-device Zitadel JSON key at
`/etc/fleet-agent/zitadel-key.json` (mode 0640, owner fleet-agent).
- Renders `/etc/fleet-agent/config.toml` with `type = "zitadel-jwt"`
pointing at the keyfile.
- Starts `fleet-agent.service` under systemd.
The agent connects to NATS over WSS using the JWT-bearer token it
mints from its keyfile. async-nats's auto-reconnect + the auth
callback re-mints the token on every reconnect attempt — the
"never lose connectivity" property holds across:
- Token expiry (12h Zitadel default → re-minted ~5 minutes before).
- NATS pod restart (chart upgrade, drain, etc.).
- Pi network blip (DHCP renewal, Wi-Fi roam).
## Verify the fleet from the operator side
```bash
kubectl --context customer1-prod -n fleet-system get device.fleet.nationtech.io
# NAME LABELS
# sensor-floor-01 arch=aarch64,group=group-a,role=sensor
# sensor-shelf-02 arch=aarch64,group=group-b,role=sensor
kubectl --context customer1-prod -n fleet-system logs deployment/fleet-callout
# ... received auth callout request
# ... Zitadel JWT validated, generating user JWT device_id=sensor-floor-01 role=device
```
## Developer: deploy a container to a labeled subset
```bash
# Apply the customer's backend (single service + sqlite volume + envs)
# to every device with group=group-a.
cargo run -p example_harmony_apply_deployment -- \
--namespace fleet-demo \
--name customer-backend \
--selector group=group-a \
--image registry.example.com/customer/backend:1.4 \
--port 8080:8080 \
--env DATABASE_URL=sqlite:///data/app.db \
--env LOG_LEVEL=info \
--volume /var/lib/customer-backend:/data \
--restart unless-stopped
```
The operator sees one Deployment CR materialized, NATS KV gets a
`desired-state.<device-id>.customer-backend` entry per matched
device, and each Pi's agent reconciles podman to match. The
container's data persists across agent restarts and Pi reboots
because the bind mount survives both.
`kubectl get device` shows the agents heartbeating; their per-deployment
state shows up on `Device.status.aggregate` (Chapter 2 reflect-back
already in place).
### Translating a docker-compose to a Deployment CR
For the call: walk through the customer's compose file once, paste
the equivalent `--env`/`--volume`/`--port` flags. Bind mounts only;
named volumes need a separate decision per service. Most compose
shapes translate mechanically; depends_on / startup ordering does
not (PodmanV0 has no ordering primitive — design out of scope for
the demo).
## Cross-device security model (worth showing)
- Pi A's NATS connection has a user JWT permissioned to
`device-state.sensor-floor-01.>` and `device-commands.sensor-floor-01.>`.
- Pi A *cannot* publish to or subscribe from `sensor-shelf-02`'s
subjects — the auth callout never grants them.
- An admin user (Zitadel role `fleet-admin`) gets `>` on both
publish + subscribe — they observe every device.
- A user with no fleet role is rejected at NATS connect time.
This is the same security model the local `examples/fleet_auth_callout`
suite (3 cargo tests sharing a OnceCell k3d cluster) verifies in CI.
## What's NOT in the demo
- Compose-to-Deployment auto-translation (low priority — manual
translation during the call works).
- A web UI for `harmony fleet apply` (post-demo).
- Tailscale/Headscale-based SSH backdoor to the Pis (separate daemon,
out of scope).
- Device-join-request + admin-approve flow (would replace
bootstrap-PAT pattern; out of scope).
- OpenBao for non-NATS secrets (env-var-only is fine for demo).
- K8s OIDC integration so kubectl accepts Zitadel JWTs (post-demo).
## Re-run idempotency
Every harness in this runbook is idempotent.
- `fleet-staging-deploy` rides helm-upgrade-by-default, the
ZitadelSetupScore search-then-create loop, and a persisted issuer
NKey in a K8s secret.
- `fleet-rpi-setup` byte-compares the rendered TOML against the
device's existing config and only reapplies on drift; the keyfile
drop + agent restart only happen when something actually changed.
- `harmony-apply-deployment` is a `kube::Api::patch(...)` apply, so
re-running with the same fields is a server-side no-op.
EOF
)