Hand-on walkthrough for the 48-hour customer demo: - Operator: build/push the callout image → fleet-staging-deploy → capture project_id + cli_client_id from the printed panel. - Developer: fleet-sso-login proves Zitadel SSO works end-to-end. - Pi onboarding: extract iam-admin-pat from the staging cluster, cross-compile the agent for aarch64, run fleet-rpi-setup once per device with --bootstrap-token. Each Pi's agent connects to NATS over WSS using the JWT-bearer token minted from its per-device keyfile. - Deploy a container to a labeled subset via example_harmony_apply_deployment with --env / --volume / --restart flags (env + bind mounts + restart policy that work_item #1 added). - Observe the cross-device security model holding via the auth callout's logs. Also captures what's deliberately NOT in the demo (compose auto-translation, UI, Tailscale backdoor, device-join-request flow, OpenBao, K8s OIDC) so the customer call has clean expectation- setting. The runbook is the closing piece of the 48h-demo work plan; sequenced after the eight feat / refactor commits that built the underlying functionality.
222 lines
8.4 KiB
Markdown
222 lines
8.4 KiB
Markdown
# Fleet Platform Demo Runbook
|
|
|
|
48-hour-demo edition. Covers the operator-side (NationTech) and the
|
|
customer-developer-side (two devs onboarding two Pis, applying a
|
|
container deployment to them). Hand-on, no UI yet.
|
|
|
|
## Roles
|
|
|
|
- **NationTech operator** — runs `fleet-staging-deploy` once against the
|
|
customer's OKD cluster.
|
|
- **Customer developer** — runs `fleet-sso-login` to prove auth works,
|
|
then runs `fleet-rpi-setup` for each Pi, then applies their workload
|
|
via the existing `harmony-apply-deployment` example.
|
|
|
|
## Prerequisites
|
|
|
|
### Cluster (operator-side)
|
|
|
|
- OKD ≥ 4.10 (HAProxy ingress, edge-TLS).
|
|
- Wildcard DNS `*.<base-domain>` pointing at the cluster ingress IP
|
|
(e.g. `*.customer1.nationtech.io`).
|
|
- Wildcard cert that the HAProxy router serves for that domain (the
|
|
default OKD pattern).
|
|
- `cert-manager`, `cloudnative-pg` operators installed (Zitadel chart
|
|
depends on them via `K8sAnywhereTopology`'s ensure_ready).
|
|
- Access to a container registry the cluster can pull from. Customer
|
|
may have their own; the default in `fleet-staging-deploy` is
|
|
`quay.io/nationtech/harmony-nats-callout:demo`.
|
|
|
|
### Driver machine (operator + developers)
|
|
|
|
- `kubectl` with kubeconfig wired up.
|
|
- `cargo` (Rust toolchain).
|
|
- `podman` (used to build the agent image / fleet-callout image).
|
|
- `ssh` into the Pis from the developers' machines.
|
|
|
|
### Pis
|
|
|
|
- Pi OS Lite booted, SSH server enabled, developer's SSH pubkey in
|
|
`~/.ssh/authorized_keys`. `fleet-rpi-setup` handles the rest.
|
|
|
|
## Operator: deploy the staging stack
|
|
|
|
```bash
|
|
# 1. Build the callout image and push it to the customer's registry.
|
|
cargo build --release -p harmony-nats-callout
|
|
podman build -t quay.io/nationtech/harmony-nats-callout:demo \
|
|
-f nats/callout/Dockerfile .
|
|
podman push quay.io/nationtech/harmony-nats-callout:demo
|
|
|
|
# 2. Deploy the central stack.
|
|
cargo run -p example-fleet-staging-deploy -- \
|
|
--base-domain customer1.nationtech.io \
|
|
--kube-context customer1-prod \
|
|
--callout-image quay.io/nationtech/harmony-nats-callout:demo \
|
|
--nats-auth-pass "$(openssl rand -hex 16)" \
|
|
--nats-system-pass "$(openssl rand -hex 16)"
|
|
```
|
|
|
|
Expected output ends with a "next steps" panel containing the project
|
|
ID, the `harmony-cli` client_id, the NATS WSS URL, and the exact
|
|
follow-up commands. Save those — both developers will need them.
|
|
|
|
## Developer: prove SSO works
|
|
|
|
```bash
|
|
cargo run -p example-fleet-sso-login -- \
|
|
--base-domain customer1.nationtech.io \
|
|
--client-id <CLI_CLIENT_ID printed by staging deploy>
|
|
```
|
|
|
|
Browser opens, developer logs into Zitadel, CLI prints
|
|
`Welcome <name> <email>` and persists `~/.local/share/harmony/sso-session.json`.
|
|
|
|
Two developers each do this once with their own Zitadel accounts.
|
|
|
|
## Operator (or developer with an admin PAT): onboard a Pi
|
|
|
|
```bash
|
|
# Extract the Zitadel admin PAT once (it's in a K8s secret on the
|
|
# staging cluster).
|
|
PAT=$(kubectl --context customer1-prod \
|
|
-n zitadel get secret iam-admin-pat \
|
|
-o jsonpath='{.data.pat}' | base64 -d)
|
|
|
|
# Cross-compile the agent for aarch64 (one-time per agent rev).
|
|
cargo build --release --target aarch64-unknown-linux-gnu -p harmony-fleet-agent
|
|
|
|
# Onboard Pi #1 — sensor on the floor with arch=aarch64, group=group-a.
|
|
cargo run -p example-fleet-rpi-setup -- \
|
|
--pi-host 192.168.1.42 \
|
|
--pi-user pi \
|
|
--device-id sensor-floor-01 \
|
|
--labels "group=group-a,arch=aarch64,role=sensor" \
|
|
--bootstrap-token "$PAT" \
|
|
--zitadel-issuer-url https://zitadel.customer1.nationtech.io \
|
|
--zitadel-project-id <PROJECT_ID printed by staging deploy> \
|
|
--nats-url wss://nats.customer1.nationtech.io/ \
|
|
--agent-binary ./target/aarch64-unknown-linux-gnu/release/fleet-agent
|
|
|
|
# Onboard Pi #2 — different group label so we can target by selector.
|
|
cargo run -p example-fleet-rpi-setup -- \
|
|
--pi-host 192.168.1.43 \
|
|
--pi-user pi \
|
|
--device-id sensor-shelf-02 \
|
|
--labels "group=group-b,arch=aarch64,role=sensor" \
|
|
--bootstrap-token "$PAT" \
|
|
--zitadel-issuer-url https://zitadel.customer1.nationtech.io \
|
|
--zitadel-project-id <PROJECT_ID> \
|
|
--nats-url wss://nats.customer1.nationtech.io/ \
|
|
--agent-binary ./target/aarch64-unknown-linux-gnu/release/fleet-agent
|
|
```
|
|
|
|
Each Pi onboarding does the following on the device:
|
|
|
|
- Installs podman + systemd-container.
|
|
- Creates the `fleet-agent` user (with subuid/subgid for rootless
|
|
podman + linger).
|
|
- Drops the per-device Zitadel JSON key at
|
|
`/etc/fleet-agent/zitadel-key.json` (mode 0640, owner fleet-agent).
|
|
- Renders `/etc/fleet-agent/config.toml` with `type = "zitadel-jwt"`
|
|
pointing at the keyfile.
|
|
- Starts `fleet-agent.service` under systemd.
|
|
|
|
The agent connects to NATS over WSS using the JWT-bearer token it
|
|
mints from its keyfile. async-nats's auto-reconnect + the auth
|
|
callback re-mints the token on every reconnect attempt — the
|
|
"never lose connectivity" property holds across:
|
|
|
|
- Token expiry (12h Zitadel default → re-minted ~5 minutes before).
|
|
- NATS pod restart (chart upgrade, drain, etc.).
|
|
- Pi network blip (DHCP renewal, Wi-Fi roam).
|
|
|
|
## Verify the fleet from the operator side
|
|
|
|
```bash
|
|
kubectl --context customer1-prod -n fleet-system get device.fleet.nationtech.io
|
|
# NAME LABELS
|
|
# sensor-floor-01 arch=aarch64,group=group-a,role=sensor
|
|
# sensor-shelf-02 arch=aarch64,group=group-b,role=sensor
|
|
|
|
kubectl --context customer1-prod -n fleet-system logs deployment/fleet-callout
|
|
# ... received auth callout request
|
|
# ... Zitadel JWT validated, generating user JWT device_id=sensor-floor-01 role=device
|
|
```
|
|
|
|
## Developer: deploy a container to a labeled subset
|
|
|
|
```bash
|
|
# Apply the customer's backend (single service + sqlite volume + envs)
|
|
# to every device with group=group-a.
|
|
cargo run -p example_harmony_apply_deployment -- \
|
|
--namespace fleet-demo \
|
|
--name customer-backend \
|
|
--selector group=group-a \
|
|
--image registry.example.com/customer/backend:1.4 \
|
|
--port 8080:8080 \
|
|
--env DATABASE_URL=sqlite:///data/app.db \
|
|
--env LOG_LEVEL=info \
|
|
--volume /var/lib/customer-backend:/data \
|
|
--restart unless-stopped
|
|
```
|
|
|
|
The operator sees one Deployment CR materialized, NATS KV gets a
|
|
`desired-state.<device-id>.customer-backend` entry per matched
|
|
device, and each Pi's agent reconciles podman to match. The
|
|
container's data persists across agent restarts and Pi reboots
|
|
because the bind mount survives both.
|
|
|
|
`kubectl get device` shows the agents heartbeating; their per-deployment
|
|
state shows up on `Device.status.aggregate` (Chapter 2 reflect-back
|
|
already in place).
|
|
|
|
### Translating a docker-compose to a Deployment CR
|
|
|
|
For the call: walk through the customer's compose file once, paste
|
|
the equivalent `--env`/`--volume`/`--port` flags. Bind mounts only;
|
|
named volumes need a separate decision per service. Most compose
|
|
shapes translate mechanically; depends_on / startup ordering does
|
|
not (PodmanV0 has no ordering primitive — design out of scope for
|
|
the demo).
|
|
|
|
## Cross-device security model (worth showing)
|
|
|
|
- Pi A's NATS connection has a user JWT permissioned to
|
|
`device-state.sensor-floor-01.>` and `device-commands.sensor-floor-01.>`.
|
|
- Pi A *cannot* publish to or subscribe from `sensor-shelf-02`'s
|
|
subjects — the auth callout never grants them.
|
|
- An admin user (Zitadel role `fleet-admin`) gets `>` on both
|
|
publish + subscribe — they observe every device.
|
|
- A user with no fleet role is rejected at NATS connect time.
|
|
|
|
This is the same security model the local `examples/fleet_auth_callout`
|
|
suite (3 cargo tests sharing a OnceCell k3d cluster) verifies in CI.
|
|
|
|
## What's NOT in the demo
|
|
|
|
- Compose-to-Deployment auto-translation (low priority — manual
|
|
translation during the call works).
|
|
- A web UI for `harmony fleet apply` (post-demo).
|
|
- Tailscale/Headscale-based SSH backdoor to the Pis (separate daemon,
|
|
out of scope).
|
|
- Device-join-request + admin-approve flow (would replace
|
|
bootstrap-PAT pattern; out of scope).
|
|
- OpenBao for non-NATS secrets (env-var-only is fine for demo).
|
|
- K8s OIDC integration so kubectl accepts Zitadel JWTs (post-demo).
|
|
|
|
## Re-run idempotency
|
|
|
|
Every harness in this runbook is idempotent.
|
|
|
|
- `fleet-staging-deploy` rides helm-upgrade-by-default, the
|
|
ZitadelSetupScore search-then-create loop, and a persisted issuer
|
|
NKey in a K8s secret.
|
|
- `fleet-rpi-setup` byte-compares the rendered TOML against the
|
|
device's existing config and only reapplies on drift; the keyfile
|
|
drop + agent restart only happen when something actually changed.
|
|
- `harmony-apply-deployment` is a `kube::Api::patch(...)` apply, so
|
|
re-running with the same fields is a server-side no-op.
|
|
EOF
|
|
)
|