Files
harmony/examples/fleet_e2e_demo/RUNBOOK.md
2026-05-20 13:41:40 -04:00

12 KiB

Local fleet rehearsal runbook

End-to-end walkthrough of the IoT fleet platform on your laptop: k3d-hosted control plane (Zitadel + NATS + auth callout) plus two libvirt VMs running the fleet-agent. Mirrors the production topology closely enough that you can watch the auth callout flow, the JetStream KV traffic, and the per-device permission boundary in a real cluster.

This is not the integration-test harness (that runs unattended). It is a step-by-step sequence with inspection points in between. Run each section, look at what happened, then continue.

0. Prerequisites

  • Linux host with KVM (the user running the commands in libvirt / kvm group; check with id).
  • podman, qemu-system-x86_64 (and qemu-system-aarch64 if you pick --arch aarch64), mdbook (optional), kubectl, nats CLI (optional, for the manual subscribe step). Most other tooling (k3d, ansible venv, cloud images) is auto-provisioned under ~/.local/share/harmony/.
  • /etc/hosts: 127.0.0.1 sso.fleet.local so you can hit Zitadel from your browser through the cluster's HTTP_PORT (see examples/fleet_auth_callout/src/lib.rs for the constant).
  • Free TCP ports 8080 and 30422 on the host.

Source map for the things you'll inspect:

Component File
Bring-up flow examples/fleet_e2e_demo/src/lib.rs
Per-device Zitadel + agent install same, provision_device()
NATS Score (auth-callout mode) fleet/harmony-fleet-deploy/src/nats.rs::FleetNatsScore::callout
Shared agent config schema fleet/harmony-fleet-auth/src/agent_config.rs
Auth callout deployment Score harmony/src/modules/nats_auth_callout/mod.rs
Callout decision logic nats/callout/src/handler.rs::decide
Per-device permissions template nats/callout/src/permissions.rs::device_default
Agent NATS auth (JWT-bearer mint) fleet/harmony-fleet-auth/src/credentials.rs
Agent KV publishers + direct pulse fleet/harmony-fleet-agent/src/fleet_publisher.rs
Walking-skeleton tests examples/fleet_e2e_demo/tests/e2e_walking_skeleton.rs

The NATS server's helm values are rendered from typed Rust structs via serde_yaml::to_string (see FleetNatsScore::values_yaml), not by format!() string interpolation. Same with the agent's /etc/fleet-agent/config.toml — typed AgentConfigtoml::to_string → ConfigMap. Per ADR-023 principle 2 the e2e demo composes the same *Score types the production deploy uses.

1. Provision the VMs

Each VM is one libvirt domain on the default network (192.168.122.0/24). Run fleet_vm_setup once per VM. Pass --only-vm so it stops at the cloud-init step (the agent install happens later from the e2e bring-up — keeps the two phases legible).

# VM 0
cargo run --release -p example-fleet-vm-setup -- \
  --arch aarch64 \
  --vm-name vm-device-00 \
  --only-vm

# VM 1
cargo run --release -p example-fleet-vm-setup -- \
  --arch aarch64 \
  --vm-name vm-device-01 \
  --only-vm

Use --arch x86_64 for native KVM speed; aarch64 runs under qemu-system-aarch64 TCG emulation on x86_64 hosts and is slower but matches Pi targets.

Inspect:

virsh list --all
virsh domifaddr vm-device-00
virsh domifaddr vm-device-01

Note the IPs — you'll pass them in step 2. Confirm SSH works:

ssh -i ~/.local/share/harmony/fleet/ssh/id_ed25519 \
    fleet-admin@<vm0-ip> uptime

The keypair lives under ~/.local/share/harmony/fleet/ssh/, generated on first run.

2. Bring up the control-plane stack

This single command does everything: k3d cluster, Zitadel, ZitadelSetupScore (project + roles + 2 device machine users + fleet-ops admin), NATS with auth_callout, callout image build & sideload, callout Deployment, and finally FleetDeviceSetupScore over SSH for each VM (packages, agent binary, JWT keyfile, systemd unit).

FLEET_E2E_VM_0_IP=<vm0-ip> FLEET_E2E_VM_1_IP=<vm1-ip> \
  cargo run --release -p example-fleet-e2e-demo -- --num-devices 2

The bring-up logs each step as [e2e-demo X/9]. Read along with examples/fleet_e2e_demo/src/lib.rs::bring_up_full_stack to see what's happening at each line. Stops at STACK READY and waits on Ctrl-C (the cluster stays up after Ctrl-C — this is just the foreground holder).

Inspect:

export KUBECONFIG=$(k3d kubeconfig write fleet-auth-callout)

# All workloads up?
kubectl get pods -n fleet-system
kubectl get pods -n zitadel

# Callout config the deployment is using:
kubectl get deployment -n fleet-system fleet-callout \
  -o jsonpath='{.spec.template.spec.containers[0].env}' | jq

Open Zitadel in the browser: http://sso.fleet.local:8080/ui/console (login with root@zitadel.local / the bootstrap password printed during step [e2e-demo 3/9]). Click into the fleet project → Users to see the two device-vm-device-0X machine users with device role grants and the fleet-ops admin.

3. Watch the auth callout in action

The callout is the security boundary: every NATS connect attempt hits $SYS.REQ.USER.AUTH, the callout validates the Zitadel JWT in connect_opts.auth_token, applies the decision tree in nats/callout/src/handler.rs::decide, and signs back a user JWT with role-scoped permissions.

Tail it while the agents reconnect:

kubectl logs -n fleet-system -l app=fleet-callout -f

You'll see one set of lines per (re)connect:

received auth callout request user_nkey=U…
Zitadel JWT validated, generating user JWT device_id=vm-device-00 role=device
sending auth response

The device_id field is the value AFTER device_id_prefix_strip runs (Zitadel emits client_id=device-vm-device-00; the callout strips device- so permissions are interpolated against the bare device id the agent uses for KV keys). See nats/callout/src/zitadel.rs::extract_device_id for the strip.

Force a reconnect to make a callout fire on demand:

ssh -i ~/.local/share/harmony/fleet/ssh/id_ed25519 \
    fleet-admin@<vm0-ip> 'sudo systemctl restart fleet-agent'

Watch the callout pod log emit one fresh request/response.

4. Watch the agent

ssh -i ~/.local/share/harmony/fleet/ssh/id_ed25519 \
    fleet-admin@<vm0-ip> 'sudo journalctl -u fleet-agent -f'

What good looks like, in order:

Log line Where it comes from
minted fresh Zitadel access token audience=… credentials.rs::zitadel_mint — RFC 7523 JWT-bearer flow, signed with the per-device machine key under /etc/fleet-agent/zitadel-key.json
connected successfully server=4222 NATS accepted the JWT minted by the callout
fleet publisher ready KV buckets opened; device-info write succeeded
watching KV keys filter=vm-device-00.> desired-state subscriber is up

Absence of Permissions Violation lines is the success signal — those mean the JWT's perms don't match what the agent tried to publish (you'd hit them if device_id_prefix_strip were misconfigured, for example).

5. Observe fleet traffic as admin

The harness mints a fleet-ops admin machine user with the fleet-admin role; the callout maps that role to pub/sub allow: [">"]. The integration test admin_jwt_reads_any_device_subject exercises this — easiest path to see it live is to run it with output. The test is #[ignore]d on cargo test so a developer box doesn't burn a 10-minute Zitadel bring-up by accident; --ignored opts in:

FLEET_E2E_VM_0_IP=<vm0-ip> FLEET_E2E_VM_1_IP=<vm1-ip> \
  cargo test -p example-fleet-e2e-demo \
    --test e2e_walking_skeleton \
    admin_jwt_reads_any_device_subject \
    -- --test-threads=1 --nocapture --ignored

It subscribes admin to device-state.> (the direct, non-JetStream fan-out subject the agent emits a pulse on every 30s — see fleet_publisher.rs::publish_state_pulse) and asserts a message arrives within 30s.

Inspect KV state directly using a bare admin client. The underlying mechanism is in examples/fleet_e2e_demo/tests/e2e_walking_skeleton.rs::admin_nats_client: mint a JWT-bearer token from stack.admin_machine_key, hand it to async_nats as auth_token. The test both_devices_heartbeat_within_60s then reads device-info keys directly:

let js = async_nats::jetstream::new(admin);
let bucket = js.get_key_value(BUCKET_DEVICE_INFO).await?;
let entry = bucket.entry(&device_info_key("vm-device-00")).await?;

To do it from a shell, port-forward NATS and use the nats CLI with admin creds — but creds for an auth-callout server take a JWT-bearer token, which the nats CLI doesn't speak natively; running the test is the path of least friction.

6. Verify cross-device isolation (currently #[ignore])

cross_device_isolation_enforced_in_vm is an empty test marked #[ignore = "requires E2eHandles::device_machine_key plumbing"] in e2e_walking_skeleton.rs — the test is a placeholder. The plumbing it's waiting on is straightforward: the existing DeviceHandle struct (examples/fleet_e2e_demo/src/lib.rs:106) exposes device_id + vm_ip + labels but not the per-device Zitadel machine key the test would need to mint a device-role JWT and try cross-device subjects. provision_device already creates the key (line ~324, machine_key_json) — wiring it through into DeviceHandle.machine_key and implementing the test body (mint JWT-bearer for vm-device-00, sub to device-commands.vm-device-01, expect Permissions Violation) is a single follow-up commit. I haven't touched it because nothing in this branch's scope required it.

You can verify the boundary manually right now, even without the test wired up: tail the callout pod, then SSH onto vm-device-00 and run the agent with a tampered config that points it at vm-device-01's keyfile. The callout will issue a JWT for vm-device-01 (because the JWT-bearer assertion is signed with that user's key); the agent on vm-device-00 will then publish on $KV.device-info.info.vm-device-00, which is NOT in the JWT's allow list — NATS rejects with Permissions Violation. This is the same gate the test would automate.

The permissions template is in nats/callout/src/permissions.rs::device_default — every allowed subject contains {device_id} and is interpolated per-request, so device A's JWT physically cannot publish to device B's subjects.

7. Drive the desired-state loop

(Not yet covered by a walking-skeleton test, but the agent's reconciler is wired and observable.) From an admin client, write a desired state for vm-device-00:

// pseudocode — see harmony-reconciler-contracts for the exact types
let kv = jetstream.create_key_value(kv::Config {
    bucket: BUCKET_DESIRED_STATE.into(),
    history: 1,
    ..Default::default()
}).await?;
kv.put(
    &desired_state_key("vm-device-00", &dn("hello-web")),
    payload.into(),
).await?;

What happens, observable from the agent's journal:

  1. Agent's KV watcher (filter vm-device-00.>) fires.
  2. Reconciler computes the diff and runs the podman create.
  3. write_deployment_state(&state) fires:
    • puts state.vm-device-00.hello-web into the device-state KV bucket (operator-side watch picks it up)
    • publishes the same payload on direct subject device-state.vm-device-00 (admin observers see it live)

You can subscribe to the latter with admin and watch reconcile events stream in real time.

8. Teardown

The cluster persists across runs (re-running fleet_e2e_demo converges drift, doesn't recreate). When you want a clean slate:

k3d cluster delete fleet-auth-callout

virsh destroy vm-device-00; virsh undefine vm-device-00 --remove-all-storage
virsh destroy vm-device-01; virsh undefine vm-device-01 --remove-all-storage

Cached assets (cloud images, k3d binary, ansible venv, SSH key, fleet secrets) live under ~/.local/share/harmony/ and survive cluster/VM destruction by design — first run after a clean reuses them.