12 KiB
Local fleet rehearsal runbook
End-to-end walkthrough of the IoT fleet platform on your laptop: k3d-hosted control plane (Zitadel + NATS + auth callout) plus two libvirt VMs running the fleet-agent. Mirrors the production topology closely enough that you can watch the auth callout flow, the JetStream KV traffic, and the per-device permission boundary in a real cluster.
This is not the integration-test harness (that runs unattended). It is a step-by-step sequence with inspection points in between. Run each section, look at what happened, then continue.
0. Prerequisites
- Linux host with KVM (the user running the commands in
libvirt/kvmgroup; check withid). podman,qemu-system-x86_64(andqemu-system-aarch64if you pick--arch aarch64),mdbook(optional),kubectl,natsCLI (optional, for the manual subscribe step). Most other tooling (k3d, ansible venv, cloud images) is auto-provisioned under~/.local/share/harmony/./etc/hosts:127.0.0.1 sso.fleet.localso you can hit Zitadel from your browser through the cluster's HTTP_PORT (seeexamples/fleet_auth_callout/src/lib.rsfor the constant).- Free TCP ports
8080and30422on the host.
Source map for the things you'll inspect:
| Component | File |
|---|---|
| Bring-up flow | examples/fleet_e2e_demo/src/lib.rs |
| Per-device Zitadel + agent install | same, provision_device() |
| NATS Score (auth-callout mode) | fleet/harmony-fleet-deploy/src/nats.rs::FleetNatsScore::callout |
| Shared agent config schema | fleet/harmony-fleet-auth/src/agent_config.rs |
| Auth callout deployment Score | harmony/src/modules/nats_auth_callout/mod.rs |
| Callout decision logic | nats/callout/src/handler.rs::decide |
| Per-device permissions template | nats/callout/src/permissions.rs::device_default |
| Agent NATS auth (JWT-bearer mint) | fleet/harmony-fleet-auth/src/credentials.rs |
| Agent KV publishers + direct pulse | fleet/harmony-fleet-agent/src/fleet_publisher.rs |
| Walking-skeleton tests | examples/fleet_e2e_demo/tests/e2e_walking_skeleton.rs |
The NATS server's helm values are rendered from typed Rust structs
via serde_yaml::to_string (see FleetNatsScore::values_yaml),
not by format!() string interpolation. Same with the agent's
/etc/fleet-agent/config.toml — typed AgentConfig →
toml::to_string → ConfigMap. Per ADR-023 principle 2 the e2e
demo composes the same *Score types the production deploy uses.
1. Provision the VMs
Each VM is one libvirt domain on the default network
(192.168.122.0/24). Run fleet_vm_setup once per VM. Pass
--only-vm so it stops at the cloud-init step (the agent install
happens later from the e2e bring-up — keeps the two phases legible).
# VM 0
cargo run --release -p example-fleet-vm-setup -- \
--arch aarch64 \
--vm-name vm-device-00 \
--only-vm
# VM 1
cargo run --release -p example-fleet-vm-setup -- \
--arch aarch64 \
--vm-name vm-device-01 \
--only-vm
Use --arch x86_64 for native KVM speed; aarch64 runs under
qemu-system-aarch64 TCG emulation on x86_64 hosts and is slower but
matches Pi targets.
Inspect:
virsh list --all
virsh domifaddr vm-device-00
virsh domifaddr vm-device-01
Note the IPs — you'll pass them in step 2. Confirm SSH works:
ssh -i ~/.local/share/harmony/fleet/ssh/id_ed25519 \
fleet-admin@<vm0-ip> uptime
The keypair lives under ~/.local/share/harmony/fleet/ssh/,
generated on first run.
2. Bring up the control-plane stack
This single command does everything: k3d cluster, Zitadel,
ZitadelSetupScore (project + roles + 2 device machine users +
fleet-ops admin), NATS with auth_callout, callout image build &
sideload, callout Deployment, and finally FleetDeviceSetupScore
over SSH for each VM (packages, agent binary, JWT keyfile,
systemd unit).
FLEET_E2E_VM_0_IP=<vm0-ip> FLEET_E2E_VM_1_IP=<vm1-ip> \
cargo run --release -p example-fleet-e2e-demo -- --num-devices 2
The bring-up logs each step as [e2e-demo X/9]. Read along with
examples/fleet_e2e_demo/src/lib.rs::bring_up_full_stack to see
what's happening at each line. Stops at STACK READY and waits on
Ctrl-C (the cluster stays up after Ctrl-C — this is just the
foreground holder).
Inspect:
export KUBECONFIG=$(k3d kubeconfig write fleet-auth-callout)
# All workloads up?
kubectl get pods -n fleet-system
kubectl get pods -n zitadel
# Callout config the deployment is using:
kubectl get deployment -n fleet-system fleet-callout \
-o jsonpath='{.spec.template.spec.containers[0].env}' | jq
Open Zitadel in the browser: http://sso.fleet.local:8080/ui/console
(login with root@zitadel.local / the bootstrap password printed
during step [e2e-demo 3/9]). Click into the fleet project →
Users to see the two device-vm-device-0X machine users with
device role grants and the fleet-ops admin.
3. Watch the auth callout in action
The callout is the security boundary: every NATS connect attempt
hits $SYS.REQ.USER.AUTH, the callout validates the Zitadel JWT
in connect_opts.auth_token, applies the decision tree in
nats/callout/src/handler.rs::decide, and signs back a user JWT
with role-scoped permissions.
Tail it while the agents reconnect:
kubectl logs -n fleet-system -l app=fleet-callout -f
You'll see one set of lines per (re)connect:
received auth callout request user_nkey=U…
Zitadel JWT validated, generating user JWT device_id=vm-device-00 role=device
sending auth response
The device_id field is the value AFTER device_id_prefix_strip
runs (Zitadel emits client_id=device-vm-device-00; the callout
strips device- so permissions are interpolated against the bare
device id the agent uses for KV keys). See
nats/callout/src/zitadel.rs::extract_device_id for the strip.
Force a reconnect to make a callout fire on demand:
ssh -i ~/.local/share/harmony/fleet/ssh/id_ed25519 \
fleet-admin@<vm0-ip> 'sudo systemctl restart fleet-agent'
Watch the callout pod log emit one fresh request/response.
4. Watch the agent
ssh -i ~/.local/share/harmony/fleet/ssh/id_ed25519 \
fleet-admin@<vm0-ip> 'sudo journalctl -u fleet-agent -f'
What good looks like, in order:
| Log line | Where it comes from |
|---|---|
minted fresh Zitadel access token audience=… |
credentials.rs::zitadel_mint — RFC 7523 JWT-bearer flow, signed with the per-device machine key under /etc/fleet-agent/zitadel-key.json |
connected successfully server=4222 |
NATS accepted the JWT minted by the callout |
fleet publisher ready |
KV buckets opened; device-info write succeeded |
watching KV keys filter=vm-device-00.> |
desired-state subscriber is up |
Absence of Permissions Violation lines is the success signal —
those mean the JWT's perms don't match what the agent tried to
publish (you'd hit them if device_id_prefix_strip were
misconfigured, for example).
5. Observe fleet traffic as admin
The harness mints a fleet-ops admin machine user with the
fleet-admin role; the callout maps that role to
pub/sub allow: [">"]. The integration test
admin_jwt_reads_any_device_subject exercises this — easiest path
to see it live is to run it with output. The test is
#[ignore]d on cargo test so a developer box doesn't burn a
10-minute Zitadel bring-up by accident; --ignored opts in:
FLEET_E2E_VM_0_IP=<vm0-ip> FLEET_E2E_VM_1_IP=<vm1-ip> \
cargo test -p example-fleet-e2e-demo \
--test e2e_walking_skeleton \
admin_jwt_reads_any_device_subject \
-- --test-threads=1 --nocapture --ignored
It subscribes admin to device-state.> (the direct, non-JetStream
fan-out subject the agent emits a pulse on every 30s — see
fleet_publisher.rs::publish_state_pulse) and asserts a message
arrives within 30s.
Inspect KV state directly using a bare admin client. The
underlying mechanism is in
examples/fleet_e2e_demo/tests/e2e_walking_skeleton.rs::admin_nats_client:
mint a JWT-bearer token from stack.admin_machine_key, hand it to
async_nats as auth_token. The test
both_devices_heartbeat_within_60s then reads device-info keys
directly:
let js = async_nats::jetstream::new(admin);
let bucket = js.get_key_value(BUCKET_DEVICE_INFO).await?;
let entry = bucket.entry(&device_info_key("vm-device-00")).await?;
To do it from a shell, port-forward NATS and use the nats CLI
with admin creds — but creds for an auth-callout server take a
JWT-bearer token, which the nats CLI doesn't speak natively;
running the test is the path of least friction.
6. Verify cross-device isolation (currently #[ignore])
cross_device_isolation_enforced_in_vm is an empty test marked
#[ignore = "requires E2eHandles::device_machine_key plumbing"]
in e2e_walking_skeleton.rs — the test is a placeholder. The
plumbing it's waiting on is straightforward: the existing
DeviceHandle struct (examples/fleet_e2e_demo/src/lib.rs:106)
exposes device_id + vm_ip + labels but not the per-device
Zitadel machine key the test would need to mint a device-role
JWT and try cross-device subjects. provision_device already
creates the key (line ~324, machine_key_json) — wiring it through
into DeviceHandle.machine_key and implementing the test body
(mint JWT-bearer for vm-device-00, sub to
device-commands.vm-device-01, expect Permissions Violation)
is a single follow-up commit. I haven't touched it because nothing
in this branch's scope required it.
You can verify the boundary manually right now, even without
the test wired up: tail the callout pod, then SSH onto vm-device-00
and run the agent with a tampered config that points it at
vm-device-01's keyfile. The callout will issue a JWT for
vm-device-01 (because the JWT-bearer assertion is signed with
that user's key); the agent on vm-device-00 will then publish on
$KV.device-info.info.vm-device-00, which is NOT in the JWT's
allow list — NATS rejects with Permissions Violation. This is
the same gate the test would automate.
The permissions template is in
nats/callout/src/permissions.rs::device_default — every allowed
subject contains {device_id} and is interpolated per-request, so
device A's JWT physically cannot publish to device B's subjects.
7. Drive the desired-state loop
(Not yet covered by a walking-skeleton test, but the agent's reconciler is wired and observable.) From an admin client, write a desired state for vm-device-00:
// pseudocode — see harmony-reconciler-contracts for the exact types
let kv = jetstream.create_key_value(kv::Config {
bucket: BUCKET_DESIRED_STATE.into(),
history: 1,
..Default::default()
}).await?;
kv.put(
&desired_state_key("vm-device-00", &dn("hello-web")),
payload.into(),
).await?;
What happens, observable from the agent's journal:
- Agent's KV watcher (filter
vm-device-00.>) fires. - Reconciler computes the diff and runs the podman create.
write_deployment_state(&state)fires:- puts
state.vm-device-00.hello-webinto thedevice-stateKV bucket (operator-side watch picks it up) - publishes the same payload on direct subject
device-state.vm-device-00(admin observers see it live)
- puts
You can subscribe to the latter with admin and watch reconcile events stream in real time.
8. Teardown
The cluster persists across runs (re-running fleet_e2e_demo
converges drift, doesn't recreate). When you want a clean slate:
k3d cluster delete fleet-auth-callout
virsh destroy vm-device-00; virsh undefine vm-device-00 --remove-all-storage
virsh destroy vm-device-01; virsh undefine vm-device-01 --remove-all-storage
Cached assets (cloud images, k3d binary, ansible venv, SSH key,
fleet secrets) live under ~/.local/share/harmony/ and survive
cluster/VM destruction by design — first run after a clean reuses
them.