Some checks failed
Run Check Script / check (pull_request) Failing after 1m51s
313 lines
12 KiB
Markdown
313 lines
12 KiB
Markdown
# Local fleet rehearsal runbook
|
|
|
|
End-to-end walkthrough of the IoT fleet platform on your laptop:
|
|
k3d-hosted control plane (Zitadel + NATS + auth callout) plus two
|
|
libvirt VMs running the fleet-agent. Mirrors the production topology
|
|
closely enough that you can watch the auth callout flow, the
|
|
JetStream KV traffic, and the per-device permission boundary in a
|
|
real cluster.
|
|
|
|
This is not the integration-test harness (that runs unattended). It
|
|
is a step-by-step sequence with inspection points in between. Run
|
|
each section, look at what happened, then continue.
|
|
|
|
## 0. Prerequisites
|
|
|
|
- Linux host with KVM (the user running the commands in `libvirt` /
|
|
`kvm` group; check with `id`).
|
|
- `podman`, `qemu-system-x86_64` (and `qemu-system-aarch64` if you
|
|
pick `--arch aarch64`), `mdbook` (optional), `kubectl`, `nats` CLI
|
|
(optional, for the manual subscribe step). Most other tooling
|
|
(k3d, ansible venv, cloud images) is auto-provisioned under
|
|
`~/.local/share/harmony/`.
|
|
- `/etc/hosts`: `127.0.0.1 sso.fleet.local` so you can hit Zitadel
|
|
from your browser through the cluster's HTTP_PORT (see
|
|
`examples/fleet_auth_callout/src/lib.rs` for the constant).
|
|
- Free TCP ports `8080` and `30422` on the host.
|
|
|
|
Source map for the things you'll inspect:
|
|
|
|
| Component | File |
|
|
| --- | --- |
|
|
| Bring-up flow | `examples/fleet_e2e_demo/src/lib.rs` |
|
|
| Per-device Zitadel + agent install | same, `provision_device()` |
|
|
| NATS Score (auth-callout mode) | `fleet/harmony-fleet-deploy/src/nats.rs::FleetNatsScore::callout` |
|
|
| Shared agent config schema | `fleet/harmony-fleet-auth/src/agent_config.rs` |
|
|
| Auth callout deployment Score | `harmony/src/modules/nats_auth_callout/mod.rs` |
|
|
| Callout decision logic | `nats/callout/src/handler.rs::decide` |
|
|
| Per-device permissions template | `nats/callout/src/permissions.rs::device_default` |
|
|
| Agent NATS auth (JWT-bearer mint) | `fleet/harmony-fleet-auth/src/credentials.rs` |
|
|
| Agent KV publishers + direct pulse | `fleet/harmony-fleet-agent/src/fleet_publisher.rs` |
|
|
| Walking-skeleton tests | `examples/fleet_e2e_demo/tests/e2e_walking_skeleton.rs` |
|
|
|
|
The NATS server's helm values are rendered from typed Rust structs
|
|
via `serde_yaml::to_string` (see `FleetNatsScore::values_yaml`),
|
|
not by `format!()` string interpolation. Same with the agent's
|
|
`/etc/fleet-agent/config.toml` — typed `AgentConfig` →
|
|
`toml::to_string` → ConfigMap. Per ADR-023 principle 2 the e2e
|
|
demo composes the same `*Score` types the production deploy uses.
|
|
|
|
## 1. Provision the VMs
|
|
|
|
Each VM is one libvirt domain on the default network
|
|
(`192.168.122.0/24`). Run `fleet_vm_setup` once per VM. Pass
|
|
`--only-vm` so it stops at the cloud-init step (the agent install
|
|
happens later from the e2e bring-up — keeps the two phases legible).
|
|
|
|
```bash
|
|
# VM 0
|
|
cargo run --release -p example-fleet-vm-setup -- \
|
|
--arch aarch64 \
|
|
--vm-name vm-device-00 \
|
|
--only-vm
|
|
|
|
# VM 1
|
|
cargo run --release -p example-fleet-vm-setup -- \
|
|
--arch aarch64 \
|
|
--vm-name vm-device-01 \
|
|
--only-vm
|
|
```
|
|
|
|
Use `--arch x86_64` for native KVM speed; `aarch64` runs under
|
|
qemu-system-aarch64 TCG emulation on x86_64 hosts and is slower but
|
|
matches Pi targets.
|
|
|
|
**Inspect:**
|
|
|
|
```bash
|
|
virsh list --all
|
|
virsh domifaddr vm-device-00
|
|
virsh domifaddr vm-device-01
|
|
```
|
|
|
|
Note the IPs — you'll pass them in step 2. Confirm SSH works:
|
|
|
|
```bash
|
|
ssh -i ~/.local/share/harmony/fleet/ssh/id_ed25519 \
|
|
fleet-admin@<vm0-ip> uptime
|
|
```
|
|
|
|
The keypair lives under `~/.local/share/harmony/fleet/ssh/`,
|
|
generated on first run.
|
|
|
|
## 2. Bring up the control-plane stack
|
|
|
|
This single command does everything: k3d cluster, Zitadel,
|
|
ZitadelSetupScore (project + roles + 2 device machine users +
|
|
`fleet-ops` admin), NATS with `auth_callout`, callout image build &
|
|
sideload, callout Deployment, and finally `FleetDeviceSetupScore`
|
|
over SSH for each VM (packages, agent binary, JWT keyfile,
|
|
systemd unit).
|
|
|
|
```bash
|
|
FLEET_E2E_VM_0_IP=<vm0-ip> FLEET_E2E_VM_1_IP=<vm1-ip> \
|
|
cargo run --release -p example-fleet-e2e-demo -- --num-devices 2
|
|
```
|
|
|
|
The bring-up logs each step as `[e2e-demo X/9]`. Read along with
|
|
`examples/fleet_e2e_demo/src/lib.rs::bring_up_full_stack` to see
|
|
what's happening at each line. Stops at `STACK READY` and waits on
|
|
Ctrl-C (the cluster stays up after Ctrl-C — this is just the
|
|
foreground holder).
|
|
|
|
**Inspect:**
|
|
|
|
```bash
|
|
export KUBECONFIG=$(k3d kubeconfig write fleet-auth-callout)
|
|
|
|
# All workloads up?
|
|
kubectl get pods -n fleet-system
|
|
kubectl get pods -n zitadel
|
|
|
|
# Callout config the deployment is using:
|
|
kubectl get deployment -n fleet-system fleet-callout \
|
|
-o jsonpath='{.spec.template.spec.containers[0].env}' | jq
|
|
```
|
|
|
|
Open Zitadel in the browser: <http://sso.fleet.local:8080/ui/console>
|
|
(login with `root@zitadel.local` / the bootstrap password printed
|
|
during step `[e2e-demo 3/9]`). Click into the `fleet` project →
|
|
`Users` to see the two `device-vm-device-0X` machine users with
|
|
`device` role grants and the `fleet-ops` admin.
|
|
|
|
## 3. Watch the auth callout in action
|
|
|
|
The callout is the security boundary: every NATS connect attempt
|
|
hits `$SYS.REQ.USER.AUTH`, the callout validates the Zitadel JWT
|
|
in `connect_opts.auth_token`, applies the decision tree in
|
|
`nats/callout/src/handler.rs::decide`, and signs back a user JWT
|
|
with role-scoped permissions.
|
|
|
|
Tail it while the agents reconnect:
|
|
|
|
```bash
|
|
kubectl logs -n fleet-system -l app=fleet-callout -f
|
|
```
|
|
|
|
You'll see one set of lines per (re)connect:
|
|
|
|
```
|
|
received auth callout request user_nkey=U…
|
|
Zitadel JWT validated, generating user JWT device_id=vm-device-00 role=device
|
|
sending auth response
|
|
```
|
|
|
|
The `device_id` field is the value AFTER `device_id_prefix_strip`
|
|
runs (Zitadel emits `client_id=device-vm-device-00`; the callout
|
|
strips `device-` so permissions are interpolated against the bare
|
|
device id the agent uses for KV keys). See
|
|
`nats/callout/src/zitadel.rs::extract_device_id` for the strip.
|
|
|
|
**Force a reconnect to make a callout fire on demand:**
|
|
|
|
```bash
|
|
ssh -i ~/.local/share/harmony/fleet/ssh/id_ed25519 \
|
|
fleet-admin@<vm0-ip> 'sudo systemctl restart fleet-agent'
|
|
```
|
|
|
|
Watch the callout pod log emit one fresh request/response.
|
|
|
|
## 4. Watch the agent
|
|
|
|
```bash
|
|
ssh -i ~/.local/share/harmony/fleet/ssh/id_ed25519 \
|
|
fleet-admin@<vm0-ip> 'sudo journalctl -u fleet-agent -f'
|
|
```
|
|
|
|
What good looks like, in order:
|
|
|
|
| Log line | Where it comes from |
|
|
| --- | --- |
|
|
| `minted fresh Zitadel access token audience=…` | `credentials.rs::zitadel_mint` — RFC 7523 JWT-bearer flow, signed with the per-device machine key under `/etc/fleet-agent/zitadel-key.json` |
|
|
| `connected successfully server=4222` | NATS accepted the JWT minted by the callout |
|
|
| `fleet publisher ready` | KV buckets opened; `device-info` write succeeded |
|
|
| `watching KV keys filter=vm-device-00.>` | desired-state subscriber is up |
|
|
|
|
Absence of `Permissions Violation` lines is the success signal —
|
|
those mean the JWT's perms don't match what the agent tried to
|
|
publish (you'd hit them if `device_id_prefix_strip` were
|
|
misconfigured, for example).
|
|
|
|
## 5. Observe fleet traffic as admin
|
|
|
|
The harness mints a `fleet-ops` admin machine user with the
|
|
`fleet-admin` role; the callout maps that role to
|
|
`pub/sub allow: [">"]`. The integration test
|
|
`admin_jwt_reads_any_device_subject` exercises this — easiest path
|
|
to see it live is to run it with output. The test is
|
|
`#[ignore]`d on `cargo test` so a developer box doesn't burn a
|
|
10-minute Zitadel bring-up by accident; `--ignored` opts in:
|
|
|
|
```bash
|
|
FLEET_E2E_VM_0_IP=<vm0-ip> FLEET_E2E_VM_1_IP=<vm1-ip> \
|
|
cargo test -p example-fleet-e2e-demo \
|
|
--test e2e_walking_skeleton \
|
|
admin_jwt_reads_any_device_subject \
|
|
-- --test-threads=1 --nocapture --ignored
|
|
```
|
|
|
|
It subscribes admin to `device-state.>` (the direct, non-JetStream
|
|
fan-out subject the agent emits a pulse on every 30s — see
|
|
`fleet_publisher.rs::publish_state_pulse`) and asserts a message
|
|
arrives within 30s.
|
|
|
|
**Inspect KV state directly** using a bare admin client. The
|
|
underlying mechanism is in
|
|
`examples/fleet_e2e_demo/tests/e2e_walking_skeleton.rs::admin_nats_client`:
|
|
mint a JWT-bearer token from `stack.admin_machine_key`, hand it to
|
|
`async_nats` as `auth_token`. The test
|
|
`both_devices_heartbeat_within_60s` then reads `device-info` keys
|
|
directly:
|
|
|
|
```rust
|
|
let js = async_nats::jetstream::new(admin);
|
|
let bucket = js.get_key_value(BUCKET_DEVICE_INFO).await?;
|
|
let entry = bucket.entry(&device_info_key("vm-device-00")).await?;
|
|
```
|
|
|
|
To do it from a shell, port-forward NATS and use the `nats` CLI
|
|
with admin creds — but creds for an auth-callout server take a
|
|
JWT-bearer token, which the `nats` CLI doesn't speak natively;
|
|
running the test is the path of least friction.
|
|
|
|
## 6. Verify cross-device isolation (currently `#[ignore]`)
|
|
|
|
`cross_device_isolation_enforced_in_vm` is an empty test marked
|
|
`#[ignore = "requires E2eHandles::device_machine_key plumbing"]`
|
|
in `e2e_walking_skeleton.rs` — the test is a placeholder. The
|
|
plumbing it's waiting on is straightforward: the existing
|
|
`DeviceHandle` struct (`examples/fleet_e2e_demo/src/lib.rs:106`)
|
|
exposes `device_id` + `vm_ip` + `labels` but not the per-device
|
|
Zitadel machine key the test would need to mint a `device`-role
|
|
JWT and try cross-device subjects. `provision_device` already
|
|
creates the key (line ~324, `machine_key_json`) — wiring it through
|
|
into `DeviceHandle.machine_key` and implementing the test body
|
|
(mint JWT-bearer for vm-device-00, sub to
|
|
`device-commands.vm-device-01`, expect `Permissions Violation`)
|
|
is a single follow-up commit. I haven't touched it because nothing
|
|
in this branch's scope required it.
|
|
|
|
**You can verify the boundary manually right now**, even without
|
|
the test wired up: tail the callout pod, then SSH onto vm-device-00
|
|
and run the agent with a tampered config that points it at
|
|
vm-device-01's keyfile. The callout will issue a JWT for
|
|
`vm-device-01` (because the JWT-bearer assertion is signed with
|
|
that user's key); the agent on vm-device-00 will then publish on
|
|
`$KV.device-info.info.vm-device-00`, which is NOT in the JWT's
|
|
allow list — NATS rejects with `Permissions Violation`. This is
|
|
the same gate the test would automate.
|
|
|
|
The permissions template is in
|
|
`nats/callout/src/permissions.rs::device_default` — every allowed
|
|
subject contains `{device_id}` and is interpolated per-request, so
|
|
device A's JWT physically cannot publish to device B's subjects.
|
|
|
|
## 7. Drive the desired-state loop
|
|
|
|
(Not yet covered by a walking-skeleton test, but the agent's
|
|
reconciler is wired and observable.) From an admin client, write a
|
|
desired state for vm-device-00:
|
|
|
|
```rust
|
|
// pseudocode — see harmony-reconciler-contracts for the exact types
|
|
let kv = jetstream.create_key_value(kv::Config {
|
|
bucket: BUCKET_DESIRED_STATE.into(),
|
|
history: 1,
|
|
..Default::default()
|
|
}).await?;
|
|
kv.put(
|
|
&desired_state_key("vm-device-00", &dn("hello-web")),
|
|
payload.into(),
|
|
).await?;
|
|
```
|
|
|
|
What happens, observable from the agent's journal:
|
|
|
|
1. Agent's KV watcher (filter `vm-device-00.>`) fires.
|
|
2. Reconciler computes the diff and runs the podman create.
|
|
3. `write_deployment_state(&state)` fires:
|
|
- puts `state.vm-device-00.hello-web` into the `device-state`
|
|
KV bucket (operator-side watch picks it up)
|
|
- publishes the same payload on direct subject
|
|
`device-state.vm-device-00` (admin observers see it live)
|
|
|
|
You can subscribe to the latter with admin and watch reconcile
|
|
events stream in real time.
|
|
|
|
## 8. Teardown
|
|
|
|
The cluster persists across runs (re-running `fleet_e2e_demo`
|
|
converges drift, doesn't recreate). When you want a clean slate:
|
|
|
|
```bash
|
|
k3d cluster delete fleet-auth-callout
|
|
|
|
virsh destroy vm-device-00; virsh undefine vm-device-00 --remove-all-storage
|
|
virsh destroy vm-device-01; virsh undefine vm-device-01 --remove-all-storage
|
|
```
|
|
|
|
Cached assets (cloud images, k3d binary, ansible venv, SSH key,
|
|
fleet secrets) live under `~/.local/share/harmony/` and survive
|
|
cluster/VM destruction by design — first run after a clean reuses
|
|
them.
|