Caller must pass `UserPassCredentials` to `FleetNatsScore::user_pass` — no more `e2e-admin`/`e2e-device` defaults shipped in the library. The deploy binary reads `HARMONY_FLEET_*` env vars (default namespace `harmony-fleet-system`) and fails fast when NATS creds aren't set. Also: `style/dist/` gitignored, `manual_mint/mint.py` moved next to `nats/callout/` with README + secrets gitignore (the real RSA key that was sitting untracked has been removed), `architecture_review.md` moved to `docs/adr/drafts/024-`, three low-value ROADMAP docs deleted. Updates pre-merge checklist (§1.6, §1.8, §3.1, §5).
harmony-fleet-e2e
End-to-end test harness for the fleet stack. Brings up NATS (in k3d)
plus one or more fleet-agent instances — either as in-cluster Pods
(cheap, no podman) or on real libvirt VMs (expensive, real podman,
matches the production Raspberry Pi target).
Per ADR-023 P2, the harness composes the same *Score types
production uses (FleetNatsScore, FleetAgentScore,
ProvisionVmScore, FleetDeviceSetupScore). The only thing this
crate owns is the test-fixture wiring: per-binary OnceCell bring-up,
RAII cleanup of namespaces + VMs, and admin-side KV helpers.
File map
src/
├── lib.rs # entry, re-exports
├── stack.rs # Pod-target stack (NATS + Pod agents, num_devices=0 = infra-only)
├── images.rs # cargo build + podman build + k3d image import (Pod path)
├── namespace.rs # k8s namespace RAII guard
├── kv_admin.rs # admin KV helpers: put/delete desired state + wait_for_phase
└── vm/ # VM-target harness
├── stack.rs # VmStack = infra Stack + Vec<VmDevice>
├── device.rs # one libvirt VM: ProvisionVmScore + FleetDeviceSetupScore
├── agent_build.rs # build the agent for the requested guest arch (aarch64 cross / x86_64 native)
└── network.rs # libvirt default-network gateway IP discovery
Tests in tests/ map 1:1 to scenarios:
| File | What it asserts | Cost |
|---|---|---|
ping.rs |
Pod agent replies to Verb::Ping over NATS |
~30 s (k3d + image build) |
operator.rs |
Operator adds Fleet Deployment finalizers and reconciles desired-state KV create/delete | ~30 s (k3d + image build) |
vm_ping.rs |
VM agent replies to Verb::Ping over NATS |
~75 s (x86 KVM) / ~7 min (aarch64 TCG) |
vm_isolation.rs |
VM agent does NOT react to another device's KV key | ~75 s (x86 KVM) / ~8 min (aarch64 TCG) |
vm_deploy_lifecycle.rs |
deploy → upgrade → delete podman deployment, KV phases + podman ps ground truth |
~90 s (x86 KVM) / ~7-8 min (aarch64 TCG) |
Env gates
Every test in this crate is gated so cargo test --workspace stays cheap.
| Var | Purpose |
|---|---|
HARMONY_FLEET_E2E=1 |
Enable the Pod-target test (ping.rs). Needs k3d + podman on PATH. |
HARMONY_FLEET_VM_E2E=1 |
Enable the VM-target tests (vm_*). Needs libvirt + qemu (+ aarch64 cross-toolchain when running the default arch). |
FLEET_E2E_KEEP=1 |
Leave the k8s namespace + libvirt VM in place on test exit (debug). |
FLEET_E2E_VM_ARCH=x86_64 |
Boot an x86_64 KVM guest instead of an aarch64 TCG guest. Default aarch64 (production target). x86 runs ~3-4× faster — useful for iteration. |
RUST_LOG=... |
Standard tracing filter; default is info. |
Running tests
Pod-target (cheap, fast iteration)
HARMONY_FLEET_E2E=1 cargo test -p harmony-fleet-e2e --test ping -- --nocapture
VM-target — pick aarch64 (prod parity) or x86_64 (fast iteration)
The same three tests run against either guest arch — flip
FLEET_E2E_VM_ARCH. Defaults to aarch64 (Raspberry Pi target).
| Path | Guest CPU | Wall-clock for vm_ping (warm caches) |
Use when |
|---|---|---|---|
FLEET_E2E_VM_ARCH=x86_64 |
native KVM | ~75 s | dev iteration loop |
(default, aarch64) |
qemu TCG emulation | ~7 min | pre-push / CI / arch-drift catch |
CI must run aarch64 — even though x86 covers the logic, a new crate dep with a broken aarch64 build or a podman call that segfaults under TCG will only surface on the real target.
# ---- dev iteration loop (x86_64 KVM, ~3× faster end-to-end) ----
HARMONY_FLEET_VM_E2E=1 FLEET_E2E_VM_ARCH=x86_64 RUST_LOG=info \
cargo test -p harmony-fleet-e2e --test vm_ping -- --nocapture
HARMONY_FLEET_VM_E2E=1 FLEET_E2E_VM_ARCH=x86_64 RUST_LOG=info \
cargo test -p harmony-fleet-e2e --test vm_isolation -- --nocapture
HARMONY_FLEET_VM_E2E=1 FLEET_E2E_VM_ARCH=x86_64 RUST_LOG=info \
cargo test -p harmony-fleet-e2e --test vm_deploy_lifecycle -- --nocapture
# ---- pre-push / CI (aarch64 — production target) ----
HARMONY_FLEET_VM_E2E=1 RUST_LOG=info \
cargo test -p harmony-fleet-e2e --test vm_ping -- --nocapture
HARMONY_FLEET_VM_E2E=1 RUST_LOG=info \
cargo test -p harmony-fleet-e2e --test vm_isolation -- --nocapture
HARMONY_FLEET_VM_E2E=1 RUST_LOG=info \
cargo test -p harmony-fleet-e2e --test vm_deploy_lifecycle -- --nocapture
# ---- all three sequentially (each is a separate binary → its own VM bring-up) ----
HARMONY_FLEET_VM_E2E=1 FLEET_E2E_VM_ARCH=x86_64 RUST_LOG=info cargo test -p harmony-fleet-e2e \
--test vm_ping --test vm_isolation --test vm_deploy_lifecycle -- --nocapture --test-threads=1
# ---- everything in the crate at once (pod + vm, gates honored per-test) ----
HARMONY_FLEET_E2E=1 HARMONY_FLEET_VM_E2E=1 RUST_LOG=info \
cargo test -p harmony-fleet-e2e -- --nocapture --test-threads=1
Wall-clock breakdown (measured on this host)
vm_ping from cold libvirt + cold cargo cache (one-time pain) to a
green test:
| Step | aarch64 TCG | x86_64 KVM | Speedup |
|---|---|---|---|
| Agent build (cold) | 85 s (cross) | 72 s (native) | 1.2× |
| qemu start → DHCP | 48 s | 9 s | 5.3× |
| sshd accepts | 9 s | <1 s | ≥10× |
| Ansible Python detect | 15 s | 1 s | 15× |
apt install podman + systemd-container |
261 s | 23 s | 11.3× |
| FleetDeviceSetup steps 3-7 + restart | ~50 s | ~4 s | ~12× |
wait_until_ready ping retry |
~2 s | <1 s | 2× |
Total test future (finished in …s) |
440 s | 149 s | 2.95× |
The single biggest swing is apt install podman inside the guest:
4 min 21 s on TCG vs 23 s on KVM. The whole-test 2.95× speedup is
because cold cargo cross-build and cargo native build are comparable
(~80 s either way) — the in-guest work is where the x86 path
collapses. Warm-cache iteration is closer to 6× because the cargo
build vanishes.
Debugging a failed bring-up
# Leave the VM + namespace alive; inspect by hand.
FLEET_E2E_KEEP=1 HARMONY_FLEET_VM_E2E=1 RUST_LOG=debug \
cargo test -p harmony-fleet-e2e --test vm_ping -- --nocapture
# After the test exits, the harness logs the cleanup commands you'd run:
# kubectl delete namespace e2e-<uuid>
# virsh destroy fleet-e2e-vm-<run>-<i>
# virsh undefine --nvram --remove-all-storage fleet-e2e-vm-<run>-<i>
# Tail the VM agent's journal:
ssh -i ~/.local/share/harmony/fleet/ssh/id_ed25519 \
fleet-admin@<vm-ip> -- 'journalctl -u fleet-agent -f'
Host prerequisites
The Pod path needs: k3d, podman, cargo, kubectl.
The VM path adds:
# Arch
sudo pacman -S libvirt qemu-full libisoburn python podman \
aarch64-linux-gnu-gcc
rustup target add aarch64-unknown-linux-gnu
# Debian / Ubuntu
sudo apt install libvirt-daemon-system qemu-kvm xorriso python3 python3-venv \
podman gcc-aarch64-linux-gnu
rustup target add aarch64-unknown-linux-gnu
# One-time libvirt setup
sudo usermod -aG libvirt "$USER" # then re-login
sudo virsh net-start default
sudo virsh net-autostart default
fleet/scripts/smoke-a3-arm.sh is the bash equivalent of vm_ping.rs
and a useful sanity check when the Rust path misbehaves — same
underlying Scores, fewer moving parts.
How the VM tests reach NATS
NATS runs in k3d. The harness publishes it as a NodePort Service
on host port 30423. The test process connects directly to
nats://127.0.0.1:30423; the VM connects to the same NodePort via
the libvirt default-network gateway (typically 192.168.122.1) —
vm::network::libvirt_default_gateway_ip discovers the IP at
bring-up.
What's deliberately not tested here
- Operator-side aggregation. The operator's KV-watch → CR-status reflection is covered by the operator crate's own suite. These tests bypass the operator and talk to NATS directly to keep the failure surface narrow — when an agent test fails, you know it's the agent.
- Real Zitadel auth. All VM tests run against the
FleetNatsScore::user_passmode. The Zitadel-JWT path is exercised byexamples/fleet_e2e_demo(currently#[ignore]'d pending a CI runner with full bring-up capacity).