Files
harmony/fleet/harmony-fleet-e2e/README.md
Jean-Gabriel Gill-Couture ba685baddb
Some checks failed
Run Check Script / check (pull_request) Failing after 2m23s
doc: fleet e2e x86 arch support
2026-05-20 22:47:52 -04:00

186 lines
8.0 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# harmony-fleet-e2e
End-to-end test harness for the fleet stack. Brings up NATS (in k3d)
plus one or more `fleet-agent` instances — either as in-cluster Pods
(cheap, no podman) or on real libvirt VMs (expensive, real podman,
matches the production Raspberry Pi target).
Per ADR-023 P2, the harness composes the **same `*Score` types
production uses** (`FleetNatsScore`, `FleetAgentScore`,
`ProvisionVmScore`, `FleetDeviceSetupScore`). The only thing this
crate owns is the test-fixture wiring: per-binary `OnceCell` bring-up,
RAII cleanup of namespaces + VMs, and admin-side KV helpers.
## File map
```
src/
├── lib.rs # entry, re-exports
├── stack.rs # Pod-target stack (NATS + Pod agents, num_devices=0 = infra-only)
├── images.rs # cargo build + podman build + k3d image import (Pod path)
├── namespace.rs # k8s namespace RAII guard
├── kv_admin.rs # admin KV helpers: put/delete desired state + wait_for_phase
└── vm/ # VM-target harness
├── stack.rs # VmStack = infra Stack + Vec<VmDevice>
├── device.rs # one libvirt VM: ProvisionVmScore + FleetDeviceSetupScore
├── agent_build.rs # build the agent for the requested guest arch (aarch64 cross / x86_64 native)
└── network.rs # libvirt default-network gateway IP discovery
```
Tests in `tests/` map 1:1 to scenarios:
| File | What it asserts | Cost |
|---|---|---|
| `ping.rs` | Pod agent replies to `Verb::Ping` over NATS | ~30 s (k3d + image build) |
| `vm_ping.rs` | VM agent replies to `Verb::Ping` over NATS | ~75 s (x86 KVM) / ~7 min (aarch64 TCG) |
| `vm_isolation.rs` | VM agent does NOT react to another device's KV key | ~75 s (x86 KVM) / ~8 min (aarch64 TCG) |
| `vm_deploy_lifecycle.rs` | deploy → upgrade → delete podman deployment, KV phases + `podman ps` ground truth | ~90 s (x86 KVM) / ~7-8 min (aarch64 TCG) |
## Env gates
Every test in this crate is gated so `cargo test --workspace` stays cheap.
| Var | Purpose |
|---|---|
| `HARMONY_FLEET_E2E=1` | Enable the Pod-target test (`ping.rs`). Needs k3d + podman on PATH. |
| `HARMONY_FLEET_VM_E2E=1` | Enable the VM-target tests (`vm_*`). Needs libvirt + qemu (+ aarch64 cross-toolchain when running the default arch). |
| `FLEET_E2E_KEEP=1` | Leave the k8s namespace + libvirt VM in place on test exit (debug). |
| `FLEET_E2E_VM_ARCH=x86_64` | Boot an x86_64 KVM guest instead of an aarch64 TCG guest. Default `aarch64` (production target). x86 runs ~3-4× faster — useful for iteration. |
| `RUST_LOG=...` | Standard tracing filter; default is `info`. |
## Running tests
### Pod-target (cheap, fast iteration)
```bash
HARMONY_FLEET_E2E=1 cargo test -p harmony-fleet-e2e --test ping -- --nocapture
```
### VM-target — pick aarch64 (prod parity) or x86_64 (fast iteration)
The same three tests run against either guest arch — flip
`FLEET_E2E_VM_ARCH`. Defaults to `aarch64` (Raspberry Pi target).
| Path | Guest CPU | Wall-clock for `vm_ping` (warm caches) | Use when |
|---|---|---|---|
| `FLEET_E2E_VM_ARCH=x86_64` | native KVM | **~75 s** | dev iteration loop |
| (default, `aarch64`) | qemu TCG emulation | **~7 min** | pre-push / CI / arch-drift catch |
CI **must** run aarch64 — even though x86 covers the logic, a new
crate dep with a broken aarch64 build or a podman call that segfaults
under TCG will only surface on the real target.
```bash
# ---- dev iteration loop (x86_64 KVM, ~3× faster end-to-end) ----
HARMONY_FLEET_VM_E2E=1 FLEET_E2E_VM_ARCH=x86_64 RUST_LOG=info \
cargo test -p harmony-fleet-e2e --test vm_ping -- --nocapture
HARMONY_FLEET_VM_E2E=1 FLEET_E2E_VM_ARCH=x86_64 RUST_LOG=info \
cargo test -p harmony-fleet-e2e --test vm_isolation -- --nocapture
HARMONY_FLEET_VM_E2E=1 FLEET_E2E_VM_ARCH=x86_64 RUST_LOG=info \
cargo test -p harmony-fleet-e2e --test vm_deploy_lifecycle -- --nocapture
# ---- pre-push / CI (aarch64 — production target) ----
HARMONY_FLEET_VM_E2E=1 RUST_LOG=info \
cargo test -p harmony-fleet-e2e --test vm_ping -- --nocapture
HARMONY_FLEET_VM_E2E=1 RUST_LOG=info \
cargo test -p harmony-fleet-e2e --test vm_isolation -- --nocapture
HARMONY_FLEET_VM_E2E=1 RUST_LOG=info \
cargo test -p harmony-fleet-e2e --test vm_deploy_lifecycle -- --nocapture
# ---- all three sequentially (each is a separate binary → its own VM bring-up) ----
HARMONY_FLEET_VM_E2E=1 FLEET_E2E_VM_ARCH=x86_64 RUST_LOG=info cargo test -p harmony-fleet-e2e \
--test vm_ping --test vm_isolation --test vm_deploy_lifecycle -- --nocapture --test-threads=1
# ---- everything in the crate at once (pod + vm, gates honored per-test) ----
HARMONY_FLEET_E2E=1 HARMONY_FLEET_VM_E2E=1 RUST_LOG=info \
cargo test -p harmony-fleet-e2e -- --nocapture --test-threads=1
```
### Wall-clock breakdown (measured on this host)
`vm_ping` from cold libvirt + cold cargo cache (one-time pain) to a
green test:
| Step | aarch64 TCG | x86_64 KVM | Speedup |
|---|---|---|---|
| Agent build (cold) | 85 s (cross) | 72 s (native) | 1.2× |
| qemu start → DHCP | 48 s | 9 s | 5.3× |
| sshd accepts | 9 s | <1 s | 10× |
| Ansible Python detect | 15 s | 1 s | 15× |
| `apt install podman + systemd-container` | **261 s** | **23 s** | **11.3×** |
| FleetDeviceSetup steps 3-7 + restart | ~50 s | ~4 s | ~12× |
| `wait_until_ready` ping retry | ~2 s | <1 s | 2× |
| **Total test future (`finished in …s`)** | **440 s** | **149 s** | **2.95×** |
The single biggest swing is `apt install podman` inside the guest:
4 min 21 s on TCG vs 23 s on KVM. The whole-test 2.95× speedup is
because cold cargo cross-build and cargo native build are comparable
(~80 s either way) the in-guest work is where the x86 path
collapses. **Warm-cache iteration is closer to 6× because the cargo
build vanishes.**
### Debugging a failed bring-up
```bash
# Leave the VM + namespace alive; inspect by hand.
FLEET_E2E_KEEP=1 HARMONY_FLEET_VM_E2E=1 RUST_LOG=debug \
cargo test -p harmony-fleet-e2e --test vm_ping -- --nocapture
# After the test exits, the harness logs the cleanup commands you'd run:
# kubectl delete namespace e2e-<uuid>
# virsh destroy fleet-e2e-vm-<run>-<i>
# virsh undefine --nvram --remove-all-storage fleet-e2e-vm-<run>-<i>
# Tail the VM agent's journal:
ssh -i ~/.local/share/harmony/fleet/ssh/id_ed25519 \
fleet-admin@<vm-ip> -- 'journalctl -u fleet-agent -f'
```
## Host prerequisites
The Pod path needs: `k3d`, `podman`, `cargo`, `kubectl`.
The VM path adds:
```bash
# Arch
sudo pacman -S libvirt qemu-full libisoburn python podman \
aarch64-linux-gnu-gcc
rustup target add aarch64-unknown-linux-gnu
# Debian / Ubuntu
sudo apt install libvirt-daemon-system qemu-kvm xorriso python3 python3-venv \
podman gcc-aarch64-linux-gnu
rustup target add aarch64-unknown-linux-gnu
# One-time libvirt setup
sudo usermod -aG libvirt "$USER" # then re-login
sudo virsh net-start default
sudo virsh net-autostart default
```
`fleet/scripts/smoke-a3-arm.sh` is the bash equivalent of `vm_ping.rs`
and a useful sanity check when the Rust path misbehaves same
underlying Scores, fewer moving parts.
## How the VM tests reach NATS
NATS runs in k3d. The harness publishes it as a `NodePort` Service
on host port `30423`. The test process connects directly to
`nats://127.0.0.1:30423`; the VM connects to the same NodePort via
the libvirt default-network gateway (typically `192.168.122.1`)
`vm::network::libvirt_default_gateway_ip` discovers the IP at
bring-up.
## What's deliberately not tested here
- **Operator-side aggregation.** The operator's KV-watch CR-status
reflection is covered by the operator crate's own suite. These
tests bypass the operator and talk to NATS directly to keep the
failure surface narrow when an agent test fails, you know
it's the agent.
- **Real Zitadel auth.** All VM tests run against the
`FleetNatsScore::user_pass` mode. The Zitadel-JWT path is
exercised by `examples/fleet_e2e_demo` (currently `#[ignore]`'d
pending a CI runner with full bring-up capacity).