Some checks failed
Run Check Script / check (pull_request) Failing after 2m23s
186 lines
8.0 KiB
Markdown
186 lines
8.0 KiB
Markdown
# harmony-fleet-e2e
|
||
|
||
End-to-end test harness for the fleet stack. Brings up NATS (in k3d)
|
||
plus one or more `fleet-agent` instances — either as in-cluster Pods
|
||
(cheap, no podman) or on real libvirt VMs (expensive, real podman,
|
||
matches the production Raspberry Pi target).
|
||
|
||
Per ADR-023 P2, the harness composes the **same `*Score` types
|
||
production uses** (`FleetNatsScore`, `FleetAgentScore`,
|
||
`ProvisionVmScore`, `FleetDeviceSetupScore`). The only thing this
|
||
crate owns is the test-fixture wiring: per-binary `OnceCell` bring-up,
|
||
RAII cleanup of namespaces + VMs, and admin-side KV helpers.
|
||
|
||
## File map
|
||
|
||
```
|
||
src/
|
||
├── lib.rs # entry, re-exports
|
||
├── stack.rs # Pod-target stack (NATS + Pod agents, num_devices=0 = infra-only)
|
||
├── images.rs # cargo build + podman build + k3d image import (Pod path)
|
||
├── namespace.rs # k8s namespace RAII guard
|
||
├── kv_admin.rs # admin KV helpers: put/delete desired state + wait_for_phase
|
||
└── vm/ # VM-target harness
|
||
├── stack.rs # VmStack = infra Stack + Vec<VmDevice>
|
||
├── device.rs # one libvirt VM: ProvisionVmScore + FleetDeviceSetupScore
|
||
├── agent_build.rs # build the agent for the requested guest arch (aarch64 cross / x86_64 native)
|
||
└── network.rs # libvirt default-network gateway IP discovery
|
||
```
|
||
|
||
Tests in `tests/` map 1:1 to scenarios:
|
||
|
||
| File | What it asserts | Cost |
|
||
|---|---|---|
|
||
| `ping.rs` | Pod agent replies to `Verb::Ping` over NATS | ~30 s (k3d + image build) |
|
||
| `vm_ping.rs` | VM agent replies to `Verb::Ping` over NATS | ~75 s (x86 KVM) / ~7 min (aarch64 TCG) |
|
||
| `vm_isolation.rs` | VM agent does NOT react to another device's KV key | ~75 s (x86 KVM) / ~8 min (aarch64 TCG) |
|
||
| `vm_deploy_lifecycle.rs` | deploy → upgrade → delete podman deployment, KV phases + `podman ps` ground truth | ~90 s (x86 KVM) / ~7-8 min (aarch64 TCG) |
|
||
|
||
## Env gates
|
||
|
||
Every test in this crate is gated so `cargo test --workspace` stays cheap.
|
||
|
||
| Var | Purpose |
|
||
|---|---|
|
||
| `HARMONY_FLEET_E2E=1` | Enable the Pod-target test (`ping.rs`). Needs k3d + podman on PATH. |
|
||
| `HARMONY_FLEET_VM_E2E=1` | Enable the VM-target tests (`vm_*`). Needs libvirt + qemu (+ aarch64 cross-toolchain when running the default arch). |
|
||
| `FLEET_E2E_KEEP=1` | Leave the k8s namespace + libvirt VM in place on test exit (debug). |
|
||
| `FLEET_E2E_VM_ARCH=x86_64` | Boot an x86_64 KVM guest instead of an aarch64 TCG guest. Default `aarch64` (production target). x86 runs ~3-4× faster — useful for iteration. |
|
||
| `RUST_LOG=...` | Standard tracing filter; default is `info`. |
|
||
|
||
## Running tests
|
||
|
||
### Pod-target (cheap, fast iteration)
|
||
|
||
```bash
|
||
HARMONY_FLEET_E2E=1 cargo test -p harmony-fleet-e2e --test ping -- --nocapture
|
||
```
|
||
|
||
### VM-target — pick aarch64 (prod parity) or x86_64 (fast iteration)
|
||
|
||
The same three tests run against either guest arch — flip
|
||
`FLEET_E2E_VM_ARCH`. Defaults to `aarch64` (Raspberry Pi target).
|
||
|
||
| Path | Guest CPU | Wall-clock for `vm_ping` (warm caches) | Use when |
|
||
|---|---|---|---|
|
||
| `FLEET_E2E_VM_ARCH=x86_64` | native KVM | **~75 s** | dev iteration loop |
|
||
| (default, `aarch64`) | qemu TCG emulation | **~7 min** | pre-push / CI / arch-drift catch |
|
||
|
||
CI **must** run aarch64 — even though x86 covers the logic, a new
|
||
crate dep with a broken aarch64 build or a podman call that segfaults
|
||
under TCG will only surface on the real target.
|
||
|
||
```bash
|
||
# ---- dev iteration loop (x86_64 KVM, ~3× faster end-to-end) ----
|
||
HARMONY_FLEET_VM_E2E=1 FLEET_E2E_VM_ARCH=x86_64 RUST_LOG=info \
|
||
cargo test -p harmony-fleet-e2e --test vm_ping -- --nocapture
|
||
HARMONY_FLEET_VM_E2E=1 FLEET_E2E_VM_ARCH=x86_64 RUST_LOG=info \
|
||
cargo test -p harmony-fleet-e2e --test vm_isolation -- --nocapture
|
||
HARMONY_FLEET_VM_E2E=1 FLEET_E2E_VM_ARCH=x86_64 RUST_LOG=info \
|
||
cargo test -p harmony-fleet-e2e --test vm_deploy_lifecycle -- --nocapture
|
||
|
||
# ---- pre-push / CI (aarch64 — production target) ----
|
||
HARMONY_FLEET_VM_E2E=1 RUST_LOG=info \
|
||
cargo test -p harmony-fleet-e2e --test vm_ping -- --nocapture
|
||
HARMONY_FLEET_VM_E2E=1 RUST_LOG=info \
|
||
cargo test -p harmony-fleet-e2e --test vm_isolation -- --nocapture
|
||
HARMONY_FLEET_VM_E2E=1 RUST_LOG=info \
|
||
cargo test -p harmony-fleet-e2e --test vm_deploy_lifecycle -- --nocapture
|
||
|
||
# ---- all three sequentially (each is a separate binary → its own VM bring-up) ----
|
||
HARMONY_FLEET_VM_E2E=1 FLEET_E2E_VM_ARCH=x86_64 RUST_LOG=info cargo test -p harmony-fleet-e2e \
|
||
--test vm_ping --test vm_isolation --test vm_deploy_lifecycle -- --nocapture --test-threads=1
|
||
|
||
# ---- everything in the crate at once (pod + vm, gates honored per-test) ----
|
||
HARMONY_FLEET_E2E=1 HARMONY_FLEET_VM_E2E=1 RUST_LOG=info \
|
||
cargo test -p harmony-fleet-e2e -- --nocapture --test-threads=1
|
||
```
|
||
|
||
### Wall-clock breakdown (measured on this host)
|
||
|
||
`vm_ping` from cold libvirt + cold cargo cache (one-time pain) to a
|
||
green test:
|
||
|
||
| Step | aarch64 TCG | x86_64 KVM | Speedup |
|
||
|---|---|---|---|
|
||
| Agent build (cold) | 85 s (cross) | 72 s (native) | 1.2× |
|
||
| qemu start → DHCP | 48 s | 9 s | 5.3× |
|
||
| sshd accepts | 9 s | <1 s | ≥10× |
|
||
| Ansible Python detect | 15 s | 1 s | 15× |
|
||
| `apt install podman + systemd-container` | **261 s** | **23 s** | **11.3×** |
|
||
| FleetDeviceSetup steps 3-7 + restart | ~50 s | ~4 s | ~12× |
|
||
| `wait_until_ready` ping retry | ~2 s | <1 s | 2× |
|
||
| **Total test future (`finished in …s`)** | **440 s** | **149 s** | **2.95×** |
|
||
|
||
The single biggest swing is `apt install podman` inside the guest:
|
||
4 min 21 s on TCG vs 23 s on KVM. The whole-test 2.95× speedup is
|
||
because cold cargo cross-build and cargo native build are comparable
|
||
(~80 s either way) — the in-guest work is where the x86 path
|
||
collapses. **Warm-cache iteration is closer to 6× because the cargo
|
||
build vanishes.**
|
||
|
||
### Debugging a failed bring-up
|
||
|
||
```bash
|
||
# Leave the VM + namespace alive; inspect by hand.
|
||
FLEET_E2E_KEEP=1 HARMONY_FLEET_VM_E2E=1 RUST_LOG=debug \
|
||
cargo test -p harmony-fleet-e2e --test vm_ping -- --nocapture
|
||
|
||
# After the test exits, the harness logs the cleanup commands you'd run:
|
||
# kubectl delete namespace e2e-<uuid>
|
||
# virsh destroy fleet-e2e-vm-<run>-<i>
|
||
# virsh undefine --nvram --remove-all-storage fleet-e2e-vm-<run>-<i>
|
||
|
||
# Tail the VM agent's journal:
|
||
ssh -i ~/.local/share/harmony/fleet/ssh/id_ed25519 \
|
||
fleet-admin@<vm-ip> -- 'journalctl -u fleet-agent -f'
|
||
```
|
||
|
||
## Host prerequisites
|
||
|
||
The Pod path needs: `k3d`, `podman`, `cargo`, `kubectl`.
|
||
|
||
The VM path adds:
|
||
|
||
```bash
|
||
# Arch
|
||
sudo pacman -S libvirt qemu-full libisoburn python podman \
|
||
aarch64-linux-gnu-gcc
|
||
rustup target add aarch64-unknown-linux-gnu
|
||
|
||
# Debian / Ubuntu
|
||
sudo apt install libvirt-daemon-system qemu-kvm xorriso python3 python3-venv \
|
||
podman gcc-aarch64-linux-gnu
|
||
rustup target add aarch64-unknown-linux-gnu
|
||
|
||
# One-time libvirt setup
|
||
sudo usermod -aG libvirt "$USER" # then re-login
|
||
sudo virsh net-start default
|
||
sudo virsh net-autostart default
|
||
```
|
||
|
||
`fleet/scripts/smoke-a3-arm.sh` is the bash equivalent of `vm_ping.rs`
|
||
and a useful sanity check when the Rust path misbehaves — same
|
||
underlying Scores, fewer moving parts.
|
||
|
||
## How the VM tests reach NATS
|
||
|
||
NATS runs in k3d. The harness publishes it as a `NodePort` Service
|
||
on host port `30423`. The test process connects directly to
|
||
`nats://127.0.0.1:30423`; the VM connects to the same NodePort via
|
||
the libvirt default-network gateway (typically `192.168.122.1`) —
|
||
`vm::network::libvirt_default_gateway_ip` discovers the IP at
|
||
bring-up.
|
||
|
||
## What's deliberately not tested here
|
||
|
||
- **Operator-side aggregation.** The operator's KV-watch → CR-status
|
||
reflection is covered by the operator crate's own suite. These
|
||
tests bypass the operator and talk to NATS directly to keep the
|
||
failure surface narrow — when an agent test fails, you know
|
||
it's the agent.
|
||
- **Real Zitadel auth.** All VM tests run against the
|
||
`FleetNatsScore::user_pass` mode. The Zitadel-JWT path is
|
||
exercised by `examples/fleet_e2e_demo` (currently `#[ignore]`'d
|
||
pending a CI runner with full bring-up capacity).
|