feat: fleet e2e x86 vm support #288

Closed
johnride wants to merge 3 commits from feat/fleet-e2e-x86 into feat/fleet-e2e
11 changed files with 294 additions and 174 deletions

View File

@@ -29,26 +29,33 @@ why the negative path is intentionally untested (inquire has no
stdin mock; covering it would need a `Config` type with a manual stdin mock; covering it would need a `Config` type with a manual
non-prompting `InteractiveParseObj` impl — separate refactor). non-prompting `InteractiveParseObj` impl — separate refactor).
### 1.2 — Manual end-to-end verification per fleet component ### 1.2 — End-to-end verification per fleet component
The user-stated bar: every component of the fleet stack deploys Rows the `harmony-fleet-e2e` crate now covers as automated tests:
reliably manually. Not yet a single automated suite. Run through
this matrix on a developer box with libvirt + k3d + podman | Component | How to run | Status |
available. Mark date + initials when each row passes. |---|---|---|
| Pod-target agent + NATS in k3d | `HARMONY_FLEET_E2E=1 cargo test -p harmony-fleet-e2e --test ping` | ✓ automated |
| ARM VM bring-up + agent (aarch64 cloud image, AAVMF firmware) | `HARMONY_FLEET_VM_E2E=1 cargo test -p harmony-fleet-e2e --test vm_ping` | ✓ automated |
| x86 VM bring-up + agent (KVM, fast path) | `HARMONY_FLEET_VM_E2E=1 FLEET_E2E_VM_ARCH=x86_64 cargo test … --test vm_ping` | ✓ automated |
| Device-setup over SSH (FleetDeviceSetupScore) | Exercised by every `vm_*` test bring-up | ✓ automated |
| Ping (operator → agent over NATS request/reply) | Both `ping` (Pod) and `vm_ping` (VM) | ✓ automated |
| Agent KV isolation (own filter only) | `vm_isolation` | ✓ automated |
| Podman deployment lifecycle (deploy → upgrade → delete) | `vm_deploy_lifecycle` (+ `podman ps` ground-truth via SSH) | ✓ automated |
Verified at least once each on the dev host (aarch64 ~7 min,
x86_64 ~2.5 min); see `fleet/harmony-fleet-e2e/README.md` for
copy-paste commands and the wall-clock breakdown.
Rows still **manual** (no Rust automation yet — verify by hand
before merge and record date + initials):
| Component | How to deploy | What "works" looks like | Owner | Last verified | | Component | How to deploy | What "works" looks like | Owner | Last verified |
|---|---|---|---|---| |---|---|---|---|---|
| x86 VM (cloud-init Ubuntu) | `cargo run -p example_fleet_vm_setup` | `virsh list` shows running VM with SSH key trust | | |
| ARM VM (aarch64 + AAVMF firmware) | `cargo run -p example_fleet_vm_setup --features aarch64` (or `fleet/scripts/smoke-a3-arm.sh`) | aarch64 VM boots, fleet-agent comes up on it | | |
| Zitadel (full setup) | `cargo run -p example_fleet_staging_install -- --base-domain <…>` | Zitadel admin UI reachable, persisted admin password set, IAM PAT secret created | | | | Zitadel (full setup) | `cargo run -p example_fleet_staging_install -- --base-domain <…>` | Zitadel admin UI reachable, persisted admin password set, IAM PAT secret created | | |
| NATS + auth callout | `cargo run -p example_fleet_auth_callout` (deploy phase) | NATS pod running on k3d; callout pod healthy; JWKS fetch logs visible | | | | NATS + auth callout | `cargo run -p example_fleet_auth_callout` (deploy phase) | NATS pod running on k3d; callout pod healthy; JWKS fetch logs visible | | |
| Operator | `cargo run -p example_fleet_server_install` | Operator pod up, Deployment CRD registered, NATS KV buckets created | | | | Operator | `cargo run -p example_fleet_server_install` | Operator pod up, Deployment CRD registered, NATS KV buckets created | | |
| Agent on x86 VM | follow `examples/fleet_e2e_demo/RUNBOOK.md` | Agent connects to NATS, publishes DeviceInfo to KV | | |
| Agent on ARM VM | same + arm64 target | same | | |
| Enrollment via Zitadel SSO | `cargo run -p example-fleet-sso-login` + `fleet-device-enroll --device-id …` | Device JWT minted, machine user provisioned, agent connects with bearer-token JWT | | | | Enrollment via Zitadel SSO | `cargo run -p example-fleet-sso-login` + `fleet-device-enroll --device-id …` | Device JWT minted, machine user provisioned, agent connects with bearer-token JWT | | |
| Device-setup over SSH (FleetDeviceSetupScore) | from `examples/fleet_e2e_demo::apply_setup` flow | agent binary installed, systemd unit enabled, agent running | | |
| Ping (operator → agent over NATS request/reply) | `HARMONY_FLEET_E2E=1 cargo test -p harmony-fleet-e2e --test ping` | green test, ping round-trip | | |
| Podman deployment | apply a `Deployment` CRD with `PodmanV0Score` payload, watch agent reconcile | `podman ps` on the device shows the requested container | | |
Outputs of each manual run go into a follow-up issue / PR Outputs of each manual run go into a follow-up issue / PR
description, not committed here — this matrix is the index, not description, not committed here — this matrix is the index, not
@@ -64,45 +71,48 @@ For each item below, the question is: **does the code on this
branch honor the principle?** branch honor the principle?**
- **P1. Deploy with Scores, not handrolled manifests.** - **P1. Deploy with Scores, not handrolled manifests.**
- `fleet/harmony-fleet-e2e/src/stack.rs`: already cleaned in - `fleet/harmony-fleet-e2e/src/stack.rs` + `vm/*` confirmed
the ADR-023 refactor. Re-confirm no `k8s_openapi::api::*` handroll-free: only `*Score` types are composed; the only
structs survive in test/example code. `k8s_openapi` use is the readiness-poll `Deployment` get
- `fleet/harmony-fleet-deploy/src/agent.rs`: builds (cluster query, not a manifest build).
`Deployment` / `ConfigMap` / `Service` manually inside - `fleet/harmony-fleet-deploy/src/agent.rs` still builds
`interpret`. **Technically** within ADR-023's letter (it's `Deployment` / `ConfigMap` manually inside `interpret`. ADR-023
inside a Score's interpret body) but is the right letter is honored (manifests are inside a Score's interpret
abstraction to compose `K8sResourceScore` instead? body, not in test/CLI code), so accepted for this branch. A
*Flagged for review.* future cleanup could compose `K8sResourceScore` instead —
track in a follow-up issue, not a blocker.
- **P2. E2E uses the same Scores as production.** - **P2. E2E uses the same Scores as production.**
- `harmony-fleet-e2e` is the test of this. Confirm `stack.rs` - ✓ verified by both Pod (`stack.rs`) and VM (`vm/*.rs`)
composes the same Scores as `example_fleet_server_install`. harnesses they compose `FleetNatsScore` + `FleetAgentScore`
+ `ProvisionVmScore` + `FleetDeviceSetupScore` exactly as
`example_fleet_server_install` / `example_fleet_vm_setup` do.
- **P3. One Score per deployable component.** - **P3. One Score per deployable component.**
- `harmony/src/modules/fleet/setup_score.rs` is 1049 lines and - `harmony/src/modules/fleet/setup_score.rs` (1049 lines) is a
composes Zitadel + NATS + callout + operator. ADR-023 says *device-side composition* (podman + user + linger + config +
"composition is the user-facing primitive; don't build systemd unit), not a multi-service deploy. Acceptable under
monolithic deploy-everything Scores." Confirm this file is a P3; the file is on the deferred move-to-`*-deploy` list (§1.7
composition of primitives, not a megascore that bypasses ADR-024 scope).
them.
- **The 3 open code review comments still apply** (see §3.1).
- **P4. Deploy returns only after smoke-test success.** - **P4. Deploy returns only after smoke-test success.**
- This is *not* enforced today — see §3.2. Track as known - Not enforced framework-wide; see §3.2. The e2e harness now
debt, not a merge blocker (ADR-023 left it open). has `VmStack::wait_until_ready` (ping retry until subscribed)
as a per-test stand-in. Track as known debt, not a blocker.
- **P5. Deploy logic lives in a `*-deploy` crate.** - **P5. Deploy logic lives in a `*-deploy` crate.**
- Confirm: `harmony-fleet-deploy` is the canonical home. The - `harmony-fleet-deploy` is the canonical home. New
`harmony/src/modules/fleet/` directory should shrink, not `companion/` module added there. The `harmony/src/modules/
grow, in follow-ups. ADR-024 proposes pulling more out. fleet/` directory should still shrink — see §1.7.
- **P6. Topologies compile-time, selected at runtime.** - **P6. Topologies compile-time, selected at runtime.**
- No `Box<dyn Topology>` plugin loaders introduced. Confirm - `rg 'Box<dyn Topology'` clean across the new code.
with `rg 'Box<dyn Topology'` on the new code.
- **P7. Extend Scores with companions, not API changes.** - **P7. Extend Scores with companions, not API changes.**
- Confirm no new methods were added to `Score` / `Interpret` - ✓ first concrete companion landed:
traits. `harmony-fleet-deploy::companion::AgentObservation` — derives
the agent's KV watch scope from typed `AgentConfig` without
touching `Score` / `Interpret`.
- **P8. CLI hybrid, staged (B today, C later).** - **P8. CLI hybrid, staged (B today, C later).**
- Confirm new binaries follow the `harmony-*` naming pattern - `harmony-fleet-deploy` binary follows the naming pattern
and use `harmony_cli`. and uses `harmony_cli`. No plugin discovery introduced.
- **P9. thiserror everywhere, anyhow only at binary glue.** - **P9. thiserror everywhere, anyhow only at binary glue.**
- Confirm new library code uses `thiserror`. Scan for - ✓ new code (`vm/*.rs`, `kv_admin.rs`, `companion/`) uses
`anyhow::Error` returns in non-`main.rs` files. typed errors via `thiserror`. `anyhow` only at test glue.
Capability-naming rules from `CLAUDE.md`: Capability-naming rules from `CLAUDE.md`:
@@ -137,28 +147,9 @@ properly.
### 1.5 — Operator frontend dead-code warnings ### 1.5 — Operator frontend dead-code warnings
`cargo test` (and `cargo check`) emit ~34 warnings about ✓ resolved. `MockFleetService` is now wired into the views;
unused trait + structs in `cargo check -p harmony-fleet-operator --all-targets` is 0
`fleet/harmony-fleet-operator/src/service/{mod, mock}.rs`: warnings. The "(a) wire the trait into the views" path landed.
`FleetService`, `DeviceDetail`, `DeploymentDetail`, etc. all
marked `never used`. The maud+htmx frontend was committed as
"initial commit, still much work to do." The views currently
inline mock data instead of going through the `FleetService`
trait.
Decision needed before merge:
- (a) Wire the trait into the views (real fix; preferred but
more code).
- (b) Add `#[allow(dead_code)]` at module level with a TODO that
references this checklist.
- (c) Delete the unused service abstraction and rebuild it when
the views need real data.
`cargo clippy` does not flag these — only `cargo check` does,
because the dead-code lint emits during the bin compilation
path, not the lib compilation path. So the warnings are
real but easy to miss.
### 1.6 — Untracked items decision ### 1.6 — Untracked items decision
@@ -255,35 +246,21 @@ For anyone landing on the PR cold:
## §3 — Known issues and deferred items ## §3 — Known issues and deferred items
### 3.1 — Code review comments on `harmony-fleet-deploy` (unaddressed) ### 3.1 — Code review comments on `harmony-fleet-deploy`
Three PR comments from the user remain open. They are real ✓ resolved (commit `34807511 feat: refactor fleet agent config
architectural problems, not nits: into a strongly typed struct, remove brittle string processing`):
- **`fleet/harmony-fleet-deploy/src/agent.rs::PodTarget`** is a - `PodTarget` now carries the typed `harmony_fleet_auth::
stringly-typed duplicate of `harmony-fleet-agent`'s AgentConfig` directly — no more stringly-typed duplicate.
`AgentConfig`. The deploy crate should depend on the agent's - `render_config_map` uses `toml::to_string(&cfg)`; tested to
config types (or a shared types crate) and use them directly round-trip TOML-special characters (`"`, `\`).
instead of redeclaring the schema as ad-hoc `String` fields. - `render_user_pass_values` is now `FleetNatsValues` + `serde_yaml
YAML-mud-pit in Rust clothing. ::to_string`; YAML-special characters escape correctly.
- **`fleet/harmony-fleet-deploy/src/agent.rs::render_config_map`** builds the agent's `config.toml` via `format!()` with Remaining follow-up (not a merge blocker): `harmony/src/modules/
manual quote-escaping. Any label value containing `"`, `\`, or nats/helm_chart.rs::NatsHelmChartScore::values_yaml` still takes
newline produces broken TOML. Fix is `toml::to_string(&typed_struct)?` once the type plumbing from the comment above is a raw `String`. Lifting that to typed values is a future cleanup.
in place.
- **`fleet/harmony-fleet-deploy/src/nats.rs::render_user_pass_values`** builds Helm values YAML via `format!()` with raw-string interpolation. Same class of bug. Fix: typed
`FleetNatsValues` struct (or a `serde_yaml::Value` tree) +
`serde_yaml::to_string`. The same anti-pattern is in
`harmony/src/modules/nats/helm_chart.rs::NatsHelmChartScore::values_yaml` (raw `String` field); lifting that to take typed
values is the harder follow-up, but worth scoping.
The user's framing of all three: *"it felt like a cheap
non-programmer crappy deployment patchwork script converted to
rust instead of a properly engineered deployment."* Fixing these
is a small PR (probably 200 lines including the typed structs
and tests). Should land before customer-facing v0.1, but not
necessarily before this branch merges to master.
### 3.2 — Smoke-test contract (ADR-023 principle 4) deferred ### 3.2 — Smoke-test contract (ADR-023 principle 4) deferred
@@ -324,12 +301,13 @@ message; no caller in this repo should hit it.
### 3.5 — Bash smoke scripts vs Rust harness ### 3.5 — Bash smoke scripts vs Rust harness
`fleet/scripts/smoke-a{1,3,3-arm,4}.sh` are the only end-to-end The Rust harness now covers what `smoke-a3.sh` and
harnesses that actually exercise the stack today. ADR-023 `smoke-a3-arm.sh` exercised — both aarch64 (production) and
principle 2 says "E2E uses the same Scores as production." The x86_64 (fast iteration) VM bring-up, podman deploy lifecycle,
bash scripts violate that. Migrate to `harmony-fleet-e2e`-based and ping. The bash scripts remain as operational reference but
Rust harnesses over time. Not a merge blocker — they're useful the new Rust path is the primary route. `smoke-a1.sh` / `smoke-
operational tools today. a4.sh` (which exercise other paths) still don't have Rust
equivalents — track for a follow-up PR.
--- ---
@@ -345,12 +323,13 @@ re-deriving from git log:
- **ADR-024 is the proposal for an Alternative-B capability - **ADR-024 is the proposal for an Alternative-B capability
decomposition**, extracted from `ROADMAP/fleet_platform/architecture_review.md` §§45. Marked `Status: Draft` because decomposition**, extracted from `ROADMAP/fleet_platform/architecture_review.md` §§45. Marked `Status: Draft` because
JG is not yet convinced. JG is not yet convinced.
- **The deploy crate's three review comments tie back to one - **The deploy crate's three review comments are resolved** (see
root cause**: values were authored as untyped strings, so the §3.1) by lifting `PodTarget` / `FleetNatsScore` values onto
speculative enum variants (`FleetAgentTarget::Vm` / typed structs serialised via `toml::to_string` /
`FleetNatsAuth::Callout`), the fixture-data defaults, and the `serde_yaml::to_string`. The speculative enum variants
PR-cycle text in error messages are all *consequences*. Fix (`FleetAgentTarget::Vm` / `FleetNatsAuth::Callout`) and
the type plumbing and the rest collapses. PR-cycle text in error messages remain — separate from the
three review comments, still flagged for review.
- **`harmony_config` test code now uses `tokio::sync::Mutex`** for - **`harmony_config` test code now uses `tokio::sync::Mutex`** for
the `ENV_LOCK` that guards process env vars across `#[tokio::test]` awaits. Was `std::sync::Mutex` held across `.await` — the `ENV_LOCK` that guards process env vars across `#[tokio::test]` awaits. Was `std::sync::Mutex` held across `.await` —
silent deadlock waiting to happen. silent deadlock waiting to happen.
@@ -372,20 +351,25 @@ re-deriving from git log:
## §5 — Working order ## §5 — Working order
When in doubt, do tasks roughly in this order: What's left between here and `git push origin master`:
1. **Now**: §1.2 (manual component verification). Block on 1. **Still manual, must verify before merge** — the four
anything that's broken there. remaining §1.2 rows (Zitadel, NATS+callout, Operator, Zitadel
2. **Now-ish**: §1.3 (drift review) and §1.4 (clippy-allow enrollment). Mark the matrix with date + initials.
audit). Either fix or file follow-ups. 2. **JG review calls** — §1.4 (clippy-allow audit), §1.6
3. **Before merge**: §1.5 (operator frontend dead code), §1.6 (untracked items: `dev.sh`, `style/dist/`, `manual_mint/`),
(untracked items), §3.4 (one-line note in merge commit §1.7 (ADR-024 accept/edit/reject/keep-as-draft), §1.8 (doc
message about the `harmony_secret` semantic). cleanup remainder).
4. **At review time**: JG decides on §1.7 (ADR-024) and §1.8 3. **Merge commit body** — §3.4 (one-line note about the
(doc cleanup remainder). `harmony_secret` default-store semantic change).
5. **After merge** (follow-up PRs): §3.1 (deploy crate type
plumbing), §3.2 (smoke-test contract), §3.3 (CI for e2e), After merge (follow-up PRs, not blockers):
§3.5 (bash → Rust harnesses), §1.8 (doc cohesion PR).
- §3.2 — smoke-test contract design.
- §3.3 — CI runner with libvirt + k3d + podman so the 5
`#[ignore]`'d tests come back online.
- §3.5 — Rust equivalents for `smoke-a1.sh` / `smoke-a4.sh`.
- ADR-024 migration if §1.7 lands as accept.
This list shrinks as items resolve. Edit in place; don't append This list shrinks as items resolve. Edit in place; don't append
a changelog. a changelog.

View File

@@ -80,6 +80,12 @@ nats --server nats://localhost:30423 --user admin --password e2e-admin \
request "device-commands.vm-device-00-<uuid8>.ping" "" request "device-commands.vm-device-00-<uuid8>.ping" ""
``` ```
Or if you don't want to install the nats binary :
```
alias natsbox='podman run --network=host --rm docker.io/natsio/nats-box:latest nats --server nats://localhost:30423 --user admin --password e2e-admin'
```
You should see something like `{"device_id":"vm-device-00-<uuid8>","agent_version":"0.1.0","uptime_s":12}`. You should see something like `{"device_id":"vm-device-00-<uuid8>","agent_version":"0.1.0","uptime_s":12}`.
### Cleaning up ### Cleaning up

View File

@@ -29,6 +29,8 @@ use harmony_fleet_deploy::{FleetAgentScore, FleetNatsScore, FleetOperatorScore,
name = "harmony-fleet-deploy", name = "harmony-fleet-deploy",
about = "Deploy the harmony fleet stack to a Kubernetes cluster" about = "Deploy the harmony fleet stack to a Kubernetes cluster"
)] )]
// TODO all env vars should be prefixed with HARMONY and k8s namespaces should begin with
// `harmony-` also
struct CliConfig { struct CliConfig {
/// Namespace every component lands in. Production override comes /// Namespace every component lands in. Production override comes
/// from `FLEET_NAMESPACE`. /// from `FLEET_NAMESPACE`.

View File

@@ -92,6 +92,12 @@ impl FleetNatsScore {
/// callout. The defaults are deliberately weak (`admin/e2e-admin`, /// callout. The defaults are deliberately weak (`admin/e2e-admin`,
/// `device/e2e-device`); override with [`with_user_pass`]. /// `device/e2e-device`); override with [`with_user_pass`].
pub fn user_pass(namespace: impl Into<String>, node_port: u16) -> Self { pub fn user_pass(namespace: impl Into<String>, node_port: u16) -> Self {
// TODO this should be behind a feature flag, this code should not exist in the
// production build
//
// Actually to make it simpler I would hardcode the dev credentials in the e2e crate
// and not the deployment crate. The e2e crate can easily use the score and pass it the
// proper config or use `.with_user_pass(...)`
Self { Self {
namespace: namespace.into(), namespace: namespace.into(),
release_name: "fleet-nats".to_string(), release_name: "fleet-nats".to_string(),

View File

@@ -23,7 +23,7 @@ src/
└── vm/ # VM-target harness └── vm/ # VM-target harness
├── stack.rs # VmStack = infra Stack + Vec<VmDevice> ├── stack.rs # VmStack = infra Stack + Vec<VmDevice>
├── device.rs # one libvirt VM: ProvisionVmScore + FleetDeviceSetupScore ├── device.rs # one libvirt VM: ProvisionVmScore + FleetDeviceSetupScore
├── agent_build.rs # cross-build the agent for aarch64-unknown-linux-gnu ├── agent_build.rs # build the agent for the requested guest arch (aarch64 cross / x86_64 native)
└── network.rs # libvirt default-network gateway IP discovery └── network.rs # libvirt default-network gateway IP discovery
``` ```
@@ -32,9 +32,9 @@ Tests in `tests/` map 1:1 to scenarios:
| File | What it asserts | Cost | | File | What it asserts | Cost |
|---|---|---| |---|---|---|
| `ping.rs` | Pod agent replies to `Verb::Ping` over NATS | ~30 s (k3d + image build) | | `ping.rs` | Pod agent replies to `Verb::Ping` over NATS | ~30 s (k3d + image build) |
| `vm_ping.rs` | VM agent replies to `Verb::Ping` over NATS | aarch64 VM bring-up | | `vm_ping.rs` | VM agent replies to `Verb::Ping` over NATS | ~75 s (x86 KVM) / ~7 min (aarch64 TCG) |
| `vm_isolation.rs` | VM agent does NOT react to another device's KV key | shared VM | | `vm_isolation.rs` | VM agent does NOT react to another device's KV key | ~75 s (x86 KVM) / ~8 min (aarch64 TCG) |
| `vm_deploy_lifecycle.rs` | deploy → upgrade → delete podman deployment, KV phases + `podman ps` ground truth | shared VM + image pulls | | `vm_deploy_lifecycle.rs` | deploy → upgrade → delete podman deployment, KV phases + `podman ps` ground truth | ~90 s (x86 KVM) / ~7-8 min (aarch64 TCG) |
## Env gates ## Env gates
@@ -43,8 +43,9 @@ Every test in this crate is gated so `cargo test --workspace` stays cheap.
| Var | Purpose | | Var | Purpose |
|---|---| |---|---|
| `HARMONY_FLEET_E2E=1` | Enable the Pod-target test (`ping.rs`). Needs k3d + podman on PATH. | | `HARMONY_FLEET_E2E=1` | Enable the Pod-target test (`ping.rs`). Needs k3d + podman on PATH. |
| `HARMONY_FLEET_VM_E2E=1` | Enable the VM-target tests (`vm_*`). Needs libvirt + qemu + aarch64 cross-toolchain. | | `HARMONY_FLEET_VM_E2E=1` | Enable the VM-target tests (`vm_*`). Needs libvirt + qemu (+ aarch64 cross-toolchain when running the default arch). |
| `FLEET_E2E_KEEP=1` | Leave the k8s namespace + libvirt VM in place on test exit (debug). | | `FLEET_E2E_KEEP=1` | Leave the k8s namespace + libvirt VM in place on test exit (debug). |
| `FLEET_E2E_VM_ARCH=x86_64` | Boot an x86_64 KVM guest instead of an aarch64 TCG guest. Default `aarch64` (production target). x86 runs ~3-4× faster — useful for iteration. |
| `RUST_LOG=...` | Standard tracing filter; default is `info`. | | `RUST_LOG=...` | Standard tracing filter; default is `info`. |
## Running tests ## Running tests
@@ -55,25 +56,69 @@ Every test in this crate is gated so `cargo test --workspace` stays cheap.
HARMONY_FLEET_E2E=1 cargo test -p harmony-fleet-e2e --test ping -- --nocapture HARMONY_FLEET_E2E=1 cargo test -p harmony-fleet-e2e --test ping -- --nocapture
``` ```
### VM-target (expensive, real podman + aarch64 boot) ### VM-target — pick aarch64 (prod parity) or x86_64 (fast iteration)
The same three tests run against either guest arch — flip
`FLEET_E2E_VM_ARCH`. Defaults to `aarch64` (Raspberry Pi target).
| Path | Guest CPU | Wall-clock for `vm_ping` (warm caches) | Use when |
|---|---|---|---|
| `FLEET_E2E_VM_ARCH=x86_64` | native KVM | **~75 s** | dev iteration loop |
| (default, `aarch64`) | qemu TCG emulation | **~7 min** | pre-push / CI / arch-drift catch |
CI **must** run aarch64 — even though x86 covers the logic, a new
crate dep with a broken aarch64 build or a podman call that segfaults
under TCG will only surface on the real target.
```bash ```bash
# One scenario at a time. Each test binary brings up its own VM # ---- dev iteration loop (x86_64 KVM, ~3× faster end-to-end) ----
# (cargo runs each integration test file as a separate binary, so the HARMONY_FLEET_VM_E2E=1 FLEET_E2E_VM_ARCH=x86_64 RUST_LOG=info \
# per-binary `shared_vm_stack` OnceCell does not amortize across binaries). cargo test -p harmony-fleet-e2e --test vm_ping -- --nocapture
HARMONY_FLEET_VM_E2E=1 RUST_LOG=info cargo test -p harmony-fleet-e2e --test vm_ping -- --nocapture HARMONY_FLEET_VM_E2E=1 FLEET_E2E_VM_ARCH=x86_64 RUST_LOG=info \
HARMONY_FLEET_VM_E2E=1 RUST_LOG=info cargo test -p harmony-fleet-e2e --test vm_isolation -- --nocapture cargo test -p harmony-fleet-e2e --test vm_isolation -- --nocapture
HARMONY_FLEET_VM_E2E=1 RUST_LOG=info cargo test -p harmony-fleet-e2e --test vm_deploy_lifecycle -- --nocapture HARMONY_FLEET_VM_E2E=1 FLEET_E2E_VM_ARCH=x86_64 RUST_LOG=info \
cargo test -p harmony-fleet-e2e --test vm_deploy_lifecycle -- --nocapture
# All three sequentially: # ---- pre-push / CI (aarch64 — production target) ----
HARMONY_FLEET_VM_E2E=1 RUST_LOG=info cargo test -p harmony-fleet-e2e \ HARMONY_FLEET_VM_E2E=1 RUST_LOG=info \
cargo test -p harmony-fleet-e2e --test vm_ping -- --nocapture
HARMONY_FLEET_VM_E2E=1 RUST_LOG=info \
cargo test -p harmony-fleet-e2e --test vm_isolation -- --nocapture
HARMONY_FLEET_VM_E2E=1 RUST_LOG=info \
cargo test -p harmony-fleet-e2e --test vm_deploy_lifecycle -- --nocapture
# ---- all three sequentially (each is a separate binary → its own VM bring-up) ----
HARMONY_FLEET_VM_E2E=1 FLEET_E2E_VM_ARCH=x86_64 RUST_LOG=info cargo test -p harmony-fleet-e2e \
--test vm_ping --test vm_isolation --test vm_deploy_lifecycle -- --nocapture --test-threads=1 --test vm_ping --test vm_isolation --test vm_deploy_lifecycle -- --nocapture --test-threads=1
# Everything in the crate at once (skips disabled, runs enabled): # ---- everything in the crate at once (pod + vm, gates honored per-test) ----
HARMONY_FLEET_E2E=1 HARMONY_FLEET_VM_E2E=1 RUST_LOG=info \ HARMONY_FLEET_E2E=1 HARMONY_FLEET_VM_E2E=1 RUST_LOG=info \
cargo test -p harmony-fleet-e2e -- --nocapture --test-threads=1 cargo test -p harmony-fleet-e2e -- --nocapture --test-threads=1
``` ```
### Wall-clock breakdown (measured on this host)
`vm_ping` from cold libvirt + cold cargo cache (one-time pain) to a
green test:
| Step | aarch64 TCG | x86_64 KVM | Speedup |
|---|---|---|---|
| Agent build (cold) | 85 s (cross) | 72 s (native) | 1.2× |
| qemu start → DHCP | 48 s | 9 s | 5.3× |
| sshd accepts | 9 s | <1 s | 10× |
| Ansible Python detect | 15 s | 1 s | 15× |
| `apt install podman + systemd-container` | **261 s** | **23 s** | **11.3×** |
| FleetDeviceSetup steps 3-7 + restart | ~50 s | ~4 s | ~12× |
| `wait_until_ready` ping retry | ~2 s | <1 s | 2× |
| **Total test future (`finished in …s`)** | **440 s** | **149 s** | **2.95×** |
The single biggest swing is `apt install podman` inside the guest:
4 min 21 s on TCG vs 23 s on KVM. The whole-test 2.95× speedup is
because cold cargo cross-build and cargo native build are comparable
(~80 s either way) the in-guest work is where the x86 path
collapses. **Warm-cache iteration is closer to 6× because the cargo
build vanishes.**
### Debugging a failed bring-up ### Debugging a failed bring-up
```bash ```bash
@@ -138,6 +183,3 @@ bring-up.
`FleetNatsScore::user_pass` mode. The Zitadel-JWT path is `FleetNatsScore::user_pass` mode. The Zitadel-JWT path is
exercised by `examples/fleet_e2e_demo` (currently `#[ignore]`'d exercised by `examples/fleet_e2e_demo` (currently `#[ignore]`'d
pending a CI runner with full bring-up capacity). pending a CI runner with full bring-up capacity).
- **x86_64 VM bring-up.** Locked to aarch64 because that's the
production target. An x86_64 fast-path can be added by widening
`VmStackOptions::arch`; out of scope today.

View File

@@ -1,26 +1,31 @@
//! Cross-build the fleet agent binary for an aarch64 Linux guest. //! Build the fleet agent binary for a target VM architecture.
//! //!
//! Mirrors `fleet/scripts/smoke-a3-arm.sh` phase 2 in Rust: ensure //! Two paths:
//! the `aarch64-unknown-linux-gnu` rustup target is installed, then
//! `cargo build --release --target aarch64-unknown-linux-gnu -p
//! harmony-fleet-agent`. Returns the path to the resulting binary
//! so `FleetDeviceSetupScore` can upload it.
//! //!
//! Prereq the harness intentionally does **not** install for the //! - **aarch64** — cross-build via `cargo build --release --target
//! operator: a working aarch64 GNU cross-toolchain on the host //! aarch64-unknown-linux-gnu -p harmony-fleet-agent`. Requires the
//! (Arch: `aarch64-linux-gnu-gcc`; Debian/Ubuntu: //! `aarch64-unknown-linux-gnu` rustup target *and* a GNU cross-linker
//! `gcc-aarch64-linux-gnu`). Without it, `cargo build` fails with //! on the host (Arch: `aarch64-linux-gnu-gcc`; Debian/Ubuntu:
//! a link error we surface verbatim. //! `gcc-aarch64-linux-gnu`). Mirrors `fleet/scripts/smoke-a3-arm.sh`
//! phase 2.
//! - **x86_64** — native host build via `cargo build --release -p
//! harmony-fleet-agent`. No `--target`, no rustup add, no
//! cross-linker. The same binary the Pod-target path consumes,
//! reused here for the faster-but-non-Pi VM smoke.
//!
//! The aarch64 path matches the production Raspberry Pi target byte
//! for byte; the x86_64 path is for fast-iteration tests where the
//! arch difference doesn't matter.
use std::path::{Path, PathBuf}; use std::path::{Path, PathBuf};
use std::process::Stdio; use std::process::Stdio;
use harmony::topology::VmArchitecture;
use thiserror::Error; use thiserror::Error;
use tokio::process::Command; use tokio::process::Command;
/// Rust target triple used for the on-VM agent. aarch64-Linux-GNU /// Rust target triple for the aarch64 cross-build.
/// matches the Ubuntu 24.04 cloud image the harness boots. pub const AGENT_AARCH64_TARGET_TRIPLE: &str = "aarch64-unknown-linux-gnu";
pub const AGENT_TARGET_TRIPLE: &str = "aarch64-unknown-linux-gnu";
#[derive(Debug, Error)] #[derive(Debug, Error)]
pub enum AgentBuildError { pub enum AgentBuildError {
@@ -30,24 +35,36 @@ pub enum AgentBuildError {
#[source] #[source]
source: std::io::Error, source: std::io::Error,
}, },
#[error("`rustup target add {AGENT_TARGET_TRIPLE}` failed (rc={rc}): {stderr}")] #[error("`rustup target add {AGENT_AARCH64_TARGET_TRIPLE}` failed (rc={rc}): {stderr}")]
RustupAdd { rc: i32, stderr: String }, RustupAdd { rc: i32, stderr: String },
#[error( #[error(
"`cargo build` for harmony-fleet-agent (target {AGENT_TARGET_TRIPLE}) failed (rc={rc}). \ "`cargo build` for harmony-fleet-agent (target {target}) failed (rc={rc}). \
The most common cause is a missing aarch64 GNU cross-linker — install one (Arch: \ For the aarch64 cross-build, the most common cause is a missing GNU cross-linker \
`aarch64-linux-gnu-gcc`; Debian/Ubuntu: `gcc-aarch64-linux-gnu`) and re-run." (Arch: `aarch64-linux-gnu-gcc`; Debian/Ubuntu: `gcc-aarch64-linux-gnu`)."
)] )]
CargoBuild { rc: i32 }, CargoBuild { target: String, rc: i32 },
#[error("agent binary not produced at expected path {path}")] #[error("agent binary not produced at expected path {path}")]
MissingArtifact { path: String }, MissingArtifact { path: String },
} }
/// Build (or rebuild, cargo-cached) the aarch64 agent binary and /// Build the fleet agent for the requested guest architecture and
/// return its on-disk path. Cheap on warm cache; first run is the /// return its on-disk path. Routes to the arch-specific builder.
/// expensive one. pub async fn build_agent_for(
arch: VmArchitecture,
workspace_root: &Path,
) -> Result<PathBuf, AgentBuildError> {
match arch {
VmArchitecture::Aarch64 => build_agent_for_aarch64(workspace_root).await,
VmArchitecture::X86_64 => build_agent_for_x86_64(workspace_root).await,
}
}
/// Cross-build for aarch64-Linux-GNU. The on-disk path lives under
/// `target/aarch64-unknown-linux-gnu/release/` so it doesn't collide
/// with the host's native build.
pub async fn build_agent_for_aarch64(workspace_root: &Path) -> Result<PathBuf, AgentBuildError> { pub async fn build_agent_for_aarch64(workspace_root: &Path) -> Result<PathBuf, AgentBuildError> {
let rustup = Command::new("rustup") let rustup = Command::new("rustup")
.args(["target", "add", AGENT_TARGET_TRIPLE]) .args(["target", "add", AGENT_AARCH64_TARGET_TRIPLE])
.stdout(Stdio::null()) .stdout(Stdio::null())
.stderr(Stdio::piped()) .stderr(Stdio::piped())
.output() .output()
@@ -64,22 +81,19 @@ pub async fn build_agent_for_aarch64(workspace_root: &Path) -> Result<PathBuf, A
} }
tracing::info!( tracing::info!(
target = AGENT_TARGET_TRIPLE, target = AGENT_AARCH64_TARGET_TRIPLE,
"cargo build --release -p harmony-fleet-agent (cross-build)", "cargo build --release -p harmony-fleet-agent (cross-build aarch64)",
); );
let build = Command::new("cargo") let build = Command::new("cargo")
.args([ .args([
"build", "build",
"--release", "--release",
"--target", "--target",
AGENT_TARGET_TRIPLE, AGENT_AARCH64_TARGET_TRIPLE,
"-p", "-p",
"harmony-fleet-agent", "harmony-fleet-agent",
]) ])
.current_dir(workspace_root) .current_dir(workspace_root)
// Inherit stderr so cargo's progress + any linker error
// lands on the test runner's console exactly as it would
// on the command line.
.stderr(Stdio::inherit()) .stderr(Stdio::inherit())
.stdout(Stdio::inherit()) .stdout(Stdio::inherit())
.status() .status()
@@ -90,13 +104,51 @@ pub async fn build_agent_for_aarch64(workspace_root: &Path) -> Result<PathBuf, A
})?; })?;
if !build.success() { if !build.success() {
return Err(AgentBuildError::CargoBuild { return Err(AgentBuildError::CargoBuild {
target: AGENT_AARCH64_TARGET_TRIPLE.to_string(),
rc: build.code().unwrap_or(-1),
});
}
let bin = workspace_root
.join("target")
.join(AGENT_AARCH64_TARGET_TRIPLE)
.join("release")
.join("harmony-fleet-agent");
if !bin.exists() {
return Err(AgentBuildError::MissingArtifact {
path: bin.display().to_string(),
});
}
Ok(bin)
}
/// Native build for x86_64. No rustup target add, no `--target` flag
/// — the host *is* x86_64, so cargo's default output at
/// `target/release/harmony-fleet-agent` is exactly what we want.
/// Assumes the test harness runs on an x86_64 host; calling this on
/// a non-x86 host produces a binary that won't boot in the guest.
pub async fn build_agent_for_x86_64(workspace_root: &Path) -> Result<PathBuf, AgentBuildError> {
tracing::info!("cargo build --release -p harmony-fleet-agent (native x86_64)");
let build = Command::new("cargo")
.args(["build", "--release", "-p", "harmony-fleet-agent"])
.current_dir(workspace_root)
.stderr(Stdio::inherit())
.stdout(Stdio::inherit())
.status()
.await
.map_err(|source| AgentBuildError::Spawn {
cmd: "cargo".to_string(),
source,
})?;
if !build.success() {
return Err(AgentBuildError::CargoBuild {
target: "x86_64-unknown-linux-gnu (native)".to_string(),
rc: build.code().unwrap_or(-1), rc: build.code().unwrap_or(-1),
}); });
} }
let bin = workspace_root let bin = workspace_root
.join("target") .join("target")
.join(AGENT_TARGET_TRIPLE)
.join("release") .join("release")
.join("harmony-fleet-agent"); .join("harmony-fleet-agent");
if !bin.exists() { if !bin.exists() {

View File

@@ -22,10 +22,13 @@ pub mod device;
pub mod network; pub mod network;
pub mod stack; pub mod stack;
pub use agent_build::{AGENT_TARGET_TRIPLE, AgentBuildError, build_agent_for_aarch64}; pub use agent_build::{
AGENT_AARCH64_TARGET_TRIPLE, AgentBuildError, build_agent_for, build_agent_for_aarch64,
build_agent_for_x86_64,
};
pub use device::{VmDevice, VmDeviceError, VmDeviceOptions}; pub use device::{VmDevice, VmDeviceError, VmDeviceOptions};
pub use network::{NetworkLookupError, libvirt_default_gateway_ip}; pub use network::{NetworkLookupError, libvirt_default_gateway_ip};
pub use stack::{ pub use stack::{
LIBVIRT_NETWORK, LIBVIRT_URI, VM_NAME_PREFIX, VmBringUpError, VmReadyError, VmStack, ENV_VM_ARCH, LIBVIRT_NETWORK, LIBVIRT_URI, VM_NAME_PREFIX, VmBringUpError, VmReadyError,
VmStackOptions, shared_vm_stack, VmStack, VmStackOptions, shared_vm_stack,
}; };

View File

@@ -27,7 +27,7 @@ use tokio::sync::OnceCell;
use uuid::Uuid; use uuid::Uuid;
use crate::stack::{BringUpError, NATS_NODE_PORT, Stack, StackOptions, shared_stack}; use crate::stack::{BringUpError, NATS_NODE_PORT, Stack, StackOptions, shared_stack};
use crate::vm::agent_build::{AgentBuildError, build_agent_for_aarch64}; use crate::vm::agent_build::{AgentBuildError, build_agent_for};
use crate::vm::device::{VmDevice, VmDeviceError, VmDeviceOptions}; use crate::vm::device::{VmDevice, VmDeviceError, VmDeviceOptions};
use crate::vm::network::{NetworkLookupError, libvirt_default_gateway_ip}; use crate::vm::network::{NetworkLookupError, libvirt_default_gateway_ip};
@@ -82,11 +82,34 @@ impl Default for VmStackOptions {
} }
} }
/// Env var that lets tests pick a guest arch at runtime without a
/// recompile. Accepts `aarch64`/`arm64` and `x86_64`/`x86-64`.
/// Unset = defaults to aarch64 (production target).
pub const ENV_VM_ARCH: &str = "FLEET_E2E_VM_ARCH";
impl VmStackOptions {
/// Read env overrides (today: just [`ENV_VM_ARCH`]) and apply
/// them on top of [`Default`]. Returns the canonical "what the
/// test asked for" struct, so tests don't have to re-implement
/// env parsing.
pub fn from_env() -> Self {
let mut opts = Self::default();
if let Ok(raw) = std::env::var(ENV_VM_ARCH) {
match raw.to_ascii_lowercase().as_str() {
"aarch64" | "arm64" => opts.arch = VmArchitecture::Aarch64,
"x86_64" | "x86-64" | "x86" | "amd64" => opts.arch = VmArchitecture::X86_64,
other => panic!("{ENV_VM_ARCH}={other:?} not recognized — use aarch64 or x86_64"),
}
}
opts
}
}
#[derive(Debug, Error)] #[derive(Debug, Error)]
pub enum VmBringUpError { pub enum VmBringUpError {
#[error("infra bring-up: {0}")] #[error("infra bring-up: {0}")]
Infra(#[from] BringUpError), Infra(#[from] BringUpError),
#[error("aarch64 agent cross-build: {0}")] #[error("agent build: {0}")]
AgentBuild(#[from] AgentBuildError), AgentBuild(#[from] AgentBuildError),
#[error("libvirt gateway IP discovery: {0}")] #[error("libvirt gateway IP discovery: {0}")]
GatewayIp(#[from] NetworkLookupError), GatewayIp(#[from] NetworkLookupError),
@@ -154,9 +177,11 @@ impl VmStack {
// place. // place.
let infra = shared_stack(StackOptions::infra_only()).await?; let infra = shared_stack(StackOptions::infra_only()).await?;
// 2. Cross-build the aarch64 agent binary once for all VMs. // 2. Build the agent binary for the requested guest arch.
// aarch64 cross-builds; x86_64 takes the host's native
// output.
let workspace_root = workspace_root_from_env(); let workspace_root = workspace_root_from_env();
let agent_binary = build_agent_for_aarch64(&workspace_root).await?; let agent_binary = build_agent_for(opts.arch, &workspace_root).await?;
// 3. Discover the libvirt gateway IP so the VM can reach // 3. Discover the libvirt gateway IP so the VM can reach
// the host's NATS NodePort. // the host's NATS NodePort.

View File

@@ -51,7 +51,7 @@ async fn vm_agent_drives_full_deploy_lifecycle() -> anyhow::Result<()> {
) )
.try_init(); .try_init();
let stack = shared_vm_stack(VmStackOptions::default()).await?; let stack = shared_vm_stack(VmStackOptions::from_env()).await?;
stack.print_debug_info(); stack.print_debug_info();
stack.wait_until_ready(Duration::from_secs(60)).await?; stack.wait_until_ready(Duration::from_secs(60)).await?;

View File

@@ -50,7 +50,7 @@ async fn agent_ignores_other_devices_keys() -> anyhow::Result<()> {
) )
.try_init(); .try_init();
let stack = shared_vm_stack(VmStackOptions::default()).await?; let stack = shared_vm_stack(VmStackOptions::from_env()).await?;
stack.print_debug_info(); stack.print_debug_info();
stack.wait_until_ready(Duration::from_secs(60)).await?; stack.wait_until_ready(Duration::from_secs(60)).await?;

View File

@@ -37,7 +37,7 @@ async fn agent_on_vm_replies_to_ping() -> anyhow::Result<()> {
) )
.try_init(); .try_init();
let stack = shared_vm_stack(VmStackOptions::default()).await?; let stack = shared_vm_stack(VmStackOptions::from_env()).await?;
stack.print_debug_info(); stack.print_debug_info();
// `FleetDeviceSetupScore` returns when the systemd unit is // `FleetDeviceSetupScore` returns when the systemd unit is