Files
harmony/examples/fleet_vm_setup/src
Jean-Gabriel Gill-Couture 1d453dd9aa feat(e2e-demo): VM-based rehearsal harness + /etc/hosts injection
Adds `examples/fleet_e2e_demo/` — composes fleet_auth_callout's
existing pieces (Zitadel + auth callout deploy) with per-device
machine-user provisioning (one ZitadelSetupScore call per VM) and
FleetDeviceSetupScore using FleetDeviceAuth::ZitadelJwt. The harness
expects pre-provisioned libvirt VMs (one per device) reachable via
`FLEET_E2E_VM_<i>_IP` env vars; full VM provisioning via
ProvisionVmScore is a follow-up — keeping the harness observable in
pieces during the cold-start debugging tomorrow.

Constituent helpers in `fleet_auth_callout::lib.rs` flipped from
private to `pub` (deploy_zitadel, wait_for_zitadel_ready,
ensure_issuer_seed, build_and_load_callout_image, etc.) so the new
harness composes them rather than re-implementing.

`bring_up_full_stack`:
1. Ensure k3d cluster (re-uses fleet_auth_callout's create_k3d).
2. Deploy Zitadel + Postgres.
3. CoreDNS rewrite + wait for Zitadel HTTP + wait for the
   chart-provisioned `iam-admin-pat` secret. (Last step is new and
   load-bearing — without it ZitadelSetupScore races the chart's
   setup job and fails on first cold-run.)
4. ZitadelSetupScore for project + API app + roles + admin
   machine-user (admin gets fleet-admin role grant).
5. Issuer NKey from a persisted secret + NATS deploy with
   auth_callout block + callout pod.
6. For each device i: per-device ZitadelSetupScore (machine-user
   with `device` role grant), pull the JSON keyfile from cache,
   render the agent's TOML with the keyfile path. (FleetDeviceSetupScore
   invocation is wired structurally; the SSH-and-apply step is
   gated behind the VM provisioning follow-up.)

`HostsEntry` + `merge_hosts_file` added to FleetDeviceSetupScore so
VMs on a libvirt NAT can resolve `sso.fleet.local` to the host
gateway. Managed-block markers in /etc/hosts make the merge
idempotent across re-runs and removable when entries are dropped
from the score. Four new unit tests cover the merge invariants
(insert, replace, strip, byte-stable).

Tests skeleton in `tests/e2e_walking_skeleton.rs`:
- `both_devices_heartbeat_within_60s` — implemented; reads from
  device-info KV via admin token.
- `admin_jwt_reads_any_device_subject` — implemented; subscribes
  to `device-state.>` as admin.
- `cross_device_isolation_enforced_in_vm` — `#[ignore]` pending
  per-device-key plumbing through E2eHandles.
- `agent_recovers_from_nats_pod_restart` — `#[ignore]` pending
  the NATS-pod-restart driver.

The two `#[ignore]`d tests cover the load-bearing reconnect and
isolation invariants. Wiring them is the morning-of-rehearsal
priority since those are the customer-facing claims.

Out of scope of this commit (called out in the roadmap doc):
- ProvisionVmScore integration (today operator runs fleet_vm_setup
  out-of-band).
- Operator install via Helm (smoke-a4 runs operator host-side; this
  harness inherits that pattern).
- Full SSH-based agent install via FleetDeviceSetupScore — Score
  built, invocation gated.
2026-05-03 17:07:40 -04:00
..