The agent's periodic reconcile destroys-and-recreates any service whose ContainerSpec has env or volumes, every 30s tick. Root cause: matches_spec returns false unconditionally for those fields because podman's list endpoint doesn't surface them; the original author chose to declare "any spec with state is drifted" as a fail-safe. That fail-safe weaponizes the polling reconciler into a loop. Tags the offending line with a multi-paragraph FIXME explaining the symptom, the root cause, the proposed fix (containers.inspect + structural compare + an integration test), and the demo-time workaround (keep demo specs trivial — the hello-web nginx demo already is). Adds the same gap to ROADMAP/fleet_platform/v0_demo_e2e.md's known-risks section so it's visible at planning time. Out of scope for tonight; in scope for delivery alongside the upcoming health-check support on ContainerSpec.
12 KiB
V0 Demo End-to-End — VM-Based Rehearsal
48-hour customer demo prep. The PO assessment from
memory/feedback_* and the prior planning discussion concluded that
shipping the customer demo against an untested OKD path is reckless.
This doc plans the VM-based rehearsal that proves the JWT-auth
chain end-to-end before we touch a real cluster.
Why VM, not OKD
Smoke-a4 already greens the chain k3d + in-cluster NATS + libvirt ARM VM + agent + apply CR + reconcile podman + status reflect-back
on x86_64 and aarch64. Zero new infra; we extend the existing
harness with Zitadel + auth callout + agent JWT auth.
Same Helm charts, same Scores, same agent code paths as production. Only the cluster topology differs (k3d/traefik vs OKD/HAProxy). The remaining OKD-specific deltas — Route annotations, edge-TLS, real DNS — are small and testable in isolation after the VM smoke is green.
Compared to validating directly against OKD:
- Local + reproducible: same
cargo runruns on any dev machine with podman + libvirt + k3d. - Fast iteration: bring-up is ~12-15 min cold, ~30s warm. We fix integration bugs in minutes, not "wait for cluster admin" hours.
- CI-able: greens in a single
cargo testinvocation, so we prevent regressions post-demo.
What this rehearsal proves
ZitadelScore'sFirstInstance.Org.Machine.Patblock actually causes the chart to provision theiam-admin-patsecret (we added the Helm config, never confirmed the secret materialises).ZitadelSetupScore::ensure_machine_userreaches a working JSON keyfile when called outside its k3d unit tests.- The agent's
CredentialSource::ZitadelJwtmints a token, that token actually authenticates against the auth callout, and the callout admits it into theDEVICESaccount. - async-nats's auto-reconnect-with-auth-callback fires fresh tokens on real NATS pod restart — the load-bearing "never lose connectivity to a device" guarantee.
- The full operator → NATS KV → agent → podman → status-back-to-CR loop survives the credential-source rewrite.
- Container env / volumes / restart policy land on the real podman instance, not just in unit tests.
What it does NOT prove (deferred, accepted)
- OKD HAProxy edge-TLS termination on the Zitadel and NATS-WSS Routes. Tested separately in a follow-up smoke once the VM smoke is green.
- Real DNS resolution from a customer LAN. We inject
/etc/hostsentries on each VM sosso.fleet.localresolves to the libvirt host. - Browser-driven device-code SSO (
fleet_sso_loginis compile-only today). Out of scope for this rehearsal — admin verification uses an injected machine-user token via JWT-bearer (same asexamples/fleet_auth_callout). - Customer's docker-compose translation. Manual at the call.
Architecture
k3d cluster (host)
┌─────────────────────────────────────────────────┐
│ Zitadel + Postgres http://sso.fleet.local │
│ │ (host:8080) │
│ │ project + roles + per-device users │
│ ▼ │
│ ZitadelSetupScore cache → keyfiles (per VM) │
│ │
│ NATS (auth_callout) nats://<host>:30422 │
│ ▲ │
│ │ JWT-bearer via callout │
│ fleet-callout pod │
│ │
│ fleet-operator → KV writes desired-state │
│ ▲ │
│ │ kube apply Deployment CR │
└──────┼──────────────────────────────────────────┘
│
┌──────┼──────────────────────────────────────────┐
│ libvirt default NAT (host = 192.168.122.1) │
└──────┼──────────────────────────────────────────┘
▼
┌──────────────┐ ┌──────────────┐
│ device-A │ │ device-B │ (cloud-init Ubuntu VMs)
│ fleet-agent │ │ fleet-agent │
│ + Zitadel │ │ + Zitadel │
│ JWT key │ │ JWT key │
│ + podman │ │ + podman │
└──────────────┘ └──────────────┘
Bring-up sequence
- Ensure k3d cluster
fleet-e2e-demo(port mappings 8080→80, 30422→30422; same as fleet_auth_callout). - Reuse
fleet_auth_callout::bring_up_stackconstituent functions:- Deploy Zitadel + Postgres
- Wait for
iam-admin-patsecret to materialise - Provision project
fleet, API app, rolesfleet-admin+device
- Install fleet operator from its Helm chart (Chapter 3 ships this).
- Generate issuer NKey, deploy NATS with
auth_calloutblock, deployNatsAuthCalloutScore(image side-loaded into k3d). - For each device i in 1..=num_devices:
- Mint Zitadel machine user
device-${device_id_i}with thedevicerole grant viaZitadelSetupScore. Cache the JSON key. - Provision libvirt VM via
ProvisionVmScore(cloud-init Ubuntu, x86_64). - SSH in via
LinuxHostTopology. Inject/etc/hosts:<host_ip> sso.fleet.local. - Run
FleetDeviceSetupScorewithFleetDeviceAuth::ZitadelJwt { machine_key_json, ... }.
- Mint Zitadel machine user
- Mint admin Zitadel machine user with
fleet-adminrole (one-off for verification — separate from the per-device users). - Hand off / run tests.
Idempotent across re-runs:
- k3d cluster create skipped if exists.
- ZitadelSetupScore is search-then-create.
- VM creation:
ProvisionVmScorereports NOOP if domain exists. - FleetDeviceSetupScore byte-compares the rendered TOML.
Tests
Real #[tokio::test] functions sharing a OnceCell-bringup. Run
sequentially (--test-threads=1 because they share the cluster +
VMs):
| # | Name | What it asserts |
|---|---|---|
| 1 | both_devices_heartbeat_within_60s |
Device CRs for A and B materialise with their labels. |
| 2 | deployment_targets_only_matching_device |
Apply CR with group=group-a selector → A reconciles, B doesn't. |
| 3 | deployment_status_aggregates_back_to_cr |
.status.aggregate.succeeded == 1 within 60s. |
| 4 | env_vars_and_volume_propagate_to_container |
SSH into A, podman inspect confirms env + bind mount. |
| 5 | admin_jwt_reads_any_device_subject |
Admin token sees A's heartbeat. |
| 6 | cross_device_isolation_enforced_in_vm |
A's per-device JWT cannot subscribe to B's command subject. |
| 7 | agent_recovers_from_nats_pod_restart |
Kill NATS pod, both agents reconnect with fresh tokens within 30s. |
Test 7 is the load-bearing one — it's the only one that exercises the auto-reconnect + auth-callback re-mint path under realistic disturbance. Asserted by: kill nats-0 pod via kube API, wait for new pod ready, then publish a message from admin and verify both agents pick it up.
Implementation order
- ✏️ Roadmap doc (this file).
- 🆕
examples/fleet_e2e_demo/crate skeleton. - ♻️ Refactor
fleet_auth_callout::bring_up_stackconstituent functions to bepubso they're individually re-usable. - ➕
/etc/hostsinjection step inFleetDeviceSetupScore. - ➕ Operator install via Helm in the new harness.
- 🔗 Compose
bring_up_full_stack(num_devices). - 🧪 Write the 7 tests.
- 🚦 Cold-start the bring-up. Fix what breaks (expected: ≥3 things).
- 🧪 Run tests. Fix what breaks (expected: ≥1 thing).
- 💥 Run test 7 in isolation; verify reconnect timing.
- 📝 Update
demo_runbook.mdwith VM-rehearsal commands.
Known risks / debugging traps
iam-admin-patsecret timing. Chart's setup job runs on first install but may take 30-90s after Helm reports the chart Ready. Need a wait-for-secret loop before invoking ZitadelSetupScore. (Today thebring_up_stackinfleet_auth_calloutdoesn't have this — it works because we re-run after the secret has settled. First-cold-run will likely fail.)- Per-device machine keys are returned ONCE. ZitadelClientConfig caches them locally. If the cache file is missing/corrupt mid-bring-up, devices fail at TOML render. Persist the cache atomically.
- VM /etc/hosts mutation. Cloud-init can do this, but FleetDeviceSetupScore doesn't currently touch /etc/hosts. Add a step before package install (low risk: idempotent line-in-file).
- k3d port collision. Existing
harmonyandharmony-exampleclusters from prior sessions may collide on host ports. Either pick unique ports or fail loudly when in use. - NATS pod restart test is non-deterministic. async-nats's reconnect timing depends on backoff schedule. Assert via "publish succeeds within 30s after restart" rather than literal reconnect events; the latter is implementation-detail-dependent.
- Bring-up time. Cold: ~15 min (Zitadel + Postgres dominate). Set test runner timeout accordingly. Warm: ~30s. The OnceCell pattern means the cost is amortised across the test suite.
- Agent reconciler is non-idempotent for env / volume specs.
harmony/src/modules/podman/topology.rs::matches_specreturns false (forcing destroy + recreate) for anyContainerSpecwith non-empty env or volumes — by deliberate "fail-safe" choice the original author made because podman's list endpoint doesn't surface env/mount data. With the periodic reconcile firing every 30s, this becomes a destroy-and-recreate loop for any non-trivial Deployment. Demo workaround: keep demo specs free of env + volumes (the hello-web nginx demo already is). Real fix (out of scope for the demo, in scope for delivery): switch the drift check tocontainers.get(name).inspect()which returns env + mounts, do a structural compare, lock with an integration test asserting container ID is stable across two consecutive applies. FIXME tag at the offending line.
Success criteria for the rehearsal day
Tomorrow's all-day testing is "green" if:
- Cold
cargo run -p example-fleet-e2e-demobrings up the full stack and prints credentials in under 20 minutes. cargo test -p example-fleet-e2e-demo --test e2e_walking_skeletongreens all 7 tests on a clean machine.cargo test ... --test e2e_walking_skeleton agent_recovers_from_nats_pod_restartgreens reliably 5 runs in a row.
Anything below this and we don't show up to the customer call with a "staging deployed" promise — we reframe to "architecture walkthrough
- local k3d security-model demo + pilot scheduled in 1-2 weeks."
What follows after greens
Once the VM rehearsal is green, the residual deltas to ship to real OKD are:
- Replace
K8sAnywhereTopology(which falls back to k3d viaHARMONY_USE_LOCAL_K3D) with a real-OKD profile. The Score code doesn't change; only the topology bootstrap. - Verify Route annotations actually edge-TLS for both Zitadel and NATS-WSS in the customer's cluster. ~30 min smoke.
- Push the callout image to a registry the customer's cluster pulls from. Mechanical.
- Real wildcard DNS for
*.<base-domain>pointed at the cluster ingress.
None of those four require new code; they're configuration. The heavy lifting (the JWT auth chain, the agent's reconnect loop, the operator → KV → agent → podman → status loop) is what the VM rehearsal proves.