Files
harmony/ROADMAP/fleet_platform/v0_demo_e2e.md
Jean-Gabriel Gill-Couture 34cfa0423b docs(podman): FIXME diagnosis for the reconcile-loop bug
The agent's periodic reconcile destroys-and-recreates any service
whose ContainerSpec has env or volumes, every 30s tick. Root cause:
matches_spec returns false unconditionally for those fields because
podman's list endpoint doesn't surface them; the original author
chose to declare "any spec with state is drifted" as a fail-safe.
That fail-safe weaponizes the polling reconciler into a loop.

Tags the offending line with a multi-paragraph FIXME explaining
the symptom, the root cause, the proposed fix (containers.inspect
+ structural compare + an integration test), and the demo-time
workaround (keep demo specs trivial — the hello-web nginx demo
already is).

Adds the same gap to ROADMAP/fleet_platform/v0_demo_e2e.md's
known-risks section so it's visible at planning time.

Out of scope for tonight; in scope for delivery alongside the
upcoming health-check support on ContainerSpec.
2026-05-05 01:59:51 -04:00

12 KiB
Raw Permalink Blame History

V0 Demo End-to-End — VM-Based Rehearsal

48-hour customer demo prep. The PO assessment from memory/feedback_* and the prior planning discussion concluded that shipping the customer demo against an untested OKD path is reckless. This doc plans the VM-based rehearsal that proves the JWT-auth chain end-to-end before we touch a real cluster.

Why VM, not OKD

Smoke-a4 already greens the chain k3d + in-cluster NATS + libvirt ARM VM + agent + apply CR + reconcile podman + status reflect-back on x86_64 and aarch64. Zero new infra; we extend the existing harness with Zitadel + auth callout + agent JWT auth.

Same Helm charts, same Scores, same agent code paths as production. Only the cluster topology differs (k3d/traefik vs OKD/HAProxy). The remaining OKD-specific deltas — Route annotations, edge-TLS, real DNS — are small and testable in isolation after the VM smoke is green.

Compared to validating directly against OKD:

  • Local + reproducible: same cargo run runs on any dev machine with podman + libvirt + k3d.
  • Fast iteration: bring-up is ~12-15 min cold, ~30s warm. We fix integration bugs in minutes, not "wait for cluster admin" hours.
  • CI-able: greens in a single cargo test invocation, so we prevent regressions post-demo.

What this rehearsal proves

  • ZitadelScore's FirstInstance.Org.Machine.Pat block actually causes the chart to provision the iam-admin-pat secret (we added the Helm config, never confirmed the secret materialises).
  • ZitadelSetupScore::ensure_machine_user reaches a working JSON keyfile when called outside its k3d unit tests.
  • The agent's CredentialSource::ZitadelJwt mints a token, that token actually authenticates against the auth callout, and the callout admits it into the DEVICES account.
  • async-nats's auto-reconnect-with-auth-callback fires fresh tokens on real NATS pod restart — the load-bearing "never lose connectivity to a device" guarantee.
  • The full operator → NATS KV → agent → podman → status-back-to-CR loop survives the credential-source rewrite.
  • Container env / volumes / restart policy land on the real podman instance, not just in unit tests.

What it does NOT prove (deferred, accepted)

  • OKD HAProxy edge-TLS termination on the Zitadel and NATS-WSS Routes. Tested separately in a follow-up smoke once the VM smoke is green.
  • Real DNS resolution from a customer LAN. We inject /etc/hosts entries on each VM so sso.fleet.local resolves to the libvirt host.
  • Browser-driven device-code SSO (fleet_sso_login is compile-only today). Out of scope for this rehearsal — admin verification uses an injected machine-user token via JWT-bearer (same as examples/fleet_auth_callout).
  • Customer's docker-compose translation. Manual at the call.

Architecture

                   k3d cluster (host)
   ┌─────────────────────────────────────────────────┐
   │  Zitadel + Postgres   http://sso.fleet.local    │
   │      │                     (host:8080)          │
   │      │  project + roles + per-device users      │
   │      ▼                                           │
   │  ZitadelSetupScore cache  → keyfiles (per VM)   │
   │                                                  │
   │  NATS (auth_callout)   nats://<host>:30422      │
   │      ▲                                           │
   │      │  JWT-bearer via callout                   │
   │  fleet-callout pod                               │
   │                                                  │
   │  fleet-operator → KV writes desired-state       │
   │      ▲                                           │
   │      │  kube apply Deployment CR                 │
   └──────┼──────────────────────────────────────────┘
          │
   ┌──────┼──────────────────────────────────────────┐
   │   libvirt default NAT (host = 192.168.122.1)    │
   └──────┼──────────────────────────────────────────┘
          ▼
   ┌──────────────┐    ┌──────────────┐
   │  device-A    │    │  device-B    │   (cloud-init Ubuntu VMs)
   │  fleet-agent │    │  fleet-agent │
   │  + Zitadel   │    │  + Zitadel   │
   │   JWT key    │    │   JWT key    │
   │  + podman    │    │  + podman    │
   └──────────────┘    └──────────────┘

Bring-up sequence

  1. Ensure k3d cluster fleet-e2e-demo (port mappings 8080→80, 30422→30422; same as fleet_auth_callout).
  2. Reuse fleet_auth_callout::bring_up_stack constituent functions:
    • Deploy Zitadel + Postgres
    • Wait for iam-admin-pat secret to materialise
    • Provision project fleet, API app, roles fleet-admin + device
  3. Install fleet operator from its Helm chart (Chapter 3 ships this).
  4. Generate issuer NKey, deploy NATS with auth_callout block, deploy NatsAuthCalloutScore (image side-loaded into k3d).
  5. For each device i in 1..=num_devices:
    • Mint Zitadel machine user device-${device_id_i} with the device role grant via ZitadelSetupScore. Cache the JSON key.
    • Provision libvirt VM via ProvisionVmScore (cloud-init Ubuntu, x86_64).
    • SSH in via LinuxHostTopology. Inject /etc/hosts: <host_ip> sso.fleet.local.
    • Run FleetDeviceSetupScore with FleetDeviceAuth::ZitadelJwt { machine_key_json, ... }.
  6. Mint admin Zitadel machine user with fleet-admin role (one-off for verification — separate from the per-device users).
  7. Hand off / run tests.

Idempotent across re-runs:

  • k3d cluster create skipped if exists.
  • ZitadelSetupScore is search-then-create.
  • VM creation: ProvisionVmScore reports NOOP if domain exists.
  • FleetDeviceSetupScore byte-compares the rendered TOML.

Tests

Real #[tokio::test] functions sharing a OnceCell-bringup. Run sequentially (--test-threads=1 because they share the cluster + VMs):

# Name What it asserts
1 both_devices_heartbeat_within_60s Device CRs for A and B materialise with their labels.
2 deployment_targets_only_matching_device Apply CR with group=group-a selector → A reconciles, B doesn't.
3 deployment_status_aggregates_back_to_cr .status.aggregate.succeeded == 1 within 60s.
4 env_vars_and_volume_propagate_to_container SSH into A, podman inspect confirms env + bind mount.
5 admin_jwt_reads_any_device_subject Admin token sees A's heartbeat.
6 cross_device_isolation_enforced_in_vm A's per-device JWT cannot subscribe to B's command subject.
7 agent_recovers_from_nats_pod_restart Kill NATS pod, both agents reconnect with fresh tokens within 30s.

Test 7 is the load-bearing one — it's the only one that exercises the auto-reconnect + auth-callback re-mint path under realistic disturbance. Asserted by: kill nats-0 pod via kube API, wait for new pod ready, then publish a message from admin and verify both agents pick it up.

Implementation order

  1. ✏️ Roadmap doc (this file).
  2. 🆕 examples/fleet_e2e_demo/ crate skeleton.
  3. ♻️ Refactor fleet_auth_callout::bring_up_stack constituent functions to be pub so they're individually re-usable.
  4. /etc/hosts injection step in FleetDeviceSetupScore.
  5. Operator install via Helm in the new harness.
  6. 🔗 Compose bring_up_full_stack(num_devices).
  7. 🧪 Write the 7 tests.
  8. 🚦 Cold-start the bring-up. Fix what breaks (expected: ≥3 things).
  9. 🧪 Run tests. Fix what breaks (expected: ≥1 thing).
  10. 💥 Run test 7 in isolation; verify reconnect timing.
  11. 📝 Update demo_runbook.md with VM-rehearsal commands.

Known risks / debugging traps

  • iam-admin-pat secret timing. Chart's setup job runs on first install but may take 30-90s after Helm reports the chart Ready. Need a wait-for-secret loop before invoking ZitadelSetupScore. (Today the bring_up_stack in fleet_auth_callout doesn't have this — it works because we re-run after the secret has settled. First-cold-run will likely fail.)
  • Per-device machine keys are returned ONCE. ZitadelClientConfig caches them locally. If the cache file is missing/corrupt mid-bring-up, devices fail at TOML render. Persist the cache atomically.
  • VM /etc/hosts mutation. Cloud-init can do this, but FleetDeviceSetupScore doesn't currently touch /etc/hosts. Add a step before package install (low risk: idempotent line-in-file).
  • k3d port collision. Existing harmony and harmony-example clusters from prior sessions may collide on host ports. Either pick unique ports or fail loudly when in use.
  • NATS pod restart test is non-deterministic. async-nats's reconnect timing depends on backoff schedule. Assert via "publish succeeds within 30s after restart" rather than literal reconnect events; the latter is implementation-detail-dependent.
  • Bring-up time. Cold: ~15 min (Zitadel + Postgres dominate). Set test runner timeout accordingly. Warm: ~30s. The OnceCell pattern means the cost is amortised across the test suite.
  • Agent reconciler is non-idempotent for env / volume specs. harmony/src/modules/podman/topology.rs::matches_spec returns false (forcing destroy + recreate) for any ContainerSpec with non-empty env or volumes — by deliberate "fail-safe" choice the original author made because podman's list endpoint doesn't surface env/mount data. With the periodic reconcile firing every 30s, this becomes a destroy-and-recreate loop for any non-trivial Deployment. Demo workaround: keep demo specs free of env + volumes (the hello-web nginx demo already is). Real fix (out of scope for the demo, in scope for delivery): switch the drift check to containers.get(name).inspect() which returns env + mounts, do a structural compare, lock with an integration test asserting container ID is stable across two consecutive applies. FIXME tag at the offending line.

Success criteria for the rehearsal day

Tomorrow's all-day testing is "green" if:

  1. Cold cargo run -p example-fleet-e2e-demo brings up the full stack and prints credentials in under 20 minutes.
  2. cargo test -p example-fleet-e2e-demo --test e2e_walking_skeleton greens all 7 tests on a clean machine.
  3. cargo test ... --test e2e_walking_skeleton agent_recovers_from_nats_pod_restart greens reliably 5 runs in a row.

Anything below this and we don't show up to the customer call with a "staging deployed" promise — we reframe to "architecture walkthrough

  • local k3d security-model demo + pilot scheduled in 1-2 weeks."

What follows after greens

Once the VM rehearsal is green, the residual deltas to ship to real OKD are:

  1. Replace K8sAnywhereTopology (which falls back to k3d via HARMONY_USE_LOCAL_K3D) with a real-OKD profile. The Score code doesn't change; only the topology bootstrap.
  2. Verify Route annotations actually edge-TLS for both Zitadel and NATS-WSS in the customer's cluster. ~30 min smoke.
  3. Push the callout image to a registry the customer's cluster pulls from. Mechanical.
  4. Real wildcard DNS for *.<base-domain> pointed at the cluster ingress.

None of those four require new code; they're configuration. The heavy lifting (the JWT auth chain, the agent's reconnect loop, the operator → KV → agent → podman → status loop) is what the VM rehearsal proves.