feat/nats-auth-callout-e2e #279
Reference in New Issue
Block a user
No description provided.
Delete Branch "feat/nats-auth-callout-e2e"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
New first step (1/7): read /etc/fleet-agent/config.toml off the device and compare against the rendered desired config. Three branches: - missing → info, first install - matches → warn, converge anyway - differs → warn + unified diff (similar::TextDiff with 2-line context radius, '-/+' marker style) + inquire::Confirm prompt defaulting to N. Aborts with InterpretError if declined. Existing 6 steps renumbered to 2/7-7/7. The diff replaces the previous "dump both full configs" approach which was unreadable even for one-line differences. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>ZitadelScore: - Auto-provisions an `iam-admin-pat` Kubernetes secret via the chart's FirstInstance.Org.Machine.Pat block. ZitadelSetupScore depended on this secret existing; without the chart values, the prior code path was non-functional. - New `external_port: Option<u32>` field. Controls Zitadel's emitted issuer URL when the host port mapping isn't 80/443 (k3d typically maps 8080:80). Without it, JWT-bearer audience validation 500s with `Errors.Internal` because the assertion's `aud` doesn't match the chart-default issuer at port 80. ZitadelSetupScore is extended for the JWT-bearer flow needed by the NATS auth callout: - API apps (resource servers — required for project-id audience scope) - Project roles (`POST .../projects/{id}/roles`, idempotent) - Machine users with KEY_TYPE_JSON keys (provisioned + cached device-side; Zitadel does not expose the key material on subsequent reads, so the local cache is the source of truth) - User grants (project + role keys) Cache (ZitadelClientConfig) gains projects, machine_user_ids, machine_keys, and user_grants — keyed for idempotency across re-runs. Backwards compatible with existing harmony_sso example: the new fields have `#[serde(default)]` and prior callers just need empty vecs. Refresh upgrade-by-default in helm chart (separate commit) lets ExternalPort changes propagate to existing releases on re-run.harmony-nats-callout becomes a deployable service, not just a library: - New [[bin]] target with env+secret-file driven config and SIGINT/SIGTERM-aware shutdown. - Dockerfile (single-stage archlinux:base, non-root, matches harmony-fleet-operator convention). - Refactored handler into a pure `decide()` function so the entire authorization decision tree is unit-testable without async-nats. - New `roles` module with role resolution + a `validate_device_id` security gate that rejects NATS subject metacharacters in device_id (.>* whitespace) — closes a real escalation path through the `{device_id}` placeholder in the per-device permissions block. - Configurable role claim path + admin/device role names; admin wins when both are present (privilege-escalation invariant). 57 unit tests cover every reachable branch of the security decision tree; 4 e2e tests in nats/integration-test-callout exercise real NATS in podman with: device pubsub on own subjects, cross-device subject isolation, admin-can-read-anything, and JWT-without-role rejection. harmony/src/modules/nats_auth_callout/: - New `NatsAuthCalloutScore` deploys the callout as a K8s Deployment + Secret. fsGroup + 0o440 secret mode so the non-root container can read its mounted seed/password without leaving them in env vars. - `render_auth_callout_block` helper produces the YAML for NATS Helm `config.merge.authorization.auth_callout` so both halves stay in sync. examples/fleet_auth_callout/: - `bring_up_stack()` orchestrates k3d -> Zitadel + Postgres -> CoreDNS rewrite -> project + roles + machine users with JWT keys -> NATS Helm with auth_callout block -> callout image build + sideload -> NatsAuthCalloutScore deploy. Idempotent across re-runs (issuer NKey persisted in a K8s secret so user JWTs survive restarts). - `mint_access_token()` RFC 7523 JWT-bearer client. Uses Host header with port so Zitadel emits a matching issuer. - main.rs prints URLs/creds/keyIds and waits for Ctrl-C. - Three #[tokio::test] functions sharing one cluster via OnceCell: admin_can_read_any_device_subject, device_can_only_access_own_subjects, unknown_role_is_rejected. All green on real k3d.The fleet agent's NATS connection is the load-bearing piece of the "never lose connectivity to a device" guarantee. This commit makes that hold even when Zitadel access tokens expire across NATS pod restarts and network partitions. New `[credentials]` config variants (externally-tagged): type = "toml-shared" { nats_user, nats_pass } # v0/dev type = "zitadel-jwt" { key_path, oidc_issuer_url, audience, ... } A `CredentialSource` enum dispatches per variant: - TomlShared returns the same user/pass each call. - ZitadelJwt mints an access token from Zitadel via the JWT-bearer flow (RFC 7523). The keyfile at `key_path` is the only durable secret on the device; the bearer token is short-lived and refreshed in-memory when the cached value is within 5 minutes of expiry. Two concurrent refreshes are race-safe — the second writer's mint is wasted but produces a correct token. The agent's `connect_nats` is rewritten on top of async-nats's `with_auth_callback`, which is invoked on every (re)connect attempt: - async-nats reconnects automatically on disconnect (default behaviour of ConnectOptions) — we don't need a watchdog. - Each reconnect attempt invokes the callback, which calls `next_credential()`. If the cached token is expired, a fresh one is minted before the reconnect proceeds. So a Pi that loses NATS while its token has just expired will pick up a brand-new token on the next reconnect attempt with no operator intervention. - An `event_callback` surfaces Connected / Disconnected / SlowConsumer / ServerError events into tracing — operators can see exactly when reconnects happen, which is non-negotiable for an out-of-warranty device fleet. A subtle constraint drove the trait shape: async-nats's `with_auth_callback` requires the returned future to be `Send + Sync`, which `#[async_trait]`'s erased `Pin<Box<dyn Future + Send>>` does not satisfy. The credential source is therefore an enum (concrete dispatch) rather than `dyn CredentialSource`. Two variants is small enough that enum dispatch beats trait-object plumbing. Out of scope, tracked for follow-up: a separate daemon for SSH access to the Pi via Tailscale/Headscale ("secure backdoor"), and the device-join-request + admin-approve flow that would replace the current admin-PAT bootstrap pattern.The Pi onboarding flow can now mint a per-device Zitadel machine user on the operator's machine and ship the resulting JWT key to the Pi — the agent then authenticates to NATS via JWT-bearer instead of shared nats_user/nats_pass. `FleetDeviceSetupConfig.auth: FleetDeviceAuth` replaces the previous flat `nats_user` / `nats_pass` fields. Two variants: - TomlShared { nats_user, nats_pass } — legacy / dev fallback. - ZitadelJwt { machine_key_json, oidc_issuer_url, audience, ... } — per-device JWT-bearer. The Score: * Drops `machine_key_json` to /etc/fleet-agent/zitadel-key.json (mode 0640, owner fleet-agent — matches the agent's secret-mount conventions). * Renders [credentials] type = "zitadel-jwt" pointing at that keyfile + the issuer + audience the agent's CredentialSource needs. A change to either the keyfile content or the TOML triggers an agent restart, same as binary / unit drift. `fleet_rpi_setup --bootstrap-token <PAT>` activates the Zitadel path. The bootstrap PAT is held in the CLI's memory only; it never lands on the Pi. New flags: --zitadel-issuer-url, --zitadel-project-id, --zitadel-device-role (default `device`), --danger-accept-invalid-certs. `zitadel_bootstrap` is a slim ManagementAPI client that, idempotently per device: 1. Find-or-create machine user `device-${device_id}`. 2. Find-or-skip a project role grant (defaults to `device`). 3. Always mint a fresh JSON key and return its content. (Zitadel doesn't expose the private half of an existing key, so reusing isn't possible — stale keys remain valid until expiry, which is fine because each setup run overwrites the on-device keyfile.) Three new render_toml tests cover the zitadel-jwt path; eleven existing agent tests still pass. Out of scope, tracked: device-join-request + admin-approve flow that would replace bootstrap-PAT entirely (closer to the OKD node-approval pattern). Long-lived admin PAT is acceptable for the demo per product call.Adds `examples/fleet_staging_deploy/` — the operator-side, run-once- per-customer harness that brings up the fleet platform's central services on a real OKD/K8s cluster. Complements the existing `fleet_auth_callout` (k3d local-dev harness, kept unchanged) and `fleet_rpi_setup` (per-device onboarding). `FleetDomainConfig` is the single source of truth for hostnames: base_domain = "customer1.nationtech.io" → zitadel.<base> (Zitadel HTTPS via OKD HAProxy edge-TLS) → nats.<base> (NATS WSS through the same ingress) Nothing is hardcoded; the operator supplies one --base-domain flag and the deploy is fully parameterized. Re-running is idempotent (rides the helm-upgrade-by-default + ZitadelSetupScore search-then- create + persisted issuer-NKey-secret idempotency layers). NATS values render under config.merge.{auth_callout, accounts, system_account}, with WSS via `websocket: { enabled, port: 8443, ingress: { className: openshift-default, ... } }` and the OKD-flavored HAProxy edge-TLS annotations: route.openshift.io/termination: edge haproxy.router.openshift.io/timeout: "1h" (Switch to `reencrypt` when the customer wants pod-to-edge TLS; gateway-api migration is on their roadmap, separate from the demo.) bring_up_staging(): - Deploys ZitadelScore (external_secure: true, no external_port → 443). - Waits for HTTPS .well-known. - Provisions the project + API app + roles via ZitadelSetupScore hitting Zitadel through the public ingress (port 443, TLS verified). No machine users provisioned — fleet_rpi_setup mints them on demand per device, so the staging deploy stays device-count-agnostic. - Persists / reads the issuer NKey seed in the `callout-issuer-seed` K8s secret (so re-runs don't invalidate user JWTs already in flight on customer Pis). - Deploys NATS via NatsHelmChartScore with the WSS values. - Deploys NatsAuthCalloutScore (oidc_audience = project_id; external_secure path means no danger_accept_invalid_certs). main.rs ends by printing the exact `cargo run -p example-fleet-rpi-setup ...` invocation the operator runs against a Pi, with the project_id and zitadel/nats URLs filled in. Three unit tests cover the domain config + NATS values rendering (WSS + edge-TLS annotations + auth_callout under merge).Two bugs surfaced when the agent went live against NATS JetStream KV in the VM-based e2e rehearsal: 1. The default `device` role only allowed flat `device-state.<id>` / `device-commands.<id>` subjects. The agent's actual data plane is JetStream KV, which puts every operation on `$KV.<bucket>.<key>` subjects with control-plane traffic on `$JS.API.>` and `$JS.ACK.>`. With the old role config, the very first KV publish died with `Permissions Violation for Publish to "$JS.API.INFO"`. The role now allows `$JS.API.>` + `$JS.ACK.>` plus the four per-device data subjects derived from harmony_reconciler_contracts::kv (info.<id>, state.<id>.<dep>, heartbeat.<id>, desired-state.<id>.<dep>). The legacy direct `device-state.<id>` / `device-commands.<id>` subjects are kept so non-JetStream callers of NatsAuthCalloutScore still work. A new unit test (`device_role_covers_reconciler_contract_kv_subjects`) imports the contract crate as a dev-dep and asserts each contract- produced subject is matched, plus that cross-device subjects are *not* matched. This locks the role config to the contract surface so future renames break the test before they break prod. 2. Zitadel's `client_id` claim for a machine user equals the userName verbatim. Both `fleet_rpi_setup` and `fleet_e2e_demo` create the user as `device-{device_id}`, so the JWT carries `device-vm-device-00` while the agent's KV keys use the bare `vm-device-00`. The callout was interpolating the prefixed string into permissions, producing rules that never matched what the agent actually publishes. Adds `device_id_prefix_strip` (env: `DEVICE_ID_PREFIX_STRIP`, defaults empty so existing deployments are unaffected). When set, the validator strips the prefix from the extracted claim before permission interpolation. The fleet_auth_callout example wires it to `device-` so the e2e harness stays end-to-end correct without reaching into either naming convention. Verified end-to-end: both VM agents now publish DeviceInfo / heartbeat through JetStream KV with no permission errors and zero service restarts since the rollout.The agent's data plane was JetStream-KV-only, so live observers that don't want to consume the JS stream had no signal to subscribe to. The walking-skeleton e2e admin test was failing as a result — admin subscribes to `device-state.>` (the per-device direct subject) and saw nothing in 30s. This commit adds a small core-NATS publish on `device-state.<id>` alongside the existing KV writes: - `FleetPublisher::publish_state_pulse()` emits a tiny `{device_id, kind: "heartbeat", at}` payload on `device-state.<device_id>`, called from the heartbeat loop so observers see traffic on the same 30s cadence as the KV heartbeat write — but on a non-JetStream subject anyone can sub to. - `write_deployment_state()` now fans out the same payload it puts in the KV bucket on the direct subject, so live admin tooling picks up reconcile transitions immediately without watching the KV stream. Also threads `device_id_prefix_strip = "device-"` through the fleet_e2e_demo bring-up. The bring-up has its own NatsAuthCalloutScore construction (parallel to fleet_auth_callout's `bring_up_stack`), and was missing the prefix-strip line, so the deployed callout was interpolating permissions against `device-vm-device-00` instead of the bare device id the agent uses. Locks the regression with a unit test (`device_id_prefix_strip_lands_as_env_value`) on the deployment manifest builder. Verified end-to-end in the VM rehearsal: test both_devices_heartbeat_within_60s ... ok test admin_jwt_reads_any_device_subject ... okExtend DiscoverHostForRoleScore with three new interactive prompts after the installation-disk selection: - "Configure a network bond?" (only when host has >= 2 NICs), followed by a multi-select of bond members (min 2) and a bond-mode picker (LACP / active-backup / balance-rr / balance-xor / broadcast / balance-tlb / balance-alb). - "Blacklist any remaining interface?", with candidates limited to NICs not already claimed by the bond. The answers are persisted as a JSON-encoded NetworkConfig on a new host_role_mapping.network_config column. HostConfig now exposes network_config alongside installation_device so downstream scores can honor the user's intent. Also adds a new harmony_host_discovery example that discovers a single host on 192.168.40.0/24:25000.- PhysicalHost::summary() becomes terser and more informative: - Storage: "400 GB [8 GB, 477 GB]" (was "400 GB Storage (2 Disks [8 GB, 477 GB])"). Single-disk collapses to just the total. - Network: list every NIC as "[ip, mac]" with a count prefix (e.g. "3 NICs: [192.168.40.10, 98:fa:9b:03:17:6f], [00:e0:ed:7a:ec:4d], ..."). Single-NIC form drops the count and "s": "NIC: [ip, mac]". NICs without an IPv4 render as "[mac]". - Promote the inventory agent's Chipset { vendor, name } into a "system-product-name" label during host conversion (both MDNS and CIDR flows), so summary()'s first field shows "LENOVO 3136" instead of falling back to the HostCategory string ("Server"). Extracted into build_discovered_host_labels() to keep the two conversion sites in sync. When the chipset is blank, the old category fallback still applies. - Print a blank line before every interactive inquire prompt in the discovery flow (role pick, disk pick, bond confirm/multi-select/mode, blacklist confirm/multi-select) so prompts stand out from the preceding log output on the terminal.- SqliteInventoryRepository::save() now compares the incoming serde_json bytes against the latest stored `data` blob for this host_id. If byte-identical, the insert is skipped with an info log "Host '<id>' unchanged, skipping save". Genuine changes still produce a new version row, preserving the audit trail. Eliminates the unbounded row growth from repeated discovery (mDNS is continuous, CIDR scans often re-run). Addresses the long-standing FIXME in modules/inventory; the comment is now removed. - Reworded the caller-side log that fires after repo.save() from "Saved [new] host id X, summary: ..." to "Discovered host X, summary: ...". The old text claimed "Saved" even when the repo had actually skipped the insert, producing contradictory log lines on re-runs. - Harmonized every host-specific inquire prompt in the discovery flow behind a new print_host_header() helper: each prompt is now preceded by a blank line and a "Host: <summary>" banner, and the redundant host name inside the question text is stripped (disk prompt, bond confirm). The node-selection prompt is unchanged -- it picks *which* host, so there is no current host yet.- host_role_mapping now holds at most one row per host_id. SqliteInventoryRepository::save_role_mapping wraps a DELETE of any prior rows for the host and the INSERT of the new one in a single transaction, self-healing pre-existing duplicate rows along the way. - Before re-prompting for disk and networking, the discovery flow looks up the current role mapping via the new InventoryRepository::get_role_mapping(host_id) method. If one exists, the operator sees a summary (role, install disk, bond mode + interfaces, blacklist) and picks between "Update" and "Cancel"; cancelling skips the host entirely and continues the selection loop without touching the DB. New HostRoleMapping domain type carries the returned row back to the caller. - Network interfaces are sorted by name at the hwinfo-to-domain conversion step (both MDNS and CIDR flows), so f0 always appears before f1 in every downstream consumer — host summary, bond multi-select, blacklist multi-select. This also makes the byte-equality dedup in save() robust against the agent returning NICs in different sysfs-walk order across reboots. - PhysicalHost::summary() split into summary_parts_through_storage() + append_network_summary(), with a new public summary_short() variant that omits the NIC list. print_host_header() in the discovery prompts now uses summary_short() so the "Host: ..." banner fits on one line; full summaries still render in the node picker, logs, and Display impl. - Fix CPU summary rendering when the agent reports an empty model: single-CPU renders as "6c/6t", multi-CPU as "2x CPU (12c/24t)", no stray double-space in the pipe-separated summary. - Regenerate .sqlx offline cache for the new DELETE and SELECT queries.- Switch SqliteInventoryRepository to DELETE journal mode with create_if_missing, so `.sqlite-wal` / `.sqlite-shm` files no longer appear next to the DB. Existing WAL-mode DBs are checkpointed and converted on next open. - Print a blank line after prompt_network_config returns so the save logs don't stomp on the last answered question.Addresses review feedback on the previous HAProxy field-default fixes: the eight match blocks in `configure_service` that mapped loose strings ("get", "tcp", "roundrobin", ...) to generated OPNsense enum variants were poor Rust — they duplicated the wire-value knowledge that the codegen already has, and any new enum variant in OPNsense meant editing every call site by hand. - `opnsense-codegen/src/codegen.rs::generate_enum` now emits `impl From<&str>` and `impl From<String>` for every generated enum, right after the existing serde module. Lowercase-matches wire values; unknown inputs fall through to the `Other(String)` variant the codegen already emits for forward-compat round-tripping. - `opnsense-api/src/generated/haproxy.rs` regenerated — 153 enums, 306 new impl blocks. No hand edits; re-run via `cargo run -p opnsense-codegen -- generate --xml opnsense-codegen/vendor/plugins/net/haproxy/src/opnsense/mvc/app/models/OPNsense/HAProxy/HAProxy.xml --output-dir opnsense-api/src/generated --module-name haproxy`. - `opnsense-config/src/modules/load_balancer.rs::configure_service` replaces eight string-match blocks with one-liners: `HealthcheckType::from(hc.check_type.as_str())` etc. - Drive-by: fixed a pre-existing typo at `harmony/src/infra/opnsense/load_balancer.rs:185` and the matching reverse at `:149` — `SSL::SNI` was mapped to `"sslni"`, but the OPNsense wire value is `"sslsni"`. Before this refactor the typo silently hit `HealthcheckSsl::Other("sslni")`; the cleaner conversion made the bug obvious so it's fixed here rather than left for a follow-up. Verification: - `cargo check -p harmony -p opnsense-config -p opnsense-api` clean - `cargo test -p harmony --lib okd::load_balancer` 6/6 pass - `cargo test -p opnsense-codegen` 22/22 pass Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Bumps coverage on harmony-fleet-auth from 5 to 18 unit tests. The new tests lock the corners we burned cycles on while debugging the live system: * cache freshness boundary (within-leeway, outside-leeway, no-cache, non-zitadel variant) * assertion claim shape (iss/sub/aud/exp/iat) and the 60-second lifetime constant Zitadel enforces server-side * scope string content (plural-projects-roles + singular-project-id URN + openid base) * token URL strips trailing slashes (the //oauth/v2/token 404 waiting to bite the next operator) * MachineKeyFile JSON parsing under Zitadel's wire shape Refactor: build_assertion now delegates to build_assertion_claims + build_assertion_header (pure, no signing). Lets the claim/header shape be unit-tested without an RSA private-key fixture; the sign-and-decode end-to-end is still covered by the e2e harness. No new deps. wiremock not needed — every meaningful assertion is on pure logic.