feat/nats-auth-callout-e2e #279

Merged

johnride merged 64 commits from feat/nats-auth-callout-e2e into feat/iot-walking-skeleton

2026-05-05 13:46:15 +00:00

Author	SHA1	Message	Date
Jean-Gabriel Gill-Couture	29896bfeab	fix(zitadel,operator): user-grant search endpoint + operator keyfile mode Some checks failed Run Check Script / check (pull_request) Failing after 2m15s Details Two bugs uncovered while running the full e2e walk end to end: 1. find_user_grant POSTed to /management/v1/users/<id>/grants/_search which Zitadel rejects with 405 Method Not Allowed (the original author's note in the comment hinted at this). The cache previously masked it: first apply created the grant + cached the id; second apply hit the cache and skipped the broken search. The live-query refactor (`f4d6fb94`) removed the cache short-circuit, surfacing the bug as "Create user grant failed: User grant already exists" on every re-apply. Fix: switch to the collection endpoint /management/v1/users/grants/_search with a userIdQuery filter, matching the Zitadel API that's actually wired up. Now returns the existing grant on re-apply and the create_user_grant fallback is correctly skipped. 2. Operator keyfile mounted as 0o400 owned by root. The operator pod runs as non-root (image USER directive — no fixed runAsUser because we want SCC compatibility). Result: operator boots, tries to load the JSON keyfile from the Secret volume, hits EACCES, fails the credential factory, retries forever. Fix: mode 0o444. World-read inside the pod is fine — single container, no other consumers, the Secret namespace is locked down, and the file never escapes pod-fs. The proper fsGroup-based alternative requires pinning a UID/GID, which conflicts with our SCC-friendly choice of leaving runAsUser unset. Also fixes a stale `git rm` from commit `4194baac` (harmony-fleet-auth extraction) — the agent's local credentials.rs was deleted from disk but never staged. Verified end to end: * STACK READY in 2 min on warm cluster * Operator pod: "minted fresh Zitadel access token", "NATS connected", "starting Deployment controller", "watching device-info KV" * 2 Device CRs auto-created with full label set * `kubectl apply -f` of a Deployment CR with targetSelector.matchLabels: { group: group-a } produced: - status.aggregate { matched=1, succeeded=1, failed=0 } - HTTP 200 from nginx on vm-device-00:8080 - connection refused from vm-device-01:8080 (correctly excluded)	2026-05-05 06:55:24 -04:00
Jean-Gabriel Gill-Couture	34cfa0423b	docs(podman): FIXME diagnosis for the reconcile-loop bug The agent's periodic reconcile destroys-and-recreates any service whose ContainerSpec has env or volumes, every 30s tick. Root cause: matches_spec returns false unconditionally for those fields because podman's list endpoint doesn't surface them; the original author chose to declare "any spec with state is drifted" as a fail-safe. That fail-safe weaponizes the polling reconciler into a loop. Tags the offending line with a multi-paragraph FIXME explaining the symptom, the root cause, the proposed fix (containers.inspect + structural compare + an integration test), and the demo-time workaround (keep demo specs trivial — the hello-web nginx demo already is). Adds the same gap to ROADMAP/fleet_platform/v0_demo_e2e.md's known-risks section so it's visible at planning time. Out of scope for tonight; in scope for delivery alongside the upcoming health-check support on ContainerSpec.	2026-05-05 01:59:51 -04:00
Jean-Gabriel Gill-Couture	8a609c5342	feat(operator): NATS auth via shared harmony-fleet-auth + e2e wiring The operator was opening a bare async_nats::connect with no auth, which would fail closed against a callout-protected NATS. Wires it through the same JWT-bearer flow the agent uses, sharing the recently-extracted harmony-fleet-auth crate. Operator side ------------- * main.rs: read FLEET_OPERATOR_CREDENTIALS_TOML (TOML snippet, same shape as the agent's [credentials] block — single CredentialsSection struct, just a different byte source). Empty string bypasses (callout-less dev only, with a loud warning). * chart.rs: ChartOptions gains an optional OperatorCredentials field. When set, build_chart's Deployment mounts a Secret as both envFrom (TOML payload → FLEET_OPERATOR_CREDENTIALS_TOML) and a volume mount for the JSON keyfile at the configured key_path (defaults to /etc/fleet-operator/zitadel-key.json). On-disk helm chart still emits credentials: None — those are environment- specific and out of scope for a redistributable chart. * Public manifest builders (build_service_account, build_cluster_role, build_cluster_role_binding, build_operator_deployment, operator_secret) so the e2e bring-up can apply each resource via K8sResourceScore without re-implementing the manifests. * mod chart now lives in lib.rs so external consumers (the e2e bring-up) can reach into it. E2e bring-up ------------ * Bring-up gains a separate `fleet-operator` machine user with the fleet-admin role grant — distinct from the manual-admin `fleet-ops` user so audit logs can tell automated operator actions apart from human ones. * New steps 8/10 (build + sideload operator image) and 9/10 (apply CRDs + RBAC + Secret + Deployment + wait for Ready). Devices step becomes 10/10. * Reuses harmony_fleet_operator's manifest builders + operator_secret via K8sResourceScore — no duplicated YAML, no shell-out. Tests ----- * All existing tests pass (harmony-fleet-auth: 18, harmony-fleet-agent: 7, harmony-fleet-operator: 2). E2e walking-skeleton is exercised by the next phase's clean rerun.	2026-05-05 01:58:14 -04:00
Jean-Gabriel Gill-Couture	84a25dbb07	test(fleet-auth): cover assertion claims, scope, token URL, cache, keyfile Bumps coverage on harmony-fleet-auth from 5 to 18 unit tests. The new tests lock the corners we burned cycles on while debugging the live system: * cache freshness boundary (within-leeway, outside-leeway, no-cache, non-zitadel variant) * assertion claim shape (iss/sub/aud/exp/iat) and the 60-second lifetime constant Zitadel enforces server-side * scope string content (plural-projects-roles + singular-project-id URN + openid base) * token URL strips trailing slashes (the //oauth/v2/token 404 waiting to bite the next operator) * MachineKeyFile JSON parsing under Zitadel's wire shape Refactor: build_assertion now delegates to build_assertion_claims + build_assertion_header (pure, no signing). Lets the claim/header shape be unit-tested without an RSA private-key fixture; the sign-and-decode end-to-end is still covered by the e2e harness. No new deps. wiremock not needed — every meaningful assertion is on pure logic.	2026-05-05 01:50:28 -04:00
Jean-Gabriel Gill-Couture	4194baacad	refactor(fleet): extract NATS credential plumbing into harmony-fleet-auth The agent's `credentials.rs` + `CredentialsSection` enum graduate into a workspace crate (`fleet/harmony-fleet-auth/`) so the operator can consume the same code path. Single struct, single factory, single auth-callback wiring. The only thing that varies between consumers is where the `[credentials]` TOML bytes come from — the agent reads them from a config file on disk, the operator (next commit) will read them from an env var. Public surface of the new crate: CredentialsSection — the deserializable CredentialSource / NatsCredential — the runtime objects MachineKeyFile / CachedToken — helper types credential_source_from_config — factory connect_options_with_credentials — async-nats wiring Agent consumes via `pub use harmony_fleet_auth::CredentialsSection` in its own `config.rs` so existing call sites keep working. Existing 5 tests in the new crate + 7 in the agent all green. This commit is structurally a move; behavior unchanged. Operator wiring, additional unit tests, and the JWT-mint refactor (split build_assertion / build_scope / build_token_url for testability) follow in the next commits.	2026-05-05 01:48:42 -04:00
Jean-Gabriel Gill-Couture	612d934ad4	docs(fleet): manual JWT-bearer mint + NATS write recipe Working PyJWT script + nats CLI commands for talking to a callout-protected NATS by hand. Distills what we learned debugging the auth chain: which scope claims matter, why the audience is the project id (not the API app's clientId), how to read OIDC_AUDIENCE off the live callout instead of trusting the cache, and the failure modes — including the PyJWT vs jwt package collision that costs 30 minutes the first time you hit it. Cross-linked from fleet-zitadel-faq.md.	2026-05-05 01:43:36 -04:00
Jean-Gabriel Gill-Couture	f4d6fb9431	fix(zitadel): always live-query Zitadel for IDs instead of trusting cache ZitadelClientConfig was used as both a key store (machine keys — which Zitadel cannot return after creation, so caching is required) AND a lookup cache (project_id, machine_user_ids, user_grants). The latter introduced a silent drift class: - ZitadelSetupScore writes the cache incrementally as it creates each resource. - If Zitadel is reset between runs (Postgres recreated, IDs reissued), the cache still holds the old IDs. - ensure_project / ensure_app / ensure_machine_user / user_grant short-circuited on cache hit and never consulted Zitadel — so downstream Scores got the stale ID. - The legacy `project_id` field was further `is_none`-guarded so it preserved the very first id ever seen, surviving any number of Zitadel resets. Net effect in the wild: the deployed callout's `OIDC_AUDIENCE` silently pointed at a project that no longer existed, while agents kept working only because their TOML config carried the matching stale id. A manual mint script reading `project_id` from the cache would produce tokens that pass signature validation but fail the audience check — exactly the symptom that surfaced this bug. Fix: drop the cache-hit short-circuit in every ensure_* path and always live-query. The cache now only holds machine key material (its only legitimate role) and a record of last-known IDs that get refreshed on every apply. Cost: ~1 extra HTTP per project / app / user / grant per Score apply — these are not hot paths. Also: stop is_none-guarding `config.project_id` so the legacy field tracks live state for older single-project consumers.	2026-05-05 01:11:18 -04:00
Jean-Gabriel Gill-Couture	3069f5b9ae	Merge remote-tracking branch 'origin' into feat/nats-auth-callout-e2e Some checks failed Run Check Script / check (pull_request) Failing after -44h57m27s Details	2026-05-04 15:38:52 -04:00
Jean-Gabriel Gill-Couture	c6284c09bc	feat(fleet-agent): emit state pulse on direct device-state.<id> subject Some checks failed Run Check Script / check (pull_request) Failing after -44h56m12s Details The agent's data plane was JetStream-KV-only, so live observers that don't want to consume the JS stream had no signal to subscribe to. The walking-skeleton e2e admin test was failing as a result — admin subscribes to `device-state.>` (the per-device direct subject) and saw nothing in 30s. This commit adds a small core-NATS publish on `device-state.<id>` alongside the existing KV writes: - `FleetPublisher::publish_state_pulse()` emits a tiny `{device_id, kind: "heartbeat", at}` payload on `device-state.<device_id>`, called from the heartbeat loop so observers see traffic on the same 30s cadence as the KV heartbeat write — but on a non-JetStream subject anyone can sub to. - `write_deployment_state()` now fans out the same payload it puts in the KV bucket on the direct subject, so live admin tooling picks up reconcile transitions immediately without watching the KV stream. Also threads `device_id_prefix_strip = "device-"` through the fleet_e2e_demo bring-up. The bring-up has its own NatsAuthCalloutScore construction (parallel to fleet_auth_callout's `bring_up_stack`), and was missing the prefix-strip line, so the deployed callout was interpolating permissions against `device-vm-device-00` instead of the bare device id the agent uses. Locks the regression with a unit test (`device_id_prefix_strip_lands_as_env_value`) on the deployment manifest builder. Verified end-to-end in the VM rehearsal: test both_devices_heartbeat_within_60s ... ok test admin_jwt_reads_any_device_subject ... ok	2026-05-04 09:36:26 -04:00
Jean-Gabriel Gill-Couture	54308fd7a4	chore: formatting Some checks failed Run Check Script / check (pull_request) Failing after -44h56m9s Details	2026-05-04 09:03:35 -04:00
Jean-Gabriel Gill-Couture	d4fd4859ec	fix(callout): align device permissions with KV key formats and machine-user prefix Some checks failed Run Check Script / check (pull_request) Failing after -44h57m23s Details Two bugs surfaced when the agent went live against NATS JetStream KV in the VM-based e2e rehearsal: 1. The default `device` role only allowed flat `device-state.<id>` / `device-commands.<id>` subjects. The agent's actual data plane is JetStream KV, which puts every operation on `$KV.<bucket>.<key>` subjects with control-plane traffic on `$JS.API.>` and `$JS.ACK.>`. With the old role config, the very first KV publish died with `Permissions Violation for Publish to "$JS.API.INFO"`. The role now allows `$JS.API.>` + `$JS.ACK.>` plus the four per-device data subjects derived from harmony_reconciler_contracts::kv (info.<id>, state.<id>.<dep>, heartbeat.<id>, desired-state.<id>.<dep>). The legacy direct `device-state.<id>` / `device-commands.<id>` subjects are kept so non-JetStream callers of NatsAuthCalloutScore still work. A new unit test (`device_role_covers_reconciler_contract_kv_subjects`) imports the contract crate as a dev-dep and asserts each contract- produced subject is matched, plus that cross-device subjects are not matched. This locks the role config to the contract surface so future renames break the test before they break prod. 2. Zitadel's `client_id` claim for a machine user equals the userName verbatim. Both `fleet_rpi_setup` and `fleet_e2e_demo` create the user as `device-{device_id}`, so the JWT carries `device-vm-device-00` while the agent's KV keys use the bare `vm-device-00`. The callout was interpolating the prefixed string into permissions, producing rules that never matched what the agent actually publishes. Adds `device_id_prefix_strip` (env: `DEVICE_ID_PREFIX_STRIP`, defaults empty so existing deployments are unaffected). When set, the validator strips the prefix from the extracted claim before permission interpolation. The fleet_auth_callout example wires it to `device-` so the e2e harness stays end-to-end correct without reaching into either naming convention. Verified end-to-end: both VM agents now publish DeviceInfo / heartbeat through JetStream KV with no permission errors and zero service restarts since the rollout.	2026-05-03 17:49:48 -04:00
Jean-Gabriel Gill-Couture	050d4697d2	chore: cargo fmt setup_score.rs	2026-05-03 17:49:22 -04:00
Jean-Gabriel Gill-Couture	7dd5f1504f	chore: cargo fmt sweep across modified files No behavior changes; only re-flowing existing expressions.	2026-05-03 17:49:15 -04:00
Jean-Gabriel Gill-Couture	6607fe7494	fix(e2e-demo): point agent_binary default at the real cargo target name The cargo bin target is `harmony-fleet-agent`, not `fleet-agent` — the latter never existed under target/release. Smoke-a4 happened to work because callers passed --agent-binary explicitly; the harness defaults didn't.	2026-05-03 17:49:09 -04:00
Jean-Gabriel Gill-Couture	a4b9e7ac9f	fix(fleet-agent): request projects:roles scope so role claim is emitted Zitadel only includes the project-roles block in an access token when the JWT-bearer request asks for it via the `urn:zitadel:iam:org:projects:roles` scope (PLURAL "projects"). Without it the agent's token has a valid signature/audience but no roles, so the NATS auth callout rejects with "no authorized role in token" even though the machine user has a "device" grant. Discovered while running the VM-based e2e rehearsal: agents could mint a token, connect to NATS, then immediately fail authorization. The plural-projects vs. singular-project distinction is a Zitadel convention; both scopes are required, and the comment now spells out what each one does.	2026-05-03 17:49:04 -04:00
Jean-Gabriel Gill-Couture	49f9834eb2	feat(e2e-demo): apply FleetDeviceSetupScore over SSH per VM Wires the previously-built FleetDeviceSetupScore through to a LinuxHostTopology against each pre-provisioned VM. Mirrors the fleet_rpi_setup pattern but synthesizes inline so the harness drives N VMs in sequence without re-deriving the CLI plumbing. Each VM gets: - An /etc/hosts entry mapping `sso.fleet.local` → libvirt host IP via the new HostsEntry support, so the in-VM agent's HTTP client to Zitadel can resolve the issuer. - The per-device Zitadel machine key dropped at /etc/fleet-agent/zitadel-key.json. - Agent TOML with `type = "zitadel-jwt"` pointing at the keyfile. - Agent service started under systemd. SSH user assumed `fleet-admin` (matches what fleet_vm_setup + smoke-a4 cloud-init create). Private key from the harmony fleet keypair (ensure_fleet_ssh_keypair). After this commit, `cargo run -p example-fleet-e2e-demo` is the single command that turns a fresh k3d + 2 booted VMs into a fully-converged stack: Zitadel + NATS callout + 2 agents speaking JWT-bearer to NATS. Tomorrow's morning: prove it actually does that on a clean machine.	2026-05-03 17:08:52 -04:00
Jean-Gabriel Gill-Couture	1d453dd9aa	feat(e2e-demo): VM-based rehearsal harness + /etc/hosts injection Adds `examples/fleet_e2e_demo/` — composes fleet_auth_callout's existing pieces (Zitadel + auth callout deploy) with per-device machine-user provisioning (one ZitadelSetupScore call per VM) and FleetDeviceSetupScore using FleetDeviceAuth::ZitadelJwt. The harness expects pre-provisioned libvirt VMs (one per device) reachable via `FLEET_E2E_VM_<i>_IP` env vars; full VM provisioning via ProvisionVmScore is a follow-up — keeping the harness observable in pieces during the cold-start debugging tomorrow. Constituent helpers in `fleet_auth_callout::lib.rs` flipped from private to `pub` (deploy_zitadel, wait_for_zitadel_ready, ensure_issuer_seed, build_and_load_callout_image, etc.) so the new harness composes them rather than re-implementing. `bring_up_full_stack`: 1. Ensure k3d cluster (re-uses fleet_auth_callout's create_k3d). 2. Deploy Zitadel + Postgres. 3. CoreDNS rewrite + wait for Zitadel HTTP + wait for the chart-provisioned `iam-admin-pat` secret. (Last step is new and load-bearing — without it ZitadelSetupScore races the chart's setup job and fails on first cold-run.) 4. ZitadelSetupScore for project + API app + roles + admin machine-user (admin gets fleet-admin role grant). 5. Issuer NKey from a persisted secret + NATS deploy with auth_callout block + callout pod. 6. For each device i: per-device ZitadelSetupScore (machine-user with `device` role grant), pull the JSON keyfile from cache, render the agent's TOML with the keyfile path. (FleetDeviceSetupScore invocation is wired structurally; the SSH-and-apply step is gated behind the VM provisioning follow-up.) `HostsEntry` + `merge_hosts_file` added to FleetDeviceSetupScore so VMs on a libvirt NAT can resolve `sso.fleet.local` to the host gateway. Managed-block markers in /etc/hosts make the merge idempotent across re-runs and removable when entries are dropped from the score. Four new unit tests cover the merge invariants (insert, replace, strip, byte-stable). Tests skeleton in `tests/e2e_walking_skeleton.rs`: - `both_devices_heartbeat_within_60s` — implemented; reads from device-info KV via admin token. - `admin_jwt_reads_any_device_subject` — implemented; subscribes to `device-state.>` as admin. - `cross_device_isolation_enforced_in_vm` — `#[ignore]` pending per-device-key plumbing through E2eHandles. - `agent_recovers_from_nats_pod_restart` — `#[ignore]` pending the NATS-pod-restart driver. The two `#[ignore]`d tests cover the load-bearing reconnect and isolation invariants. Wiring them is the morning-of-rehearsal priority since those are the customer-facing claims. Out of scope of this commit (called out in the roadmap doc): - ProvisionVmScore integration (today operator runs fleet_vm_setup out-of-band). - Operator install via Helm (smoke-a4 runs operator host-side; this harness inherits that pattern). - Full SSH-based agent install via FleetDeviceSetupScore — Score built, invocation gated.	2026-05-03 17:07:40 -04:00
Jean-Gabriel Gill-Couture	fdcc7040dd	docs(fleet): chapter 6 — VM-based customer demo rehearsal plan Adds ROADMAP/fleet_platform/v0_demo_e2e.md and threads it from v0_1_plan.md. The VM rehearsal extends smoke-a4 (already-green k3d + libvirt VM + agent + apply CR + reconcile loop) with Zitadel + auth callout + agent JWT auth. Two devices + one admin, real cargo tests sharing a OnceCell-bringup. Plan calls out: - The 7 tests, including the load-bearing `agent_recovers_from_nats_pod_restart` (asserts the auto-reconnect + auth-callback re-mint path under realistic disturbance). - Five known risks / debugging traps to expect on first cold-start (iam-admin-pat secret timing, /etc/hosts injection, k3d port collisions, etc.). - Success criteria for the rehearsal day: cold cargo run greens in <20 min, all 7 tests green on a clean machine, the NATS-restart test reliably greens 5 runs in a row. - Anything below the success criteria → reframe the customer call to "architecture walkthrough + local k3d demo + pilot in 1-2 weeks." Avoids burning the relationship to keep a deadline. Once VM rehearsal is green the residual OKD deltas are configuration (Route annotations, image registry, real DNS, cert) — no new code.	2026-05-03 16:59:43 -04:00
Jean-Gabriel Gill-Couture	e3e6d33dc8	fix(fleet_vm_setup): adopt FleetDeviceAuth::TomlShared shape The VM smoke harness still uses shared NATS creds for v0 (no Zitadel JWT path through libvirt — the customer-facing Pi flow has it via fleet_rpi_setup --bootstrap-token). Rewriting the FleetDeviceSetupConfig literal against the new `auth: FleetDeviceAuth` field.	2026-05-03 15:44:18 -04:00
Jean-Gabriel Gill-Couture	4053ac52de	docs(fleet): demo runbook (operator + developer flow, single page) Hand-on walkthrough for the 48-hour customer demo: - Operator: build/push the callout image → fleet-staging-deploy → capture project_id + cli_client_id from the printed panel. - Developer: fleet-sso-login proves Zitadel SSO works end-to-end. - Pi onboarding: extract iam-admin-pat from the staging cluster, cross-compile the agent for aarch64, run fleet-rpi-setup once per device with --bootstrap-token. Each Pi's agent connects to NATS over WSS using the JWT-bearer token minted from its per-device keyfile. - Deploy a container to a labeled subset via example_harmony_apply_deployment with --env / --volume / --restart flags (env + bind mounts + restart policy that work_item #1 added). - Observe the cross-device security model holding via the auth callout's logs. Also captures what's deliberately NOT in the demo (compose auto-translation, UI, Tailscale backdoor, device-join-request flow, OpenBao, K8s OIDC) so the customer call has clean expectation- setting. The runbook is the closing piece of the 48h-demo work plan; sequenced after the eight feat / refactor commits that built the underlying functionality.	2026-05-03 15:43:10 -04:00
Jean-Gabriel Gill-Couture	5396ef8bf2	feat(example): fleet-sso-login — Zitadel device-code CLI login Adds `examples/fleet_sso_login/` — the developer-side CLI that proves the SSO works end-to-end against a deployed staging instance. RFC 8628 device-code flow: - POSTs `/oauth/v2/device_authorization` with the harmony-cli client_id. - Prints `verification_uri_complete` so the developer opens one URL in the browser; Zitadel handles the auth (username/password, MFA, whatever the customer has wired into Zitadel's auth chain). - Polls `/oauth/v2/token` honouring the standard `authorization_pending` / `slow_down` polling protocol. - On success: decodes the access token's claims, prints `Welcome <name> <email>`, persists the session (issuer + client_id + access_token + claims) at $DATA_DIR/harmony/sso-session.json with mode 0600. For the demo this proves the SSO chain end-to-end. The actual `harmony fleet apply` operation (which would consume the persisted token through a fleet-platform API gateway) is post-demo — clusters typically don't accept Zitadel JWTs as kube-apiserver bearer tokens without an OIDC integration the customer would have to opt into. `fleet_staging_deploy` now also provisions a `harmony-cli` Device Code OIDC application alongside the existing API app, captures its client_id from the ZitadelClientConfig cache, and prints both the client_id and the exact `cargo run -p example-fleet-sso-login ...` invocation in the operator's "next steps" panel.	2026-05-03 15:41:54 -04:00
Jean-Gabriel Gill-Couture	8d8e700786	feat(example): fleet-staging-deploy — operator-side OKD bringup Adds `examples/fleet_staging_deploy/` — the operator-side, run-once- per-customer harness that brings up the fleet platform's central services on a real OKD/K8s cluster. Complements the existing `fleet_auth_callout` (k3d local-dev harness, kept unchanged) and `fleet_rpi_setup` (per-device onboarding). `FleetDomainConfig` is the single source of truth for hostnames: base_domain = "customer1.nationtech.io" → zitadel.<base> (Zitadel HTTPS via OKD HAProxy edge-TLS) → nats.<base> (NATS WSS through the same ingress) Nothing is hardcoded; the operator supplies one --base-domain flag and the deploy is fully parameterized. Re-running is idempotent (rides the helm-upgrade-by-default + ZitadelSetupScore search-then- create + persisted issuer-NKey-secret idempotency layers). NATS values render under config.merge.{auth_callout, accounts, system_account}, with WSS via `websocket: { enabled, port: 8443, ingress: { className: openshift-default, ... } }` and the OKD-flavored HAProxy edge-TLS annotations: route.openshift.io/termination: edge haproxy.router.openshift.io/timeout: "1h" (Switch to `reencrypt` when the customer wants pod-to-edge TLS; gateway-api migration is on their roadmap, separate from the demo.) bring_up_staging(): - Deploys ZitadelScore (external_secure: true, no external_port → 443). - Waits for HTTPS .well-known. - Provisions the project + API app + roles via ZitadelSetupScore hitting Zitadel through the public ingress (port 443, TLS verified). No machine users provisioned — fleet_rpi_setup mints them on demand per device, so the staging deploy stays device-count-agnostic. - Persists / reads the issuer NKey seed in the `callout-issuer-seed` K8s secret (so re-runs don't invalidate user JWTs already in flight on customer Pis). - Deploys NATS via NatsHelmChartScore with the WSS values. - Deploys NatsAuthCalloutScore (oidc_audience = project_id; external_secure path means no danger_accept_invalid_certs). main.rs ends by printing the exact `cargo run -p example-fleet-rpi-setup ...` invocation the operator runs against a Pi, with the project_id and zitadel/nats URLs filled in. Three unit tests cover the domain config + NATS values rendering (WSS + edge-TLS annotations + auth_callout under merge).	2026-05-03 15:38:56 -04:00
Jean-Gabriel Gill-Couture	ab98cbabf9	feat(fleet): per-device Zitadel bootstrap in fleet_rpi_setup The Pi onboarding flow can now mint a per-device Zitadel machine user on the operator's machine and ship the resulting JWT key to the Pi — the agent then authenticates to NATS via JWT-bearer instead of shared nats_user/nats_pass. `FleetDeviceSetupConfig.auth: FleetDeviceAuth` replaces the previous flat `nats_user` / `nats_pass` fields. Two variants: - TomlShared { nats_user, nats_pass } — legacy / dev fallback. - ZitadelJwt { machine_key_json, oidc_issuer_url, audience, ... } — per-device JWT-bearer. The Score: * Drops `machine_key_json` to /etc/fleet-agent/zitadel-key.json (mode 0640, owner fleet-agent — matches the agent's secret-mount conventions). * Renders [credentials] type = "zitadel-jwt" pointing at that keyfile + the issuer + audience the agent's CredentialSource needs. A change to either the keyfile content or the TOML triggers an agent restart, same as binary / unit drift. `fleet_rpi_setup --bootstrap-token <PAT>` activates the Zitadel path. The bootstrap PAT is held in the CLI's memory only; it never lands on the Pi. New flags: --zitadel-issuer-url, --zitadel-project-id, --zitadel-device-role (default `device`), --danger-accept-invalid-certs. `zitadel_bootstrap` is a slim ManagementAPI client that, idempotently per device: 1. Find-or-create machine user `device-${device_id}`. 2. Find-or-skip a project role grant (defaults to `device`). 3. Always mint a fresh JSON key and return its content. (Zitadel doesn't expose the private half of an existing key, so reusing isn't possible — stale keys remain valid until expiry, which is fine because each setup run overwrites the on-device keyfile.) Three new render_toml tests cover the zitadel-jwt path; eleven existing agent tests still pass. Out of scope, tracked: device-join-request + admin-approve flow that would replace bootstrap-PAT entirely (closer to the OKD node-approval pattern). Long-lived admin PAT is acceptable for the demo per product call.	2026-05-03 15:22:13 -04:00
Jean-Gabriel Gill-Couture	b4d3d7d02c	fix(linux): SshCredentials default_ubuntu_aws missing sudo_password The merge of feat/prepare-rpi added a `sudo_password: Option<String>` field to SshCredentials but the `default_ubuntu_aws` constructor on the destination branch was authored before that field existed. Add the missing field as `None` (matches the prepare-rpi semantics: passwordless sudo expected unless explicitly configured).	2026-05-03 15:17:03 -04:00
Jean-Gabriel Gill-Couture	c785f13abd	merge: feat/prepare-rpi (Pi onboarding harness + linux-host capabilities)	2026-05-03 15:15:22 -04:00
Jean-Gabriel Gill-Couture	74ee7fc9f2	feat(agent): Zitadel JWT credential source + auto-reconnect The fleet agent's NATS connection is the load-bearing piece of the "never lose connectivity to a device" guarantee. This commit makes that hold even when Zitadel access tokens expire across NATS pod restarts and network partitions. New `[credentials]` config variants (externally-tagged): type = "toml-shared" { nats_user, nats_pass } # v0/dev type = "zitadel-jwt" { key_path, oidc_issuer_url, audience, ... } A `CredentialSource` enum dispatches per variant: - TomlShared returns the same user/pass each call. - ZitadelJwt mints an access token from Zitadel via the JWT-bearer flow (RFC 7523). The keyfile at `key_path` is the only durable secret on the device; the bearer token is short-lived and refreshed in-memory when the cached value is within 5 minutes of expiry. Two concurrent refreshes are race-safe — the second writer's mint is wasted but produces a correct token. The agent's `connect_nats` is rewritten on top of async-nats's `with_auth_callback`, which is invoked on every (re)connect attempt: - async-nats reconnects automatically on disconnect (default behaviour of ConnectOptions) — we don't need a watchdog. - Each reconnect attempt invokes the callback, which calls `next_credential()`. If the cached token is expired, a fresh one is minted before the reconnect proceeds. So a Pi that loses NATS while its token has just expired will pick up a brand-new token on the next reconnect attempt with no operator intervention. - An `event_callback` surfaces Connected / Disconnected / SlowConsumer / ServerError events into tracing — operators can see exactly when reconnects happen, which is non-negotiable for an out-of-warranty device fleet. A subtle constraint drove the trait shape: async-nats's `with_auth_callback` requires the returned future to be `Send + Sync`, which `#[async_trait]`'s erased `Pin<Box<dyn Future + Send>>` does not satisfy. The credential source is therefore an enum (concrete dispatch) rather than `dyn CredentialSource`. Two variants is small enough that enum dispatch beats trait-object plumbing. Out of scope, tracked for follow-up: a separate daemon for SSH access to the Pi via Tailscale/Headscale ("secure backdoor"), and the device-join-request + admin-approve flow that would replace the current admin-PAT bootstrap pattern.	2026-05-03 15:15:01 -04:00
Jean-Gabriel Gill-Couture	a0a5faa3d0	chore: remove accidentally-committed scratch + agent worktrees The previous commit swept in `.claude/worktrees/*` (ephemeral agent worktree submodules) and a few scratch files that landed at the repo root during prior sessions. None of them are project artifacts. Removing them from the index and adding to .gitignore so future `git add -A` doesn't re-include them. Files on disk are unchanged.	2026-05-03 15:08:19 -04:00
Jean-Gabriel Gill-Couture	6d55892736	feat(podman): env vars + bind-mount volumes + restart policy The IoT walking-skeleton's PodmanV0Score and the underlying ContainerSpec capability were name+image+ports only. Real customer workloads (the demo target's docker-compose for example) need at minimum: - Environment variables for runtime config + secrets injected at deploy time. - Bind-mount volumes so the container can persist data across recreates (sqlite db files, config dirs). - Restart policy so the container survives device reboot or crash. PodmanService and ContainerSpec gain `env: Vec<(String, String)>`, `volumes: Vec<VolumeMount>`, and `restart_policy: RestartPolicy`. All three default to empty / `unless-stopped` via #[serde(default)] so any Deployment CR written before this change still deserializes — that includes the existing smoke harnesses and any field-side state. VolumeMount is bind-only in v0 (host_path -> container_path, optional read_only). Named/anonymous volumes can be added behind the same field later by inspecting host_path's shape; the customer's compose file is expected to use bind mounts only. RestartPolicy mirrors podman/docker convention — `no`, `unless-stopped` (default, matching docker-compose), `on-failure`, `always`. Serialized kebab-case so docker-compose translation is mechanical. PodmanTopology::ensure_service_running now passes env / mounts / restart policy to the podman API. matches_spec conservatively forces recreate whenever the spec carries non-empty env / volumes or a non- default restart policy: the podman list endpoint doesn't surface those fields, so a structural compare isn't possible from ListContainer alone. Recreating an unchanged container is cheap (~hundreds of ms); the alternative (silent stale-config window) isn't acceptable for fleet-managed devices. example_harmony_apply_deployment grows --env, --volume, and --restart flags so an operator can drive the new shape from the CLI when authoring a Deployment CR. Tests: - legacy CR JSON without the new fields deserializes (wire-compat). - env ordering survives roundtrip (drift-detection invariant). - restart policy serializes kebab-case (compose-translation contract). - podman_v0_score_roundtrip exercises env + volumes + restart.	2026-05-03 15:08:01 -04:00
Jean-Gabriel Gill-Couture	6c45fb22ba	feat(nats-callout): production callout + harmony module + e2e demo harmony-nats-callout becomes a deployable service, not just a library: - New [[bin]] target with env+secret-file driven config and SIGINT/SIGTERM-aware shutdown. - Dockerfile (single-stage archlinux:base, non-root, matches harmony-fleet-operator convention). - Refactored handler into a pure `decide()` function so the entire authorization decision tree is unit-testable without async-nats. - New `roles` module with role resolution + a `validate_device_id` security gate that rejects NATS subject metacharacters in device_id (.>* whitespace) — closes a real escalation path through the `{device_id}` placeholder in the per-device permissions block. - Configurable role claim path + admin/device role names; admin wins when both are present (privilege-escalation invariant). 57 unit tests cover every reachable branch of the security decision tree; 4 e2e tests in nats/integration-test-callout exercise real NATS in podman with: device pubsub on own subjects, cross-device subject isolation, admin-can-read-anything, and JWT-without-role rejection. harmony/src/modules/nats_auth_callout/: - New `NatsAuthCalloutScore` deploys the callout as a K8s Deployment + Secret. fsGroup + 0o440 secret mode so the non-root container can read its mounted seed/password without leaving them in env vars. - `render_auth_callout_block` helper produces the YAML for NATS Helm `config.merge.authorization.auth_callout` so both halves stay in sync. examples/fleet_auth_callout/: - `bring_up_stack()` orchestrates k3d -> Zitadel + Postgres -> CoreDNS rewrite -> project + roles + machine users with JWT keys -> NATS Helm with auth_callout block -> callout image build + sideload -> NatsAuthCalloutScore deploy. Idempotent across re-runs (issuer NKey persisted in a K8s secret so user JWTs survive restarts). - `mint_access_token()` RFC 7523 JWT-bearer client. Uses Host header with port so Zitadel emits a matching issuer. - main.rs prints URLs/creds/keyIds and waits for Ctrl-C. - Three #[tokio::test] functions sharing one cluster via OnceCell: admin_can_read_any_device_subject, device_can_only_access_own_subjects, unknown_role_is_rejected. All green on real k3d.	2026-05-03 15:01:44 -04:00
Jean-Gabriel Gill-Couture	b8bc2217fd	feat(zitadel): ExternalPort + machine-user/role/key/grant provisioning ZitadelScore: - Auto-provisions an `iam-admin-pat` Kubernetes secret via the chart's FirstInstance.Org.Machine.Pat block. ZitadelSetupScore depended on this secret existing; without the chart values, the prior code path was non-functional. - New `external_port: Option<u32>` field. Controls Zitadel's emitted issuer URL when the host port mapping isn't 80/443 (k3d typically maps 8080:80). Without it, JWT-bearer audience validation 500s with `Errors.Internal` because the assertion's `aud` doesn't match the chart-default issuer at port 80. ZitadelSetupScore is extended for the JWT-bearer flow needed by the NATS auth callout: - API apps (resource servers — required for project-id audience scope) - Project roles (`POST .../projects/{id}/roles`, idempotent) - Machine users with KEY_TYPE_JSON keys (provisioned + cached device-side; Zitadel does not expose the key material on subsequent reads, so the local cache is the source of truth) - User grants (project + role keys) Cache (ZitadelClientConfig) gains projects, machine_user_ids, machine_keys, and user_grants — keyed for idempotency across re-runs. Backwards compatible with existing harmony_sso example: the new fields have `#[serde(default)]` and prior callers just need empty vecs. Refresh upgrade-by-default in helm chart (separate commit) lets ExternalPort changes propagate to existing releases on re-run.	2026-05-03 15:01:22 -04:00
Jean-Gabriel Gill-Couture	36974bda32	refactor(helm): upgrade-by-default for unpinned releases Helm releases without a pinned `chart_version` previously short-circuited to a NOOP when already installed, which silently dropped any `values_yaml` / `values_overrides` changes the caller had made. Now we fall through to `helm upgrade --install` whenever: - the release isn't installed (unchanged), or - it's installed and either unpinned or pinned-and-matching. Helm itself becomes the source of truth for "did anything actually change" — no-op upgrades are cheap and changed values get applied automatically without the caller having to opt in via a flag. `install_only=true` keeps the prior skip-if-installed shortcut so bootstrap operators (cert-manager, prometheus-operator, CRDs) that should not be touched on re-runs continue to behave the same. Pinned-version safety net is unchanged: a different version installed than what the score requests is an error, never a silent change.	2026-05-03 15:01:07 -04:00
Jean-Gabriel Gill-Couture	95a75d50a8	feat: Improve name of disable dad and system reserved score to show pool name Some checks failed Run Check Script / check (push) Failing after -44h57m34s Details Compile and package harmony_composer / package_harmony_composer (push) Failing after -44h56m7s Details	2026-05-03 07:17:40 -04:00
Jean-Gabriel Gill-Couture	7fa1ca2683	feat: default for ubuntu aws linux topology Some checks failed Run Check Script / check (pull_request) Failing after 12m51s Details	2026-05-01 08:53:03 -04:00
Jean-Gabriel Gill-Couture	af67992b6e	refactor: production auth callout service with real integration tests nats-jwt: - Add NkeyPub newtype with prefix validation - Add ClaimType and Algorithm typed enums - Add impl_nats_claims! macro eliminating 4x duplicated impl blocks - Add AuthorizationRequestClaimsBuilder (completing all builder types) - Fix AuthorizationResponseBuilder: add issuer() builder method, stop mutating iss in sign() - Tighten trait bounds: encode<T: Serialize>, decode_unverified<T: DeserializeOwned> - Remove dead error variants Expired/NotYetValid - Add builder tests for all 4 claims types - Deduplicate is_zero helper harmony-nats-callout (rewritten): - AuthCalloutService: production service connecting to NATS, subscribing to .REQ.USER.AUTH, dispatching auth requests - AuthCalloutConfig with builder pattern - handler.rs: pure auth request handler (decode → validate → mint → respond) extracted from test - Fix ZitadelValidator: validate() is now async (was blocking_read deadlock in async contexts) - Remove dead fields kid_map, jwks_uri - Make danger_accept_invalid_certs configurable - permissions: InterpolatedPermissions named struct instead of 4-tuple integration-test-callout: - Converted to lib+test crate: src/lib.rs exports test utilities - Tests now exercise the REAL AuthCalloutService (not inline handler) - Extracted MockOidcServer, NatsServer, CalloutContext into library - Replace yasna with rsa crate for DER parsing - Add Drop to NatsServer for container cleanup - Add module constants for all magic values - README updated with new architecture diagram	2026-04-29 00:45:05 -04:00
Jean-Gabriel Gill-Couture	48ec80ed66	docs: add integration test README with auth flow diagram	2026-04-28 23:21:23 -04:00
Jean-Gabriel Gill-Couture	f848d94808	refactor: remove dead operator-mode code from nats crates - Remove operator-mode files: account_manager, authorizer, service, config, main.rs, plan.md from callout crate - Remove operator/activation claims from nats-jwt (builder and claims) - Inline PermissionsConfig into permissions.rs (config.rs removed) - Remove harmony-nats-callout dep from integration test (unused) - Remove unused imports in algorithm.rs tests - Clean up callout Cargo.toml (remove bin, unused deps)	2026-04-28 23:20:37 -04:00
Jean-Gabriel Gill-Couture	65daa76658	feat: NATS auth callout e2e integration test - nats-jwt crate: JWT builder types for user claims, authorization request/response, account claims, algorithm encode/decode - harmony-nats-callout crate: Zitadel OIDC JWT validator, callout service scaffold, account manager (WIP) - integration-test-callout: end-to-end test validating the full auth callout flow — device connects with Zitadel JWT → callout validates JWT → returns per-device user JWT with scoped permissions → device can pub/sub on its own subjects only - Mock OIDC server for test (JWKS + openid-configuration) - Negative test: device A cannot subscribe to device B's subjects - Added UserClaimsBuilder::audience() for account-scoped user JWTs	2026-04-28 23:15:18 -04:00
Jean-Gabriel Gill-Couture	50debfd163	chore: Some code review comments inlined	2026-04-28 16:41:15 -04:00
stremblay	be4b9acaad	Merge pull request 'fix(opnsense): valid HAProxy config + From<&str> codegen cleanup' (#273 ) from fix/haproxy-issues into master Some checks failed Run Check Script / check (push) Successful in 2m8s Details Compile and package harmony_composer / package_harmony_composer (push) Failing after 2m13s Details Reviewed-on: #273	2026-04-22 17:01:53 +00:00
Sylvain Tremblay	ead76e710f	fix(opnsense): lowercase match arms in generated From<&str> All checks were successful Run Check Script / check (pull_request) Successful in 2m6s Details Two regressions from `fc16e9f` that ./build/check.sh catches: 1. `opnsense-api`'s `test_haproxy_deser` example references `resp.haproxy` on the response wrapper. The regen auto-derived the field name as `op_nsenseha_proxy` from the struct name. Need to pass `--api-key haproxy` to keep the wrapper key stable. 2. For enums whose wire values aren't all-lowercase (e.g. `"SSLv3"`, `"CONNECT"`), the emitted `From<&str>` matched `s.to_lowercase()` against the original-case wire value, which clippy flags as unreachable ("match arm has differing case"). Lowercase the wire value in the emitted match arm so case-insensitive matching actually works; serialization still emits the original-case wire value because the serde module is unaffected. Regenerated `haproxy.rs` via `cargo run -p opnsense-codegen -- generate --xml ... --module-name haproxy --api-key haproxy`. `./build/check.sh` now passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 12:52:26 -04:00
Sylvain Tremblay	fc16e9fac9	refactor(opnsense): use From<&str> for wire-value conversions Some checks failed Run Check Script / check (pull_request) Failing after 54s Details Addresses review feedback on the previous HAProxy field-default fixes: the eight match blocks in `configure_service` that mapped loose strings ("get", "tcp", "roundrobin", ...) to generated OPNsense enum variants were poor Rust — they duplicated the wire-value knowledge that the codegen already has, and any new enum variant in OPNsense meant editing every call site by hand. - `opnsense-codegen/src/codegen.rs::generate_enum` now emits `impl From<&str>` and `impl From<String>` for every generated enum, right after the existing serde module. Lowercase-matches wire values; unknown inputs fall through to the `Other(String)` variant the codegen already emits for forward-compat round-tripping. - `opnsense-api/src/generated/haproxy.rs` regenerated — 153 enums, 306 new impl blocks. No hand edits; re-run via `cargo run -p opnsense-codegen -- generate --xml opnsense-codegen/vendor/plugins/net/haproxy/src/opnsense/mvc/app/models/OPNsense/HAProxy/HAProxy.xml --output-dir opnsense-api/src/generated --module-name haproxy`. - `opnsense-config/src/modules/load_balancer.rs::configure_service` replaces eight string-match blocks with one-liners: `HealthcheckType::from(hc.check_type.as_str())` etc. - Drive-by: fixed a pre-existing typo at `harmony/src/infra/opnsense/load_balancer.rs:185` and the matching reverse at `:149` — `SSL::SNI` was mapped to `"sslni"`, but the OPNsense wire value is `"sslsni"`. Before this refactor the typo silently hit `HealthcheckSsl::Other("sslni")`; the cleaner conversion made the bug obvious so it's fixed here rather than left for a follow-up. Verification: - `cargo check -p harmony -p opnsense-config -p opnsense-api` clean - `cargo test -p harmony --lib okd::load_balancer` 6/6 pass - `cargo test -p opnsense-codegen` 22/22 pass Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 12:31:35 -04:00
Sylvain Tremblay	a196268c1e	revert(okd): bind load balancer on 0.0.0.0 again All checks were successful Run Check Script / check (pull_request) Successful in 2m17s Details Reverting `5e72777`. The HAProxy startup failure that motivated the bind-to-FW-IP change was environment-specific on the sttest basement firewall: OPNsense's "HTTP → HTTPS redirect" service (lighttpd bound to `[::]:80`, dual-stack) was holding IPv4 port 80 via v4-mapped addresses — invisible in `sockstat -l4` but still enough to make `0.0.0.0:80` return EADDRINUSE to HAProxy. Disabling the HTTP redirect on that firewall resolves the conflict. Other OPNsense deployments already ship with the redirect off (or HAProxy on non-conflicting ports), so `0.0.0.0` remains the correct default. This reverts commit `5e72777`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 12:10:57 -04:00
Sylvain Tremblay	5a17bc229e	fix: formatting All checks were successful Run Check Script / check (pull_request) Successful in 2m19s Details	2026-04-22 11:29:33 -04:00
Sylvain Tremblay	5e72777c15	fix(okd): bind load balancer services on firewall IP, not 0.0.0.0 Some checks failed Run Check Script / check (pull_request) Failing after 56s Details Binding HAProxy on 0.0.0.0 collided with OPNsense's own listeners (HTTP→HTTPS redirect on :80, WebUI, etc.), preventing the HAProxy service from starting once the LoadBalancer score was applied. Use `topology.load_balancer.get_ip()` to bind each frontend on the firewall's LAN interface IP instead. The `LoadBalancer` capability was already in scope, so no new trait imports are needed. The previous `0.0.0.0` rationale (avoiding CARP VIP rebind races) is noted in a comment: HA CARP setups still need OPNsense's `net.inet.ip.nonlocal_bind` or HAProxy `transparent` bind — not addressed here. Test module: added an inline `DummyLoadBalancer` stub (mirrors the existing `DummyRouter` pattern) so `OKDLoadBalancerScore::new` no longer hits `DummyInfra::get_ip`'s `unimplemented!()` panic. Renamed `test_all_services_bind_on_unspecified_address` → `test_all_services_bind_on_firewall_ip`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 11:07:59 -04:00
Sylvain Tremblay	21035d2c56	fix(opnsense): set HAProxy healthcheck/server fields explicitly `configure_service` was relying on `..Default::default()` for most fields of the generated HAProxy structs. That leaked OPNsense's model defaults into the wire payload for fields Harmony never meant to default: - `http_host` → `localhost` (sent `Host: localhost` on every check) - `http_method` → `options` (sent OPTIONS instead of the declared method) - `http_version` → `http10` (wanted NONE) - `sslVerify` on real servers → `1` (broke self-signed backends) - Healthcheck `ssl` was never propagated, so SSL-required checks like kube-apiserver `/readyz` on 6443 stayed plain HTTP and never succeeded Set every field explicitly from `LbHealthCheck`/`LbServer`: map `http_method` through `HealthcheckHttpMethod`, pass `None` for `http_version` (serializes as `""` = NONE), clear `http_host` to an empty string, propagate `hc.ssl` through `HealthcheckSsl`, and pin `ssl`/`sslVerify` to `false` on the server struct so intent is declared at the call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 11:07:47 -04:00
Sylvain Tremblay	83d9af211a	fix(opnsense): distinguish unreachable API from missing HAProxy plugin `LoadBalancerConfig::is_installed` previously collapsed every error from the settings endpoint into `false`, so a timeout, DNS failure, or auth rejection all looked identical to "os-haproxy not installed" — the `LoadBalancer` score would then attempt to install the plugin on top of an unreachable firewall and fail in cascade further down the pipeline. Return `Result<bool, Error>` and treat only HTTP 404 (controller not found) as "not installed". Every other error is propagated so `ensure_initialized` fails the score immediately with a message pointing at the real problem. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 09:53:22 -04:00
stremblay	503f9eb357	Merge pull request 'feat: capture network intent at host discovery' (#267 ) from feat/discover-networking into master Some checks failed Run Check Script / check (push) Successful in 2m8s Details Compile and package harmony_composer / package_harmony_composer (push) Failing after 2m12s Details Reviewed-on: #267	2026-04-21 16:20:48 +00:00
Sylvain Tremblay	84a083a012	refactor(discovery): use shared LaggProtocol for bond mode All checks were successful Run Check Script / check (pull_request) Successful in 2m6s Details Replace the Linux-specific BondMode enum with harmony_types' LaggProtocol, which is already used by the OPNsense LAGG score. "Capabilities are industry concepts, not tools" — the kernel mode numbers (BalanceRr/ActiveBackup/…) were the wrong abstraction; LaggProtocol's Lacp / Failover / LoadBalance / RoundRobin span Linux bonding and BSD lagg uniformly. LaggProtocol now derives Deserialize so NetworkConfig can round-trip through SQLite. Make SqliteInventoryRepository::get_role_mapping tolerate a network_config blob it cannot deserialize: log a warning and fall back to NetworkConfig::default() so the operator still sees the existing mapping prompt and can pick "Update" to overwrite the bad row. This self-heals DBs that were written with the old BondMode variant names and gives the repo real resilience for future NetworkConfig evolutions.	2026-04-21 12:05:43 -04:00
Sylvain Tremblay	adb05a0b91	chore(discovery): drop sqlite WAL sidecars, add blank line after prompts All checks were successful Run Check Script / check (pull_request) Successful in 2m4s Details - Switch SqliteInventoryRepository to DELETE journal mode with create_if_missing, so `.sqlite-wal` / `.sqlite-shm` files no longer appear next to the DB. Existing WAL-mode DBs are checkpointed and converted on next open. - Print a blank line after prompt_network_config returns so the save logs don't stomp on the last answered question.	2026-04-21 11:29:03 -04:00
Sylvain Tremblay	0556b2ea0d	feat(discovery): replace role mappings, sort NICs, polish host header - host_role_mapping now holds at most one row per host_id. SqliteInventoryRepository::save_role_mapping wraps a DELETE of any prior rows for the host and the INSERT of the new one in a single transaction, self-healing pre-existing duplicate rows along the way. - Before re-prompting for disk and networking, the discovery flow looks up the current role mapping via the new InventoryRepository::get_role_mapping(host_id) method. If one exists, the operator sees a summary (role, install disk, bond mode + interfaces, blacklist) and picks between "Update" and "Cancel"; cancelling skips the host entirely and continues the selection loop without touching the DB. New HostRoleMapping domain type carries the returned row back to the caller. - Network interfaces are sorted by name at the hwinfo-to-domain conversion step (both MDNS and CIDR flows), so f0 always appears before f1 in every downstream consumer — host summary, bond multi-select, blacklist multi-select. This also makes the byte-equality dedup in save() robust against the agent returning NICs in different sysfs-walk order across reboots. - PhysicalHost::summary() split into summary_parts_through_storage() + append_network_summary(), with a new public summary_short() variant that omits the NIC list. print_host_header() in the discovery prompts now uses summary_short() so the "Host: ..." banner fits on one line; full summaries still render in the node picker, logs, and Display impl. - Fix CPU summary rendering when the agent reports an empty model: single-CPU renders as "6c/6t", multi-CPU as "2x CPU (12c/24t)", no stray double-space in the pipe-separated summary. - Regenerate .sqlx offline cache for the new DELETE and SELECT queries.	2026-04-21 11:19:23 -04:00
Sylvain Tremblay	18fc87a597	feat(discovery): dedup identical host saves and harmonize prompt headers - SqliteInventoryRepository::save() now compares the incoming serde_json bytes against the latest stored `data` blob for this host_id. If byte-identical, the insert is skipped with an info log "Host '<id>' unchanged, skipping save". Genuine changes still produce a new version row, preserving the audit trail. Eliminates the unbounded row growth from repeated discovery (mDNS is continuous, CIDR scans often re-run). Addresses the long-standing FIXME in modules/inventory; the comment is now removed. - Reworded the caller-side log that fires after repo.save() from "Saved [new] host id X, summary: ..." to "Discovered host X, summary: ...". The old text claimed "Saved" even when the repo had actually skipped the insert, producing contradictory log lines on re-runs. - Harmonized every host-specific inquire prompt in the discovery flow behind a new print_host_header() helper: each prompt is now preceded by a blank line and a "Host: <summary>" banner, and the redundant host name inside the question text is stripped (disk prompt, bond confirm). The node-selection prompt is unchanged -- it picks which host, so there is no current host yet.	2026-04-21 10:56:46 -04:00
Sylvain Tremblay	bdba4dda27	feat(discovery): tighten host summary and readability of prompts - PhysicalHost::summary() becomes terser and more informative: - Storage: "400 GB [8 GB, 477 GB]" (was "400 GB Storage (2 Disks [8 GB, 477 GB])"). Single-disk collapses to just the total. - Network: list every NIC as "[ip, mac]" with a count prefix (e.g. "3 NICs: [192.168.40.10, 98:fa:9b:03:17:6f], [00:e0:ed:7a:ec:4d], ..."). Single-NIC form drops the count and "s": "NIC: [ip, mac]". NICs without an IPv4 render as "[mac]". - Promote the inventory agent's Chipset { vendor, name } into a "system-product-name" label during host conversion (both MDNS and CIDR flows), so summary()'s first field shows "LENOVO 3136" instead of falling back to the HostCategory string ("Server"). Extracted into build_discovered_host_labels() to keep the two conversion sites in sync. When the chipset is blank, the old category fallback still applies. - Print a blank line before every interactive inquire prompt in the discovery flow (role pick, disk pick, bond confirm/multi-select/mode, blacklist confirm/multi-select) so prompts stand out from the preceding log output on the terminal.	2026-04-21 10:35:48 -04:00
Sylvain Tremblay	bf4f300383	feat(discovery): capture bond, blacklist and bond-mode intent per host Extend DiscoverHostForRoleScore with three new interactive prompts after the installation-disk selection: - "Configure a network bond?" (only when host has >= 2 NICs), followed by a multi-select of bond members (min 2) and a bond-mode picker (LACP / active-backup / balance-rr / balance-xor / broadcast / balance-tlb / balance-alb). - "Blacklist any remaining interface?", with candidates limited to NICs not already claimed by the bond. The answers are persisted as a JSON-encoded NetworkConfig on a new host_role_mapping.network_config column. HostConfig now exposes network_config alongside installation_device so downstream scores can honor the user's intent. Also adds a new harmony_host_discovery example that discovers a single host on 192.168.40.0/24:25000.	2026-04-21 10:13:58 -04:00
stremblay	52db82865d	Merge pull request 'feat(monitoring): Datadog 15-key-metrics dashboard + Ceph "what's wrong" drilldown' (#266 ) from feat/datadog-k8s-metrics into master Some checks failed Run Check Script / check (push) Successful in 2m5s Details Compile and package harmony_composer / package_harmony_composer (push) Failing after 2m4s Details Reviewed-on: #266	2026-04-21 11:21:28 +00:00
Sylvain Tremblay	349c2a1358	feat: improve ceph dashboard All checks were successful Run Check Script / check (pull_request) Successful in 2m17s Details	2026-04-20 15:58:52 -04:00
Sylvain Tremblay	c2718e843b	feat: improve ceph dashboard - list alerts and WHY its NOT green All checks were successful Run Check Script / check (pull_request) Successful in 2m6s Details	2026-04-20 15:47:12 -04:00
Sylvain Tremblay	391c44b369	feat: add the datadog-15-k8s-metrics dashboard	2026-04-20 15:29:54 -04:00
stremblay	bae162a3e4	Merge pull request 'feat(monitoring): Ceph alerts integrated with OKD's native alerting stack' (#265 ) from feat/ceph-alerts into master Some checks failed Run Check Script / check (push) Successful in 2m14s Details Compile and package harmony_composer / package_harmony_composer (push) Failing after 2m11s Details Reviewed-on: #265	2026-04-20 18:14:44 +00:00
Sylvain Tremblay	8acd9de275	feat: score to create ceph alerts in the okd default alerting stack All checks were successful Run Check Script / check (pull_request) Successful in 2m10s Details	2026-04-20 13:52:36 -04:00
johnride	ef418f2f96	Merge pull request 'feat: Disable ipv4 address conflict detection score. This is useful when setting up bonds as the wrong mac may get a dhcp offer and then the system will perceive it as a conflict when it sets up the bond correctly' (#263 ) from feat/disableDadScore into master Some checks failed Run Check Script / check (push) Successful in 2m11s Details Compile and package harmony_composer / package_harmony_composer (push) Failing after 2m12s Details Reviewed-on: #263	2026-04-20 17:44:29 +00:00
Sylvain Tremblay	126390bb63	feat: split storage dashboard in two : ceph + persistent storage	2026-04-20 13:09:34 -04:00
Sylvain Tremblay	7265d8a4f3	fix: fix ceph dashboard for root volumes not populated	2026-04-20 12:01:25 -04:00
Jean-Gabriel Gill-Couture	54ef3f70bd	feat: Refactor dad score into reusable node file score using machine config All checks were successful Run Check Script / check (pull_request) Successful in 2m17s Details	2026-04-19 05:01:09 -04:00
Jean-Gabriel Gill-Couture	6267c2757f	feat: Disable ipv4 address conflict detection score. This is useful when setting up bonds as the wrong mac may get a dhcp offer and then the system will perceive it as a conflict when it sets up the bond correctly All checks were successful Run Check Script / check (pull_request) Successful in 2m37s Details	2026-04-15 15:48:22 -04:00

feat/nats-auth-callout-e2e #279

64 Commits