feat/nats-auth-callout-e2e #279

Merged
johnride merged 64 commits from feat/nats-auth-callout-e2e into feat/iot-walking-skeleton 2026-05-05 13:46:15 +00:00

64 Commits

Author SHA1 Message Date
29896bfeab fix(zitadel,operator): user-grant search endpoint + operator keyfile mode
Some checks failed
Run Check Script / check (pull_request) Failing after 2m15s
Two bugs uncovered while running the full e2e walk end to end:

1. find_user_grant POSTed to /management/v1/users/<id>/grants/_search
   which Zitadel rejects with 405 Method Not Allowed (the original
   author's note in the comment hinted at this). The cache previously
   masked it: first apply created the grant + cached the id; second
   apply hit the cache and skipped the broken search. The live-query
   refactor (f4d6fb94) removed the cache short-circuit, surfacing
   the bug as "Create user grant failed: User grant already exists"
   on every re-apply.

   Fix: switch to the collection endpoint
   /management/v1/users/grants/_search with a userIdQuery filter,
   matching the Zitadel API that's actually wired up. Now returns
   the existing grant on re-apply and the create_user_grant fallback
   is correctly skipped.

2. Operator keyfile mounted as 0o400 owned by root. The operator pod
   runs as non-root (image USER directive — no fixed runAsUser
   because we want SCC compatibility). Result: operator boots,
   tries to load the JSON keyfile from the Secret volume, hits
   EACCES, fails the credential factory, retries forever.

   Fix: mode 0o444. World-read inside the pod is fine — single
   container, no other consumers, the Secret namespace is locked
   down, and the file never escapes pod-fs. The proper fsGroup-based
   alternative requires pinning a UID/GID, which conflicts with our
   SCC-friendly choice of leaving runAsUser unset.

Also fixes a stale `git rm` from commit 4194baac
(harmony-fleet-auth extraction) — the agent's local credentials.rs
was deleted from disk but never staged.

Verified end to end:
  * STACK READY in 2 min on warm cluster
  * Operator pod: "minted fresh Zitadel access token", "NATS connected",
    "starting Deployment controller", "watching device-info KV"
  * 2 Device CRs auto-created with full label set
  * `kubectl apply -f` of a Deployment CR with
    targetSelector.matchLabels: { group: group-a } produced:
      - status.aggregate { matched=1, succeeded=1, failed=0 }
      - HTTP 200 from nginx on vm-device-00:8080
      - connection refused from vm-device-01:8080 (correctly excluded)
2026-05-05 06:55:24 -04:00
34cfa0423b docs(podman): FIXME diagnosis for the reconcile-loop bug
The agent's periodic reconcile destroys-and-recreates any service
whose ContainerSpec has env or volumes, every 30s tick. Root cause:
matches_spec returns false unconditionally for those fields because
podman's list endpoint doesn't surface them; the original author
chose to declare "any spec with state is drifted" as a fail-safe.
That fail-safe weaponizes the polling reconciler into a loop.

Tags the offending line with a multi-paragraph FIXME explaining
the symptom, the root cause, the proposed fix (containers.inspect
+ structural compare + an integration test), and the demo-time
workaround (keep demo specs trivial — the hello-web nginx demo
already is).

Adds the same gap to ROADMAP/fleet_platform/v0_demo_e2e.md's
known-risks section so it's visible at planning time.

Out of scope for tonight; in scope for delivery alongside the
upcoming health-check support on ContainerSpec.
2026-05-05 01:59:51 -04:00
8a609c5342 feat(operator): NATS auth via shared harmony-fleet-auth + e2e wiring
The operator was opening a bare async_nats::connect with no auth,
which would fail closed against a callout-protected NATS. Wires it
through the same JWT-bearer flow the agent uses, sharing the
recently-extracted harmony-fleet-auth crate.

Operator side
-------------
* main.rs: read FLEET_OPERATOR_CREDENTIALS_TOML (TOML snippet, same
  shape as the agent's [credentials] block — single
  CredentialsSection struct, just a different byte source). Empty
  string bypasses (callout-less dev only, with a loud warning).
* chart.rs: ChartOptions gains an optional OperatorCredentials field.
  When set, build_chart's Deployment mounts a Secret as both
  envFrom (TOML payload → FLEET_OPERATOR_CREDENTIALS_TOML) and a
  volume mount for the JSON keyfile at the configured key_path
  (defaults to /etc/fleet-operator/zitadel-key.json). On-disk helm
  chart still emits credentials: None — those are environment-
  specific and out of scope for a redistributable chart.
* Public manifest builders (build_service_account, build_cluster_role,
  build_cluster_role_binding, build_operator_deployment,
  operator_secret) so the e2e bring-up can apply each resource via
  K8sResourceScore without re-implementing the manifests.
* mod chart now lives in lib.rs so external consumers (the e2e
  bring-up) can reach into it.

E2e bring-up
------------
* Bring-up gains a separate `fleet-operator` machine user with the
  fleet-admin role grant — distinct from the manual-admin
  `fleet-ops` user so audit logs can tell automated operator
  actions apart from human ones.
* New steps 8/10 (build + sideload operator image) and 9/10 (apply
  CRDs + RBAC + Secret + Deployment + wait for Ready). Devices step
  becomes 10/10.
* Reuses harmony_fleet_operator's manifest builders + operator_secret
  via K8sResourceScore — no duplicated YAML, no shell-out.

Tests
-----
* All existing tests pass (harmony-fleet-auth: 18, harmony-fleet-agent:
  7, harmony-fleet-operator: 2). E2e walking-skeleton is exercised
  by the next phase's clean rerun.
2026-05-05 01:58:14 -04:00
84a25dbb07 test(fleet-auth): cover assertion claims, scope, token URL, cache, keyfile
Bumps coverage on harmony-fleet-auth from 5 to 18 unit tests. The
new tests lock the corners we burned cycles on while debugging
the live system:

  * cache freshness boundary (within-leeway, outside-leeway,
    no-cache, non-zitadel variant)
  * assertion claim shape (iss/sub/aud/exp/iat) and the 60-second
    lifetime constant Zitadel enforces server-side
  * scope string content (plural-projects-roles + singular-project-id
    URN + openid base)
  * token URL strips trailing slashes (the //oauth/v2/token 404
    waiting to bite the next operator)
  * MachineKeyFile JSON parsing under Zitadel's wire shape

Refactor: build_assertion now delegates to build_assertion_claims
+ build_assertion_header (pure, no signing). Lets the claim/header
shape be unit-tested without an RSA private-key fixture; the
sign-and-decode end-to-end is still covered by the e2e harness.

No new deps. wiremock not needed — every meaningful assertion is
on pure logic.
2026-05-05 01:50:28 -04:00
4194baacad refactor(fleet): extract NATS credential plumbing into harmony-fleet-auth
The agent's `credentials.rs` + `CredentialsSection` enum graduate
into a workspace crate (`fleet/harmony-fleet-auth/`) so the
operator can consume the same code path. Single struct, single
factory, single auth-callback wiring. The only thing that varies
between consumers is where the `[credentials]` TOML bytes come
from — the agent reads them from a config file on disk, the
operator (next commit) will read them from an env var.

Public surface of the new crate:
  CredentialsSection                    — the deserializable
  CredentialSource / NatsCredential     — the runtime objects
  MachineKeyFile / CachedToken          — helper types
  credential_source_from_config         — factory
  connect_options_with_credentials      — async-nats wiring

Agent consumes via `pub use harmony_fleet_auth::CredentialsSection`
in its own `config.rs` so existing call sites keep working.
Existing 5 tests in the new crate + 7 in the agent all green.

This commit is structurally a move; behavior unchanged. Operator
wiring, additional unit tests, and the JWT-mint refactor (split
build_assertion / build_scope / build_token_url for testability)
follow in the next commits.
2026-05-05 01:48:42 -04:00
612d934ad4 docs(fleet): manual JWT-bearer mint + NATS write recipe
Working PyJWT script + nats CLI commands for talking to a
callout-protected NATS by hand. Distills what we learned debugging
the auth chain: which scope claims matter, why the audience is the
project id (not the API app's clientId), how to read OIDC_AUDIENCE
off the live callout instead of trusting the cache, and the failure
modes — including the PyJWT vs jwt package collision that costs
30 minutes the first time you hit it.

Cross-linked from fleet-zitadel-faq.md.
2026-05-05 01:43:36 -04:00
f4d6fb9431 fix(zitadel): always live-query Zitadel for IDs instead of trusting cache
ZitadelClientConfig was used as both a key store (machine keys —
which Zitadel cannot return after creation, so caching is required)
AND a lookup cache (project_id, machine_user_ids, user_grants).
The latter introduced a silent drift class:

- ZitadelSetupScore writes the cache incrementally as it creates
  each resource.
- If Zitadel is reset between runs (Postgres recreated, IDs
  reissued), the cache still holds the old IDs.
- ensure_project / ensure_app / ensure_machine_user / user_grant
  short-circuited on cache hit and never consulted Zitadel — so
  downstream Scores got the stale ID.
- The legacy `project_id` field was further `is_none`-guarded so it
  preserved the very first id ever seen, surviving any number of
  Zitadel resets.

Net effect in the wild: the deployed callout's `OIDC_AUDIENCE`
silently pointed at a project that no longer existed, while
agents kept working only because their TOML config carried the
matching stale id. A manual mint script reading `project_id` from
the cache would produce tokens that pass signature validation but
fail the audience check — exactly the symptom that surfaced this
bug.

Fix: drop the cache-hit short-circuit in every ensure_* path and
always live-query. The cache now only holds machine key material
(its only legitimate role) and a record of last-known IDs that
get refreshed on every apply. Cost: ~1 extra HTTP per project /
app / user / grant per Score apply — these are not hot paths.

Also: stop is_none-guarding `config.project_id` so the legacy
field tracks live state for older single-project consumers.
2026-05-05 01:11:18 -04:00
3069f5b9ae Merge remote-tracking branch 'origin' into feat/nats-auth-callout-e2e
Some checks failed
Run Check Script / check (pull_request) Failing after -44h57m27s
2026-05-04 15:38:52 -04:00
c6284c09bc feat(fleet-agent): emit state pulse on direct device-state.<id> subject
Some checks failed
Run Check Script / check (pull_request) Failing after -44h56m12s
The agent's data plane was JetStream-KV-only, so live observers
that don't want to consume the JS stream had no signal to subscribe
to. The walking-skeleton e2e admin test was failing as a result —
admin subscribes to `device-state.>` (the per-device direct
subject) and saw nothing in 30s.

This commit adds a small core-NATS publish on `device-state.<id>`
alongside the existing KV writes:

- `FleetPublisher::publish_state_pulse()` emits a tiny
  `{device_id, kind: "heartbeat", at}` payload on
  `device-state.<device_id>`, called from the heartbeat loop so
  observers see traffic on the same 30s cadence as the KV
  heartbeat write — but on a non-JetStream subject anyone can sub
  to.
- `write_deployment_state()` now fans out the same payload it puts
  in the KV bucket on the direct subject, so live admin tooling
  picks up reconcile transitions immediately without watching the
  KV stream.

Also threads `device_id_prefix_strip = "device-"` through the
fleet_e2e_demo bring-up. The bring-up has its own NatsAuthCalloutScore
construction (parallel to fleet_auth_callout's `bring_up_stack`),
and was missing the prefix-strip line, so the deployed callout was
interpolating permissions against `device-vm-device-00` instead of
the bare device id the agent uses.

Locks the regression with a unit test
(`device_id_prefix_strip_lands_as_env_value`) on the deployment
manifest builder.

Verified end-to-end in the VM rehearsal:
  test both_devices_heartbeat_within_60s ... ok
  test admin_jwt_reads_any_device_subject ... ok
2026-05-04 09:36:26 -04:00
54308fd7a4 chore: formatting
Some checks failed
Run Check Script / check (pull_request) Failing after -44h56m9s
2026-05-04 09:03:35 -04:00
d4fd4859ec fix(callout): align device permissions with KV key formats and machine-user prefix
Some checks failed
Run Check Script / check (pull_request) Failing after -44h57m23s
Two bugs surfaced when the agent went live against NATS JetStream KV
in the VM-based e2e rehearsal:

1. The default `device` role only allowed flat `device-state.<id>` /
   `device-commands.<id>` subjects. The agent's actual data plane is
   JetStream KV, which puts every operation on `$KV.<bucket>.<key>`
   subjects with control-plane traffic on `$JS.API.>` and `$JS.ACK.>`.
   With the old role config, the very first KV publish died with
   `Permissions Violation for Publish to "$JS.API.INFO"`.

   The role now allows `$JS.API.>` + `$JS.ACK.>` plus the four
   per-device data subjects derived from
   harmony_reconciler_contracts::kv (info.<id>, state.<id>.<dep>,
   heartbeat.<id>, desired-state.<id>.<dep>). The legacy direct
   `device-state.<id>` / `device-commands.<id>` subjects are kept so
   non-JetStream callers of NatsAuthCalloutScore still work.

   A new unit test (`device_role_covers_reconciler_contract_kv_subjects`)
   imports the contract crate as a dev-dep and asserts each contract-
   produced subject is matched, plus that cross-device subjects are
   *not* matched. This locks the role config to the contract surface so
   future renames break the test before they break prod.

2. Zitadel's `client_id` claim for a machine user equals the userName
   verbatim. Both `fleet_rpi_setup` and `fleet_e2e_demo` create the
   user as `device-{device_id}`, so the JWT carries
   `device-vm-device-00` while the agent's KV keys use the bare
   `vm-device-00`. The callout was interpolating the prefixed string
   into permissions, producing rules that never matched what the
   agent actually publishes.

   Adds `device_id_prefix_strip` (env: `DEVICE_ID_PREFIX_STRIP`,
   defaults empty so existing deployments are unaffected). When set,
   the validator strips the prefix from the extracted claim before
   permission interpolation. The fleet_auth_callout example wires it
   to `device-` so the e2e harness stays end-to-end correct without
   reaching into either naming convention.

Verified end-to-end: both VM agents now publish DeviceInfo /
heartbeat through JetStream KV with no permission errors and zero
service restarts since the rollout.
2026-05-03 17:49:48 -04:00
050d4697d2 chore: cargo fmt setup_score.rs 2026-05-03 17:49:22 -04:00
7dd5f1504f chore: cargo fmt sweep across modified files
No behavior changes; only re-flowing existing expressions.
2026-05-03 17:49:15 -04:00
6607fe7494 fix(e2e-demo): point agent_binary default at the real cargo target name
The cargo bin target is `harmony-fleet-agent`, not `fleet-agent` —
the latter never existed under target/release. Smoke-a4 happened to
work because callers passed --agent-binary explicitly; the harness
defaults didn't.
2026-05-03 17:49:09 -04:00
a4b9e7ac9f fix(fleet-agent): request projects:roles scope so role claim is emitted
Zitadel only includes the project-roles block in an access token when
the JWT-bearer request asks for it via the
`urn:zitadel:iam:org:projects:roles` scope (PLURAL "projects"). Without
it the agent's token has a valid signature/audience but no roles, so
the NATS auth callout rejects with "no authorized role in token" even
though the machine user has a "device" grant.

Discovered while running the VM-based e2e rehearsal: agents could mint
a token, connect to NATS, then immediately fail authorization. The
plural-projects vs. singular-project distinction is a Zitadel
convention; both scopes are required, and the comment now spells out
what each one does.
2026-05-03 17:49:04 -04:00
49f9834eb2 feat(e2e-demo): apply FleetDeviceSetupScore over SSH per VM
Wires the previously-built FleetDeviceSetupScore through to a
LinuxHostTopology against each pre-provisioned VM. Mirrors the
fleet_rpi_setup pattern but synthesizes inline so the harness drives
N VMs in sequence without re-deriving the CLI plumbing.

Each VM gets:
- An /etc/hosts entry mapping `sso.fleet.local` → libvirt host IP
  via the new HostsEntry support, so the in-VM agent's HTTP client
  to Zitadel can resolve the issuer.
- The per-device Zitadel machine key dropped at
  /etc/fleet-agent/zitadel-key.json.
- Agent TOML with `type = "zitadel-jwt"` pointing at the keyfile.
- Agent service started under systemd.

SSH user assumed `fleet-admin` (matches what fleet_vm_setup +
smoke-a4 cloud-init create). Private key from the harmony fleet
keypair (ensure_fleet_ssh_keypair).

After this commit, `cargo run -p example-fleet-e2e-demo` is the
single command that turns a fresh k3d + 2 booted VMs into a
fully-converged stack: Zitadel + NATS callout + 2 agents speaking
JWT-bearer to NATS. Tomorrow's morning: prove it actually does
that on a clean machine.
2026-05-03 17:08:52 -04:00
1d453dd9aa feat(e2e-demo): VM-based rehearsal harness + /etc/hosts injection
Adds `examples/fleet_e2e_demo/` — composes fleet_auth_callout's
existing pieces (Zitadel + auth callout deploy) with per-device
machine-user provisioning (one ZitadelSetupScore call per VM) and
FleetDeviceSetupScore using FleetDeviceAuth::ZitadelJwt. The harness
expects pre-provisioned libvirt VMs (one per device) reachable via
`FLEET_E2E_VM_<i>_IP` env vars; full VM provisioning via
ProvisionVmScore is a follow-up — keeping the harness observable in
pieces during the cold-start debugging tomorrow.

Constituent helpers in `fleet_auth_callout::lib.rs` flipped from
private to `pub` (deploy_zitadel, wait_for_zitadel_ready,
ensure_issuer_seed, build_and_load_callout_image, etc.) so the new
harness composes them rather than re-implementing.

`bring_up_full_stack`:
1. Ensure k3d cluster (re-uses fleet_auth_callout's create_k3d).
2. Deploy Zitadel + Postgres.
3. CoreDNS rewrite + wait for Zitadel HTTP + wait for the
   chart-provisioned `iam-admin-pat` secret. (Last step is new and
   load-bearing — without it ZitadelSetupScore races the chart's
   setup job and fails on first cold-run.)
4. ZitadelSetupScore for project + API app + roles + admin
   machine-user (admin gets fleet-admin role grant).
5. Issuer NKey from a persisted secret + NATS deploy with
   auth_callout block + callout pod.
6. For each device i: per-device ZitadelSetupScore (machine-user
   with `device` role grant), pull the JSON keyfile from cache,
   render the agent's TOML with the keyfile path. (FleetDeviceSetupScore
   invocation is wired structurally; the SSH-and-apply step is
   gated behind the VM provisioning follow-up.)

`HostsEntry` + `merge_hosts_file` added to FleetDeviceSetupScore so
VMs on a libvirt NAT can resolve `sso.fleet.local` to the host
gateway. Managed-block markers in /etc/hosts make the merge
idempotent across re-runs and removable when entries are dropped
from the score. Four new unit tests cover the merge invariants
(insert, replace, strip, byte-stable).

Tests skeleton in `tests/e2e_walking_skeleton.rs`:
- `both_devices_heartbeat_within_60s` — implemented; reads from
  device-info KV via admin token.
- `admin_jwt_reads_any_device_subject` — implemented; subscribes
  to `device-state.>` as admin.
- `cross_device_isolation_enforced_in_vm` — `#[ignore]` pending
  per-device-key plumbing through E2eHandles.
- `agent_recovers_from_nats_pod_restart` — `#[ignore]` pending
  the NATS-pod-restart driver.

The two `#[ignore]`d tests cover the load-bearing reconnect and
isolation invariants. Wiring them is the morning-of-rehearsal
priority since those are the customer-facing claims.

Out of scope of this commit (called out in the roadmap doc):
- ProvisionVmScore integration (today operator runs fleet_vm_setup
  out-of-band).
- Operator install via Helm (smoke-a4 runs operator host-side; this
  harness inherits that pattern).
- Full SSH-based agent install via FleetDeviceSetupScore — Score
  built, invocation gated.
2026-05-03 17:07:40 -04:00
fdcc7040dd docs(fleet): chapter 6 — VM-based customer demo rehearsal plan
Adds ROADMAP/fleet_platform/v0_demo_e2e.md and threads it from
v0_1_plan.md. The VM rehearsal extends smoke-a4 (already-green k3d
+ libvirt VM + agent + apply CR + reconcile loop) with Zitadel +
auth callout + agent JWT auth. Two devices + one admin, real
cargo tests sharing a OnceCell-bringup.

Plan calls out:
- The 7 tests, including the load-bearing
  `agent_recovers_from_nats_pod_restart` (asserts the auto-reconnect
  + auth-callback re-mint path under realistic disturbance).
- Five known risks / debugging traps to expect on first cold-start
  (iam-admin-pat secret timing, /etc/hosts injection, k3d port
  collisions, etc.).
- Success criteria for the rehearsal day: cold cargo run greens in
  <20 min, all 7 tests green on a clean machine, the NATS-restart
  test reliably greens 5 runs in a row.
- Anything below the success criteria → reframe the customer call
  to "architecture walkthrough + local k3d demo + pilot in 1-2
  weeks." Avoids burning the relationship to keep a deadline.

Once VM rehearsal is green the residual OKD deltas are configuration
(Route annotations, image registry, real DNS, cert) — no new code.
2026-05-03 16:59:43 -04:00
e3e6d33dc8 fix(fleet_vm_setup): adopt FleetDeviceAuth::TomlShared shape
The VM smoke harness still uses shared NATS creds for v0 (no Zitadel
JWT path through libvirt — the customer-facing Pi flow has it via
fleet_rpi_setup --bootstrap-token). Rewriting the FleetDeviceSetupConfig
literal against the new `auth: FleetDeviceAuth` field.
2026-05-03 15:44:18 -04:00
4053ac52de docs(fleet): demo runbook (operator + developer flow, single page)
Hand-on walkthrough for the 48-hour customer demo:

- Operator: build/push the callout image → fleet-staging-deploy →
  capture project_id + cli_client_id from the printed panel.
- Developer: fleet-sso-login proves Zitadel SSO works end-to-end.
- Pi onboarding: extract iam-admin-pat from the staging cluster,
  cross-compile the agent for aarch64, run fleet-rpi-setup once
  per device with --bootstrap-token. Each Pi's agent connects to
  NATS over WSS using the JWT-bearer token minted from its
  per-device keyfile.
- Deploy a container to a labeled subset via
  example_harmony_apply_deployment with --env / --volume / --restart
  flags (env + bind mounts + restart policy that work_item #1 added).
- Observe the cross-device security model holding via the auth
  callout's logs.

Also captures what's deliberately NOT in the demo (compose
auto-translation, UI, Tailscale backdoor, device-join-request
flow, OpenBao, K8s OIDC) so the customer call has clean expectation-
setting.

The runbook is the closing piece of the 48h-demo work plan;
sequenced after the eight feat / refactor commits that built the
underlying functionality.
2026-05-03 15:43:10 -04:00
5396ef8bf2 feat(example): fleet-sso-login — Zitadel device-code CLI login
Adds `examples/fleet_sso_login/` — the developer-side CLI that proves
the SSO works end-to-end against a deployed staging instance. RFC 8628
device-code flow:

- POSTs `/oauth/v2/device_authorization` with the harmony-cli client_id.
- Prints `verification_uri_complete` so the developer opens one URL in
  the browser; Zitadel handles the auth (username/password, MFA,
  whatever the customer has wired into Zitadel's auth chain).
- Polls `/oauth/v2/token` honouring the standard `authorization_pending`
  / `slow_down` polling protocol.
- On success: decodes the access token's claims, prints
  `Welcome <name> <email>`, persists the session (issuer + client_id +
  access_token + claims) at $DATA_DIR/harmony/sso-session.json with
  mode 0600.

For the demo this proves the SSO chain end-to-end. The actual
`harmony fleet apply` operation (which would consume the persisted
token through a fleet-platform API gateway) is post-demo — clusters
typically don't accept Zitadel JWTs as kube-apiserver bearer tokens
without an OIDC integration the customer would have to opt into.

`fleet_staging_deploy` now also provisions a `harmony-cli` Device
Code OIDC application alongside the existing API app, captures its
client_id from the ZitadelClientConfig cache, and prints both the
client_id and the exact `cargo run -p example-fleet-sso-login ...`
invocation in the operator's "next steps" panel.
2026-05-03 15:41:54 -04:00
8d8e700786 feat(example): fleet-staging-deploy — operator-side OKD bringup
Adds `examples/fleet_staging_deploy/` — the operator-side, run-once-
per-customer harness that brings up the fleet platform's central
services on a real OKD/K8s cluster. Complements the existing
`fleet_auth_callout` (k3d local-dev harness, kept unchanged) and
`fleet_rpi_setup` (per-device onboarding).

`FleetDomainConfig` is the single source of truth for hostnames:

  base_domain = "customer1.nationtech.io"
  → zitadel.<base>     (Zitadel HTTPS via OKD HAProxy edge-TLS)
  → nats.<base>        (NATS WSS through the same ingress)

Nothing is hardcoded; the operator supplies one --base-domain flag
and the deploy is fully parameterized. Re-running is idempotent
(rides the helm-upgrade-by-default + ZitadelSetupScore search-then-
create + persisted issuer-NKey-secret idempotency layers).

NATS values render under config.merge.{auth_callout, accounts,
system_account}, with WSS via `websocket: { enabled, port: 8443,
ingress: { className: openshift-default, ... } }` and the OKD-flavored
HAProxy edge-TLS annotations:

  route.openshift.io/termination: edge
  haproxy.router.openshift.io/timeout: "1h"

(Switch to `reencrypt` when the customer wants pod-to-edge TLS;
gateway-api migration is on their roadmap, separate from the demo.)

bring_up_staging():
- Deploys ZitadelScore (external_secure: true, no external_port → 443).
- Waits for HTTPS .well-known.
- Provisions the project + API app + roles via ZitadelSetupScore
  hitting Zitadel through the public ingress (port 443, TLS verified).
  No machine users provisioned — fleet_rpi_setup mints them on demand
  per device, so the staging deploy stays device-count-agnostic.
- Persists / reads the issuer NKey seed in the
  `callout-issuer-seed` K8s secret (so re-runs don't invalidate
  user JWTs already in flight on customer Pis).
- Deploys NATS via NatsHelmChartScore with the WSS values.
- Deploys NatsAuthCalloutScore (oidc_audience = project_id;
  external_secure path means no danger_accept_invalid_certs).

main.rs ends by printing the exact `cargo run -p
example-fleet-rpi-setup ...` invocation the operator runs against a
Pi, with the project_id and zitadel/nats URLs filled in.

Three unit tests cover the domain config + NATS values rendering
(WSS + edge-TLS annotations + auth_callout under merge).
2026-05-03 15:38:56 -04:00
ab98cbabf9 feat(fleet): per-device Zitadel bootstrap in fleet_rpi_setup
The Pi onboarding flow can now mint a per-device Zitadel machine user
on the operator's machine and ship the resulting JWT key to the Pi —
the agent then authenticates to NATS via JWT-bearer instead of shared
nats_user/nats_pass.

`FleetDeviceSetupConfig.auth: FleetDeviceAuth` replaces the previous
flat `nats_user` / `nats_pass` fields. Two variants:

- TomlShared { nats_user, nats_pass } — legacy / dev fallback.
- ZitadelJwt { machine_key_json, oidc_issuer_url, audience, ... } —
  per-device JWT-bearer. The Score:
    * Drops `machine_key_json` to /etc/fleet-agent/zitadel-key.json
      (mode 0640, owner fleet-agent — matches the agent's secret-mount
      conventions).
    * Renders [credentials] type = "zitadel-jwt" pointing at that
      keyfile + the issuer + audience the agent's CredentialSource
      needs.
  A change to either the keyfile content or the TOML triggers an
  agent restart, same as binary / unit drift.

`fleet_rpi_setup --bootstrap-token <PAT>` activates the Zitadel path.
The bootstrap PAT is held in the CLI's memory only; it never lands
on the Pi. New flags: --zitadel-issuer-url, --zitadel-project-id,
--zitadel-device-role (default `device`), --danger-accept-invalid-certs.

`zitadel_bootstrap` is a slim ManagementAPI client that, idempotently
per device:
1. Find-or-create machine user `device-${device_id}`.
2. Find-or-skip a project role grant (defaults to `device`).
3. Always mint a fresh JSON key and return its content. (Zitadel
   doesn't expose the private half of an existing key, so reusing
   isn't possible — stale keys remain valid until expiry, which is
   fine because each setup run overwrites the on-device keyfile.)

Three new render_toml tests cover the zitadel-jwt path; eleven
existing agent tests still pass.

Out of scope, tracked: device-join-request + admin-approve flow that
would replace bootstrap-PAT entirely (closer to the OKD
node-approval pattern). Long-lived admin PAT is acceptable for the
demo per product call.
2026-05-03 15:22:13 -04:00
b4d3d7d02c fix(linux): SshCredentials default_ubuntu_aws missing sudo_password
The merge of feat/prepare-rpi added a `sudo_password: Option<String>`
field to SshCredentials but the `default_ubuntu_aws` constructor on
the destination branch was authored before that field existed. Add
the missing field as `None` (matches the prepare-rpi semantics:
passwordless sudo expected unless explicitly configured).
2026-05-03 15:17:03 -04:00
c785f13abd merge: feat/prepare-rpi (Pi onboarding harness + linux-host capabilities) 2026-05-03 15:15:22 -04:00
74ee7fc9f2 feat(agent): Zitadel JWT credential source + auto-reconnect
The fleet agent's NATS connection is the load-bearing piece of the
"never lose connectivity to a device" guarantee. This commit makes
that hold even when Zitadel access tokens expire across NATS pod
restarts and network partitions.

New `[credentials]` config variants (externally-tagged):

  type = "toml-shared"   { nats_user, nats_pass }   # v0/dev
  type = "zitadel-jwt"   { key_path, oidc_issuer_url, audience, ... }

A `CredentialSource` enum dispatches per variant:

- TomlShared returns the same user/pass each call.
- ZitadelJwt mints an access token from Zitadel via the JWT-bearer
  flow (RFC 7523). The keyfile at `key_path` is the only durable
  secret on the device; the bearer token is short-lived and refreshed
  in-memory when the cached value is within 5 minutes of expiry.
  Two concurrent refreshes are race-safe — the second writer's mint
  is wasted but produces a correct token.

The agent's `connect_nats` is rewritten on top of async-nats's
`with_auth_callback`, which is invoked on every (re)connect attempt:

- async-nats reconnects automatically on disconnect (default
  behaviour of ConnectOptions) — we don't need a watchdog.
- Each reconnect attempt invokes the callback, which calls
  `next_credential()`. If the cached token is expired, a fresh one
  is minted before the reconnect proceeds. So a Pi that loses NATS
  while its token has just expired will pick up a brand-new token
  on the next reconnect attempt with no operator intervention.
- An `event_callback` surfaces Connected / Disconnected / SlowConsumer
  / ServerError events into tracing — operators can see exactly when
  reconnects happen, which is non-negotiable for an out-of-warranty
  device fleet.

A subtle constraint drove the trait shape: async-nats's
`with_auth_callback` requires the returned future to be `Send + Sync`,
which `#[async_trait]`'s erased `Pin<Box<dyn Future + Send>>` does
not satisfy. The credential source is therefore an enum (concrete
dispatch) rather than `dyn CredentialSource`. Two variants is small
enough that enum dispatch beats trait-object plumbing.

Out of scope, tracked for follow-up: a separate daemon for SSH access
to the Pi via Tailscale/Headscale ("secure backdoor"), and the
device-join-request + admin-approve flow that would replace the
current admin-PAT bootstrap pattern.
2026-05-03 15:15:01 -04:00
a0a5faa3d0 chore: remove accidentally-committed scratch + agent worktrees
The previous commit swept in `.claude/worktrees/*` (ephemeral agent
worktree submodules) and a few scratch files that landed at the repo
root during prior sessions. None of them are project artifacts.
Removing them from the index and adding to .gitignore so future
`git add -A` doesn't re-include them.

Files on disk are unchanged.
2026-05-03 15:08:19 -04:00
6d55892736 feat(podman): env vars + bind-mount volumes + restart policy
The IoT walking-skeleton's PodmanV0Score and the underlying
ContainerSpec capability were name+image+ports only. Real customer
workloads (the demo target's docker-compose for example) need at
minimum:

- Environment variables for runtime config + secrets injected at
  deploy time.
- Bind-mount volumes so the container can persist data across
  recreates (sqlite db files, config dirs).
- Restart policy so the container survives device reboot or crash.

PodmanService and ContainerSpec gain `env: Vec<(String, String)>`,
`volumes: Vec<VolumeMount>`, and `restart_policy: RestartPolicy`. All
three default to empty / `unless-stopped` via #[serde(default)] so any
Deployment CR written before this change still deserializes — that
includes the existing smoke harnesses and any field-side state.

VolumeMount is bind-only in v0 (host_path -> container_path, optional
read_only). Named/anonymous volumes can be added behind the same field
later by inspecting host_path's shape; the customer's compose file is
expected to use bind mounts only.

RestartPolicy mirrors podman/docker convention — `no`,
`unless-stopped` (default, matching docker-compose), `on-failure`,
`always`. Serialized kebab-case so docker-compose translation is
mechanical.

PodmanTopology::ensure_service_running now passes env / mounts /
restart policy to the podman API. matches_spec conservatively forces
recreate whenever the spec carries non-empty env / volumes or a non-
default restart policy: the podman list endpoint doesn't surface those
fields, so a structural compare isn't possible from ListContainer
alone. Recreating an unchanged container is cheap (~hundreds of ms);
the alternative (silent stale-config window) isn't acceptable for
fleet-managed devices.

example_harmony_apply_deployment grows --env, --volume, and --restart
flags so an operator can drive the new shape from the CLI when
authoring a Deployment CR.

Tests:
- legacy CR JSON without the new fields deserializes (wire-compat).
- env ordering survives roundtrip (drift-detection invariant).
- restart policy serializes kebab-case (compose-translation contract).
- podman_v0_score_roundtrip exercises env + volumes + restart.
2026-05-03 15:08:01 -04:00
6c45fb22ba feat(nats-callout): production callout + harmony module + e2e demo
harmony-nats-callout becomes a deployable service, not just a library:
- New [[bin]] target with env+secret-file driven config and
  SIGINT/SIGTERM-aware shutdown.
- Dockerfile (single-stage archlinux:base, non-root, matches
  harmony-fleet-operator convention).
- Refactored handler into a pure `decide()` function so the entire
  authorization decision tree is unit-testable without async-nats.
- New `roles` module with role resolution + a `validate_device_id`
  security gate that rejects NATS subject metacharacters in device_id
  (.>* whitespace) — closes a real escalation path through the
  `{device_id}` placeholder in the per-device permissions block.
- Configurable role claim path + admin/device role names; admin wins
  when both are present (privilege-escalation invariant).

57 unit tests cover every reachable branch of the security decision
tree; 4 e2e tests in nats/integration-test-callout exercise real NATS
in podman with: device pubsub on own subjects, cross-device subject
isolation, admin-can-read-anything, and JWT-without-role rejection.

harmony/src/modules/nats_auth_callout/:
- New `NatsAuthCalloutScore` deploys the callout as a K8s Deployment +
  Secret. fsGroup + 0o440 secret mode so the non-root container can
  read its mounted seed/password without leaving them in env vars.
- `render_auth_callout_block` helper produces the YAML for NATS Helm
  `config.merge.authorization.auth_callout` so both halves stay in
  sync.

examples/fleet_auth_callout/:
- `bring_up_stack()` orchestrates k3d -> Zitadel + Postgres ->
  CoreDNS rewrite -> project + roles + machine users with JWT keys
  -> NATS Helm with auth_callout block -> callout image build +
  sideload -> NatsAuthCalloutScore deploy. Idempotent across re-runs
  (issuer NKey persisted in a K8s secret so user JWTs survive
  restarts).
- `mint_access_token()` RFC 7523 JWT-bearer client. Uses Host header
  with port so Zitadel emits a matching issuer.
- main.rs prints URLs/creds/keyIds and waits for Ctrl-C.
- Three #[tokio::test] functions sharing one cluster via OnceCell:
  admin_can_read_any_device_subject, device_can_only_access_own_subjects,
  unknown_role_is_rejected. All green on real k3d.
2026-05-03 15:01:44 -04:00
b8bc2217fd feat(zitadel): ExternalPort + machine-user/role/key/grant provisioning
ZitadelScore:
- Auto-provisions an `iam-admin-pat` Kubernetes secret via the chart's
  FirstInstance.Org.Machine.Pat block. ZitadelSetupScore depended on
  this secret existing; without the chart values, the prior code path
  was non-functional.
- New `external_port: Option<u32>` field. Controls Zitadel's emitted
  issuer URL when the host port mapping isn't 80/443 (k3d typically
  maps 8080:80). Without it, JWT-bearer audience validation 500s with
  `Errors.Internal` because the assertion's `aud` doesn't match the
  chart-default issuer at port 80.

ZitadelSetupScore is extended for the JWT-bearer flow needed by the
NATS auth callout:
- API apps (resource servers — required for project-id audience scope)
- Project roles (`POST .../projects/{id}/roles`, idempotent)
- Machine users with KEY_TYPE_JSON keys (provisioned + cached
  device-side; Zitadel does not expose the key material on subsequent
  reads, so the local cache is the source of truth)
- User grants (project + role keys)

Cache (ZitadelClientConfig) gains projects, machine_user_ids,
machine_keys, and user_grants — keyed for idempotency across re-runs.
Backwards compatible with existing harmony_sso example: the new fields
have `#[serde(default)]` and prior callers just need empty vecs.

Refresh upgrade-by-default in helm chart (separate commit) lets
ExternalPort changes propagate to existing releases on re-run.
2026-05-03 15:01:22 -04:00
36974bda32 refactor(helm): upgrade-by-default for unpinned releases
Helm releases without a pinned `chart_version` previously short-circuited
to a NOOP when already installed, which silently dropped any
`values_yaml` / `values_overrides` changes the caller had made. Now we
fall through to `helm upgrade --install` whenever:

- the release isn't installed (unchanged), or
- it's installed and either unpinned or pinned-and-matching.

Helm itself becomes the source of truth for "did anything actually
change" — no-op upgrades are cheap and changed values get applied
automatically without the caller having to opt in via a flag.

`install_only=true` keeps the prior skip-if-installed shortcut so
bootstrap operators (cert-manager, prometheus-operator, CRDs) that
should not be touched on re-runs continue to behave the same.

Pinned-version safety net is unchanged: a different version installed
than what the score requests is an error, never a silent change.
2026-05-03 15:01:07 -04:00
95a75d50a8 feat: Improve name of disable dad and system reserved score to show pool name
Some checks failed
Run Check Script / check (push) Failing after -44h57m34s
Compile and package harmony_composer / package_harmony_composer (push) Failing after -44h56m7s
2026-05-03 07:17:40 -04:00
7fa1ca2683 feat: default for ubuntu aws linux topology
Some checks failed
Run Check Script / check (pull_request) Failing after 12m51s
2026-05-01 08:53:03 -04:00
af67992b6e refactor: production auth callout service with real integration tests
nats-jwt:
- Add NkeyPub newtype with prefix validation
- Add ClaimType and Algorithm typed enums
- Add impl_nats_claims! macro eliminating 4x duplicated impl blocks
- Add AuthorizationRequestClaimsBuilder (completing all builder types)
- Fix AuthorizationResponseBuilder: add issuer() builder method, stop
  mutating iss in sign()
- Tighten trait bounds: encode<T: Serialize>, decode_unverified<T:
  DeserializeOwned>
- Remove dead error variants Expired/NotYetValid
- Add builder tests for all 4 claims types
- Deduplicate is_zero helper

harmony-nats-callout (rewritten):
- AuthCalloutService: production service connecting to NATS, subscribing
  to .REQ.USER.AUTH, dispatching auth requests
- AuthCalloutConfig with builder pattern
- handler.rs: pure auth request handler (decode → validate → mint →
  respond) extracted from test
- Fix ZitadelValidator: validate() is now async (was blocking_read
  deadlock in async contexts)
- Remove dead fields kid_map, jwks_uri
- Make danger_accept_invalid_certs configurable
- permissions: InterpolatedPermissions named struct instead of 4-tuple

integration-test-callout:
- Converted to lib+test crate: src/lib.rs exports test utilities
- Tests now exercise the REAL AuthCalloutService (not inline handler)
- Extracted MockOidcServer, NatsServer, CalloutContext into library
- Replace yasna with rsa crate for DER parsing
- Add Drop to NatsServer for container cleanup
- Add module constants for all magic values
- README updated with new architecture diagram
2026-04-29 00:45:05 -04:00
48ec80ed66 docs: add integration test README with auth flow diagram 2026-04-28 23:21:23 -04:00
f848d94808 refactor: remove dead operator-mode code from nats crates
- Remove operator-mode files: account_manager, authorizer, service, config,
  main.rs, plan.md from callout crate
- Remove operator/activation claims from nats-jwt (builder and claims)
- Inline PermissionsConfig into permissions.rs (config.rs removed)
- Remove harmony-nats-callout dep from integration test (unused)
- Remove unused imports in algorithm.rs tests
- Clean up callout Cargo.toml (remove bin, unused deps)
2026-04-28 23:20:37 -04:00
65daa76658 feat: NATS auth callout e2e integration test
- nats-jwt crate: JWT builder types for user claims, authorization
  request/response, account claims, algorithm encode/decode
- harmony-nats-callout crate: Zitadel OIDC JWT validator, callout
  service scaffold, account manager (WIP)
- integration-test-callout: end-to-end test validating the full
  auth callout flow — device connects with Zitadel JWT → callout
  validates JWT → returns per-device user JWT with scoped
  permissions → device can pub/sub on its own subjects only
- Mock OIDC server for test (JWKS + openid-configuration)
- Negative test: device A cannot subscribe to device B's subjects
- Added UserClaimsBuilder::audience() for account-scoped user JWTs
2026-04-28 23:15:18 -04:00
50debfd163 chore: Some code review comments inlined 2026-04-28 16:41:15 -04:00
be4b9acaad Merge pull request 'fix(opnsense): valid HAProxy config + From<&str> codegen cleanup' (#273) from fix/haproxy-issues into master
Some checks failed
Run Check Script / check (push) Successful in 2m8s
Compile and package harmony_composer / package_harmony_composer (push) Failing after 2m13s
Reviewed-on: #273
2026-04-22 17:01:53 +00:00
ead76e710f fix(opnsense): lowercase match arms in generated From<&str>
All checks were successful
Run Check Script / check (pull_request) Successful in 2m6s
Two regressions from fc16e9f that ./build/check.sh catches:

1. `opnsense-api`'s `test_haproxy_deser` example references
   `resp.haproxy` on the response wrapper. The regen auto-derived the
   field name as `op_nsenseha_proxy` from the struct name. Need to pass
   `--api-key haproxy` to keep the wrapper key stable.

2. For enums whose wire values aren't all-lowercase (e.g. `"SSLv3"`,
   `"CONNECT"`), the emitted `From<&str>` matched `s.to_lowercase()`
   against the original-case wire value, which clippy flags as
   unreachable ("match arm has differing case"). Lowercase the wire
   value in the emitted match arm so case-insensitive matching actually
   works; serialization still emits the original-case wire value
   because the serde module is unaffected.

Regenerated `haproxy.rs` via
`cargo run -p opnsense-codegen -- generate --xml ... --module-name haproxy --api-key haproxy`.

`./build/check.sh` now passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 12:52:26 -04:00
fc16e9fac9 refactor(opnsense): use From<&str> for wire-value conversions
Some checks failed
Run Check Script / check (pull_request) Failing after 54s
Addresses review feedback on the previous HAProxy field-default fixes:
the eight match blocks in `configure_service` that mapped loose strings
("get", "tcp", "roundrobin", ...) to generated OPNsense enum variants
were poor Rust — they duplicated the wire-value knowledge that the
codegen already has, and any new enum variant in OPNsense meant editing
every call site by hand.

- `opnsense-codegen/src/codegen.rs::generate_enum` now emits
  `impl From<&str>` and `impl From<String>` for every generated enum,
  right after the existing serde module. Lowercase-matches wire values;
  unknown inputs fall through to the `Other(String)` variant the codegen
  already emits for forward-compat round-tripping.
- `opnsense-api/src/generated/haproxy.rs` regenerated — 153 enums, 306
  new impl blocks. No hand edits; re-run via
  `cargo run -p opnsense-codegen -- generate --xml
  opnsense-codegen/vendor/plugins/net/haproxy/src/opnsense/mvc/app/models/OPNsense/HAProxy/HAProxy.xml
  --output-dir opnsense-api/src/generated --module-name haproxy`.
- `opnsense-config/src/modules/load_balancer.rs::configure_service`
  replaces eight string-match blocks with one-liners:
  `HealthcheckType::from(hc.check_type.as_str())` etc.
- Drive-by: fixed a pre-existing typo at
  `harmony/src/infra/opnsense/load_balancer.rs:185` and the matching
  reverse at `:149` — `SSL::SNI` was mapped to `"sslni"`, but the
  OPNsense wire value is `"sslsni"`. Before this refactor the typo
  silently hit `HealthcheckSsl::Other("sslni")`; the cleaner conversion
  made the bug obvious so it's fixed here rather than left for a
  follow-up.

Verification:
- `cargo check -p harmony -p opnsense-config -p opnsense-api` clean
- `cargo test -p harmony --lib okd::load_balancer` 6/6 pass
- `cargo test -p opnsense-codegen` 22/22 pass

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 12:31:35 -04:00
a196268c1e revert(okd): bind load balancer on 0.0.0.0 again
All checks were successful
Run Check Script / check (pull_request) Successful in 2m17s
Reverting 5e72777. The HAProxy startup failure that motivated the
bind-to-FW-IP change was environment-specific on the sttest basement
firewall: OPNsense's "HTTP → HTTPS redirect" service (lighttpd bound to
`[::]:80`, dual-stack) was holding IPv4 port 80 via v4-mapped addresses
— invisible in `sockstat -l4` but still enough to make `0.0.0.0:80`
return EADDRINUSE to HAProxy.

Disabling the HTTP redirect on that firewall resolves the conflict.
Other OPNsense deployments already ship with the redirect off (or
HAProxy on non-conflicting ports), so `0.0.0.0` remains the correct
default.

This reverts commit 5e72777.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 12:10:57 -04:00
5a17bc229e fix: formatting
All checks were successful
Run Check Script / check (pull_request) Successful in 2m19s
2026-04-22 11:29:33 -04:00
5e72777c15 fix(okd): bind load balancer services on firewall IP, not 0.0.0.0
Some checks failed
Run Check Script / check (pull_request) Failing after 56s
Binding HAProxy on 0.0.0.0 collided with OPNsense's own listeners
(HTTP→HTTPS redirect on :80, WebUI, etc.), preventing the HAProxy
service from starting once the LoadBalancer score was applied.

Use `topology.load_balancer.get_ip()` to bind each frontend on the
firewall's LAN interface IP instead. The `LoadBalancer` capability was
already in scope, so no new trait imports are needed.

The previous `0.0.0.0` rationale (avoiding CARP VIP rebind races) is
noted in a comment: HA CARP setups still need OPNsense's
`net.inet.ip.nonlocal_bind` or HAProxy `transparent` bind — not
addressed here.

Test module: added an inline `DummyLoadBalancer` stub (mirrors the
existing `DummyRouter` pattern) so `OKDLoadBalancerScore::new` no longer
hits `DummyInfra::get_ip`'s `unimplemented!()` panic. Renamed
`test_all_services_bind_on_unspecified_address` →
`test_all_services_bind_on_firewall_ip`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 11:07:59 -04:00
21035d2c56 fix(opnsense): set HAProxy healthcheck/server fields explicitly
`configure_service` was relying on `..Default::default()` for most fields
of the generated HAProxy structs. That leaked OPNsense's *model defaults*
into the wire payload for fields Harmony never meant to default:

- `http_host` → `localhost` (sent `Host: localhost` on every check)
- `http_method` → `options` (sent OPTIONS instead of the declared method)
- `http_version` → `http10` (wanted NONE)
- `sslVerify` on real servers → `1` (broke self-signed backends)
- Healthcheck `ssl` was never propagated, so SSL-required checks like
  kube-apiserver `/readyz` on 6443 stayed plain HTTP and never succeeded

Set every field explicitly from `LbHealthCheck`/`LbServer`: map
`http_method` through `HealthcheckHttpMethod`, pass `None` for
`http_version` (serializes as `""` = NONE), clear `http_host` to an empty
string, propagate `hc.ssl` through `HealthcheckSsl`, and pin
`ssl`/`sslVerify` to `false` on the server struct so intent is declared
at the call site.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 11:07:47 -04:00
83d9af211a fix(opnsense): distinguish unreachable API from missing HAProxy plugin
`LoadBalancerConfig::is_installed` previously collapsed every error from
the settings endpoint into `false`, so a timeout, DNS failure, or auth
rejection all looked identical to "os-haproxy not installed" — the
`LoadBalancer` score would then attempt to install the plugin on top of
an unreachable firewall and fail in cascade further down the pipeline.

Return `Result<bool, Error>` and treat only HTTP 404 (controller not
found) as "not installed". Every other error is propagated so
`ensure_initialized` fails the score immediately with a message pointing
at the real problem.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 09:53:22 -04:00
503f9eb357 Merge pull request 'feat: capture network intent at host discovery' (#267) from feat/discover-networking into master
Some checks failed
Run Check Script / check (push) Successful in 2m8s
Compile and package harmony_composer / package_harmony_composer (push) Failing after 2m12s
Reviewed-on: #267
2026-04-21 16:20:48 +00:00
84a083a012 refactor(discovery): use shared LaggProtocol for bond mode
All checks were successful
Run Check Script / check (pull_request) Successful in 2m6s
Replace the Linux-specific BondMode enum with harmony_types'
  LaggProtocol, which is already used by the OPNsense LAGG score.
  "Capabilities are industry concepts, not tools" — the kernel mode
  numbers (BalanceRr/ActiveBackup/…) were the wrong abstraction;
  LaggProtocol's Lacp / Failover / LoadBalance / RoundRobin span
  Linux bonding and BSD lagg uniformly. LaggProtocol now derives
  Deserialize so NetworkConfig can round-trip through SQLite.

  Make SqliteInventoryRepository::get_role_mapping tolerate a
  network_config blob it cannot deserialize: log a warning and
  fall back to NetworkConfig::default() so the operator still sees
  the existing mapping prompt and can pick "Update" to overwrite
  the bad row. This self-heals DBs that were written with the old
  BondMode variant names and gives the repo real resilience for
  future NetworkConfig evolutions.
2026-04-21 12:05:43 -04:00
adb05a0b91 chore(discovery): drop sqlite WAL sidecars, add blank line after prompts
All checks were successful
Run Check Script / check (pull_request) Successful in 2m4s
- Switch SqliteInventoryRepository to DELETE journal mode with
    create_if_missing, so `.sqlite-wal` / `.sqlite-shm` files no longer
    appear next to the DB. Existing WAL-mode DBs are checkpointed and
    converted on next open.

  - Print a blank line after prompt_network_config returns so the save
    logs don't stomp on the last answered question.
2026-04-21 11:29:03 -04:00
0556b2ea0d feat(discovery): replace role mappings, sort NICs, polish host header
- host_role_mapping now holds at most one row per host_id.
    SqliteInventoryRepository::save_role_mapping wraps a DELETE of any
    prior rows for the host and the INSERT of the new one in a single
    transaction, self-healing pre-existing duplicate rows along the way.

  - Before re-prompting for disk and networking, the discovery flow
    looks up the current role mapping via the new
    InventoryRepository::get_role_mapping(host_id) method. If one
    exists, the operator sees a summary (role, install disk, bond
    mode + interfaces, blacklist) and picks between "Update" and
    "Cancel"; cancelling skips the host entirely and continues the
    selection loop without touching the DB. New HostRoleMapping
    domain type carries the returned row back to the caller.

  - Network interfaces are sorted by name at the hwinfo-to-domain
    conversion step (both MDNS and CIDR flows), so f0 always appears
    before f1 in every downstream consumer — host summary, bond
    multi-select, blacklist multi-select. This also makes the
    byte-equality dedup in save() robust against the agent returning
    NICs in different sysfs-walk order across reboots.

  - PhysicalHost::summary() split into summary_parts_through_storage()
    + append_network_summary(), with a new public summary_short()
    variant that omits the NIC list. print_host_header() in the
    discovery prompts now uses summary_short() so the "Host: ..."
    banner fits on one line; full summaries still render in the node
    picker, logs, and Display impl.

  - Fix CPU summary rendering when the agent reports an empty model:
    single-CPU renders as "6c/6t", multi-CPU as "2x CPU (12c/24t)",
    no stray double-space in the pipe-separated summary.

  - Regenerate .sqlx offline cache for the new DELETE and SELECT
    queries.
2026-04-21 11:19:23 -04:00
18fc87a597 feat(discovery): dedup identical host saves and harmonize prompt headers
- SqliteInventoryRepository::save() now compares the incoming
    serde_json bytes against the latest stored `data` blob for this
    host_id. If byte-identical, the insert is skipped with an info log
    "Host '<id>' unchanged, skipping save". Genuine changes still
    produce a new version row, preserving the audit trail. Eliminates
    the unbounded row growth from repeated discovery (mDNS is
    continuous, CIDR scans often re-run). Addresses the long-standing
    FIXME in modules/inventory; the comment is now removed.

  - Reworded the caller-side log that fires after repo.save() from
    "Saved [new] host id X, summary: ..." to "Discovered host X,
    summary: ...". The old text claimed "Saved" even when the repo had
    actually skipped the insert, producing contradictory log lines on
    re-runs.

  - Harmonized every host-specific inquire prompt in the discovery
    flow behind a new print_host_header() helper: each prompt is now
    preceded by a blank line and a "Host: <summary>" banner, and the
    redundant host name inside the question text is stripped (disk
    prompt, bond confirm). The node-selection prompt is unchanged --
    it picks *which* host, so there is no current host yet.
2026-04-21 10:56:46 -04:00
bdba4dda27 feat(discovery): tighten host summary and readability of prompts
- PhysicalHost::summary() becomes terser and more informative:
    - Storage: "400 GB [8 GB, 477 GB]" (was "400 GB Storage (2 Disks [8 GB, 477 GB])").
      Single-disk collapses to just the total.
    - Network: list every NIC as "[ip, mac]" with a count prefix
      (e.g. "3 NICs: [192.168.40.10, 98:fa:9b:03:17:6f], [00:e0:ed:7a:ec:4d], ...").
      Single-NIC form drops the count and "s": "NIC: [ip, mac]".
      NICs without an IPv4 render as "[mac]".

  - Promote the inventory agent's Chipset { vendor, name } into a
    "system-product-name" label during host conversion (both MDNS and CIDR
    flows), so summary()'s first field shows "LENOVO 3136" instead of
    falling back to the HostCategory string ("Server"). Extracted into
    build_discovered_host_labels() to keep the two conversion sites in
    sync. When the chipset is blank, the old category fallback still
    applies.

  - Print a blank line before every interactive inquire prompt in the
    discovery flow (role pick, disk pick, bond confirm/multi-select/mode,
    blacklist confirm/multi-select) so prompts stand out from the
    preceding log output on the terminal.
2026-04-21 10:35:48 -04:00
bf4f300383 feat(discovery): capture bond, blacklist and bond-mode intent per host
Extend DiscoverHostForRoleScore with three new interactive prompts after
  the installation-disk selection:

  - "Configure a network bond?" (only when host has >= 2 NICs), followed by
    a multi-select of bond members (min 2) and a bond-mode picker
    (LACP / active-backup / balance-rr / balance-xor / broadcast /
    balance-tlb / balance-alb).
  - "Blacklist any remaining interface?", with candidates limited to NICs
    not already claimed by the bond.

  The answers are persisted as a JSON-encoded NetworkConfig on a new
  host_role_mapping.network_config column. HostConfig now exposes
  network_config alongside installation_device so downstream scores can
  honor the user's intent.

  Also adds a new harmony_host_discovery example that discovers a single
  host on 192.168.40.0/24:25000.
2026-04-21 10:13:58 -04:00
52db82865d Merge pull request 'feat(monitoring): Datadog 15-key-metrics dashboard + Ceph "what's wrong" drilldown' (#266) from feat/datadog-k8s-metrics into master
Some checks failed
Run Check Script / check (push) Successful in 2m5s
Compile and package harmony_composer / package_harmony_composer (push) Failing after 2m4s
Reviewed-on: #266
2026-04-21 11:21:28 +00:00
349c2a1358 feat: improve ceph dashboard
All checks were successful
Run Check Script / check (pull_request) Successful in 2m17s
2026-04-20 15:58:52 -04:00
c2718e843b feat: improve ceph dashboard - list alerts and WHY its NOT green
All checks were successful
Run Check Script / check (pull_request) Successful in 2m6s
2026-04-20 15:47:12 -04:00
391c44b369 feat: add the datadog-15-k8s-metrics dashboard 2026-04-20 15:29:54 -04:00
bae162a3e4 Merge pull request 'feat(monitoring): Ceph alerts integrated with OKD's native alerting stack' (#265) from feat/ceph-alerts into master
Some checks failed
Run Check Script / check (push) Successful in 2m14s
Compile and package harmony_composer / package_harmony_composer (push) Failing after 2m11s
Reviewed-on: #265
2026-04-20 18:14:44 +00:00
8acd9de275 feat: score to create ceph alerts in the okd default alerting stack
All checks were successful
Run Check Script / check (pull_request) Successful in 2m10s
2026-04-20 13:52:36 -04:00
ef418f2f96 Merge pull request 'feat: Disable ipv4 address conflict detection score. This is useful when setting up bonds as the wrong mac may get a dhcp offer and then the system will perceive it as a conflict when it sets up the bond correctly' (#263) from feat/disableDadScore into master
Some checks failed
Run Check Script / check (push) Successful in 2m11s
Compile and package harmony_composer / package_harmony_composer (push) Failing after 2m12s
Reviewed-on: #263
2026-04-20 17:44:29 +00:00
126390bb63 feat: split storage dashboard in two : ceph + persistent storage 2026-04-20 13:09:34 -04:00
7265d8a4f3 fix: fix ceph dashboard for root volumes not populated 2026-04-20 12:01:25 -04:00
54ef3f70bd feat: Refactor dad score into reusable node file score using machine config
All checks were successful
Run Check Script / check (pull_request) Successful in 2m17s
2026-04-19 05:01:09 -04:00
6267c2757f feat: Disable ipv4 address conflict detection score. This is useful when setting up bonds as the wrong mac may get a dhcp offer and then the system will perceive it as a conflict when it sets up the bond correctly
All checks were successful
Run Check Script / check (pull_request) Successful in 2m37s
2026-04-15 15:48:22 -04:00