feat/nats-auth-callout-e2e #279

johnride · 2026-05-01T13:37:44Z

johnride commented

2026-05-01 13:37:44 +00:00

No description provided.

johnride added 6 commits 2026-05-01 13:37:44 +00:00

chore: Some code review comments inlined 50debfd163

feat: NATS auth callout e2e integration test 65daa76658

- nats-jwt crate: JWT builder types for user claims, authorization
  request/response, account claims, algorithm encode/decode
- harmony-nats-callout crate: Zitadel OIDC JWT validator, callout
  service scaffold, account manager (WIP)
- integration-test-callout: end-to-end test validating the full
  auth callout flow — device connects with Zitadel JWT → callout
  validates JWT → returns per-device user JWT with scoped
  permissions → device can pub/sub on its own subjects only
- Mock OIDC server for test (JWKS + openid-configuration)
- Negative test: device A cannot subscribe to device B's subjects
- Added UserClaimsBuilder::audience() for account-scoped user JWTs

refactor: remove dead operator-mode code from nats crates f848d94808

- Remove operator-mode files: account_manager, authorizer, service, config,
  main.rs, plan.md from callout crate
- Remove operator/activation claims from nats-jwt (builder and claims)
- Inline PermissionsConfig into permissions.rs (config.rs removed)
- Remove harmony-nats-callout dep from integration test (unused)
- Remove unused imports in algorithm.rs tests
- Clean up callout Cargo.toml (remove bin, unused deps)

docs: add integration test README with auth flow diagram 48ec80ed66

refactor: production auth callout service with real integration tests af67992b6e

nats-jwt:
- Add NkeyPub newtype with prefix validation
- Add ClaimType and Algorithm typed enums
- Add impl_nats_claims! macro eliminating 4x duplicated impl blocks
- Add AuthorizationRequestClaimsBuilder (completing all builder types)
- Fix AuthorizationResponseBuilder: add issuer() builder method, stop
  mutating iss in sign()
- Tighten trait bounds: encode<T: Serialize>, decode_unverified<T:
  DeserializeOwned>
- Remove dead error variants Expired/NotYetValid
- Add builder tests for all 4 claims types
- Deduplicate is_zero helper

harmony-nats-callout (rewritten):
- AuthCalloutService: production service connecting to NATS, subscribing
  to .REQ.USER.AUTH, dispatching auth requests
- AuthCalloutConfig with builder pattern
- handler.rs: pure auth request handler (decode → validate → mint →
  respond) extracted from test
- Fix ZitadelValidator: validate() is now async (was blocking_read
  deadlock in async contexts)
- Remove dead fields kid_map, jwks_uri
- Make danger_accept_invalid_certs configurable
- permissions: InterpolatedPermissions named struct instead of 4-tuple

integration-test-callout:
- Converted to lib+test crate: src/lib.rs exports test utilities
- Tests now exercise the REAL AuthCalloutService (not inline handler)
- Extracted MockOidcServer, NatsServer, CalloutContext into library
- Replace yasna with rsa crate for DER parsing
- Add Drop to NatsServer for container cleanup
- Add module constants for all magic values
- README updated with new architecture diagram

feat: default for ubuntu aws linux topology

Run Check Script / check (pull_request) Failing after 12m51s

Details

7fa1ca2683

johnride added 35 commits 2026-05-04 12:59:58 +00:00

fix(linux): make ensure_user_unit_active actually report NOOP 97ec4848fd

systemctl --user enable --now is systemd-level idempotent, but the
prior implementation always returned ChangeReport::CHANGED. This made
every re-run of any score that touches a user-scoped unit (notably
FleetDeviceSetupScore's podman.socket step) lie about its change
count, defeating the noop detection the rest of the score honors.

Probe is-enabled --quiet && is-active --quiet first; only call
enable --now (and report CHANGED) when the unit isn't already in the
desired state. Mirrors the existing ensure_linger pattern in the
same file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(examples): add fleet_rpi_setup for onboarding a physical Pi 2cb884976a

Sibling of fleet_vm_setup with the libvirt provisioning step removed:
the operator has already booted Pi OS Lite themselves (rpi-imager,
preloaded SSH key, passwordless sudo on the admin user), so the
example goes straight to applying FleetDeviceSetupScore over SSH.

Defaults match the typical rpi-imager flow (--pi-user pi,
--ssh-key ~/.ssh/id_ed25519); --ssh-key supports tilde expansion.
The harmony dep is pulled in without the kvm feature since no VM is
created here. RUST_LOG defaults to info so the score's per-step
traces show up without the operator having to set the env var.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fix(linux): drop redundant envelope from parseable ansible errors bf1a438b90

When stdout already parses into UNREACHABLE!/FAILED! + msg, the
trailing (ansible-exit=..., stderr=..., stdout=...) envelope just
duplicated the same text. Strip it when stderr is empty and the
verb is recognized; keep it when it adds debug signal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(fleet): structured progress + recap for FleetDeviceSetupScore c3b25c8298

Replace the opaque change-log with tagged per-step info traces and
a human-readable Outcome.details recap (Device ID / NATS / Labels /
User / Agent binary -> remote / Service). User and Service lines
carry their own ✅/🔄 state markers; final line is ✅ for noop and
🎉 for runs that applied changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

refactor(examples/fleet_rpi_setup): adopt harmony_cli::run convention 96f17f3ca0

Drop the bespoke framed renderer, failure hint catalog, and custom
env_logger setup. Score output now flows through harmony_cli's
standard reporter (bullet list under "🚀 All done!"), matching the
other examples. cli_logger::init() at the top of main so early
logs (ensure_ansible_venv) get the same formatting.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fix(cli): surface Outcome.details on NOOP outcomes too 3cbe62c807

cli_reporter only accumulated details for SUCCESS, dropping the
recap on idempotent re-runs that legitimately return NOOP with
populated details. FleetDeviceSetupScore is the first score to
exercise this path; the filter was over-restrictive.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

refactor(fleet): inline agent binary local->remote on one recap line 12249a7f88

Folds the "-> /usr/local/bin/fleet-agent" continuation into the
"Agent binary:" line. Removes the hardcoded-indent fragility (bullet
prefix shifts in cli_reporter would have broken alignment).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(linux): add FileFetcher capability for reading remote files 54d1f0733b

Mirrors FileDelivery in the opposite direction: returns Some(content)
or None if the file doesn't exist. AnsibleHostConfigurator implements
it via two SSH calls (sudo test -e + sudo cat), routed through sudo
to handle root- or service-owned config files. Added to the
LinuxHostConfiguration umbrella so any score with that bound gets it.

Enables scores to pre-flight-compare desired state against current
state before committing to a destructive change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(fleet): pre-flight config diff + confirm in FleetDeviceSetupScore 413077bcd0

New first step (1/7): read /etc/fleet-agent/config.toml off the
device and compare against the rendered desired config. Three
branches:

  - missing  → info, first install
  - matches  → warn, converge anyway
  - differs  → warn + unified diff (similar::TextDiff with 2-line
    context radius, '-/+' marker style) + inquire::Confirm prompt
    defaulting to N. Aborts with InterpretError if declined.

Existing 6 steps renumbered to 2/7-7/7. The diff replaces the
previous "dump both full configs" approach which was unreadable
even for one-line differences.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(secret): add SudoPassword Secret type 414fe1cf7b

Sudo password for a Linux bootstrap admin user. Stored under key
"SudoPassword" via SecretManager when a host doesn't have
passwordless sudo configured. Same shape as the other single-field
Secret types in this file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(linux): support optional sudo_password on SshCredentials c2b8403f14

Lets callers populate creds.sudo_password when the bootstrap admin
doesn't have passwordless sudo. None = current behavior unchanged.

Wire-level injection:
- ansible runs: when Some, write to a tempfile::NamedTempFile and
  pass ANSIBLE_BECOME_PASSWORD_FILE=<path> via Command::env. Path
  in env, never value in argv. File deletes on drop.
- direct ssh_exec sudo paths (ensure_linger, ensure_user_unit_active,
  fetch_file): new sudo_exec helper that uses `sudo -S` with the
  password piped via the new ssh_exec stdin parameter, otherwise
  plain sudo. ensure_user_unit_active's && chain folded into one
  sudo+sh -c call since `sudo -S` only reads stdin once.

ssh_executor.rs: ssh_exec gains an optional stdin: Option<&str>; on
Some, writes via channel.data() then channel.eof() so the remote
reader doesn't hang. Existing 4 call sites pass None.

fleet_vm_setup updated to set sudo_password: None (behavior
identical).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(examples/fleet_rpi_setup): auto-fetch sudo password via SecretManager be9073e461

Probe `sudo -n true` over SSH before constructing the topology. If
the probe succeeds (passwordless sudo, the typical rpi-imager
default), proceed silently. If it fails, fetch the password through
SecretManager::get_or_prompt::<SudoPassword>() — first run prompts
the operator, subsequent runs reuse the cached value (same flow
SshKeyPair etc. use).

Adds harmony_secret dep, env.sh with the standard
HARMONY_SECRET_NAMESPACE / HARMONY_SECRET_STORE / HARMONY_DATABASE_URL
/ RUST_LOG variables, and a doc snippet at the top of main.rs
pointing at it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs(linux): clarify sudo_password scope, TODO for SSH password auth 34e2f832ec

The new sudo_password field is strictly for privilege escalation on
the remote host (sudo -S, ansible become) — not for SSH login. SSH
auth is still key-only. Adds a TODO on SshCredentials pointing at
where SSH password support would land if/when we want it, and a
matching note on the SudoPassword Secret type.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat: add little script to call the fleet_rpi_setup example

Run Check Script / check (pull_request) Failing after -44h57m30s

Details

b86f8f11f9

refactor(helm): upgrade-by-default for unpinned releases 36974bda32

Helm releases without a pinned `chart_version` previously short-circuited
to a NOOP when already installed, which silently dropped any
`values_yaml` / `values_overrides` changes the caller had made. Now we
fall through to `helm upgrade --install` whenever:

- the release isn't installed (unchanged), or
- it's installed and either unpinned or pinned-and-matching.

Helm itself becomes the source of truth for "did anything actually
change" — no-op upgrades are cheap and changed values get applied
automatically without the caller having to opt in via a flag.

`install_only=true` keeps the prior skip-if-installed shortcut so
bootstrap operators (cert-manager, prometheus-operator, CRDs) that
should not be touched on re-runs continue to behave the same.

Pinned-version safety net is unchanged: a different version installed
than what the score requests is an error, never a silent change.

feat(zitadel): ExternalPort + machine-user/role/key/grant provisioning b8bc2217fd

ZitadelScore:
- Auto-provisions an `iam-admin-pat` Kubernetes secret via the chart's
  FirstInstance.Org.Machine.Pat block. ZitadelSetupScore depended on
  this secret existing; without the chart values, the prior code path
  was non-functional.
- New `external_port: Option<u32>` field. Controls Zitadel's emitted
  issuer URL when the host port mapping isn't 80/443 (k3d typically
  maps 8080:80). Without it, JWT-bearer audience validation 500s with
  `Errors.Internal` because the assertion's `aud` doesn't match the
  chart-default issuer at port 80.

ZitadelSetupScore is extended for the JWT-bearer flow needed by the
NATS auth callout:
- API apps (resource servers — required for project-id audience scope)
- Project roles (`POST .../projects/{id}/roles`, idempotent)
- Machine users with KEY_TYPE_JSON keys (provisioned + cached
  device-side; Zitadel does not expose the key material on subsequent
  reads, so the local cache is the source of truth)
- User grants (project + role keys)

Cache (ZitadelClientConfig) gains projects, machine_user_ids,
machine_keys, and user_grants — keyed for idempotency across re-runs.
Backwards compatible with existing harmony_sso example: the new fields
have `#[serde(default)]` and prior callers just need empty vecs.

Refresh upgrade-by-default in helm chart (separate commit) lets
ExternalPort changes propagate to existing releases on re-run.

feat(nats-callout): production callout + harmony module + e2e demo 6c45fb22ba

harmony-nats-callout becomes a deployable service, not just a library:
- New [[bin]] target with env+secret-file driven config and
  SIGINT/SIGTERM-aware shutdown.
- Dockerfile (single-stage archlinux:base, non-root, matches
  harmony-fleet-operator convention).
- Refactored handler into a pure `decide()` function so the entire
  authorization decision tree is unit-testable without async-nats.
- New `roles` module with role resolution + a `validate_device_id`
  security gate that rejects NATS subject metacharacters in device_id
  (.>* whitespace) — closes a real escalation path through the
  `{device_id}` placeholder in the per-device permissions block.
- Configurable role claim path + admin/device role names; admin wins
  when both are present (privilege-escalation invariant).

57 unit tests cover every reachable branch of the security decision
tree; 4 e2e tests in nats/integration-test-callout exercise real NATS
in podman with: device pubsub on own subjects, cross-device subject
isolation, admin-can-read-anything, and JWT-without-role rejection.

harmony/src/modules/nats_auth_callout/:
- New `NatsAuthCalloutScore` deploys the callout as a K8s Deployment +
  Secret. fsGroup + 0o440 secret mode so the non-root container can
  read its mounted seed/password without leaving them in env vars.
- `render_auth_callout_block` helper produces the YAML for NATS Helm
  `config.merge.authorization.auth_callout` so both halves stay in
  sync.

examples/fleet_auth_callout/:
- `bring_up_stack()` orchestrates k3d -> Zitadel + Postgres ->
  CoreDNS rewrite -> project + roles + machine users with JWT keys
  -> NATS Helm with auth_callout block -> callout image build +
  sideload -> NatsAuthCalloutScore deploy. Idempotent across re-runs
  (issuer NKey persisted in a K8s secret so user JWTs survive
  restarts).
- `mint_access_token()` RFC 7523 JWT-bearer client. Uses Host header
  with port so Zitadel emits a matching issuer.
- main.rs prints URLs/creds/keyIds and waits for Ctrl-C.
- Three #[tokio::test] functions sharing one cluster via OnceCell:
  admin_can_read_any_device_subject, device_can_only_access_own_subjects,
  unknown_role_is_rejected. All green on real k3d.

feat(podman): env vars + bind-mount volumes + restart policy 6d55892736

The IoT walking-skeleton's PodmanV0Score and the underlying
ContainerSpec capability were name+image+ports only. Real customer
workloads (the demo target's docker-compose for example) need at
minimum:

- Environment variables for runtime config + secrets injected at
  deploy time.
- Bind-mount volumes so the container can persist data across
  recreates (sqlite db files, config dirs).
- Restart policy so the container survives device reboot or crash.

PodmanService and ContainerSpec gain `env: Vec<(String, String)>`,
`volumes: Vec<VolumeMount>`, and `restart_policy: RestartPolicy`. All
three default to empty / `unless-stopped` via #[serde(default)] so any
Deployment CR written before this change still deserializes — that
includes the existing smoke harnesses and any field-side state.

VolumeMount is bind-only in v0 (host_path -> container_path, optional
read_only). Named/anonymous volumes can be added behind the same field
later by inspecting host_path's shape; the customer's compose file is
expected to use bind mounts only.

RestartPolicy mirrors podman/docker convention — `no`,
`unless-stopped` (default, matching docker-compose), `on-failure`,
`always`. Serialized kebab-case so docker-compose translation is
mechanical.

PodmanTopology::ensure_service_running now passes env / mounts /
restart policy to the podman API. matches_spec conservatively forces
recreate whenever the spec carries non-empty env / volumes or a non-
default restart policy: the podman list endpoint doesn't surface those
fields, so a structural compare isn't possible from ListContainer
alone. Recreating an unchanged container is cheap (~hundreds of ms);
the alternative (silent stale-config window) isn't acceptable for
fleet-managed devices.

example_harmony_apply_deployment grows --env, --volume, and --restart
flags so an operator can drive the new shape from the CLI when
authoring a Deployment CR.

Tests:
- legacy CR JSON without the new fields deserializes (wire-compat).
- env ordering survives roundtrip (drift-detection invariant).
- restart policy serializes kebab-case (compose-translation contract).
- podman_v0_score_roundtrip exercises env + volumes + restart.

chore: remove accidentally-committed scratch + agent worktrees a0a5faa3d0

The previous commit swept in `.claude/worktrees/*` (ephemeral agent
worktree submodules) and a few scratch files that landed at the repo
root during prior sessions. None of them are project artifacts.
Removing them from the index and adding to .gitignore so future
`git add -A` doesn't re-include them.

Files on disk are unchanged.

feat(agent): Zitadel JWT credential source + auto-reconnect 74ee7fc9f2

The fleet agent's NATS connection is the load-bearing piece of the
"never lose connectivity to a device" guarantee. This commit makes
that hold even when Zitadel access tokens expire across NATS pod
restarts and network partitions.

New `[credentials]` config variants (externally-tagged):

  type = "toml-shared"   { nats_user, nats_pass }   # v0/dev
  type = "zitadel-jwt"   { key_path, oidc_issuer_url, audience, ... }

A `CredentialSource` enum dispatches per variant:

- TomlShared returns the same user/pass each call.
- ZitadelJwt mints an access token from Zitadel via the JWT-bearer
  flow (RFC 7523). The keyfile at `key_path` is the only durable
  secret on the device; the bearer token is short-lived and refreshed
  in-memory when the cached value is within 5 minutes of expiry.
  Two concurrent refreshes are race-safe — the second writer's mint
  is wasted but produces a correct token.

The agent's `connect_nats` is rewritten on top of async-nats's
`with_auth_callback`, which is invoked on every (re)connect attempt:

- async-nats reconnects automatically on disconnect (default
  behaviour of ConnectOptions) — we don't need a watchdog.
- Each reconnect attempt invokes the callback, which calls
  `next_credential()`. If the cached token is expired, a fresh one
  is minted before the reconnect proceeds. So a Pi that loses NATS
  while its token has just expired will pick up a brand-new token
  on the next reconnect attempt with no operator intervention.
- An `event_callback` surfaces Connected / Disconnected / SlowConsumer
  / ServerError events into tracing — operators can see exactly when
  reconnects happen, which is non-negotiable for an out-of-warranty
  device fleet.

A subtle constraint drove the trait shape: async-nats's
`with_auth_callback` requires the returned future to be `Send + Sync`,
which `#[async_trait]`'s erased `Pin<Box<dyn Future + Send>>` does
not satisfy. The credential source is therefore an enum (concrete
dispatch) rather than `dyn CredentialSource`. Two variants is small
enough that enum dispatch beats trait-object plumbing.

Out of scope, tracked for follow-up: a separate daemon for SSH access
to the Pi via Tailscale/Headscale ("secure backdoor"), and the
device-join-request + admin-approve flow that would replace the
current admin-PAT bootstrap pattern.

merge: feat/prepare-rpi (Pi onboarding harness + linux-host capabilities) c785f13abd

fix(linux): SshCredentials default_ubuntu_aws missing sudo_password b4d3d7d02c

The merge of feat/prepare-rpi added a `sudo_password: Option<String>`
field to SshCredentials but the `default_ubuntu_aws` constructor on
the destination branch was authored before that field existed. Add
the missing field as `None` (matches the prepare-rpi semantics:
passwordless sudo expected unless explicitly configured).

feat(fleet): per-device Zitadel bootstrap in fleet_rpi_setup ab98cbabf9

The Pi onboarding flow can now mint a per-device Zitadel machine user
on the operator's machine and ship the resulting JWT key to the Pi —
the agent then authenticates to NATS via JWT-bearer instead of shared
nats_user/nats_pass.

`FleetDeviceSetupConfig.auth: FleetDeviceAuth` replaces the previous
flat `nats_user` / `nats_pass` fields. Two variants:

- TomlShared { nats_user, nats_pass } — legacy / dev fallback.
- ZitadelJwt { machine_key_json, oidc_issuer_url, audience, ... } —
  per-device JWT-bearer. The Score:
    * Drops `machine_key_json` to /etc/fleet-agent/zitadel-key.json
      (mode 0640, owner fleet-agent — matches the agent's secret-mount
      conventions).
    * Renders [credentials] type = "zitadel-jwt" pointing at that
      keyfile + the issuer + audience the agent's CredentialSource
      needs.
  A change to either the keyfile content or the TOML triggers an
  agent restart, same as binary / unit drift.

`fleet_rpi_setup --bootstrap-token <PAT>` activates the Zitadel path.
The bootstrap PAT is held in the CLI's memory only; it never lands
on the Pi. New flags: --zitadel-issuer-url, --zitadel-project-id,
--zitadel-device-role (default `device`), --danger-accept-invalid-certs.

`zitadel_bootstrap` is a slim ManagementAPI client that, idempotently
per device:
1. Find-or-create machine user `device-${device_id}`.
2. Find-or-skip a project role grant (defaults to `device`).
3. Always mint a fresh JSON key and return its content. (Zitadel
   doesn't expose the private half of an existing key, so reusing
   isn't possible — stale keys remain valid until expiry, which is
   fine because each setup run overwrites the on-device keyfile.)

Three new render_toml tests cover the zitadel-jwt path; eleven
existing agent tests still pass.

Out of scope, tracked: device-join-request + admin-approve flow that
would replace bootstrap-PAT entirely (closer to the OKD
node-approval pattern). Long-lived admin PAT is acceptable for the
demo per product call.

feat(example): fleet-staging-deploy — operator-side OKD bringup 8d8e700786

Adds `examples/fleet_staging_deploy/` — the operator-side, run-once-
per-customer harness that brings up the fleet platform's central
services on a real OKD/K8s cluster. Complements the existing
`fleet_auth_callout` (k3d local-dev harness, kept unchanged) and
`fleet_rpi_setup` (per-device onboarding).

`FleetDomainConfig` is the single source of truth for hostnames:

  base_domain = "customer1.nationtech.io"
  → zitadel.<base>     (Zitadel HTTPS via OKD HAProxy edge-TLS)
  → nats.<base>        (NATS WSS through the same ingress)

Nothing is hardcoded; the operator supplies one --base-domain flag
and the deploy is fully parameterized. Re-running is idempotent
(rides the helm-upgrade-by-default + ZitadelSetupScore search-then-
create + persisted issuer-NKey-secret idempotency layers).

NATS values render under config.merge.{auth_callout, accounts,
system_account}, with WSS via `websocket: { enabled, port: 8443,
ingress: { className: openshift-default, ... } }` and the OKD-flavored
HAProxy edge-TLS annotations:

  route.openshift.io/termination: edge
  haproxy.router.openshift.io/timeout: "1h"

(Switch to `reencrypt` when the customer wants pod-to-edge TLS;
gateway-api migration is on their roadmap, separate from the demo.)

bring_up_staging():
- Deploys ZitadelScore (external_secure: true, no external_port → 443).
- Waits for HTTPS .well-known.
- Provisions the project + API app + roles via ZitadelSetupScore
  hitting Zitadel through the public ingress (port 443, TLS verified).
  No machine users provisioned — fleet_rpi_setup mints them on demand
  per device, so the staging deploy stays device-count-agnostic.
- Persists / reads the issuer NKey seed in the
  `callout-issuer-seed` K8s secret (so re-runs don't invalidate
  user JWTs already in flight on customer Pis).
- Deploys NATS via NatsHelmChartScore with the WSS values.
- Deploys NatsAuthCalloutScore (oidc_audience = project_id;
  external_secure path means no danger_accept_invalid_certs).

main.rs ends by printing the exact `cargo run -p
example-fleet-rpi-setup ...` invocation the operator runs against a
Pi, with the project_id and zitadel/nats URLs filled in.

Three unit tests cover the domain config + NATS values rendering
(WSS + edge-TLS annotations + auth_callout under merge).

feat(example): fleet-sso-login — Zitadel device-code CLI login 5396ef8bf2

Adds `examples/fleet_sso_login/` — the developer-side CLI that proves
the SSO works end-to-end against a deployed staging instance. RFC 8628
device-code flow:

- POSTs `/oauth/v2/device_authorization` with the harmony-cli client_id.
- Prints `verification_uri_complete` so the developer opens one URL in
  the browser; Zitadel handles the auth (username/password, MFA,
  whatever the customer has wired into Zitadel's auth chain).
- Polls `/oauth/v2/token` honouring the standard `authorization_pending`
  / `slow_down` polling protocol.
- On success: decodes the access token's claims, prints
  `Welcome <name> <email>`, persists the session (issuer + client_id +
  access_token + claims) at $DATA_DIR/harmony/sso-session.json with
  mode 0600.

For the demo this proves the SSO chain end-to-end. The actual
`harmony fleet apply` operation (which would consume the persisted
token through a fleet-platform API gateway) is post-demo — clusters
typically don't accept Zitadel JWTs as kube-apiserver bearer tokens
without an OIDC integration the customer would have to opt into.

`fleet_staging_deploy` now also provisions a `harmony-cli` Device
Code OIDC application alongside the existing API app, captures its
client_id from the ZitadelClientConfig cache, and prints both the
client_id and the exact `cargo run -p example-fleet-sso-login ...`
invocation in the operator's "next steps" panel.

docs(fleet): demo runbook (operator + developer flow, single page) 4053ac52de

Hand-on walkthrough for the 48-hour customer demo:

- Operator: build/push the callout image → fleet-staging-deploy →
  capture project_id + cli_client_id from the printed panel.
- Developer: fleet-sso-login proves Zitadel SSO works end-to-end.
- Pi onboarding: extract iam-admin-pat from the staging cluster,
  cross-compile the agent for aarch64, run fleet-rpi-setup once
  per device with --bootstrap-token. Each Pi's agent connects to
  NATS over WSS using the JWT-bearer token minted from its
  per-device keyfile.
- Deploy a container to a labeled subset via
  example_harmony_apply_deployment with --env / --volume / --restart
  flags (env + bind mounts + restart policy that work_item #1 added).
- Observe the cross-device security model holding via the auth
  callout's logs.

Also captures what's deliberately NOT in the demo (compose
auto-translation, UI, Tailscale backdoor, device-join-request
flow, OpenBao, K8s OIDC) so the customer call has clean expectation-
setting.

The runbook is the closing piece of the 48h-demo work plan;
sequenced after the eight feat / refactor commits that built the
underlying functionality.

fix(fleet_vm_setup): adopt FleetDeviceAuth::TomlShared shape e3e6d33dc8

The VM smoke harness still uses shared NATS creds for v0 (no Zitadel
JWT path through libvirt — the customer-facing Pi flow has it via
fleet_rpi_setup --bootstrap-token). Rewriting the FleetDeviceSetupConfig
literal against the new `auth: FleetDeviceAuth` field.

docs(fleet): chapter 6 — VM-based customer demo rehearsal plan fdcc7040dd

Adds ROADMAP/fleet_platform/v0_demo_e2e.md and threads it from
v0_1_plan.md. The VM rehearsal extends smoke-a4 (already-green k3d
+ libvirt VM + agent + apply CR + reconcile loop) with Zitadel +
auth callout + agent JWT auth. Two devices + one admin, real
cargo tests sharing a OnceCell-bringup.

Plan calls out:
- The 7 tests, including the load-bearing
  `agent_recovers_from_nats_pod_restart` (asserts the auto-reconnect
  + auth-callback re-mint path under realistic disturbance).
- Five known risks / debugging traps to expect on first cold-start
  (iam-admin-pat secret timing, /etc/hosts injection, k3d port
  collisions, etc.).
- Success criteria for the rehearsal day: cold cargo run greens in
  <20 min, all 7 tests green on a clean machine, the NATS-restart
  test reliably greens 5 runs in a row.
- Anything below the success criteria → reframe the customer call
  to "architecture walkthrough + local k3d demo + pilot in 1-2
  weeks." Avoids burning the relationship to keep a deadline.

Once VM rehearsal is green the residual OKD deltas are configuration
(Route annotations, image registry, real DNS, cert) — no new code.

feat(e2e-demo): VM-based rehearsal harness + /etc/hosts injection 1d453dd9aa

Adds `examples/fleet_e2e_demo/` — composes fleet_auth_callout's
existing pieces (Zitadel + auth callout deploy) with per-device
machine-user provisioning (one ZitadelSetupScore call per VM) and
FleetDeviceSetupScore using FleetDeviceAuth::ZitadelJwt. The harness
expects pre-provisioned libvirt VMs (one per device) reachable via
`FLEET_E2E_VM_<i>_IP` env vars; full VM provisioning via
ProvisionVmScore is a follow-up — keeping the harness observable in
pieces during the cold-start debugging tomorrow.

Constituent helpers in `fleet_auth_callout::lib.rs` flipped from
private to `pub` (deploy_zitadel, wait_for_zitadel_ready,
ensure_issuer_seed, build_and_load_callout_image, etc.) so the new
harness composes them rather than re-implementing.

`bring_up_full_stack`:
1. Ensure k3d cluster (re-uses fleet_auth_callout's create_k3d).
2. Deploy Zitadel + Postgres.
3. CoreDNS rewrite + wait for Zitadel HTTP + wait for the
   chart-provisioned `iam-admin-pat` secret. (Last step is new and
   load-bearing — without it ZitadelSetupScore races the chart's
   setup job and fails on first cold-run.)
4. ZitadelSetupScore for project + API app + roles + admin
   machine-user (admin gets fleet-admin role grant).
5. Issuer NKey from a persisted secret + NATS deploy with
   auth_callout block + callout pod.
6. For each device i: per-device ZitadelSetupScore (machine-user
   with `device` role grant), pull the JSON keyfile from cache,
   render the agent's TOML with the keyfile path. (FleetDeviceSetupScore
   invocation is wired structurally; the SSH-and-apply step is
   gated behind the VM provisioning follow-up.)

`HostsEntry` + `merge_hosts_file` added to FleetDeviceSetupScore so
VMs on a libvirt NAT can resolve `sso.fleet.local` to the host
gateway. Managed-block markers in /etc/hosts make the merge
idempotent across re-runs and removable when entries are dropped
from the score. Four new unit tests cover the merge invariants
(insert, replace, strip, byte-stable).

Tests skeleton in `tests/e2e_walking_skeleton.rs`:
- `both_devices_heartbeat_within_60s` — implemented; reads from
  device-info KV via admin token.
- `admin_jwt_reads_any_device_subject` — implemented; subscribes
  to `device-state.>` as admin.
- `cross_device_isolation_enforced_in_vm` — `#[ignore]` pending
  per-device-key plumbing through E2eHandles.
- `agent_recovers_from_nats_pod_restart` — `#[ignore]` pending
  the NATS-pod-restart driver.

The two `#[ignore]`d tests cover the load-bearing reconnect and
isolation invariants. Wiring them is the morning-of-rehearsal
priority since those are the customer-facing claims.

Out of scope of this commit (called out in the roadmap doc):
- ProvisionVmScore integration (today operator runs fleet_vm_setup
  out-of-band).
- Operator install via Helm (smoke-a4 runs operator host-side; this
  harness inherits that pattern).
- Full SSH-based agent install via FleetDeviceSetupScore — Score
  built, invocation gated.

feat(e2e-demo): apply FleetDeviceSetupScore over SSH per VM 49f9834eb2

Wires the previously-built FleetDeviceSetupScore through to a
LinuxHostTopology against each pre-provisioned VM. Mirrors the
fleet_rpi_setup pattern but synthesizes inline so the harness drives
N VMs in sequence without re-deriving the CLI plumbing.

Each VM gets:
- An /etc/hosts entry mapping `sso.fleet.local` → libvirt host IP
  via the new HostsEntry support, so the in-VM agent's HTTP client
  to Zitadel can resolve the issuer.
- The per-device Zitadel machine key dropped at
  /etc/fleet-agent/zitadel-key.json.
- Agent TOML with `type = "zitadel-jwt"` pointing at the keyfile.
- Agent service started under systemd.

SSH user assumed `fleet-admin` (matches what fleet_vm_setup +
smoke-a4 cloud-init create). Private key from the harmony fleet
keypair (ensure_fleet_ssh_keypair).

After this commit, `cargo run -p example-fleet-e2e-demo` is the
single command that turns a fresh k3d + 2 booted VMs into a
fully-converged stack: Zitadel + NATS callout + 2 agents speaking
JWT-bearer to NATS. Tomorrow's morning: prove it actually does
that on a clean machine.

fix(fleet-agent): request projects:roles scope so role claim is emitted a4b9e7ac9f

Zitadel only includes the project-roles block in an access token when
the JWT-bearer request asks for it via the
`urn:zitadel:iam:org:projects:roles` scope (PLURAL "projects"). Without
it the agent's token has a valid signature/audience but no roles, so
the NATS auth callout rejects with "no authorized role in token" even
though the machine user has a "device" grant.

Discovered while running the VM-based e2e rehearsal: agents could mint
a token, connect to NATS, then immediately fail authorization. The
plural-projects vs. singular-project distinction is a Zitadel
convention; both scopes are required, and the comment now spells out
what each one does.

fix(e2e-demo): point agent_binary default at the real cargo target name 6607fe7494

The cargo bin target is `harmony-fleet-agent`, not `fleet-agent` —
the latter never existed under target/release. Smoke-a4 happened to
work because callers passed --agent-binary explicitly; the harness
defaults didn't.

chore: cargo fmt sweep across modified files 7dd5f1504f

No behavior changes; only re-flowing existing expressions.

chore: cargo fmt setup_score.rs 050d4697d2

fix(callout): align device permissions with KV key formats and machine-user prefix

Run Check Script / check (pull_request) Failing after -44h57m23s

Details

d4fd4859ec

Two bugs surfaced when the agent went live against NATS JetStream KV
in the VM-based e2e rehearsal:

1. The default `device` role only allowed flat `device-state.<id>` /
   `device-commands.<id>` subjects. The agent's actual data plane is
   JetStream KV, which puts every operation on `$KV.<bucket>.<key>`
   subjects with control-plane traffic on `$JS.API.>` and `$JS.ACK.>`.
   With the old role config, the very first KV publish died with
   `Permissions Violation for Publish to "$JS.API.INFO"`.

   The role now allows `$JS.API.>` + `$JS.ACK.>` plus the four
   per-device data subjects derived from
   harmony_reconciler_contracts::kv (info.<id>, state.<id>.<dep>,
   heartbeat.<id>, desired-state.<id>.<dep>). The legacy direct
   `device-state.<id>` / `device-commands.<id>` subjects are kept so
   non-JetStream callers of NatsAuthCalloutScore still work.

   A new unit test (`device_role_covers_reconciler_contract_kv_subjects`)
   imports the contract crate as a dev-dep and asserts each contract-
   produced subject is matched, plus that cross-device subjects are
   *not* matched. This locks the role config to the contract surface so
   future renames break the test before they break prod.

2. Zitadel's `client_id` claim for a machine user equals the userName
   verbatim. Both `fleet_rpi_setup` and `fleet_e2e_demo` create the
   user as `device-{device_id}`, so the JWT carries
   `device-vm-device-00` while the agent's KV keys use the bare
   `vm-device-00`. The callout was interpolating the prefixed string
   into permissions, producing rules that never matched what the
   agent actually publishes.

   Adds `device_id_prefix_strip` (env: `DEVICE_ID_PREFIX_STRIP`,
   defaults empty so existing deployments are unaffected). When set,
   the validator strips the prefix from the extracted claim before
   permission interpolation. The fleet_auth_callout example wires it
   to `device-` so the e2e harness stays end-to-end correct without
   reaching into either naming convention.

Verified end-to-end: both VM agents now publish DeviceInfo /
heartbeat through JetStream KV with no permission errors and zero
service restarts since the rollout.

johnride added 1 commit 2026-05-04 13:03:40 +00:00

chore: formatting

Run Check Script / check (pull_request) Failing after -44h56m9s

Details

54308fd7a4

johnride added 1 commit 2026-05-04 19:31:17 +00:00

feat(fleet-agent): emit state pulse on direct device-state.<id> subject

Run Check Script / check (pull_request) Failing after -44h56m12s

Details

c6284c09bc

The agent's data plane was JetStream-KV-only, so live observers
that don't want to consume the JS stream had no signal to subscribe
to. The walking-skeleton e2e admin test was failing as a result —
admin subscribes to `device-state.>` (the per-device direct
subject) and saw nothing in 30s.

This commit adds a small core-NATS publish on `device-state.<id>`
alongside the existing KV writes:

- `FleetPublisher::publish_state_pulse()` emits a tiny
  `{device_id, kind: "heartbeat", at}` payload on
  `device-state.<device_id>`, called from the heartbeat loop so
  observers see traffic on the same 30s cadence as the KV
  heartbeat write — but on a non-JetStream subject anyone can sub
  to.
- `write_deployment_state()` now fans out the same payload it puts
  in the KV bucket on the direct subject, so live admin tooling
  picks up reconcile transitions immediately without watching the
  KV stream.

Also threads `device_id_prefix_strip = "device-"` through the
fleet_e2e_demo bring-up. The bring-up has its own NatsAuthCalloutScore
construction (parallel to fleet_auth_callout's `bring_up_stack`),
and was missing the prefix-strip line, so the deployed callout was
interpolating permissions against `device-vm-device-00` instead of
the bare device id the agent uses.

Locks the regression with a unit test
(`device_id_prefix_strip_lands_as_env_value`) on the deployment
manifest builder.

Verified end-to-end in the VM rehearsal:
  test both_devices_heartbeat_within_60s ... ok
  test admin_jwt_reads_any_device_subject ... ok

johnride added 28 commits 2026-05-04 19:39:26 +00:00

feat: Disable ipv4 address conflict detection score. This is useful when setting up bonds as the wrong mac may get a dhcp offer and then the system will perceive it as a conflict when it sets up the bond correctly

Run Check Script / check (pull_request) Successful in 2m37s

Details

6267c2757f

feat: Refactor dad score into reusable node file score using machine config

Run Check Script / check (pull_request) Successful in 2m17s

Details

54ef3f70bd

fix: fix ceph dashboard for root volumes not populated 7265d8a4f3

feat: split storage dashboard in two : ceph + persistent storage 126390bb63

Merge pull request 'feat: Disable ipv4 address conflict detection score. This is useful when setting up bonds as the wrong mac may get a dhcp offer and then the system will perceive it as a conflict when it sets up the bond correctly' (#263 ) from feat/disableDadScore into master

Run Check Script / check (push) Successful in 2m11s

Details

Compile and package harmony_composer / package_harmony_composer (push) Failing after 2m12s

Details

ef418f2f96

Reviewed-on: #263

feat: score to create ceph alerts in the okd default alerting stack

Run Check Script / check (pull_request) Successful in 2m10s

Details

8acd9de275

Merge pull request 'feat(monitoring): Ceph alerts integrated with OKD's native alerting stack' (#265 ) from feat/ceph-alerts into master

Run Check Script / check (push) Successful in 2m14s

Details

Compile and package harmony_composer / package_harmony_composer (push) Failing after 2m11s

Details

bae162a3e4

Reviewed-on: #265

feat: add the datadog-15-k8s-metrics dashboard 391c44b369

feat: improve ceph dashboard - list alerts and WHY its NOT green

Run Check Script / check (pull_request) Successful in 2m6s

Details

c2718e843b

feat: improve ceph dashboard

Run Check Script / check (pull_request) Successful in 2m17s

Details

349c2a1358

Merge pull request 'feat(monitoring): Datadog 15-key-metrics dashboard + Ceph "what's wrong" drilldown' (#266 ) from feat/datadog-k8s-metrics into master

Run Check Script / check (push) Successful in 2m5s

Details

Compile and package harmony_composer / package_harmony_composer (push) Failing after 2m4s

Details

52db82865d

Reviewed-on: #266

feat(discovery): capture bond, blacklist and bond-mode intent per host bf4f300383

Extend DiscoverHostForRoleScore with three new interactive prompts after
  the installation-disk selection:

  - "Configure a network bond?" (only when host has >= 2 NICs), followed by
    a multi-select of bond members (min 2) and a bond-mode picker
    (LACP / active-backup / balance-rr / balance-xor / broadcast /
    balance-tlb / balance-alb).
  - "Blacklist any remaining interface?", with candidates limited to NICs
    not already claimed by the bond.

  The answers are persisted as a JSON-encoded NetworkConfig on a new
  host_role_mapping.network_config column. HostConfig now exposes
  network_config alongside installation_device so downstream scores can
  honor the user's intent.

  Also adds a new harmony_host_discovery example that discovers a single
  host on 192.168.40.0/24:25000.

feat(discovery): tighten host summary and readability of prompts bdba4dda27

- PhysicalHost::summary() becomes terser and more informative:
    - Storage: "400 GB [8 GB, 477 GB]" (was "400 GB Storage (2 Disks [8 GB, 477 GB])").
      Single-disk collapses to just the total.
    - Network: list every NIC as "[ip, mac]" with a count prefix
      (e.g. "3 NICs: [192.168.40.10, 98:fa:9b:03:17:6f], [00:e0:ed:7a:ec:4d], ...").
      Single-NIC form drops the count and "s": "NIC: [ip, mac]".
      NICs without an IPv4 render as "[mac]".

  - Promote the inventory agent's Chipset { vendor, name } into a
    "system-product-name" label during host conversion (both MDNS and CIDR
    flows), so summary()'s first field shows "LENOVO 3136" instead of
    falling back to the HostCategory string ("Server"). Extracted into
    build_discovered_host_labels() to keep the two conversion sites in
    sync. When the chipset is blank, the old category fallback still
    applies.

  - Print a blank line before every interactive inquire prompt in the
    discovery flow (role pick, disk pick, bond confirm/multi-select/mode,
    blacklist confirm/multi-select) so prompts stand out from the
    preceding log output on the terminal.

feat(discovery): dedup identical host saves and harmonize prompt headers 18fc87a597

- SqliteInventoryRepository::save() now compares the incoming
    serde_json bytes against the latest stored `data` blob for this
    host_id. If byte-identical, the insert is skipped with an info log
    "Host '<id>' unchanged, skipping save". Genuine changes still
    produce a new version row, preserving the audit trail. Eliminates
    the unbounded row growth from repeated discovery (mDNS is
    continuous, CIDR scans often re-run). Addresses the long-standing
    FIXME in modules/inventory; the comment is now removed.

  - Reworded the caller-side log that fires after repo.save() from
    "Saved [new] host id X, summary: ..." to "Discovered host X,
    summary: ...". The old text claimed "Saved" even when the repo had
    actually skipped the insert, producing contradictory log lines on
    re-runs.

  - Harmonized every host-specific inquire prompt in the discovery
    flow behind a new print_host_header() helper: each prompt is now
    preceded by a blank line and a "Host: <summary>" banner, and the
    redundant host name inside the question text is stripped (disk
    prompt, bond confirm). The node-selection prompt is unchanged --
    it picks *which* host, so there is no current host yet.

feat(discovery): replace role mappings, sort NICs, polish host header 0556b2ea0d

- host_role_mapping now holds at most one row per host_id.
    SqliteInventoryRepository::save_role_mapping wraps a DELETE of any
    prior rows for the host and the INSERT of the new one in a single
    transaction, self-healing pre-existing duplicate rows along the way.

  - Before re-prompting for disk and networking, the discovery flow
    looks up the current role mapping via the new
    InventoryRepository::get_role_mapping(host_id) method. If one
    exists, the operator sees a summary (role, install disk, bond
    mode + interfaces, blacklist) and picks between "Update" and
    "Cancel"; cancelling skips the host entirely and continues the
    selection loop without touching the DB. New HostRoleMapping
    domain type carries the returned row back to the caller.

  - Network interfaces are sorted by name at the hwinfo-to-domain
    conversion step (both MDNS and CIDR flows), so f0 always appears
    before f1 in every downstream consumer — host summary, bond
    multi-select, blacklist multi-select. This also makes the
    byte-equality dedup in save() robust against the agent returning
    NICs in different sysfs-walk order across reboots.

  - PhysicalHost::summary() split into summary_parts_through_storage()
    + append_network_summary(), with a new public summary_short()
    variant that omits the NIC list. print_host_header() in the
    discovery prompts now uses summary_short() so the "Host: ..."
    banner fits on one line; full summaries still render in the node
    picker, logs, and Display impl.

  - Fix CPU summary rendering when the agent reports an empty model:
    single-CPU renders as "6c/6t", multi-CPU as "2x CPU (12c/24t)",
    no stray double-space in the pipe-separated summary.

  - Regenerate .sqlx offline cache for the new DELETE and SELECT
    queries.

chore(discovery): drop sqlite WAL sidecars, add blank line after prompts

Run Check Script / check (pull_request) Successful in 2m4s

Details

adb05a0b91

- Switch SqliteInventoryRepository to DELETE journal mode with
    create_if_missing, so `.sqlite-wal` / `.sqlite-shm` files no longer
    appear next to the DB. Existing WAL-mode DBs are checkpointed and
    converted on next open.

  - Print a blank line after prompt_network_config returns so the save
    logs don't stomp on the last answered question.

refactor(discovery): use shared LaggProtocol for bond mode

Run Check Script / check (pull_request) Successful in 2m6s

Details

84a083a012

Replace the Linux-specific BondMode enum with harmony_types'
  LaggProtocol, which is already used by the OPNsense LAGG score.
  "Capabilities are industry concepts, not tools" — the kernel mode
  numbers (BalanceRr/ActiveBackup/…) were the wrong abstraction;
  LaggProtocol's Lacp / Failover / LoadBalance / RoundRobin span
  Linux bonding and BSD lagg uniformly. LaggProtocol now derives
  Deserialize so NetworkConfig can round-trip through SQLite.

  Make SqliteInventoryRepository::get_role_mapping tolerate a
  network_config blob it cannot deserialize: log a warning and
  fall back to NetworkConfig::default() so the operator still sees
  the existing mapping prompt and can pick "Update" to overwrite
  the bad row. This self-heals DBs that were written with the old
  BondMode variant names and gives the repo real resilience for
  future NetworkConfig evolutions.

Merge pull request 'feat: capture network intent at host discovery' (#267 ) from feat/discover-networking into master

Run Check Script / check (push) Successful in 2m8s

Details

Compile and package harmony_composer / package_harmony_composer (push) Failing after 2m12s

Details

503f9eb357

Reviewed-on: #267

fix(opnsense): distinguish unreachable API from missing HAProxy plugin 83d9af211a

`LoadBalancerConfig::is_installed` previously collapsed every error from
the settings endpoint into `false`, so a timeout, DNS failure, or auth
rejection all looked identical to "os-haproxy not installed" — the
`LoadBalancer` score would then attempt to install the plugin on top of
an unreachable firewall and fail in cascade further down the pipeline.

Return `Result<bool, Error>` and treat only HTTP 404 (controller not
found) as "not installed". Every other error is propagated so
`ensure_initialized` fails the score immediately with a message pointing
at the real problem.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fix(opnsense): set HAProxy healthcheck/server fields explicitly 21035d2c56

`configure_service` was relying on `..Default::default()` for most fields
of the generated HAProxy structs. That leaked OPNsense's *model defaults*
into the wire payload for fields Harmony never meant to default:

- `http_host` → `localhost` (sent `Host: localhost` on every check)
- `http_method` → `options` (sent OPTIONS instead of the declared method)
- `http_version` → `http10` (wanted NONE)
- `sslVerify` on real servers → `1` (broke self-signed backends)
- Healthcheck `ssl` was never propagated, so SSL-required checks like
  kube-apiserver `/readyz` on 6443 stayed plain HTTP and never succeeded

Set every field explicitly from `LbHealthCheck`/`LbServer`: map
`http_method` through `HealthcheckHttpMethod`, pass `None` for
`http_version` (serializes as `""` = NONE), clear `http_host` to an empty
string, propagate `hc.ssl` through `HealthcheckSsl`, and pin
`ssl`/`sslVerify` to `false` on the server struct so intent is declared
at the call site.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fix(okd): bind load balancer services on firewall IP, not 0.0.0.0

Run Check Script / check (pull_request) Failing after 56s

Details

5e72777c15

Binding HAProxy on 0.0.0.0 collided with OPNsense's own listeners
(HTTP→HTTPS redirect on :80, WebUI, etc.), preventing the HAProxy
service from starting once the LoadBalancer score was applied.

Use `topology.load_balancer.get_ip()` to bind each frontend on the
firewall's LAN interface IP instead. The `LoadBalancer` capability was
already in scope, so no new trait imports are needed.

The previous `0.0.0.0` rationale (avoiding CARP VIP rebind races) is
noted in a comment: HA CARP setups still need OPNsense's
`net.inet.ip.nonlocal_bind` or HAProxy `transparent` bind — not
addressed here.

Test module: added an inline `DummyLoadBalancer` stub (mirrors the
existing `DummyRouter` pattern) so `OKDLoadBalancerScore::new` no longer
hits `DummyInfra::get_ip`'s `unimplemented!()` panic. Renamed
`test_all_services_bind_on_unspecified_address` →
`test_all_services_bind_on_firewall_ip`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fix: formatting

Run Check Script / check (pull_request) Successful in 2m19s

Details

5a17bc229e

revert(okd): bind load balancer on 0.0.0.0 again

Run Check Script / check (pull_request) Successful in 2m17s

Details

a196268c1e

Reverting 5e72777. The HAProxy startup failure that motivated the
bind-to-FW-IP change was environment-specific on the sttest basement
firewall: OPNsense's "HTTP → HTTPS redirect" service (lighttpd bound to
`[::]:80`, dual-stack) was holding IPv4 port 80 via v4-mapped addresses
— invisible in `sockstat -l4` but still enough to make `0.0.0.0:80`
return EADDRINUSE to HAProxy.

Disabling the HTTP redirect on that firewall resolves the conflict.
Other OPNsense deployments already ship with the redirect off (or
HAProxy on non-conflicting ports), so `0.0.0.0` remains the correct
default.

This reverts commit 5e72777.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

refactor(opnsense): use From<&str> for wire-value conversions

Run Check Script / check (pull_request) Failing after 54s

Details

fc16e9fac9

Addresses review feedback on the previous HAProxy field-default fixes:
the eight match blocks in `configure_service` that mapped loose strings
("get", "tcp", "roundrobin", ...) to generated OPNsense enum variants
were poor Rust — they duplicated the wire-value knowledge that the
codegen already has, and any new enum variant in OPNsense meant editing
every call site by hand.

- `opnsense-codegen/src/codegen.rs::generate_enum` now emits
  `impl From<&str>` and `impl From<String>` for every generated enum,
  right after the existing serde module. Lowercase-matches wire values;
  unknown inputs fall through to the `Other(String)` variant the codegen
  already emits for forward-compat round-tripping.
- `opnsense-api/src/generated/haproxy.rs` regenerated — 153 enums, 306
  new impl blocks. No hand edits; re-run via
  `cargo run -p opnsense-codegen -- generate --xml
  opnsense-codegen/vendor/plugins/net/haproxy/src/opnsense/mvc/app/models/OPNsense/HAProxy/HAProxy.xml
  --output-dir opnsense-api/src/generated --module-name haproxy`.
- `opnsense-config/src/modules/load_balancer.rs::configure_service`
  replaces eight string-match blocks with one-liners:
  `HealthcheckType::from(hc.check_type.as_str())` etc.
- Drive-by: fixed a pre-existing typo at
  `harmony/src/infra/opnsense/load_balancer.rs:185` and the matching
  reverse at `:149` — `SSL::SNI` was mapped to `"sslni"`, but the
  OPNsense wire value is `"sslsni"`. Before this refactor the typo
  silently hit `HealthcheckSsl::Other("sslni")`; the cleaner conversion
  made the bug obvious so it's fixed here rather than left for a
  follow-up.

Verification:
- `cargo check -p harmony -p opnsense-config -p opnsense-api` clean
- `cargo test -p harmony --lib okd::load_balancer` 6/6 pass
- `cargo test -p opnsense-codegen` 22/22 pass

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fix(opnsense): lowercase match arms in generated From<&str>

Run Check Script / check (pull_request) Successful in 2m6s

Details

ead76e710f

Two regressions from fc16e9f that ./build/check.sh catches:

1. `opnsense-api`'s `test_haproxy_deser` example references
   `resp.haproxy` on the response wrapper. The regen auto-derived the
   field name as `op_nsenseha_proxy` from the struct name. Need to pass
   `--api-key haproxy` to keep the wrapper key stable.

2. For enums whose wire values aren't all-lowercase (e.g. `"SSLv3"`,
   `"CONNECT"`), the emitted `From<&str>` matched `s.to_lowercase()`
   against the original-case wire value, which clippy flags as
   unreachable ("match arm has differing case"). Lowercase the wire
   value in the emitted match arm so case-insensitive matching actually
   works; serialization still emits the original-case wire value
   because the serde module is unaffected.

Regenerated `haproxy.rs` via
`cargo run -p opnsense-codegen -- generate --xml ... --module-name haproxy --api-key haproxy`.

`./build/check.sh` now passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge pull request 'fix(opnsense): valid HAProxy config + From<&str> codegen cleanup' (#273 ) from fix/haproxy-issues into master

Run Check Script / check (push) Successful in 2m8s

Details

Compile and package harmony_composer / package_harmony_composer (push) Failing after 2m13s

Details

be4b9acaad

Reviewed-on: #273

feat: Improve name of disable dad and system reserved score to show pool name

Run Check Script / check (push) Failing after -44h57m34s

Details

Compile and package harmony_composer / package_harmony_composer (push) Failing after -44h56m7s

Details

95a75d50a8

Merge remote-tracking branch 'origin' into feat/nats-auth-callout-e2e

Run Check Script / check (pull_request) Failing after -44h57m27s

Details

3069f5b9ae

johnride added 7 commits 2026-05-05 13:22:31 +00:00

fix(zitadel): always live-query Zitadel for IDs instead of trusting cache f4d6fb9431

ZitadelClientConfig was used as both a key store (machine keys —
which Zitadel cannot return after creation, so caching is required)
AND a lookup cache (project_id, machine_user_ids, user_grants).
The latter introduced a silent drift class:

- ZitadelSetupScore writes the cache incrementally as it creates
  each resource.
- If Zitadel is reset between runs (Postgres recreated, IDs
  reissued), the cache still holds the old IDs.
- ensure_project / ensure_app / ensure_machine_user / user_grant
  short-circuited on cache hit and never consulted Zitadel — so
  downstream Scores got the stale ID.
- The legacy `project_id` field was further `is_none`-guarded so it
  preserved the very first id ever seen, surviving any number of
  Zitadel resets.

Net effect in the wild: the deployed callout's `OIDC_AUDIENCE`
silently pointed at a project that no longer existed, while
agents kept working only because their TOML config carried the
matching stale id. A manual mint script reading `project_id` from
the cache would produce tokens that pass signature validation but
fail the audience check — exactly the symptom that surfaced this
bug.

Fix: drop the cache-hit short-circuit in every ensure_* path and
always live-query. The cache now only holds machine key material
(its only legitimate role) and a record of last-known IDs that
get refreshed on every apply. Cost: ~1 extra HTTP per project /
app / user / grant per Score apply — these are not hot paths.

Also: stop is_none-guarding `config.project_id` so the legacy
field tracks live state for older single-project consumers.

docs(fleet): manual JWT-bearer mint + NATS write recipe 612d934ad4

Working PyJWT script + nats CLI commands for talking to a
callout-protected NATS by hand. Distills what we learned debugging
the auth chain: which scope claims matter, why the audience is the
project id (not the API app's clientId), how to read OIDC_AUDIENCE
off the live callout instead of trusting the cache, and the failure
modes — including the PyJWT vs jwt package collision that costs
30 minutes the first time you hit it.

Cross-linked from fleet-zitadel-faq.md.

refactor(fleet): extract NATS credential plumbing into harmony-fleet-auth 4194baacad

The agent's `credentials.rs` + `CredentialsSection` enum graduate
into a workspace crate (`fleet/harmony-fleet-auth/`) so the
operator can consume the same code path. Single struct, single
factory, single auth-callback wiring. The only thing that varies
between consumers is where the `[credentials]` TOML bytes come
from — the agent reads them from a config file on disk, the
operator (next commit) will read them from an env var.

Public surface of the new crate:
  CredentialsSection                    — the deserializable
  CredentialSource / NatsCredential     — the runtime objects
  MachineKeyFile / CachedToken          — helper types
  credential_source_from_config         — factory
  connect_options_with_credentials      — async-nats wiring

Agent consumes via `pub use harmony_fleet_auth::CredentialsSection`
in its own `config.rs` so existing call sites keep working.
Existing 5 tests in the new crate + 7 in the agent all green.

This commit is structurally a move; behavior unchanged. Operator
wiring, additional unit tests, and the JWT-mint refactor (split
build_assertion / build_scope / build_token_url for testability)
follow in the next commits.

test(fleet-auth): cover assertion claims, scope, token URL, cache, keyfile 84a25dbb07

Bumps coverage on harmony-fleet-auth from 5 to 18 unit tests. The
new tests lock the corners we burned cycles on while debugging
the live system:

  * cache freshness boundary (within-leeway, outside-leeway,
    no-cache, non-zitadel variant)
  * assertion claim shape (iss/sub/aud/exp/iat) and the 60-second
    lifetime constant Zitadel enforces server-side
  * scope string content (plural-projects-roles + singular-project-id
    URN + openid base)
  * token URL strips trailing slashes (the //oauth/v2/token 404
    waiting to bite the next operator)
  * MachineKeyFile JSON parsing under Zitadel's wire shape

Refactor: build_assertion now delegates to build_assertion_claims
+ build_assertion_header (pure, no signing). Lets the claim/header
shape be unit-tested without an RSA private-key fixture; the
sign-and-decode end-to-end is still covered by the e2e harness.

No new deps. wiremock not needed — every meaningful assertion is
on pure logic.

feat(operator): NATS auth via shared harmony-fleet-auth + e2e wiring 8a609c5342

The operator was opening a bare async_nats::connect with no auth,
which would fail closed against a callout-protected NATS. Wires it
through the same JWT-bearer flow the agent uses, sharing the
recently-extracted harmony-fleet-auth crate.

Operator side
-------------
* main.rs: read FLEET_OPERATOR_CREDENTIALS_TOML (TOML snippet, same
  shape as the agent's [credentials] block — single
  CredentialsSection struct, just a different byte source). Empty
  string bypasses (callout-less dev only, with a loud warning).
* chart.rs: ChartOptions gains an optional OperatorCredentials field.
  When set, build_chart's Deployment mounts a Secret as both
  envFrom (TOML payload → FLEET_OPERATOR_CREDENTIALS_TOML) and a
  volume mount for the JSON keyfile at the configured key_path
  (defaults to /etc/fleet-operator/zitadel-key.json). On-disk helm
  chart still emits credentials: None — those are environment-
  specific and out of scope for a redistributable chart.
* Public manifest builders (build_service_account, build_cluster_role,
  build_cluster_role_binding, build_operator_deployment,
  operator_secret) so the e2e bring-up can apply each resource via
  K8sResourceScore without re-implementing the manifests.
* mod chart now lives in lib.rs so external consumers (the e2e
  bring-up) can reach into it.

E2e bring-up
------------
* Bring-up gains a separate `fleet-operator` machine user with the
  fleet-admin role grant — distinct from the manual-admin
  `fleet-ops` user so audit logs can tell automated operator
  actions apart from human ones.
* New steps 8/10 (build + sideload operator image) and 9/10 (apply
  CRDs + RBAC + Secret + Deployment + wait for Ready). Devices step
  becomes 10/10.
* Reuses harmony_fleet_operator's manifest builders + operator_secret
  via K8sResourceScore — no duplicated YAML, no shell-out.

Tests
-----
* All existing tests pass (harmony-fleet-auth: 18, harmony-fleet-agent:
  7, harmony-fleet-operator: 2). E2e walking-skeleton is exercised
  by the next phase's clean rerun.

docs(podman): FIXME diagnosis for the reconcile-loop bug 34cfa0423b

The agent's periodic reconcile destroys-and-recreates any service
whose ContainerSpec has env or volumes, every 30s tick. Root cause:
matches_spec returns false unconditionally for those fields because
podman's list endpoint doesn't surface them; the original author
chose to declare "any spec with state is drifted" as a fail-safe.
That fail-safe weaponizes the polling reconciler into a loop.

Tags the offending line with a multi-paragraph FIXME explaining
the symptom, the root cause, the proposed fix (containers.inspect
+ structural compare + an integration test), and the demo-time
workaround (keep demo specs trivial — the hello-web nginx demo
already is).

Adds the same gap to ROADMAP/fleet_platform/v0_demo_e2e.md's
known-risks section so it's visible at planning time.

Out of scope for tonight; in scope for delivery alongside the
upcoming health-check support on ContainerSpec.

fix(zitadel,operator): user-grant search endpoint + operator keyfile mode

Run Check Script / check (pull_request) Failing after 2m15s

Details

29896bfeab

Two bugs uncovered while running the full e2e walk end to end:

1. find_user_grant POSTed to /management/v1/users/<id>/grants/_search
   which Zitadel rejects with 405 Method Not Allowed (the original
   author's note in the comment hinted at this). The cache previously
   masked it: first apply created the grant + cached the id; second
   apply hit the cache and skipped the broken search. The live-query
   refactor (f4d6fb94) removed the cache short-circuit, surfacing
   the bug as "Create user grant failed: User grant already exists"
   on every re-apply.

   Fix: switch to the collection endpoint
   /management/v1/users/grants/_search with a userIdQuery filter,
   matching the Zitadel API that's actually wired up. Now returns
   the existing grant on re-apply and the create_user_grant fallback
   is correctly skipped.

2. Operator keyfile mounted as 0o400 owned by root. The operator pod
   runs as non-root (image USER directive — no fixed runAsUser
   because we want SCC compatibility). Result: operator boots,
   tries to load the JSON keyfile from the Secret volume, hits
   EACCES, fails the credential factory, retries forever.

   Fix: mode 0o444. World-read inside the pod is fine — single
   container, no other consumers, the Secret namespace is locked
   down, and the file never escapes pod-fs. The proper fsGroup-based
   alternative requires pinning a UID/GID, which conflicts with our
   SCC-friendly choice of leaving runAsUser unset.

Also fixes a stale `git rm` from commit 4194baac
(harmony-fleet-auth extraction) — the agent's local credentials.rs
was deleted from disk but never staged.

Verified end to end:
  * STACK READY in 2 min on warm cluster
  * Operator pod: "minted fresh Zitadel access token", "NATS connected",
    "starting Deployment controller", "watching device-info KV"
  * 2 Device CRs auto-created with full label set
  * `kubectl apply -f` of a Deployment CR with
    targetSelector.matchLabels: { group: group-a } produced:
      - status.aggregate { matched=1, succeeded=1, failed=0 }
      - HTTP 200 from nginx on vm-device-00:8080
      - connection refused from vm-device-01:8080 (correctly excluded)

johnride merged commit 023cd742cd into feat/iot-walking-skeleton

2026-05-05 13:46:15 +00:00

johnride referenced this issue from a commit

2026-05-05 13:46:16 +00:00

Merge pull request 'feat/nats-auth-callout-e2e' (#279) from feat/nats-auth-callout-e2e into feat/iot-walking-skeleton

johnride deleted branch feat/nats-auth-callout-e2e

2026-05-05 13:46:16 +00:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: NationTech/harmony#279