Two bugs uncovered while running the full e2e walk end to end:
1. find_user_grant POSTed to /management/v1/users/<id>/grants/_search
which Zitadel rejects with 405 Method Not Allowed (the original
author's note in the comment hinted at this). The cache previously
masked it: first apply created the grant + cached the id; second
apply hit the cache and skipped the broken search. The live-query
refactor (f4d6fb94) removed the cache short-circuit, surfacing
the bug as "Create user grant failed: User grant already exists"
on every re-apply.
Fix: switch to the collection endpoint
/management/v1/users/grants/_search with a userIdQuery filter,
matching the Zitadel API that's actually wired up. Now returns
the existing grant on re-apply and the create_user_grant fallback
is correctly skipped.
2. Operator keyfile mounted as 0o400 owned by root. The operator pod
runs as non-root (image USER directive — no fixed runAsUser
because we want SCC compatibility). Result: operator boots,
tries to load the JSON keyfile from the Secret volume, hits
EACCES, fails the credential factory, retries forever.
Fix: mode 0o444. World-read inside the pod is fine — single
container, no other consumers, the Secret namespace is locked
down, and the file never escapes pod-fs. The proper fsGroup-based
alternative requires pinning a UID/GID, which conflicts with our
SCC-friendly choice of leaving runAsUser unset.
Also fixes a stale `git rm` from commit 4194baac
(harmony-fleet-auth extraction) — the agent's local credentials.rs
was deleted from disk but never staged.
Verified end to end:
* STACK READY in 2 min on warm cluster
* Operator pod: "minted fresh Zitadel access token", "NATS connected",
"starting Deployment controller", "watching device-info KV"
* 2 Device CRs auto-created with full label set
* `kubectl apply -f` of a Deployment CR with
targetSelector.matchLabels: { group: group-a } produced:
- status.aggregate { matched=1, succeeded=1, failed=0 }
- HTTP 200 from nginx on vm-device-00:8080
- connection refused from vm-device-01:8080 (correctly excluded)
The agent's periodic reconcile destroys-and-recreates any service
whose ContainerSpec has env or volumes, every 30s tick. Root cause:
matches_spec returns false unconditionally for those fields because
podman's list endpoint doesn't surface them; the original author
chose to declare "any spec with state is drifted" as a fail-safe.
That fail-safe weaponizes the polling reconciler into a loop.
Tags the offending line with a multi-paragraph FIXME explaining
the symptom, the root cause, the proposed fix (containers.inspect
+ structural compare + an integration test), and the demo-time
workaround (keep demo specs trivial — the hello-web nginx demo
already is).
Adds the same gap to ROADMAP/fleet_platform/v0_demo_e2e.md's
known-risks section so it's visible at planning time.
Out of scope for tonight; in scope for delivery alongside the
upcoming health-check support on ContainerSpec.
The operator was opening a bare async_nats::connect with no auth,
which would fail closed against a callout-protected NATS. Wires it
through the same JWT-bearer flow the agent uses, sharing the
recently-extracted harmony-fleet-auth crate.
Operator side
-------------
* main.rs: read FLEET_OPERATOR_CREDENTIALS_TOML (TOML snippet, same
shape as the agent's [credentials] block — single
CredentialsSection struct, just a different byte source). Empty
string bypasses (callout-less dev only, with a loud warning).
* chart.rs: ChartOptions gains an optional OperatorCredentials field.
When set, build_chart's Deployment mounts a Secret as both
envFrom (TOML payload → FLEET_OPERATOR_CREDENTIALS_TOML) and a
volume mount for the JSON keyfile at the configured key_path
(defaults to /etc/fleet-operator/zitadel-key.json). On-disk helm
chart still emits credentials: None — those are environment-
specific and out of scope for a redistributable chart.
* Public manifest builders (build_service_account, build_cluster_role,
build_cluster_role_binding, build_operator_deployment,
operator_secret) so the e2e bring-up can apply each resource via
K8sResourceScore without re-implementing the manifests.
* mod chart now lives in lib.rs so external consumers (the e2e
bring-up) can reach into it.
E2e bring-up
------------
* Bring-up gains a separate `fleet-operator` machine user with the
fleet-admin role grant — distinct from the manual-admin
`fleet-ops` user so audit logs can tell automated operator
actions apart from human ones.
* New steps 8/10 (build + sideload operator image) and 9/10 (apply
CRDs + RBAC + Secret + Deployment + wait for Ready). Devices step
becomes 10/10.
* Reuses harmony_fleet_operator's manifest builders + operator_secret
via K8sResourceScore — no duplicated YAML, no shell-out.
Tests
-----
* All existing tests pass (harmony-fleet-auth: 18, harmony-fleet-agent:
7, harmony-fleet-operator: 2). E2e walking-skeleton is exercised
by the next phase's clean rerun.
Bumps coverage on harmony-fleet-auth from 5 to 18 unit tests. The
new tests lock the corners we burned cycles on while debugging
the live system:
* cache freshness boundary (within-leeway, outside-leeway,
no-cache, non-zitadel variant)
* assertion claim shape (iss/sub/aud/exp/iat) and the 60-second
lifetime constant Zitadel enforces server-side
* scope string content (plural-projects-roles + singular-project-id
URN + openid base)
* token URL strips trailing slashes (the //oauth/v2/token 404
waiting to bite the next operator)
* MachineKeyFile JSON parsing under Zitadel's wire shape
Refactor: build_assertion now delegates to build_assertion_claims
+ build_assertion_header (pure, no signing). Lets the claim/header
shape be unit-tested without an RSA private-key fixture; the
sign-and-decode end-to-end is still covered by the e2e harness.
No new deps. wiremock not needed — every meaningful assertion is
on pure logic.
The agent's `credentials.rs` + `CredentialsSection` enum graduate
into a workspace crate (`fleet/harmony-fleet-auth/`) so the
operator can consume the same code path. Single struct, single
factory, single auth-callback wiring. The only thing that varies
between consumers is where the `[credentials]` TOML bytes come
from — the agent reads them from a config file on disk, the
operator (next commit) will read them from an env var.
Public surface of the new crate:
CredentialsSection — the deserializable
CredentialSource / NatsCredential — the runtime objects
MachineKeyFile / CachedToken — helper types
credential_source_from_config — factory
connect_options_with_credentials — async-nats wiring
Agent consumes via `pub use harmony_fleet_auth::CredentialsSection`
in its own `config.rs` so existing call sites keep working.
Existing 5 tests in the new crate + 7 in the agent all green.
This commit is structurally a move; behavior unchanged. Operator
wiring, additional unit tests, and the JWT-mint refactor (split
build_assertion / build_scope / build_token_url for testability)
follow in the next commits.
Working PyJWT script + nats CLI commands for talking to a
callout-protected NATS by hand. Distills what we learned debugging
the auth chain: which scope claims matter, why the audience is the
project id (not the API app's clientId), how to read OIDC_AUDIENCE
off the live callout instead of trusting the cache, and the failure
modes — including the PyJWT vs jwt package collision that costs
30 minutes the first time you hit it.
Cross-linked from fleet-zitadel-faq.md.
ZitadelClientConfig was used as both a key store (machine keys —
which Zitadel cannot return after creation, so caching is required)
AND a lookup cache (project_id, machine_user_ids, user_grants).
The latter introduced a silent drift class:
- ZitadelSetupScore writes the cache incrementally as it creates
each resource.
- If Zitadel is reset between runs (Postgres recreated, IDs
reissued), the cache still holds the old IDs.
- ensure_project / ensure_app / ensure_machine_user / user_grant
short-circuited on cache hit and never consulted Zitadel — so
downstream Scores got the stale ID.
- The legacy `project_id` field was further `is_none`-guarded so it
preserved the very first id ever seen, surviving any number of
Zitadel resets.
Net effect in the wild: the deployed callout's `OIDC_AUDIENCE`
silently pointed at a project that no longer existed, while
agents kept working only because their TOML config carried the
matching stale id. A manual mint script reading `project_id` from
the cache would produce tokens that pass signature validation but
fail the audience check — exactly the symptom that surfaced this
bug.
Fix: drop the cache-hit short-circuit in every ensure_* path and
always live-query. The cache now only holds machine key material
(its only legitimate role) and a record of last-known IDs that
get refreshed on every apply. Cost: ~1 extra HTTP per project /
app / user / grant per Score apply — these are not hot paths.
Also: stop is_none-guarding `config.project_id` so the legacy
field tracks live state for older single-project consumers.
The agent's data plane was JetStream-KV-only, so live observers
that don't want to consume the JS stream had no signal to subscribe
to. The walking-skeleton e2e admin test was failing as a result —
admin subscribes to `device-state.>` (the per-device direct
subject) and saw nothing in 30s.
This commit adds a small core-NATS publish on `device-state.<id>`
alongside the existing KV writes:
- `FleetPublisher::publish_state_pulse()` emits a tiny
`{device_id, kind: "heartbeat", at}` payload on
`device-state.<device_id>`, called from the heartbeat loop so
observers see traffic on the same 30s cadence as the KV
heartbeat write — but on a non-JetStream subject anyone can sub
to.
- `write_deployment_state()` now fans out the same payload it puts
in the KV bucket on the direct subject, so live admin tooling
picks up reconcile transitions immediately without watching the
KV stream.
Also threads `device_id_prefix_strip = "device-"` through the
fleet_e2e_demo bring-up. The bring-up has its own NatsAuthCalloutScore
construction (parallel to fleet_auth_callout's `bring_up_stack`),
and was missing the prefix-strip line, so the deployed callout was
interpolating permissions against `device-vm-device-00` instead of
the bare device id the agent uses.
Locks the regression with a unit test
(`device_id_prefix_strip_lands_as_env_value`) on the deployment
manifest builder.
Verified end-to-end in the VM rehearsal:
test both_devices_heartbeat_within_60s ... ok
test admin_jwt_reads_any_device_subject ... ok
Two bugs surfaced when the agent went live against NATS JetStream KV
in the VM-based e2e rehearsal:
1. The default `device` role only allowed flat `device-state.<id>` /
`device-commands.<id>` subjects. The agent's actual data plane is
JetStream KV, which puts every operation on `$KV.<bucket>.<key>`
subjects with control-plane traffic on `$JS.API.>` and `$JS.ACK.>`.
With the old role config, the very first KV publish died with
`Permissions Violation for Publish to "$JS.API.INFO"`.
The role now allows `$JS.API.>` + `$JS.ACK.>` plus the four
per-device data subjects derived from
harmony_reconciler_contracts::kv (info.<id>, state.<id>.<dep>,
heartbeat.<id>, desired-state.<id>.<dep>). The legacy direct
`device-state.<id>` / `device-commands.<id>` subjects are kept so
non-JetStream callers of NatsAuthCalloutScore still work.
A new unit test (`device_role_covers_reconciler_contract_kv_subjects`)
imports the contract crate as a dev-dep and asserts each contract-
produced subject is matched, plus that cross-device subjects are
*not* matched. This locks the role config to the contract surface so
future renames break the test before they break prod.
2. Zitadel's `client_id` claim for a machine user equals the userName
verbatim. Both `fleet_rpi_setup` and `fleet_e2e_demo` create the
user as `device-{device_id}`, so the JWT carries
`device-vm-device-00` while the agent's KV keys use the bare
`vm-device-00`. The callout was interpolating the prefixed string
into permissions, producing rules that never matched what the
agent actually publishes.
Adds `device_id_prefix_strip` (env: `DEVICE_ID_PREFIX_STRIP`,
defaults empty so existing deployments are unaffected). When set,
the validator strips the prefix from the extracted claim before
permission interpolation. The fleet_auth_callout example wires it
to `device-` so the e2e harness stays end-to-end correct without
reaching into either naming convention.
Verified end-to-end: both VM agents now publish DeviceInfo /
heartbeat through JetStream KV with no permission errors and zero
service restarts since the rollout.
The cargo bin target is `harmony-fleet-agent`, not `fleet-agent` —
the latter never existed under target/release. Smoke-a4 happened to
work because callers passed --agent-binary explicitly; the harness
defaults didn't.
Zitadel only includes the project-roles block in an access token when
the JWT-bearer request asks for it via the
`urn:zitadel:iam:org:projects:roles` scope (PLURAL "projects"). Without
it the agent's token has a valid signature/audience but no roles, so
the NATS auth callout rejects with "no authorized role in token" even
though the machine user has a "device" grant.
Discovered while running the VM-based e2e rehearsal: agents could mint
a token, connect to NATS, then immediately fail authorization. The
plural-projects vs. singular-project distinction is a Zitadel
convention; both scopes are required, and the comment now spells out
what each one does.
Wires the previously-built FleetDeviceSetupScore through to a
LinuxHostTopology against each pre-provisioned VM. Mirrors the
fleet_rpi_setup pattern but synthesizes inline so the harness drives
N VMs in sequence without re-deriving the CLI plumbing.
Each VM gets:
- An /etc/hosts entry mapping `sso.fleet.local` → libvirt host IP
via the new HostsEntry support, so the in-VM agent's HTTP client
to Zitadel can resolve the issuer.
- The per-device Zitadel machine key dropped at
/etc/fleet-agent/zitadel-key.json.
- Agent TOML with `type = "zitadel-jwt"` pointing at the keyfile.
- Agent service started under systemd.
SSH user assumed `fleet-admin` (matches what fleet_vm_setup +
smoke-a4 cloud-init create). Private key from the harmony fleet
keypair (ensure_fleet_ssh_keypair).
After this commit, `cargo run -p example-fleet-e2e-demo` is the
single command that turns a fresh k3d + 2 booted VMs into a
fully-converged stack: Zitadel + NATS callout + 2 agents speaking
JWT-bearer to NATS. Tomorrow's morning: prove it actually does
that on a clean machine.
Adds `examples/fleet_e2e_demo/` — composes fleet_auth_callout's
existing pieces (Zitadel + auth callout deploy) with per-device
machine-user provisioning (one ZitadelSetupScore call per VM) and
FleetDeviceSetupScore using FleetDeviceAuth::ZitadelJwt. The harness
expects pre-provisioned libvirt VMs (one per device) reachable via
`FLEET_E2E_VM_<i>_IP` env vars; full VM provisioning via
ProvisionVmScore is a follow-up — keeping the harness observable in
pieces during the cold-start debugging tomorrow.
Constituent helpers in `fleet_auth_callout::lib.rs` flipped from
private to `pub` (deploy_zitadel, wait_for_zitadel_ready,
ensure_issuer_seed, build_and_load_callout_image, etc.) so the new
harness composes them rather than re-implementing.
`bring_up_full_stack`:
1. Ensure k3d cluster (re-uses fleet_auth_callout's create_k3d).
2. Deploy Zitadel + Postgres.
3. CoreDNS rewrite + wait for Zitadel HTTP + wait for the
chart-provisioned `iam-admin-pat` secret. (Last step is new and
load-bearing — without it ZitadelSetupScore races the chart's
setup job and fails on first cold-run.)
4. ZitadelSetupScore for project + API app + roles + admin
machine-user (admin gets fleet-admin role grant).
5. Issuer NKey from a persisted secret + NATS deploy with
auth_callout block + callout pod.
6. For each device i: per-device ZitadelSetupScore (machine-user
with `device` role grant), pull the JSON keyfile from cache,
render the agent's TOML with the keyfile path. (FleetDeviceSetupScore
invocation is wired structurally; the SSH-and-apply step is
gated behind the VM provisioning follow-up.)
`HostsEntry` + `merge_hosts_file` added to FleetDeviceSetupScore so
VMs on a libvirt NAT can resolve `sso.fleet.local` to the host
gateway. Managed-block markers in /etc/hosts make the merge
idempotent across re-runs and removable when entries are dropped
from the score. Four new unit tests cover the merge invariants
(insert, replace, strip, byte-stable).
Tests skeleton in `tests/e2e_walking_skeleton.rs`:
- `both_devices_heartbeat_within_60s` — implemented; reads from
device-info KV via admin token.
- `admin_jwt_reads_any_device_subject` — implemented; subscribes
to `device-state.>` as admin.
- `cross_device_isolation_enforced_in_vm` — `#[ignore]` pending
per-device-key plumbing through E2eHandles.
- `agent_recovers_from_nats_pod_restart` — `#[ignore]` pending
the NATS-pod-restart driver.
The two `#[ignore]`d tests cover the load-bearing reconnect and
isolation invariants. Wiring them is the morning-of-rehearsal
priority since those are the customer-facing claims.
Out of scope of this commit (called out in the roadmap doc):
- ProvisionVmScore integration (today operator runs fleet_vm_setup
out-of-band).
- Operator install via Helm (smoke-a4 runs operator host-side; this
harness inherits that pattern).
- Full SSH-based agent install via FleetDeviceSetupScore — Score
built, invocation gated.
Adds ROADMAP/fleet_platform/v0_demo_e2e.md and threads it from
v0_1_plan.md. The VM rehearsal extends smoke-a4 (already-green k3d
+ libvirt VM + agent + apply CR + reconcile loop) with Zitadel +
auth callout + agent JWT auth. Two devices + one admin, real
cargo tests sharing a OnceCell-bringup.
Plan calls out:
- The 7 tests, including the load-bearing
`agent_recovers_from_nats_pod_restart` (asserts the auto-reconnect
+ auth-callback re-mint path under realistic disturbance).
- Five known risks / debugging traps to expect on first cold-start
(iam-admin-pat secret timing, /etc/hosts injection, k3d port
collisions, etc.).
- Success criteria for the rehearsal day: cold cargo run greens in
<20 min, all 7 tests green on a clean machine, the NATS-restart
test reliably greens 5 runs in a row.
- Anything below the success criteria → reframe the customer call
to "architecture walkthrough + local k3d demo + pilot in 1-2
weeks." Avoids burning the relationship to keep a deadline.
Once VM rehearsal is green the residual OKD deltas are configuration
(Route annotations, image registry, real DNS, cert) — no new code.
The VM smoke harness still uses shared NATS creds for v0 (no Zitadel
JWT path through libvirt — the customer-facing Pi flow has it via
fleet_rpi_setup --bootstrap-token). Rewriting the FleetDeviceSetupConfig
literal against the new `auth: FleetDeviceAuth` field.
Hand-on walkthrough for the 48-hour customer demo:
- Operator: build/push the callout image → fleet-staging-deploy →
capture project_id + cli_client_id from the printed panel.
- Developer: fleet-sso-login proves Zitadel SSO works end-to-end.
- Pi onboarding: extract iam-admin-pat from the staging cluster,
cross-compile the agent for aarch64, run fleet-rpi-setup once
per device with --bootstrap-token. Each Pi's agent connects to
NATS over WSS using the JWT-bearer token minted from its
per-device keyfile.
- Deploy a container to a labeled subset via
example_harmony_apply_deployment with --env / --volume / --restart
flags (env + bind mounts + restart policy that work_item #1 added).
- Observe the cross-device security model holding via the auth
callout's logs.
Also captures what's deliberately NOT in the demo (compose
auto-translation, UI, Tailscale backdoor, device-join-request
flow, OpenBao, K8s OIDC) so the customer call has clean expectation-
setting.
The runbook is the closing piece of the 48h-demo work plan;
sequenced after the eight feat / refactor commits that built the
underlying functionality.
Adds `examples/fleet_sso_login/` — the developer-side CLI that proves
the SSO works end-to-end against a deployed staging instance. RFC 8628
device-code flow:
- POSTs `/oauth/v2/device_authorization` with the harmony-cli client_id.
- Prints `verification_uri_complete` so the developer opens one URL in
the browser; Zitadel handles the auth (username/password, MFA,
whatever the customer has wired into Zitadel's auth chain).
- Polls `/oauth/v2/token` honouring the standard `authorization_pending`
/ `slow_down` polling protocol.
- On success: decodes the access token's claims, prints
`Welcome <name> <email>`, persists the session (issuer + client_id +
access_token + claims) at $DATA_DIR/harmony/sso-session.json with
mode 0600.
For the demo this proves the SSO chain end-to-end. The actual
`harmony fleet apply` operation (which would consume the persisted
token through a fleet-platform API gateway) is post-demo — clusters
typically don't accept Zitadel JWTs as kube-apiserver bearer tokens
without an OIDC integration the customer would have to opt into.
`fleet_staging_deploy` now also provisions a `harmony-cli` Device
Code OIDC application alongside the existing API app, captures its
client_id from the ZitadelClientConfig cache, and prints both the
client_id and the exact `cargo run -p example-fleet-sso-login ...`
invocation in the operator's "next steps" panel.
Adds `examples/fleet_staging_deploy/` — the operator-side, run-once-
per-customer harness that brings up the fleet platform's central
services on a real OKD/K8s cluster. Complements the existing
`fleet_auth_callout` (k3d local-dev harness, kept unchanged) and
`fleet_rpi_setup` (per-device onboarding).
`FleetDomainConfig` is the single source of truth for hostnames:
base_domain = "customer1.nationtech.io"
→ zitadel.<base> (Zitadel HTTPS via OKD HAProxy edge-TLS)
→ nats.<base> (NATS WSS through the same ingress)
Nothing is hardcoded; the operator supplies one --base-domain flag
and the deploy is fully parameterized. Re-running is idempotent
(rides the helm-upgrade-by-default + ZitadelSetupScore search-then-
create + persisted issuer-NKey-secret idempotency layers).
NATS values render under config.merge.{auth_callout, accounts,
system_account}, with WSS via `websocket: { enabled, port: 8443,
ingress: { className: openshift-default, ... } }` and the OKD-flavored
HAProxy edge-TLS annotations:
route.openshift.io/termination: edge
haproxy.router.openshift.io/timeout: "1h"
(Switch to `reencrypt` when the customer wants pod-to-edge TLS;
gateway-api migration is on their roadmap, separate from the demo.)
bring_up_staging():
- Deploys ZitadelScore (external_secure: true, no external_port → 443).
- Waits for HTTPS .well-known.
- Provisions the project + API app + roles via ZitadelSetupScore
hitting Zitadel through the public ingress (port 443, TLS verified).
No machine users provisioned — fleet_rpi_setup mints them on demand
per device, so the staging deploy stays device-count-agnostic.
- Persists / reads the issuer NKey seed in the
`callout-issuer-seed` K8s secret (so re-runs don't invalidate
user JWTs already in flight on customer Pis).
- Deploys NATS via NatsHelmChartScore with the WSS values.
- Deploys NatsAuthCalloutScore (oidc_audience = project_id;
external_secure path means no danger_accept_invalid_certs).
main.rs ends by printing the exact `cargo run -p
example-fleet-rpi-setup ...` invocation the operator runs against a
Pi, with the project_id and zitadel/nats URLs filled in.
Three unit tests cover the domain config + NATS values rendering
(WSS + edge-TLS annotations + auth_callout under merge).
The Pi onboarding flow can now mint a per-device Zitadel machine user
on the operator's machine and ship the resulting JWT key to the Pi —
the agent then authenticates to NATS via JWT-bearer instead of shared
nats_user/nats_pass.
`FleetDeviceSetupConfig.auth: FleetDeviceAuth` replaces the previous
flat `nats_user` / `nats_pass` fields. Two variants:
- TomlShared { nats_user, nats_pass } — legacy / dev fallback.
- ZitadelJwt { machine_key_json, oidc_issuer_url, audience, ... } —
per-device JWT-bearer. The Score:
* Drops `machine_key_json` to /etc/fleet-agent/zitadel-key.json
(mode 0640, owner fleet-agent — matches the agent's secret-mount
conventions).
* Renders [credentials] type = "zitadel-jwt" pointing at that
keyfile + the issuer + audience the agent's CredentialSource
needs.
A change to either the keyfile content or the TOML triggers an
agent restart, same as binary / unit drift.
`fleet_rpi_setup --bootstrap-token <PAT>` activates the Zitadel path.
The bootstrap PAT is held in the CLI's memory only; it never lands
on the Pi. New flags: --zitadel-issuer-url, --zitadel-project-id,
--zitadel-device-role (default `device`), --danger-accept-invalid-certs.
`zitadel_bootstrap` is a slim ManagementAPI client that, idempotently
per device:
1. Find-or-create machine user `device-${device_id}`.
2. Find-or-skip a project role grant (defaults to `device`).
3. Always mint a fresh JSON key and return its content. (Zitadel
doesn't expose the private half of an existing key, so reusing
isn't possible — stale keys remain valid until expiry, which is
fine because each setup run overwrites the on-device keyfile.)
Three new render_toml tests cover the zitadel-jwt path; eleven
existing agent tests still pass.
Out of scope, tracked: device-join-request + admin-approve flow that
would replace bootstrap-PAT entirely (closer to the OKD
node-approval pattern). Long-lived admin PAT is acceptable for the
demo per product call.
The merge of feat/prepare-rpi added a `sudo_password: Option<String>`
field to SshCredentials but the `default_ubuntu_aws` constructor on
the destination branch was authored before that field existed. Add
the missing field as `None` (matches the prepare-rpi semantics:
passwordless sudo expected unless explicitly configured).
The fleet agent's NATS connection is the load-bearing piece of the
"never lose connectivity to a device" guarantee. This commit makes
that hold even when Zitadel access tokens expire across NATS pod
restarts and network partitions.
New `[credentials]` config variants (externally-tagged):
type = "toml-shared" { nats_user, nats_pass } # v0/dev
type = "zitadel-jwt" { key_path, oidc_issuer_url, audience, ... }
A `CredentialSource` enum dispatches per variant:
- TomlShared returns the same user/pass each call.
- ZitadelJwt mints an access token from Zitadel via the JWT-bearer
flow (RFC 7523). The keyfile at `key_path` is the only durable
secret on the device; the bearer token is short-lived and refreshed
in-memory when the cached value is within 5 minutes of expiry.
Two concurrent refreshes are race-safe — the second writer's mint
is wasted but produces a correct token.
The agent's `connect_nats` is rewritten on top of async-nats's
`with_auth_callback`, which is invoked on every (re)connect attempt:
- async-nats reconnects automatically on disconnect (default
behaviour of ConnectOptions) — we don't need a watchdog.
- Each reconnect attempt invokes the callback, which calls
`next_credential()`. If the cached token is expired, a fresh one
is minted before the reconnect proceeds. So a Pi that loses NATS
while its token has just expired will pick up a brand-new token
on the next reconnect attempt with no operator intervention.
- An `event_callback` surfaces Connected / Disconnected / SlowConsumer
/ ServerError events into tracing — operators can see exactly when
reconnects happen, which is non-negotiable for an out-of-warranty
device fleet.
A subtle constraint drove the trait shape: async-nats's
`with_auth_callback` requires the returned future to be `Send + Sync`,
which `#[async_trait]`'s erased `Pin<Box<dyn Future + Send>>` does
not satisfy. The credential source is therefore an enum (concrete
dispatch) rather than `dyn CredentialSource`. Two variants is small
enough that enum dispatch beats trait-object plumbing.
Out of scope, tracked for follow-up: a separate daemon for SSH access
to the Pi via Tailscale/Headscale ("secure backdoor"), and the
device-join-request + admin-approve flow that would replace the
current admin-PAT bootstrap pattern.
The previous commit swept in `.claude/worktrees/*` (ephemeral agent
worktree submodules) and a few scratch files that landed at the repo
root during prior sessions. None of them are project artifacts.
Removing them from the index and adding to .gitignore so future
`git add -A` doesn't re-include them.
Files on disk are unchanged.
The IoT walking-skeleton's PodmanV0Score and the underlying
ContainerSpec capability were name+image+ports only. Real customer
workloads (the demo target's docker-compose for example) need at
minimum:
- Environment variables for runtime config + secrets injected at
deploy time.
- Bind-mount volumes so the container can persist data across
recreates (sqlite db files, config dirs).
- Restart policy so the container survives device reboot or crash.
PodmanService and ContainerSpec gain `env: Vec<(String, String)>`,
`volumes: Vec<VolumeMount>`, and `restart_policy: RestartPolicy`. All
three default to empty / `unless-stopped` via #[serde(default)] so any
Deployment CR written before this change still deserializes — that
includes the existing smoke harnesses and any field-side state.
VolumeMount is bind-only in v0 (host_path -> container_path, optional
read_only). Named/anonymous volumes can be added behind the same field
later by inspecting host_path's shape; the customer's compose file is
expected to use bind mounts only.
RestartPolicy mirrors podman/docker convention — `no`,
`unless-stopped` (default, matching docker-compose), `on-failure`,
`always`. Serialized kebab-case so docker-compose translation is
mechanical.
PodmanTopology::ensure_service_running now passes env / mounts /
restart policy to the podman API. matches_spec conservatively forces
recreate whenever the spec carries non-empty env / volumes or a non-
default restart policy: the podman list endpoint doesn't surface those
fields, so a structural compare isn't possible from ListContainer
alone. Recreating an unchanged container is cheap (~hundreds of ms);
the alternative (silent stale-config window) isn't acceptable for
fleet-managed devices.
example_harmony_apply_deployment grows --env, --volume, and --restart
flags so an operator can drive the new shape from the CLI when
authoring a Deployment CR.
Tests:
- legacy CR JSON without the new fields deserializes (wire-compat).
- env ordering survives roundtrip (drift-detection invariant).
- restart policy serializes kebab-case (compose-translation contract).
- podman_v0_score_roundtrip exercises env + volumes + restart.
harmony-nats-callout becomes a deployable service, not just a library:
- New [[bin]] target with env+secret-file driven config and
SIGINT/SIGTERM-aware shutdown.
- Dockerfile (single-stage archlinux:base, non-root, matches
harmony-fleet-operator convention).
- Refactored handler into a pure `decide()` function so the entire
authorization decision tree is unit-testable without async-nats.
- New `roles` module with role resolution + a `validate_device_id`
security gate that rejects NATS subject metacharacters in device_id
(.>* whitespace) — closes a real escalation path through the
`{device_id}` placeholder in the per-device permissions block.
- Configurable role claim path + admin/device role names; admin wins
when both are present (privilege-escalation invariant).
57 unit tests cover every reachable branch of the security decision
tree; 4 e2e tests in nats/integration-test-callout exercise real NATS
in podman with: device pubsub on own subjects, cross-device subject
isolation, admin-can-read-anything, and JWT-without-role rejection.
harmony/src/modules/nats_auth_callout/:
- New `NatsAuthCalloutScore` deploys the callout as a K8s Deployment +
Secret. fsGroup + 0o440 secret mode so the non-root container can
read its mounted seed/password without leaving them in env vars.
- `render_auth_callout_block` helper produces the YAML for NATS Helm
`config.merge.authorization.auth_callout` so both halves stay in
sync.
examples/fleet_auth_callout/:
- `bring_up_stack()` orchestrates k3d -> Zitadel + Postgres ->
CoreDNS rewrite -> project + roles + machine users with JWT keys
-> NATS Helm with auth_callout block -> callout image build +
sideload -> NatsAuthCalloutScore deploy. Idempotent across re-runs
(issuer NKey persisted in a K8s secret so user JWTs survive
restarts).
- `mint_access_token()` RFC 7523 JWT-bearer client. Uses Host header
with port so Zitadel emits a matching issuer.
- main.rs prints URLs/creds/keyIds and waits for Ctrl-C.
- Three #[tokio::test] functions sharing one cluster via OnceCell:
admin_can_read_any_device_subject, device_can_only_access_own_subjects,
unknown_role_is_rejected. All green on real k3d.
ZitadelScore:
- Auto-provisions an `iam-admin-pat` Kubernetes secret via the chart's
FirstInstance.Org.Machine.Pat block. ZitadelSetupScore depended on
this secret existing; without the chart values, the prior code path
was non-functional.
- New `external_port: Option<u32>` field. Controls Zitadel's emitted
issuer URL when the host port mapping isn't 80/443 (k3d typically
maps 8080:80). Without it, JWT-bearer audience validation 500s with
`Errors.Internal` because the assertion's `aud` doesn't match the
chart-default issuer at port 80.
ZitadelSetupScore is extended for the JWT-bearer flow needed by the
NATS auth callout:
- API apps (resource servers — required for project-id audience scope)
- Project roles (`POST .../projects/{id}/roles`, idempotent)
- Machine users with KEY_TYPE_JSON keys (provisioned + cached
device-side; Zitadel does not expose the key material on subsequent
reads, so the local cache is the source of truth)
- User grants (project + role keys)
Cache (ZitadelClientConfig) gains projects, machine_user_ids,
machine_keys, and user_grants — keyed for idempotency across re-runs.
Backwards compatible with existing harmony_sso example: the new fields
have `#[serde(default)]` and prior callers just need empty vecs.
Refresh upgrade-by-default in helm chart (separate commit) lets
ExternalPort changes propagate to existing releases on re-run.
Helm releases without a pinned `chart_version` previously short-circuited
to a NOOP when already installed, which silently dropped any
`values_yaml` / `values_overrides` changes the caller had made. Now we
fall through to `helm upgrade --install` whenever:
- the release isn't installed (unchanged), or
- it's installed and either unpinned or pinned-and-matching.
Helm itself becomes the source of truth for "did anything actually
change" — no-op upgrades are cheap and changed values get applied
automatically without the caller having to opt in via a flag.
`install_only=true` keeps the prior skip-if-installed shortcut so
bootstrap operators (cert-manager, prometheus-operator, CRDs) that
should not be touched on re-runs continue to behave the same.
Pinned-version safety net is unchanged: a different version installed
than what the score requests is an error, never a silent change.
nats-jwt:
- Add NkeyPub newtype with prefix validation
- Add ClaimType and Algorithm typed enums
- Add impl_nats_claims! macro eliminating 4x duplicated impl blocks
- Add AuthorizationRequestClaimsBuilder (completing all builder types)
- Fix AuthorizationResponseBuilder: add issuer() builder method, stop
mutating iss in sign()
- Tighten trait bounds: encode<T: Serialize>, decode_unverified<T:
DeserializeOwned>
- Remove dead error variants Expired/NotYetValid
- Add builder tests for all 4 claims types
- Deduplicate is_zero helper
harmony-nats-callout (rewritten):
- AuthCalloutService: production service connecting to NATS, subscribing
to .REQ.USER.AUTH, dispatching auth requests
- AuthCalloutConfig with builder pattern
- handler.rs: pure auth request handler (decode → validate → mint →
respond) extracted from test
- Fix ZitadelValidator: validate() is now async (was blocking_read
deadlock in async contexts)
- Remove dead fields kid_map, jwks_uri
- Make danger_accept_invalid_certs configurable
- permissions: InterpolatedPermissions named struct instead of 4-tuple
integration-test-callout:
- Converted to lib+test crate: src/lib.rs exports test utilities
- Tests now exercise the REAL AuthCalloutService (not inline handler)
- Extracted MockOidcServer, NatsServer, CalloutContext into library
- Replace yasna with rsa crate for DER parsing
- Add Drop to NatsServer for container cleanup
- Add module constants for all magic values
- README updated with new architecture diagram
- nats-jwt crate: JWT builder types for user claims, authorization
request/response, account claims, algorithm encode/decode
- harmony-nats-callout crate: Zitadel OIDC JWT validator, callout
service scaffold, account manager (WIP)
- integration-test-callout: end-to-end test validating the full
auth callout flow — device connects with Zitadel JWT → callout
validates JWT → returns per-device user JWT with scoped
permissions → device can pub/sub on its own subjects only
- Mock OIDC server for test (JWKS + openid-configuration)
- Negative test: device A cannot subscribe to device B's subjects
- Added UserClaimsBuilder::audience() for account-scoped user JWTs
Two regressions from fc16e9f that ./build/check.sh catches:
1. `opnsense-api`'s `test_haproxy_deser` example references
`resp.haproxy` on the response wrapper. The regen auto-derived the
field name as `op_nsenseha_proxy` from the struct name. Need to pass
`--api-key haproxy` to keep the wrapper key stable.
2. For enums whose wire values aren't all-lowercase (e.g. `"SSLv3"`,
`"CONNECT"`), the emitted `From<&str>` matched `s.to_lowercase()`
against the original-case wire value, which clippy flags as
unreachable ("match arm has differing case"). Lowercase the wire
value in the emitted match arm so case-insensitive matching actually
works; serialization still emits the original-case wire value
because the serde module is unaffected.
Regenerated `haproxy.rs` via
`cargo run -p opnsense-codegen -- generate --xml ... --module-name haproxy --api-key haproxy`.
`./build/check.sh` now passes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses review feedback on the previous HAProxy field-default fixes:
the eight match blocks in `configure_service` that mapped loose strings
("get", "tcp", "roundrobin", ...) to generated OPNsense enum variants
were poor Rust — they duplicated the wire-value knowledge that the
codegen already has, and any new enum variant in OPNsense meant editing
every call site by hand.
- `opnsense-codegen/src/codegen.rs::generate_enum` now emits
`impl From<&str>` and `impl From<String>` for every generated enum,
right after the existing serde module. Lowercase-matches wire values;
unknown inputs fall through to the `Other(String)` variant the codegen
already emits for forward-compat round-tripping.
- `opnsense-api/src/generated/haproxy.rs` regenerated — 153 enums, 306
new impl blocks. No hand edits; re-run via
`cargo run -p opnsense-codegen -- generate --xml
opnsense-codegen/vendor/plugins/net/haproxy/src/opnsense/mvc/app/models/OPNsense/HAProxy/HAProxy.xml
--output-dir opnsense-api/src/generated --module-name haproxy`.
- `opnsense-config/src/modules/load_balancer.rs::configure_service`
replaces eight string-match blocks with one-liners:
`HealthcheckType::from(hc.check_type.as_str())` etc.
- Drive-by: fixed a pre-existing typo at
`harmony/src/infra/opnsense/load_balancer.rs:185` and the matching
reverse at `:149` — `SSL::SNI` was mapped to `"sslni"`, but the
OPNsense wire value is `"sslsni"`. Before this refactor the typo
silently hit `HealthcheckSsl::Other("sslni")`; the cleaner conversion
made the bug obvious so it's fixed here rather than left for a
follow-up.
Verification:
- `cargo check -p harmony -p opnsense-config -p opnsense-api` clean
- `cargo test -p harmony --lib okd::load_balancer` 6/6 pass
- `cargo test -p opnsense-codegen` 22/22 pass
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverting 5e72777. The HAProxy startup failure that motivated the
bind-to-FW-IP change was environment-specific on the sttest basement
firewall: OPNsense's "HTTP → HTTPS redirect" service (lighttpd bound to
`[::]:80`, dual-stack) was holding IPv4 port 80 via v4-mapped addresses
— invisible in `sockstat -l4` but still enough to make `0.0.0.0:80`
return EADDRINUSE to HAProxy.
Disabling the HTTP redirect on that firewall resolves the conflict.
Other OPNsense deployments already ship with the redirect off (or
HAProxy on non-conflicting ports), so `0.0.0.0` remains the correct
default.
This reverts commit 5e72777.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Binding HAProxy on 0.0.0.0 collided with OPNsense's own listeners
(HTTP→HTTPS redirect on :80, WebUI, etc.), preventing the HAProxy
service from starting once the LoadBalancer score was applied.
Use `topology.load_balancer.get_ip()` to bind each frontend on the
firewall's LAN interface IP instead. The `LoadBalancer` capability was
already in scope, so no new trait imports are needed.
The previous `0.0.0.0` rationale (avoiding CARP VIP rebind races) is
noted in a comment: HA CARP setups still need OPNsense's
`net.inet.ip.nonlocal_bind` or HAProxy `transparent` bind — not
addressed here.
Test module: added an inline `DummyLoadBalancer` stub (mirrors the
existing `DummyRouter` pattern) so `OKDLoadBalancerScore::new` no longer
hits `DummyInfra::get_ip`'s `unimplemented!()` panic. Renamed
`test_all_services_bind_on_unspecified_address` →
`test_all_services_bind_on_firewall_ip`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`configure_service` was relying on `..Default::default()` for most fields
of the generated HAProxy structs. That leaked OPNsense's *model defaults*
into the wire payload for fields Harmony never meant to default:
- `http_host` → `localhost` (sent `Host: localhost` on every check)
- `http_method` → `options` (sent OPTIONS instead of the declared method)
- `http_version` → `http10` (wanted NONE)
- `sslVerify` on real servers → `1` (broke self-signed backends)
- Healthcheck `ssl` was never propagated, so SSL-required checks like
kube-apiserver `/readyz` on 6443 stayed plain HTTP and never succeeded
Set every field explicitly from `LbHealthCheck`/`LbServer`: map
`http_method` through `HealthcheckHttpMethod`, pass `None` for
`http_version` (serializes as `""` = NONE), clear `http_host` to an empty
string, propagate `hc.ssl` through `HealthcheckSsl`, and pin
`ssl`/`sslVerify` to `false` on the server struct so intent is declared
at the call site.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`LoadBalancerConfig::is_installed` previously collapsed every error from
the settings endpoint into `false`, so a timeout, DNS failure, or auth
rejection all looked identical to "os-haproxy not installed" — the
`LoadBalancer` score would then attempt to install the plugin on top of
an unreachable firewall and fail in cascade further down the pipeline.
Return `Result<bool, Error>` and treat only HTTP 404 (controller not
found) as "not installed". Every other error is propagated so
`ensure_initialized` fails the score immediately with a message pointing
at the real problem.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the Linux-specific BondMode enum with harmony_types'
LaggProtocol, which is already used by the OPNsense LAGG score.
"Capabilities are industry concepts, not tools" — the kernel mode
numbers (BalanceRr/ActiveBackup/…) were the wrong abstraction;
LaggProtocol's Lacp / Failover / LoadBalance / RoundRobin span
Linux bonding and BSD lagg uniformly. LaggProtocol now derives
Deserialize so NetworkConfig can round-trip through SQLite.
Make SqliteInventoryRepository::get_role_mapping tolerate a
network_config blob it cannot deserialize: log a warning and
fall back to NetworkConfig::default() so the operator still sees
the existing mapping prompt and can pick "Update" to overwrite
the bad row. This self-heals DBs that were written with the old
BondMode variant names and gives the repo real resilience for
future NetworkConfig evolutions.
- Switch SqliteInventoryRepository to DELETE journal mode with
create_if_missing, so `.sqlite-wal` / `.sqlite-shm` files no longer
appear next to the DB. Existing WAL-mode DBs are checkpointed and
converted on next open.
- Print a blank line after prompt_network_config returns so the save
logs don't stomp on the last answered question.
- host_role_mapping now holds at most one row per host_id.
SqliteInventoryRepository::save_role_mapping wraps a DELETE of any
prior rows for the host and the INSERT of the new one in a single
transaction, self-healing pre-existing duplicate rows along the way.
- Before re-prompting for disk and networking, the discovery flow
looks up the current role mapping via the new
InventoryRepository::get_role_mapping(host_id) method. If one
exists, the operator sees a summary (role, install disk, bond
mode + interfaces, blacklist) and picks between "Update" and
"Cancel"; cancelling skips the host entirely and continues the
selection loop without touching the DB. New HostRoleMapping
domain type carries the returned row back to the caller.
- Network interfaces are sorted by name at the hwinfo-to-domain
conversion step (both MDNS and CIDR flows), so f0 always appears
before f1 in every downstream consumer — host summary, bond
multi-select, blacklist multi-select. This also makes the
byte-equality dedup in save() robust against the agent returning
NICs in different sysfs-walk order across reboots.
- PhysicalHost::summary() split into summary_parts_through_storage()
+ append_network_summary(), with a new public summary_short()
variant that omits the NIC list. print_host_header() in the
discovery prompts now uses summary_short() so the "Host: ..."
banner fits on one line; full summaries still render in the node
picker, logs, and Display impl.
- Fix CPU summary rendering when the agent reports an empty model:
single-CPU renders as "6c/6t", multi-CPU as "2x CPU (12c/24t)",
no stray double-space in the pipe-separated summary.
- Regenerate .sqlx offline cache for the new DELETE and SELECT
queries.
- SqliteInventoryRepository::save() now compares the incoming
serde_json bytes against the latest stored `data` blob for this
host_id. If byte-identical, the insert is skipped with an info log
"Host '<id>' unchanged, skipping save". Genuine changes still
produce a new version row, preserving the audit trail. Eliminates
the unbounded row growth from repeated discovery (mDNS is
continuous, CIDR scans often re-run). Addresses the long-standing
FIXME in modules/inventory; the comment is now removed.
- Reworded the caller-side log that fires after repo.save() from
"Saved [new] host id X, summary: ..." to "Discovered host X,
summary: ...". The old text claimed "Saved" even when the repo had
actually skipped the insert, producing contradictory log lines on
re-runs.
- Harmonized every host-specific inquire prompt in the discovery
flow behind a new print_host_header() helper: each prompt is now
preceded by a blank line and a "Host: <summary>" banner, and the
redundant host name inside the question text is stripped (disk
prompt, bond confirm). The node-selection prompt is unchanged --
it picks *which* host, so there is no current host yet.
- PhysicalHost::summary() becomes terser and more informative:
- Storage: "400 GB [8 GB, 477 GB]" (was "400 GB Storage (2 Disks [8 GB, 477 GB])").
Single-disk collapses to just the total.
- Network: list every NIC as "[ip, mac]" with a count prefix
(e.g. "3 NICs: [192.168.40.10, 98:fa:9b:03:17:6f], [00:e0:ed:7a:ec:4d], ...").
Single-NIC form drops the count and "s": "NIC: [ip, mac]".
NICs without an IPv4 render as "[mac]".
- Promote the inventory agent's Chipset { vendor, name } into a
"system-product-name" label during host conversion (both MDNS and CIDR
flows), so summary()'s first field shows "LENOVO 3136" instead of
falling back to the HostCategory string ("Server"). Extracted into
build_discovered_host_labels() to keep the two conversion sites in
sync. When the chipset is blank, the old category fallback still
applies.
- Print a blank line before every interactive inquire prompt in the
discovery flow (role pick, disk pick, bond confirm/multi-select/mode,
blacklist confirm/multi-select) so prompts stand out from the
preceding log output on the terminal.
Extend DiscoverHostForRoleScore with three new interactive prompts after
the installation-disk selection:
- "Configure a network bond?" (only when host has >= 2 NICs), followed by
a multi-select of bond members (min 2) and a bond-mode picker
(LACP / active-backup / balance-rr / balance-xor / broadcast /
balance-tlb / balance-alb).
- "Blacklist any remaining interface?", with candidates limited to NICs
not already claimed by the bond.
The answers are persisted as a JSON-encoded NetworkConfig on a new
host_role_mapping.network_config column. HostConfig now exposes
network_config alongside installation_device so downstream scores can
honor the user's intent.
Also adds a new harmony_host_discovery example that discovers a single
host on 192.168.40.0/24:25000.