From c71fde43449354c08a8b15350c0d301ca39782b6 Mon Sep 17 00:00:00 2001 From: Jean-Gabriel Gill-Couture Date: Mon, 1 Jun 2026 14:59:07 -0400 Subject: [PATCH 1/3] docs: ADR on how to handle securely fleet device secrets with openbao + zitadel SSO --- .../device_secret_access_handoff.md | 540 ++++++++++++++++++ docs/adr/025-fleet-device-secret-access.md | 312 ++++++++++ 2 files changed, 852 insertions(+) create mode 100644 ROADMAP/fleet_platform/device_secret_access_handoff.md create mode 100644 docs/adr/025-fleet-device-secret-access.md diff --git a/ROADMAP/fleet_platform/device_secret_access_handoff.md b/ROADMAP/fleet_platform/device_secret_access_handoff.md new file mode 100644 index 00000000..985bbdb1 --- /dev/null +++ b/ROADMAP/fleet_platform/device_secret_access_handoff.md @@ -0,0 +1,540 @@ +# Fleet device secret access — implementation handoff + +**Owner:** TBD (assignee) +**Design:** [`docs/adr/025-fleet-device-secret-access.md`](../../docs/adr/025-fleet-device-secret-access.md) +**Status:** Ready to start +**Written:** 2026-06-01 + +Read ADR-025 first. This document is the work plan, not the design. + +## What we are building, in one paragraph + +Fleet agents already authenticate to NATS using a Zitadel machine-user +JWT. The same identity, with no new device-side secret, will now also +authenticate them to OpenBao via Vault's JWT auth method. Per-deployment +scope (which secrets a device can read) is carried in the JWT itself as +a `deployments` array claim, populated by Zitadel from operator-managed +user metadata. OpenBao's `groups_claim` binds the claim values to +auto-created external groups, each carrying a small per-deployment +policy granting `read` on `harmony-fleet/data//*`. A new +agent-side `lifecycle/` module sits between NATS handling and the +reconciler, refreshing both tokens before any deployment the agent +hasn't seen secrets for is reconciled. + +## Architecture at a glance + +``` + Fleet operator (one per fleet, has Zitadel admin scope) + │ + ├── on new Deployment CR: + │ 1. Zitadel: append deployment to device.metadata.deployments + │ 2. OpenBao: upsert external group + policy + │ fleet-deployment- → read on + │ harmony-fleet/data//* + │ 3. NATS: publish desired-state.. + │ + ▼ + Fleet agent (has Zitadel machine key from FleetDeviceSetupScore) + │ + ├── NATS handler (main.rs) ── delivers KV event ──┐ + │ │ + │ ▼ + │ lifecycle/ ← NEW + │ │ + │ if desired ⊄ cached_scope ──┤ + │ │ + │ secrets/ │ ← NEW + │ ├─ invalidate Zitadel token + │ ├─ invalidate Bao token + │ ├─ re-mint Zitadel access token + │ └─ POST /v1/auth/jwt/login + │ │ + │ ▼ + │ reconciler.apply(…) (existing) +``` + +## Pre-flight (do these once, before any PR) + +- [ ] Read ADR-025 in full. Surface any objection now, not after PR-1 + lands. +- [ ] Confirm in staging: the existing fleet Zitadel project has + a service account usable by the operator with **write user + metadata** scope. If not, file a sub-task to provision it. +- [ ] Confirm OpenBao staging instance has root credentials accessible + to the operator deploy path (it already does via + `OpenbaoSetupScore`; verify versioned KV mount `harmony-fleet` + exists or plan to enable). +- [ ] Decide deployment ID format. Recommended: the existing + `DeploymentName` newtype (`harmony_reconciler_contracts::fleet`). + Validate that its allowed character set is safe for both + OpenBao paths and Vault group names (`[a-zA-Z0-9_-]`). + +## Work breakdown + +Six PRs, sequenced where they must be and parallel where they can be. +Each is sized to land in a few days; none requires the next to be +designed before it can start. + +--- + +### PR-1 — Extend `OpenbaoJwtAuth` with `bound_claims_json` + `groups_claim` + +**Crate:** `harmony` (`src/modules/openbao/setup.rs`) +**Depends on:** nothing +**Blocks:** PR-3, PR-5 + +**Goal.** Make the existing JWT auth Score expressive enough to bind +to a Zitadel project audience, require the `fleet-device` role, and +emit per-deployment external-group aliases. + +**Changes.** + +```rust +// fleet/harmony/src/modules/openbao/setup.rs +pub struct OpenbaoJwtAuth { + pub oidc_discovery_url: String, + pub bound_issuer: String, + pub role_name: String, + pub bound_audiences: String, + pub user_claim: String, + pub policies: Vec, + pub ttl: String, + pub max_ttl: String, + + // NEW — both optional, both default to empty + /// JSON string passed to `bao write auth/jwt/role/ bound_claims=…`. + /// Empty means no claim binding (current behavior). + #[serde(default)] + pub bound_claims_json: String, + + /// Claim name to read for group aliases. When set, OpenBao reads + /// this claim as an array of strings and creates one external- + /// group alias per element on each login. + #[serde(default)] + pub groups_claim: String, +} +``` + +Extend `configure_jwt` in the same file to pass `bound_claims=...` and +`groups_claim=...` to the `bao write auth/jwt/role/...` call when those +fields are non-empty. + +**Tests.** + +- Unit: serialize/deserialize round-trip with both new fields populated + and both empty. +- Integration (existing harness if any; otherwise just compile): the + field plumbs through to the `bao write` command. Inspect the command + string in the test if the existing test harness allows. + +**Acceptance.** `cargo check -p harmony --all-features` clean. Existing +callers (which leave both new fields at default) compile and behave +identically. `examples/openbao` still works. + +--- + +### PR-2 — Zitadel Action for the `deployments` claim + +**Crate:** `harmony` (probably a new module under +`src/modules/zitadel/` or `src/modules/sso/`, follow whatever the +existing Zitadel deployment code uses) +**Depends on:** nothing +**Blocks:** PR-3, PR-4 + +**Goal.** A Score that declaratively ensures the post-access-token- +creation Action exists on the configured Zitadel project, with the +script that copies `user.metadata.deployments` into a top-level +`deployments` claim. + +**Action script** (canonical text, do not modify casually): + +```javascript +function addDeployments(ctx, api) { + const md = ctx.v1.user.getMetadata(); + const entry = md.metadata.find(m => m.key === "deployments"); + if (!entry) return; + try { + const deployments = JSON.parse( + Buffer.from(entry.value, "base64").toString("utf-8") + ); + if (Array.isArray(deployments)) { + api.v1.claims.setClaim("deployments", deployments); + } + } catch (_) { /* malformed metadata → no deployments */ } +} +``` + +**Score shape.** + +```rust +pub struct ZitadelDeploymentsClaimActionScore { + pub project_id: String, + pub action_name: String, // default "fleet-deployments-claim" + pub flow_type: FlowType, // ComplementToken + pub trigger_type: TriggerType, // PreAccessTokenCreation +} +``` + +The interpret calls Zitadel's Management API to upsert the Action and +attach it to the `Complement Token / PreAccessTokenCreation` trigger. +Idempotent. + +**Tests.** + +- Unit: serialization, defaults. +- E2E: standing up a fresh Zitadel and applying the Score yields a + token whose `deployments` claim matches the metadata value. Use the + existing fleet e2e Zitadel as a fixture if available; otherwise + document as a manual verification step in staging. + +**Acceptance.** Manually mint a token in staging after applying the +Score; decode at jwt.io and confirm the `deployments` claim is +present. + +--- + +### PR-3 — `ZitadelDeviceDeploymentsScore` and operator wiring + +**Crate:** `harmony` for the Score; `fleet/harmony-fleet-operator` for +the wiring. +**Depends on:** PR-2 (the Action must exist for the metadata to surface +in tokens). +**Blocks:** PR-6. + +**Goal.** Operator writes membership to Zitadel as the **first** of +its three writes on every new/changed deployment. + +**Score shape.** + +```rust +pub struct ZitadelDeviceDeploymentsScore { + pub project_id: String, + pub device_user_id: String, // Zitadel machine user ID + pub deployments: Vec, // declarative full set +} +``` + +Interpret semantics: **declarative replace**, not append. The Score +reads the current `deployments` metadata value, compares to the +desired set, and writes only if they differ. This makes removal +trivially expressible: the operator declares the new set, the Score +diffs and writes. + +**Operator changes** (`fleet-aggregator.rs` or `device_reconciler.rs`, +wherever the per-`Deployment` reconciliation lives). On reconciling a +`Deployment` CR: + +1. Compute the desired `device → deployments` map for every device + targeted by the deployment. +2. For each device, compose and run `ZitadelDeviceDeploymentsScore` + with the desired set. +3. Continue to the OpenBao step (PR-5) and only then to the existing + NATS publish. + +Errors at step 2 must surface as a status condition on the CR; do not +proceed to NATS if Zitadel write fails. + +**Tests.** + +- Unit on the Score: diff logic (no-op when sets match, write when + they differ). +- Integration on the operator: a fake Zitadel client receives the + expected write; the order against the existing NATS publish is + preserved. + +**Acceptance.** Creating a `Deployment` CR in staging causes the +target device's Zitadel metadata to contain the deployment ID. + +--- + +### PR-4 — Operator: per-deployment OpenBao group + policy + +**Crate:** `harmony` (`src/modules/openbao/`) for the Score; operator +for wiring. +**Depends on:** PR-1. +**Blocks:** PR-6. + +**Goal.** When a deployment exists, ensure +`identity/group/` and `fleet-deployment-` exist with +the right shape. + +**Score shape.** + +```rust +pub struct OpenbaoDeploymentSecretScopeScore { + pub instance: OpenbaoInstance, + pub kv_mount: String, // "harmony-fleet" + pub deployment_id: String, + pub jwt_auth_accessor: String, // resolved from PR-1 auth role +} +``` + +Interpret produces: + +- Policy `fleet-deployment-`: + + ```hcl + path "harmony-fleet/data//*" { + capabilities = ["read"] + } + path "harmony-fleet/metadata//*" { + capabilities = ["read", "list"] + } + ``` + +- External group `` (`type=external`) with that policy + attached and an alias `name=, canonical_id=, + mount_accessor=`. + +Idempotent: existing groups/policies with matching content noop. + +**Operator wiring.** Same call site as PR-3, executed *after* the +Zitadel write and *before* the NATS publish. + +**Tests.** + +- Unit on policy text generation. +- Integration: against a real Bao (existing `examples/openbao` is the + fixture). Apply the Score twice; second run is a noop. + +**Acceptance.** Manually log in to staging Bao with a device's JWT +(see PR-6 for the agent path; for this PR, use `curl`) and confirm +the resulting Vault token has the expected policies attached. + +--- + +### PR-5 — Agent `secrets/` module + +**Crate:** `fleet/harmony-fleet-agent` (new module: `src/secrets/`). +**Depends on:** PR-1 (needs the role configured to be testable end-to-end; +the code itself does not depend on PR-1). +**Blocks:** PR-6. + +**Goal.** A `SecretsClient` that: + +- Holds a reference to the existing `CredentialSource` (the agent + already mints Zitadel tokens for NATS via + `fleet/harmony-fleet-auth/src/credentials.rs`). +- Exposes `read(deployment_id, secret_name) -> Result>`. +- Exposes `cached_scope() -> HashSet`. +- Exposes `refresh() -> Result<()>` which drops both caches and + re-authenticates. + +**Sketch.** + +```rust +// fleet/harmony-fleet-agent/src/secrets/mod.rs +pub struct SecretsClient { + bao_addr: String, + bao_role: String, + credentials: Arc, // existing Zitadel source + http: reqwest::Client, + state: Mutex>, +} + +struct BaoSession { + vault_token: String, + expires_at: i64, + deployments: HashSet, // computed from token lookup +} + +impl SecretsClient { + pub async fn cached_scope(&self) -> HashSet { … } + + pub async fn refresh(&self) -> Result<()> { + // 1. drop the Bao session + *self.state.lock().await = None; + // 2. force a fresh Zitadel access token + self.credentials.invalidate_cache(); // small new API on CredentialSource + let jwt = self.credentials.bearer_token().await?; + // 3. POST /v1/auth/jwt/login { role, jwt } + let session = login_to_bao(&self.http, &self.bao_addr, &self.bao_role, &jwt).await?; + *self.state.lock().await = Some(session); + Ok(()) + } + + pub async fn read(&self, dep: &DeploymentName, name: &str) -> Result> { + let session = self.ensure_session().await?; + let path = format!("harmony-fleet/data/{dep}/{name}"); + bao_kv_get(&self.http, &self.bao_addr, &session.vault_token, &path).await + } +} +``` + +**Small change to `harmony-fleet-auth`.** Add a public method on +`CredentialSource` to force-invalidate the Zitadel cache: + +```rust +// fleet/harmony-fleet-auth/src/credentials.rs +impl CredentialSource { + pub fn invalidate_cache(&self) { + if let Self::ZitadelJwt { cache, .. } = self { + if let Ok(mut g) = cache.lock() { *g = None; } + } + } +} +``` + +**Tests.** + +- Unit: `cached_scope` correctness across `refresh` cycles using a + fake HTTP client. +- Unit: `refresh` is concurrency-safe (two concurrent calls do not + produce two sessions; one wins, the other observes the new state). +- Integration (e2e): against the existing fleet-staging Bao, run a + short test binary that mints a Zitadel token and reads a secret. + +**Acceptance.** `cargo test -p harmony-fleet-agent` clean; manual e2e +read of a placeholder secret in staging. + +--- + +### PR-6 — Agent `lifecycle/` module + main.rs wiring + +**Crate:** `fleet/harmony-fleet-agent`. +**Depends on:** PR-5. + +**Goal.** The orchestration layer that calls `secrets.refresh()` +before `reconciler.apply()` whenever the desired-state introduces a +deployment whose secrets are outside the cached scope. + +**Module shape.** + +```rust +// fleet/harmony-fleet-agent/src/lifecycle/mod.rs +pub struct DeploymentLifecycle { + secrets: Arc, + reconciler: Arc, +} + +impl DeploymentLifecycle { + pub async fn on_apply(&self, key: &str, value: &[u8]) -> Result<()> { + let dep = deployment_from_key(key) + .context("desired-state key without deployment component")?; + if !self.secrets.cached_scope().await.contains(&dep) { + tracing::info!(%dep, "deployment outside cached secret scope — refreshing"); + self.secrets.refresh().await + .with_context(|| format!("refresh secrets for {dep}"))?; + if !self.secrets.cached_scope().await.contains(&dep) { + anyhow::bail!( + "deployment {dep} still missing from token scope after refresh — \ + operator may not have written Zitadel metadata yet" + ); + } + } + self.reconciler.apply(key, value).await + } + + pub async fn on_remove(&self, key: &str) -> Result<()> { + // No refresh needed for removal; the agent loses access naturally + // when its claim no longer includes the deployment. + self.reconciler.remove(key).await + } +} +``` + +**`main.rs` change.** Replace direct calls to `reconciler.apply` / +`reconciler.remove` in the JetStream KV watcher with the lifecycle +equivalents: + +```rust +// fleet/harmony-fleet-agent/src/main.rs (around line 110-120) +async_nats::jetstream::kv::Operation::Put => { + if let Err(e) = lifecycle.on_apply(&entry.key, &entry.value).await { + tracing::warn!(key = %entry.key, error = %e, "lifecycle apply failed"); + } +} +async_nats::jetstream::kv::Operation::Delete +| async_nats::jetstream::kv::Operation::Purge => { + if let Err(e) = lifecycle.on_remove(&entry.key).await { + tracing::warn!(key = %entry.key, error = %e, "lifecycle remove failed"); + } +} +``` + +**Retry on "still missing after refresh."** A race remains: operator +publishes to NATS before Zitadel propagates the metadata write to the +token issuer (unlikely if the operator runs the steps in order, but +the boundary is asynchronous). Handle it as a bounded retry with +backoff at the call site, not inside the lifecycle module: + +```rust +// in main.rs handler +let res = retry_with_backoff( + || lifecycle.on_apply(&entry.key, &entry.value), + max_attempts = 3, + initial = Duration::from_secs(2), +); +``` + +**Tests.** + +- Unit: `on_apply` with a fake `SecretsClient` and fake `Reconciler`, + exercising (a) cached deployment → no refresh, reconciler runs; (b) + uncached deployment → refresh runs, reconciler runs; (c) refresh + succeeds but scope still missing → reconciler does NOT run, error + returned. +- E2E: fresh device + new deployment in staging works end-to-end. + +**Acceptance.** Manually rolling out a new deployment to a device in +staging shows the agent's logs: "deployment X outside cached scope — +refreshing" once, then a successful reconcile. + +--- + +## Integration milestone — staging dry run + +After all six PRs land: + +- [ ] Apply the new `OpenbaoJwtAuth` config to staging Bao via the + existing `OpenbaoSetupScore`. +- [ ] Apply `ZitadelDeploymentsClaimActionScore` to staging Zitadel. +- [ ] Manually provision metadata + policy for one test deployment + + one test device. Confirm the agent can read its secret. +- [ ] Roll out via the operator: create a `Deployment` CR, confirm + the operator does the three writes in order and the agent + reconciles cleanly. +- [ ] Negative test: try to read a deployment's secret from a device + not in that deployment. Confirm Bao denies it. +- [ ] Negative test: a device's JWT minted before adding it to a + deployment cannot read the new deployment's secret until the + next `refresh`. Verify the refresh path triggers automatically + on the NATS desired-state arrival. + +## Decisions deferred (open questions) + +These are not blockers; flag them when you hit them. + +1. **Vault token TTL.** ADR-025 recommends 15 min. Confirm this works + with the agent's longest expected reconcile window and adjust + `OpenbaoJwtAuth::ttl` in the fleet setup if not. +2. **Hard revocation on deployment removal.** ADR-025 leaves this to + "wait for TTL." If a future fleet requires immediate revocation + (e.g. a compromised device), we add an + `OpenbaoTokenRevokeOnRemoval` companion Score. Don't pre-build it. +3. **What happens if Bao is down at agent startup.** The agent should + not crash; it should retry the auth on next reconcile attempt. + `SecretsClient::ensure_session` returning an error is fine — the + lifecycle layer propagates it and the NATS handler logs and + continues. Confirm this behavior in the integration milestone. +4. **Should `harmony-fleet-auth` move?** PR-5 adds an + `invalidate_cache()` method to `CredentialSource`. If we end up + with more Bao-specific code in that crate, consider whether the + secrets client belongs there instead of in + `harmony-fleet-agent`. Defer until at least one other caller (e.g. + the operator reading its own secrets) appears. + +## Required reading before starting + +- `docs/adr/025-fleet-device-secret-access.md` — the design. +- `docs/adr/023-deploy-architecture.md` — Score discipline for the + new deploy-side artifacts. +- `nats/callout/src/zitadel.rs` — the existing JWT validation pattern; + the JWT auth role config mirrors it. +- `fleet/harmony-fleet-auth/src/credentials.rs` — the existing + Zitadel JWT-bearer mint path. PR-5 plugs into the same + `CredentialSource`. +- `harmony/src/modules/openbao/setup.rs` — the JWT auth Score being + extended in PR-1. +- `fleet/harmony-fleet-agent/src/main.rs` and `reconciler.rs` — the + agent shape PR-6 modifies. diff --git a/docs/adr/025-fleet-device-secret-access.md b/docs/adr/025-fleet-device-secret-access.md new file mode 100644 index 00000000..840305b4 --- /dev/null +++ b/docs/adr/025-fleet-device-secret-access.md @@ -0,0 +1,312 @@ +# Architecture Decision Record: Fleet Device Secret Access via Zitadel JWT + +Initial Author: Jean-Gabriel Gill-Couture + +Initial Date: 2026-06-01 + +Last Updated Date: 2026-06-01 + +## Status + +Proposed + +## Context + +Fleet agents on devices need to read per-deployment secrets (image-pull +credentials, application secrets, etc.) from OpenBao. The agent already +holds one durable secret: a Zitadel machine-user JWT keyfile dropped by +`FleetDeviceSetupScore`. That key is the basis for the agent's existing +NATS authentication (`nats/callout` validates the Zitadel-minted access +token; `fleet/harmony-fleet-auth/src/credentials.rs` mints it via the +RFC 7523 JWT-bearer flow). + +Three requirements shape the design: + +1. **No new device-side secret.** The Zitadel machine key is already the + single root of trust on a device; the secret-access path must derive + from the same key, not introduce a second one. + +2. **Per-deployment isolation, enforced cryptographically.** A device + enrolled in deployments A and B reads only `A`'s and `B`'s secrets. + A device that hosts no deployments reads nothing. The device cannot + widen its own scope — only the operator can change membership. + +3. **Cross-project safety.** A second Zitadel project (a different + tenant, a different fleet, a malicious org) must not be able to + produce a token that OpenBao accepts. The trust boundary is the + project, not the deployment. + +The kubelet analogy is the architectural north star: the agent is a +small runtime that learns its workload (and the credentials needed to +run it) from a control-plane authority. The agent never decides what it +is allowed to run or read; it presents a signed identity and the +infrastructure decides. + +## Decision + +Three coordinating pieces. + +### 1. OpenBao JWT auth bound to the Zitadel project + +OpenBao's JWT auth method validates incoming tokens against Zitadel's +OIDC discovery URL (JWKS). One auth role per fleet, configured against +**one** Zitadel project: + +``` +bound_issuer = +bound_audiences = +bound_claims = { "urn:zitadel:iam:org:project:roles": "fleet-device" } +user_claim = sub +groups_claim = deployments +``` + +`bound_audiences` is the project boundary. A token minted in any other +Zitadel project has a different `aud` claim and is rejected before any +membership claim is read. This is the same defense +`nats/callout/src/zitadel.rs` already applies via `set_audience`. + +`groups_claim = deployments` instructs OpenBao to read the JWT's +`deployments` array and bind the resulting Vault token to one external +group per element. Each external group carries a per-deployment policy +granting `read` on `harmony-fleet/data//*`. + +### 2. Operator-managed Zitadel metadata as the membership source of truth + +The fleet operator is the only writer of `user.metadata.deployments` +on each device's Zitadel machine user. A Zitadel post-token-creation +**Action** copies that metadata into a top-level `deployments` claim on +the access token. The device never touches its own metadata. + +When a new `Deployment` CR is observed in Kubernetes, the operator +executes three writes in a strict order: + +1. **Zitadel metadata** — append the deployment ID to the device's + `deployments` array (per device targeted by the deployment). +2. **OpenBao external group + policy** — upsert + `identity/group/` (`type=external`, alias matching the + JWT-auth accessor) and policy + `fleet-deployment-` granting + `read` on `harmony-fleet/data//*`. +3. **NATS desired-state** — publish + `desired-state..` with the workload score. + +Reversed, the agent could see the desired-state, attempt a re-auth, +and find the deployment missing from its claims — a "permission denied +for a deployment I was told to run" race that is confusing to debug +and weakens the trust story. Trust state always precedes the workload +signal. + +Removal runs in reverse: NATS delete → (optional) group/policy delete → +metadata removal. Currently-cached Vault tokens retain access until +their short TTL expires; explicit revocation is available via +`bao token revoke` on the device's accessor if hard revocation is +needed. + +### 3. Agent-side lifecycle layer between NATS and the reconciler + +The agent gains a new module — `lifecycle/` — sitting between the NATS +KV watcher in `main.rs` and `Reconciler::apply()`. NATS handlers no +longer call the reconciler directly; they call the lifecycle handler, +which orchestrates the steps a new (or changed) deployment requires: + +``` +on_desired_state_changed(device, desired_deployments): + if not desired_deployments ⊂ secrets.cached_scope(): + secrets.refresh() # drops Bao + Zitadel tokens, re-mints both + reconciler.apply(...) # cannot start until refresh returns Ok +``` + +The refresh is a single atomic action against the `secrets/` module: it +invalidates both the cached Bao Vault token and the cached Zitadel +access token, re-mints a fresh Zitadel token (so the new `deployments` +claim is reflected), and re-runs `/auth/jwt/login` to obtain a fresh +Vault token. The reconciler runs only after the refresh succeeds. + +This layer has no other reason to exist *today*, but it is the +canonical home for any future "auxiliary work that must happen before a +deployment can run" — fetching image-pull credentials, registering +with monitoring, requesting a per-deployment NATS credential, etc. +Locking that home now avoids those concerns leaking into either the +NATS handler or the reconciler when they arrive. + +### Secret path layout + +``` +harmony-fleet/data// +``` + +The Zitadel project ID does **not** appear in the path. Its job is +done at the JWT validation boundary (`bound_audiences`), not repeated +in every key. + +## Rationale + +**Why Zitadel project ID lives in `bound_audiences`, not the path.** +The same trust assertion in two places is duplication, not defense in +depth — both reduce to "the JWT signature is valid for this audience." +Concentrating it at the auth role: + +- gives one source of truth ("which project owns this Bao instance"); +- keeps secret paths readable and operator-friendly; +- decouples secret organization from Zitadel project identity (a + project ID rotation reconfigures one Bao role, not every path). + +**Why user metadata over project roles for deployment membership.** +Project roles in Zitadel live in a flat namespace inside a project. +A handful of roles (`fleet-admin`, `fleet-device`) maps cleanly; one +role per deployment would not — role inventories at hundreds of +deployments per fleet become hard to audit and slow to mutate. +User metadata is a per-machine-user JSON store, naturally +multi-valued, and admin-only-writable. The Zitadel Action that copies +metadata to a claim is a one-time, fleet-wide piece of configuration. + +**Why `groups_claim` over claim-templated paths.** Vault policy +templating (`{{identity.entity.aliases…metadata.}}`) supports +single-value substitution but not iteration over an array. Multiple +deployments per device require either multiple JWT logins (one per +deployment) or one login that resolves to multiple policies. +`groups_claim` + external groups gives the latter cleanly: one login, +N policies attached automatically. + +**Why the lifecycle layer, separately from NATS handling.** The +trigger today is NATS, but the *operation* — "make this device ready +for a new deployment" — is a domain concept that may have other +triggers (admin command, local config reload, agent restart with stale +cache, future control-plane signal). Tying the orchestration to the +NATS message handler conflates transport with business logic. The +layer also makes the refresh-then-reconcile ordering an explicit +property of the agent's architecture, testable with fakes and visible +in the code. + +## Consequences + +**Pros** + +- One auth root on a device (the existing Zitadel machine key) covers + both NATS and OpenBao access. Rotation, revocation, and inventory + remain centralized. +- The operator owns membership; the agent owns identity. A compromised + device cannot widen its own access. A compromised operator's blast + radius is its own fleet (one Zitadel project, one Bao instance). +- Per-deployment policies are mechanical to generate. Bao policy text + is identical modulo the deployment ID, produced by a small templated + Score. New deployments add one external group + one policy; no + hand-written ACLs. +- The lifecycle layer is a reusable home for future + "before-reconcile" work without further architectural changes. + +**Cons** + +- **Two-token invalidation on membership change.** Both the cached + Zitadel access token and the cached Bao Vault token must be dropped + for new membership to take effect. This is encapsulated in the + `secrets.refresh()` call but is a real round-trip cost (one HTTPS to + Zitadel + one to Bao) on every membership change. Mitigated by the + fact that membership changes are rare relative to secret reads. +- **Removal latency = Vault token TTL.** Removing a device from a + deployment does not immediately revoke its currently-cached Vault + token; access ends at next renewal or TTL expiry. Short TTLs (15 min) + bound the worst case; explicit `bao token revoke -accessor` is + available if needed. +- **Operator gains Zitadel-admin scope.** The operator must hold + credentials that can write user metadata in the Zitadel project. + This is a high-privilege scope and concentrates trust in the + operator. The mitigation is a per-fleet Zitadel project: a + compromised operator can only mutate its own fleet's identities. +- **Zitadel Action required.** Surfacing user metadata as a JWT claim + needs a small Zitadel Action (server-side JavaScript). It is part of + the fleet's Zitadel setup and must be in version control / applied + by the fleet's bootstrap, not configured by hand. (See "Additional + Notes" for the script.) + +## Alternatives considered + +**Project roles for deployment membership.** Rejected: flat namespace +inside a project, no native multi-value semantics, role inventory +explodes at hundreds of deployments per fleet, mutations require +project-admin scope on a coarse-grained API. Kept for the coarse +`fleet-device` / `fleet-admin` distinction the NATS callout already +uses. + +**Project ID embedded in the secret path +(`secrets///...`).** Rejected: the project +isolation is already enforced by `bound_audiences` at the JWT layer. +Encoding it in the path is duplication of the same assertion, couples +the secret tree to a Zitadel ID, and complicates project rotations. +Adds no security: a token that passes `bound_audiences` validation can +read the path regardless; one that fails cannot read anything. + +**Claim-templated single policy +(`{{identity.…metadata.deployment_id}}`).** Rejected for the +multi-deployment case: Vault policy templating does not iterate over +arrays, so a single-policy template can only express "one deployment +per device." Acceptable for a single-deployment-per-device world; the +chosen kubelet-like architecture admits N deployments per device, and +collapsing the chosen `groups_claim` design to this would force +multiple JWT logins per refresh. + +**Static per-device Bao token issued at provisioning.** Rejected: +introduces a second long-lived secret on the device, breaks rotation +(re-provisioning required), and provides no native per-deployment +scoping. + +**OpenBao OIDC code flow.** Rejected: that flow is for human users +with a browser. Devices are headless and already hold a JWT-bearer +identity; using OIDC would re-invent the wheel and require a local +browser-equivalent. + +**Lifecycle layer inside the NATS handler.** Rejected: conflates +transport with domain logic and makes the refresh-then-reconcile +ordering implicit. The dedicated module makes the contract testable +and lets future triggers reuse the same code path. + +## Additional Notes + +### Zitadel Action (token customization) + +A single post-access-token-creation Action per fleet's Zitadel project +copies user metadata `deployments` into a top-level claim: + +```javascript +// Trigger: pre-access-token-creation +function addDeployments(ctx, api) { + const md = ctx.v1.user.getMetadata(); + const entry = md.metadata.find(m => m.key === "deployments"); + if (!entry) return; + try { + const deployments = JSON.parse( + Buffer.from(entry.value, "base64").toString("utf-8") + ); + if (Array.isArray(deployments)) { + api.v1.claims.setClaim("deployments", deployments); + } + } catch (_) { /* malformed metadata is treated as no deployments */ } +} +``` + +The Action lives in Zitadel's "Flows" configuration, attached to the +`Complement Token` flow on the relevant project. A Harmony Score +(`ZitadelTokenCustomizationScore` or similar) is the right home for +applying this declaratively; see plan document for status. + +### Relationship to ADR-016 and ADR-020-1 + +ADR-016 (agent mesh on NATS JetStream) establishes the agent's +existing Zitadel-keyed identity for NATS. This ADR reuses that +identity unchanged. + +ADR-020-1 establishes the human-developer authentication path to +OpenBao via Zitadel's Device Authorization Grant. This ADR is the +machine-user counterpart: same OpenBao, same Zitadel, different +auth-method binding (humans use device code; devices use +JWT-bearer-derived access tokens against `/auth/jwt/login`). + +### Threat model summary + +| Attacker | Capability | Defense | +|---|---|---| +| External (no Zitadel identity) | None | No valid JWT signature; rejected at JWKS validation. | +| Compromised device (key theft) | Full agent scope on its own deployments only | `groups_claim` restricts scope to the device's metadata; Zitadel admin can rotate the machine key and trigger immediate re-issuance. | +| Different Zitadel project (different tenant or malicious org) | Can mint valid Zitadel tokens for its own project | `bound_audiences` rejects at the JWT auth boundary before any claim is read. | +| Compromised operator | Can mutate Zitadel metadata + Bao policies for its fleet | One operator per fleet; operator credentials themselves stored in Bao under a separate auth path; compromise is contained to the operator's project. | +| Compromised Bao | Full access to all stored secrets | Out of scope — Bao is the root of secret trust by definition. ADR-006 covers Bao operational hardening. | -- 2.39.5 From aecb08a7e960f268333ff9c4a2b0cc00735d3381 Mon Sep 17 00:00:00 2001 From: Jean-Gabriel Gill-Couture Date: Mon, 1 Jun 2026 15:08:01 -0400 Subject: [PATCH 2/3] docs: fix openbao sso implementation plan to reuse harmony_config instead of writing a new implementation of a secret store --- .../device_secret_access_handoff.md | 676 +++++++++++------- docs/adr/025-fleet-device-secret-access.md | 72 +- 2 files changed, 459 insertions(+), 289 deletions(-) diff --git a/ROADMAP/fleet_platform/device_secret_access_handoff.md b/ROADMAP/fleet_platform/device_secret_access_handoff.md index 985bbdb1..26d8a0e2 100644 --- a/ROADMAP/fleet_platform/device_secret_access_handoff.md +++ b/ROADMAP/fleet_platform/device_secret_access_handoff.md @@ -9,17 +9,21 @@ Read ADR-025 first. This document is the work plan, not the design. ## What we are building, in one paragraph -Fleet agents already authenticate to NATS using a Zitadel machine-user -JWT. The same identity, with no new device-side secret, will now also -authenticate them to OpenBao via Vault's JWT auth method. Per-deployment -scope (which secrets a device can read) is carried in the JWT itself as -a `deployments` array claim, populated by Zitadel from operator-managed -user metadata. OpenBao's `groups_claim` binds the claim values to -auto-created external groups, each carrying a small per-deployment -policy granting `read` on `harmony-fleet/data//*`. A new -agent-side `lifecycle/` module sits between NATS handling and the -reconciler, refreshing both tokens before any deployment the agent -hasn't seen secrets for is reconciled. +The fleet agent reads per-deployment secrets through the existing +`harmony_config` chain. `harmony_config` already wraps OpenBao via +`harmony_secret::OpenbaoSecretStore`, but `OpenbaoSecretStore` only +knows how to authenticate via env token, cached token, Zitadel OIDC +device flow (humans), or userpass. We add a fifth auth branch: Zitadel +**JWT-bearer** (RFC 7523), keyed off the device's existing machine +keyfile — the same root of trust the NATS callout already uses. +Per-deployment scope is carried by the JWT itself as a `deployments` +array claim populated by the operator via Zitadel user metadata; +OpenBao's `groups_claim` binds those values to auto-created external +groups, each with a small policy granting `read` on +`harmony-fleet/data//*`. A thin agent-side `lifecycle/` +module sits between the NATS KV watcher and the reconciler, calling +`OpenbaoSecretStore::refresh_auth()` whenever the desired state +introduces a deployment outside the cached token scope. ## Architecture at a glance @@ -39,15 +43,18 @@ hasn't seen secrets for is reconciled. ├── NATS handler (main.rs) ── delivers KV event ──┐ │ │ │ ▼ - │ lifecycle/ ← NEW + │ lifecycle/ ← NEW (agent) │ │ │ if desired ⊄ cached_scope ──┤ │ │ - │ secrets/ │ ← NEW - │ ├─ invalidate Zitadel token - │ ├─ invalidate Bao token - │ ├─ re-mint Zitadel access token - │ └─ POST /v1/auth/jwt/login + │ harmony_config ─┤ + │ │ │ + │ StoreSource (existing wrapper) + │ │ + │ OpenbaoSecretStore ← EXTENDED in harmony_secret + │ └─ new JWT-bearer auth branch + │ └─ refresh_auth() / cached_scope() + │ └─ uses harmony_zitadel_auth for minting │ │ │ ▼ │ reconciler.apply(…) (existing) @@ -57,40 +64,268 @@ hasn't seen secrets for is reconciled. - [ ] Read ADR-025 in full. Surface any objection now, not after PR-1 lands. -- [ ] Confirm in staging: the existing fleet Zitadel project has - a service account usable by the operator with **write user +- [ ] Confirm in staging: the existing fleet Zitadel project has a + service account usable by the operator with **write user metadata** scope. If not, file a sub-task to provision it. -- [ ] Confirm OpenBao staging instance has root credentials accessible - to the operator deploy path (it already does via - `OpenbaoSetupScore`; verify versioned KV mount `harmony-fleet` - exists or plan to enable). +- [ ] Confirm the staging OpenBao KV mount `harmony-fleet` exists, or + plan its creation as part of `OpenbaoSetupScore`. - [ ] Decide deployment ID format. Recommended: the existing `DeploymentName` newtype (`harmony_reconciler_contracts::fleet`). - Validate that its allowed character set is safe for both - OpenBao paths and Vault group names (`[a-zA-Z0-9_-]`). + Validate that its character set is safe for both KV paths and + Vault external-group names (`[a-zA-Z0-9_-]`). ## Work breakdown -Six PRs, sequenced where they must be and parallel where they can be. +Seven PRs. Sequenced where they must be, parallel where they can be. Each is sized to land in a few days; none requires the next to be designed before it can start. +``` +PR-1 harmony_zitadel_auth: extract JWT-bearer minter + └─ PR-2 harmony_secret/OpenbaoSecretStore: JWT-bearer auth + refresh + └─ PR-7 agent lifecycle/ + main.rs wiring + +PR-3 harmony OpenbaoJwtAuth Score: bound_claims + groups_claim +PR-4 Zitadel Action Score: deployments claim + └─ PR-5 Operator: ZitadelDeviceDeploymentsScore +PR-3 ─┴─ PR-6 Operator: OpenbaoDeploymentSecretScopeScore +``` + +PRs 1, 3, 4 have no prerequisites and can start in parallel. + --- -### PR-1 — Extend `OpenbaoJwtAuth` with `bound_claims_json` + `groups_claim` +### PR-1 — Extract Zitadel JWT-bearer minter into `harmony_zitadel_auth` + +**Crate:** `harmony_zitadel_auth` +**Depends on:** nothing +**Blocks:** PR-2 + +**Goal.** Move the pure RFC 7523 JWT-bearer flow (assertion building + +token endpoint POST + caching) out of `harmony-fleet-auth` into the +neutral `harmony_zitadel_auth` crate, so OpenBao auth and NATS auth +share one implementation. Two real consumers cross the Rule-of-Three +threshold; no third is needed before extracting. + +**Why not leave it in `harmony-fleet-auth`.** `harmony_secret` cannot +depend on `harmony-fleet-auth` (fleet-specific crate would invert the +dependency graph; `harmony_secret` is general-purpose). The minter is +not fleet-specific — it's Zitadel machine-user identity, which any +Harmony component might need. + +**Shape.** + +```rust +// harmony_zitadel_auth/src/jwt_bearer.rs (new) +pub struct ZitadelJwtBearer { + key: MachineKeyFile, // moved from harmony-fleet-auth + oidc_issuer_url: String, + audience: String, + http: reqwest::Client, + cache: Mutex>, +} + +impl ZitadelJwtBearer { + pub fn new(key: MachineKeyFile, oidc_issuer_url: String, audience: String, + danger_accept_invalid_certs: bool) -> Result; + + /// Returns a cached token if comfortably valid, otherwise mints fresh. + pub async fn bearer_token(&self) -> Result; + + /// Force a re-mint on the next bearer_token() call. + pub fn invalidate_cache(&self); +} +``` + +Move `MachineKeyFile`, `CachedToken`, `build_assertion*`, `build_scope`, +`build_token_url`, and the constants into this crate. Re-export the +types from `harmony-fleet-auth` so the NATS path keeps compiling. + +**`harmony-fleet-auth` refactor.** `CredentialSource::ZitadelJwt` +becomes a thin wrapper: + +```rust +pub enum CredentialSource { + TomlShared { user: String, pass: String }, + ZitadelJwt { minter: Arc }, +} +``` + +`next_credential` for the JWT variant just calls +`minter.bearer_token()` and wraps it as `NatsCredential::BearerToken`. +All caching logic moves into the minter. + +**Tests.** + +- Port the existing `credentials.rs` unit tests (cache freshness, + assertion claims, scope, token URL, key-file parsing) into the new + crate, since the code being tested moved. +- Add one test confirming the NATS callback path still produces the + same `BearerToken` output as before. + +**Acceptance.** `cargo test -p harmony_zitadel_auth -p harmony-fleet-auth` +clean. NATS auth in the fleet e2e harness still works. + +--- + +### PR-2 — `OpenbaoSecretStore` JWT-bearer auth + `refresh_auth()` + `cached_scope()` + +**Crate:** `harmony_secret` (`src/store/openbao.rs`) +**Depends on:** PR-1 +**Blocks:** PR-7 + +**Goal.** Add the fifth rung to `OpenbaoSecretStore`'s auth ladder: +machine identity via Zitadel JWT-bearer → POST to +`/v1/auth/jwt/login`. Make the resulting Vault token refreshable in +place and expose the deployment scope it carries. + +**Auth ladder, after this PR.** Existing order preserved; new rung +slots between cached-token and Zitadel OIDC device flow: + +1. Env token (unchanged) +2. Cached token (unchanged) +3. **Zitadel JWT-bearer (NEW)** — only triggered if a machine keyfile + is configured; if it fails, fall through. +4. Zitadel OIDC device flow (unchanged — humans) +5. Userpass (unchanged) + +**New constructor inputs.** Refactor `OpenbaoSecretStore::new` to a +typed options struct (it already has the `too_many_arguments` clippy +TODO at line 55 — knock that out as part of this work). Add: + +```rust +pub struct OpenbaoStoreOptions { + pub base_url: String, + pub kv_mount: String, + pub auth_mount: String, + pub skip_tls: bool, + pub token: Option, + pub username: Option, + pub password: Option, + + // human OIDC (existing) + pub zitadel_sso_url: Option, + pub zitadel_client_id: Option, + + // machine JWT-bearer (NEW) + pub zitadel_jwt_bearer: Option, + + // jwt-auth role on the Bao side (shared between OIDC and JWT-bearer) + pub jwt_role: Option, + pub jwt_auth_mount: Option, +} + +pub struct ZitadelJwtBearerConfig { + pub key_path: Option, + pub key_json: Option, + pub oidc_issuer_url: String, + pub audience: String, // Zitadel project ID +} +``` + +**Interior mutability for refresh.** The current `client: VaultClient` +field is read-only after construction. Wrap it so `refresh_auth()` can +swap in a fresh client without rebuilding the whole store: + +```rust +pub struct OpenbaoSecretStore { + client: ArcSwap, // was: VaultClient + kv_mount: String, + auth_mount: String, + jwt_bearer: Option>, // for refresh + jwt_role: Option, + jwt_auth_mount: Option, + base_url: String, + skip_tls: bool, + scope: ArcSwap>, // deployment IDs from token +} +``` + +`arc-swap` is already a common pattern in the codebase. Read paths +(`get_raw`, `set_raw`) load the current pointer; refresh stores a new +one. Locks aren't needed on the hot path. + +**`refresh_auth()` implementation.** + +```rust +impl OpenbaoSecretStore { + pub async fn refresh_auth(&self) -> Result<(), SecretStoreError> { + let bearer = self.jwt_bearer.as_ref() + .ok_or_else(|| SecretStoreError::Store( + "refresh_auth requires JWT-bearer auth".into()))?; + bearer.invalidate_cache(); + let jwt = bearer.bearer_token().await?; + let session = jwt_login( + &self.base_url, self.skip_tls, + self.jwt_auth_mount.as_deref().unwrap_or("jwt"), + self.jwt_role.as_deref().ok_or(/* err */)?, + &jwt, + ).await?; + let client = build_vault_client(&self.base_url, self.skip_tls, &session.token)?; + self.client.store(Arc::new(client)); + self.scope.store(Arc::new(session.deployments)); + Ok(()) + } + + pub fn cached_scope(&self) -> HashSet { + (**self.scope.load()).clone() + } +} +``` + +`jwt_login` POSTs to `/v1/auth/{mount}/login` with `{role, jwt}` and +parses the response. To populate `deployments`, follow with a +`/v1/auth/token/lookup-self` call and read the +`identity_policies` / external-group-alias list. (Vault returns +external group memberships in the `external_namespace_policies` / +`identity_policies` field; document which exact field once verified +against staging.) + +**On initial construction with JWT-bearer.** The constructor wires up +`jwt_bearer` and calls `refresh_auth()` once before returning. If that +fails, fall through to the next auth method (preserves the existing +ladder semantics). + +**Distinguishing 403 from 404 in `get_raw`.** Today `StoreSource` +swallows all errors from `SecretStore::get_raw` and falls through to +the next config source (see `harmony_config/src/source/store.rs:38`). +A 403 (auth scope insufficient) should NOT be treated as "not found" +because falling through to the prompt source would prompt a human for +a secret a misconfigured device cannot read. Map 403 to a distinct +`SecretStoreError::AuthScope { … }` variant and have `StoreSource` +propagate it as a hard `ConfigError`. The lifecycle layer's +`refresh_auth()` precondition prevents reaching this path in practice, +but the defense-in-depth matters. + +**Tests.** + +- Unit: ladder ordering (JWT-bearer config present → tried before + OIDC; absent → ladder unchanged). +- Unit: `refresh_auth` against a `reqwest::Server` (or a hand-rolled + fake) returns a new token and updates `cached_scope`. +- Unit: 403 from `get_raw` propagates as `AuthScope`, not `NotFound`. +- Integration: against staging OpenBao, a machine keyfile yields a + Vault token whose scope matches the device's deployments metadata. + +**Acceptance.** `cargo test -p harmony_secret` clean. `examples/openbao` +still works for the userpass path. A manual e2e run against staging +authenticates via JWT-bearer and reads a placeholder secret. + +--- + +### PR-3 — Extend `OpenbaoJwtAuth` Score with `bound_claims_json` + `groups_claim` **Crate:** `harmony` (`src/modules/openbao/setup.rs`) **Depends on:** nothing -**Blocks:** PR-3, PR-5 +**Blocks:** PR-6 staging dry run -**Goal.** Make the existing JWT auth Score expressive enough to bind +**Goal.** Make the server-side JWT auth role expressive enough to bind to a Zitadel project audience, require the `fleet-device` role, and emit per-deployment external-group aliases. **Changes.** ```rust -// fleet/harmony/src/modules/openbao/setup.rs pub struct OpenbaoJwtAuth { pub oidc_discovery_url: String, pub bound_issuer: String, @@ -101,50 +336,41 @@ pub struct OpenbaoJwtAuth { pub ttl: String, pub max_ttl: String, - // NEW — both optional, both default to empty - /// JSON string passed to `bao write auth/jwt/role/ bound_claims=…`. - /// Empty means no claim binding (current behavior). + // NEW — both default to empty #[serde(default)] - pub bound_claims_json: String, - - /// Claim name to read for group aliases. When set, OpenBao reads - /// this claim as an array of strings and creates one external- - /// group alias per element on each login. + pub bound_claims_json: String, // → `bound_claims=…` #[serde(default)] - pub groups_claim: String, + pub groups_claim: String, // → `groups_claim=…` } ``` -Extend `configure_jwt` in the same file to pass `bound_claims=...` and -`groups_claim=...` to the `bao write auth/jwt/role/...` call when those -fields are non-empty. +Extend `configure_jwt` (`setup.rs:435+`) to pass the two new flags to +`bao write auth/jwt/role/...` only when non-empty. **Tests.** -- Unit: serialize/deserialize round-trip with both new fields populated - and both empty. -- Integration (existing harness if any; otherwise just compile): the - field plumbs through to the `bao write` command. Inspect the command - string in the test if the existing test harness allows. +- Unit: serde round-trip with both new fields populated and both + empty. +- Existing integration paths unaffected (default-empty leaves + behavior identical). -**Acceptance.** `cargo check -p harmony --all-features` clean. Existing -callers (which leave both new fields at default) compile and behave -identically. `examples/openbao` still works. +**Acceptance.** `cargo check -p harmony --all-features` clean. +Existing callers compile unchanged. `examples/openbao` still works. --- -### PR-2 — Zitadel Action for the `deployments` claim +### PR-4 — Zitadel Action Score: surface `deployments` metadata as a claim -**Crate:** `harmony` (probably a new module under -`src/modules/zitadel/` or `src/modules/sso/`, follow whatever the -existing Zitadel deployment code uses) +**Crate:** `harmony` (likely new module under `src/modules/zitadel/` or +`src/modules/sso/`, follow whatever module currently houses Zitadel +deploys) **Depends on:** nothing -**Blocks:** PR-3, PR-4 +**Blocks:** PR-5 -**Goal.** A Score that declaratively ensures the post-access-token- -creation Action exists on the configured Zitadel project, with the -script that copies `user.metadata.deployments` into a top-level -`deployments` claim. +**Goal.** A declarative Score that ensures the post-access-token-creation +Action exists on the configured Zitadel project, with the script that +copies `user.metadata.deployments` into a top-level `deployments` +claim. **Action script** (canonical text, do not modify casually): @@ -169,40 +395,37 @@ function addDeployments(ctx, api) { ```rust pub struct ZitadelDeploymentsClaimActionScore { pub project_id: String, - pub action_name: String, // default "fleet-deployments-claim" - pub flow_type: FlowType, // ComplementToken - pub trigger_type: TriggerType, // PreAccessTokenCreation + pub action_name: String, // default "fleet-deployments-claim" + pub flow_type: FlowType, // ComplementToken + pub trigger_type: TriggerType, // PreAccessTokenCreation } ``` -The interpret calls Zitadel's Management API to upsert the Action and -attach it to the `Complement Token / PreAccessTokenCreation` trigger. +Interpret upserts the Action via Zitadel Management API and attaches +it to `Complement Token / PreAccessTokenCreation` on the project. Idempotent. **Tests.** - Unit: serialization, defaults. -- E2E: standing up a fresh Zitadel and applying the Score yields a - token whose `deployments` claim matches the metadata value. Use the - existing fleet e2e Zitadel as a fixture if available; otherwise - document as a manual verification step in staging. +- Manual E2E: stand up a fresh Zitadel, apply the Score, mint a + machine-user token whose owner has metadata + `deployments=["foo","bar"]`, confirm the claim appears at jwt.io. -**Acceptance.** Manually mint a token in staging after applying the -Score; decode at jwt.io and confirm the `deployments` claim is -present. +**Acceptance.** Manually decoded token from staging contains the +expected `deployments` claim. --- -### PR-3 — `ZitadelDeviceDeploymentsScore` and operator wiring +### PR-5 — Operator: `ZitadelDeviceDeploymentsScore` + wiring **Crate:** `harmony` for the Score; `fleet/harmony-fleet-operator` for the wiring. -**Depends on:** PR-2 (the Action must exist for the metadata to surface -in tokens). -**Blocks:** PR-6. +**Depends on:** PR-4 (the Action must exist for the claim to surface) +**Blocks:** PR-7 staging dry run **Goal.** Operator writes membership to Zitadel as the **first** of -its three writes on every new/changed deployment. +its three writes on every reconciled deployment. **Score shape.** @@ -210,34 +433,32 @@ its three writes on every new/changed deployment. pub struct ZitadelDeviceDeploymentsScore { pub project_id: String, pub device_user_id: String, // Zitadel machine user ID - pub deployments: Vec, // declarative full set + pub deployments: Vec, // declarative full set, NOT a diff } ``` -Interpret semantics: **declarative replace**, not append. The Score -reads the current `deployments` metadata value, compares to the -desired set, and writes only if they differ. This makes removal -trivially expressible: the operator declares the new set, the Score -diffs and writes. +Interpret semantics: **declarative replace**. Read the current +metadata value, compare to the desired set, write only on diff. +Removal is "declare the new set without the removed deployment." -**Operator changes** (`fleet-aggregator.rs` or `device_reconciler.rs`, -wherever the per-`Deployment` reconciliation lives). On reconciling a -`Deployment` CR: +**Operator changes** (in `fleet_aggregator.rs` or `device_reconciler.rs`, +wherever per-`Deployment` reconciliation lives today). On reconciling +a `Deployment` CR: 1. Compute the desired `device → deployments` map for every device targeted by the deployment. -2. For each device, compose and run `ZitadelDeviceDeploymentsScore` - with the desired set. -3. Continue to the OpenBao step (PR-5) and only then to the existing - NATS publish. +2. For each device, compose and run `ZitadelDeviceDeploymentsScore`. +3. Continue to `OpenbaoDeploymentSecretScopeScore` (PR-6) and only + then to the existing NATS publish. -Errors at step 2 must surface as a status condition on the CR; do not -proceed to NATS if Zitadel write fails. +If step 2 fails, surface as a status condition on the CR and do NOT +proceed to NATS — never publish a desired-state for which the device +cannot acquire scope. **Tests.** -- Unit on the Score: diff logic (no-op when sets match, write when - they differ). +- Unit on the Score: diff logic (no-op when sets match; full write + when they differ). - Integration on the operator: a fake Zitadel client receives the expected write; the order against the existing NATS publish is preserved. @@ -247,12 +468,12 @@ target device's Zitadel metadata to contain the deployment ID. --- -### PR-4 — Operator: per-deployment OpenBao group + policy +### PR-6 — Operator: per-deployment OpenBao group + policy **Crate:** `harmony` (`src/modules/openbao/`) for the Score; operator for wiring. -**Depends on:** PR-1. -**Blocks:** PR-6. +**Depends on:** PR-3 (role + accessor must exist) +**Blocks:** PR-7 staging dry run **Goal.** When a deployment exists, ensure `identity/group/` and `fleet-deployment-` exist with @@ -265,7 +486,7 @@ pub struct OpenbaoDeploymentSecretScopeScore { pub instance: OpenbaoInstance, pub kv_mount: String, // "harmony-fleet" pub deployment_id: String, - pub jwt_auth_accessor: String, // resolved from PR-1 auth role + pub jwt_auth_accessor: String, // resolved from the PR-3 role } ``` @@ -282,128 +503,44 @@ Interpret produces: } ``` -- External group `` (`type=external`) with that policy - attached and an alias `name=, canonical_id=, - mount_accessor=`. +- External group `` (`type=external`), policy attached, + with an alias `name=, + canonical_id=, mount_accessor=`. -Idempotent: existing groups/policies with matching content noop. +Idempotent. -**Operator wiring.** Same call site as PR-3, executed *after* the -Zitadel write and *before* the NATS publish. +**Operator wiring.** Same call site as PR-5, after the Zitadel write +and before the NATS publish. **Tests.** - Unit on policy text generation. -- Integration: against a real Bao (existing `examples/openbao` is the - fixture). Apply the Score twice; second run is a noop. +- Integration against the real Bao in `examples/openbao`; second run + is a no-op. -**Acceptance.** Manually log in to staging Bao with a device's JWT -(see PR-6 for the agent path; for this PR, use `curl`) and confirm -the resulting Vault token has the expected policies attached. +**Acceptance.** Manually `curl`-ing +`/v1/auth/jwt/login` with a real device JWT returns a Vault token +whose `policies` include `fleet-deployment-` and whose +`identity_policies` reflect the external group binding. --- -### PR-5 — Agent `secrets/` module +### PR-7 — Agent `lifecycle/` module + main.rs wiring -**Crate:** `fleet/harmony-fleet-agent` (new module: `src/secrets/`). -**Depends on:** PR-1 (needs the role configured to be testable end-to-end; -the code itself does not depend on PR-1). -**Blocks:** PR-6. +**Crate:** `fleet/harmony-fleet-agent` +**Depends on:** PR-2 (needs `refresh_auth()` + `cached_scope()`) -**Goal.** A `SecretsClient` that: - -- Holds a reference to the existing `CredentialSource` (the agent - already mints Zitadel tokens for NATS via - `fleet/harmony-fleet-auth/src/credentials.rs`). -- Exposes `read(deployment_id, secret_name) -> Result>`. -- Exposes `cached_scope() -> HashSet`. -- Exposes `refresh() -> Result<()>` which drops both caches and - re-authenticates. - -**Sketch.** - -```rust -// fleet/harmony-fleet-agent/src/secrets/mod.rs -pub struct SecretsClient { - bao_addr: String, - bao_role: String, - credentials: Arc, // existing Zitadel source - http: reqwest::Client, - state: Mutex>, -} - -struct BaoSession { - vault_token: String, - expires_at: i64, - deployments: HashSet, // computed from token lookup -} - -impl SecretsClient { - pub async fn cached_scope(&self) -> HashSet { … } - - pub async fn refresh(&self) -> Result<()> { - // 1. drop the Bao session - *self.state.lock().await = None; - // 2. force a fresh Zitadel access token - self.credentials.invalidate_cache(); // small new API on CredentialSource - let jwt = self.credentials.bearer_token().await?; - // 3. POST /v1/auth/jwt/login { role, jwt } - let session = login_to_bao(&self.http, &self.bao_addr, &self.bao_role, &jwt).await?; - *self.state.lock().await = Some(session); - Ok(()) - } - - pub async fn read(&self, dep: &DeploymentName, name: &str) -> Result> { - let session = self.ensure_session().await?; - let path = format!("harmony-fleet/data/{dep}/{name}"); - bao_kv_get(&self.http, &self.bao_addr, &session.vault_token, &path).await - } -} -``` - -**Small change to `harmony-fleet-auth`.** Add a public method on -`CredentialSource` to force-invalidate the Zitadel cache: - -```rust -// fleet/harmony-fleet-auth/src/credentials.rs -impl CredentialSource { - pub fn invalidate_cache(&self) { - if let Self::ZitadelJwt { cache, .. } = self { - if let Ok(mut g) = cache.lock() { *g = None; } - } - } -} -``` - -**Tests.** - -- Unit: `cached_scope` correctness across `refresh` cycles using a - fake HTTP client. -- Unit: `refresh` is concurrency-safe (two concurrent calls do not - produce two sessions; one wins, the other observes the new state). -- Integration (e2e): against the existing fleet-staging Bao, run a - short test binary that mints a Zitadel token and reads a secret. - -**Acceptance.** `cargo test -p harmony-fleet-agent` clean; manual e2e -read of a placeholder secret in staging. - ---- - -### PR-6 — Agent `lifecycle/` module + main.rs wiring - -**Crate:** `fleet/harmony-fleet-agent`. -**Depends on:** PR-5. - -**Goal.** The orchestration layer that calls `secrets.refresh()` -before `reconciler.apply()` whenever the desired-state introduces a -deployment whose secrets are outside the cached scope. +**Goal.** The orchestration layer that calls +`secret_store.refresh_auth()` before `reconciler.apply()` whenever the +desired-state introduces a deployment whose secrets are outside the +cached scope. **Module shape.** ```rust // fleet/harmony-fleet-agent/src/lifecycle/mod.rs pub struct DeploymentLifecycle { - secrets: Arc, + secret_store: Arc, // shared with ConfigManager reconciler: Arc, } @@ -411,11 +548,12 @@ impl DeploymentLifecycle { pub async fn on_apply(&self, key: &str, value: &[u8]) -> Result<()> { let dep = deployment_from_key(key) .context("desired-state key without deployment component")?; - if !self.secrets.cached_scope().await.contains(&dep) { + let scope = self.secret_store.cached_scope(); + if !scope.contains(dep.as_str()) { tracing::info!(%dep, "deployment outside cached secret scope — refreshing"); - self.secrets.refresh().await + self.secret_store.refresh_auth().await .with_context(|| format!("refresh secrets for {dep}"))?; - if !self.secrets.cached_scope().await.contains(&dep) { + if !self.secret_store.cached_scope().contains(dep.as_str()) { anyhow::bail!( "deployment {dep} still missing from token scope after refresh — \ operator may not have written Zitadel metadata yet" @@ -426,115 +564,105 @@ impl DeploymentLifecycle { } pub async fn on_remove(&self, key: &str) -> Result<()> { - // No refresh needed for removal; the agent loses access naturally - // when its claim no longer includes the deployment. + // No refresh needed for removal; access lapses naturally on next refresh. self.reconciler.remove(key).await } } ``` +The `OpenbaoSecretStore` instance is the same one held by the agent's +`ConfigManager`. Either construct it once and share via `Arc`, or +expose a typed accessor on `ConfigManager` so the lifecycle layer can +reach the underlying store. The Arc-sharing is simpler. + **`main.rs` change.** Replace direct calls to `reconciler.apply` / -`reconciler.remove` in the JetStream KV watcher with the lifecycle -equivalents: +`reconciler.remove` in the JetStream KV watcher (around line 111-119) +with `lifecycle.on_apply` / `lifecycle.on_remove`. + +**Bounded retry at the call site.** A race remains: operator may +publish to NATS milliseconds before Zitadel propagates the metadata +write to the token issuer. Handle as a bounded retry around +`lifecycle.on_apply`, not inside the lifecycle module: ```rust -// fleet/harmony-fleet-agent/src/main.rs (around line 110-120) -async_nats::jetstream::kv::Operation::Put => { - if let Err(e) = lifecycle.on_apply(&entry.key, &entry.value).await { - tracing::warn!(key = %entry.key, error = %e, "lifecycle apply failed"); - } -} -async_nats::jetstream::kv::Operation::Delete -| async_nats::jetstream::kv::Operation::Purge => { - if let Err(e) = lifecycle.on_remove(&entry.key).await { - tracing::warn!(key = %entry.key, error = %e, "lifecycle remove failed"); - } -} -``` - -**Retry on "still missing after refresh."** A race remains: operator -publishes to NATS before Zitadel propagates the metadata write to the -token issuer (unlikely if the operator runs the steps in order, but -the boundary is asynchronous). Handle it as a bounded retry with -backoff at the call site, not inside the lifecycle module: - -```rust -// in main.rs handler let res = retry_with_backoff( || lifecycle.on_apply(&entry.key, &entry.value), - max_attempts = 3, - initial = Duration::from_secs(2), + /* attempts */ 3, + /* initial */ Duration::from_secs(2), ); ``` **Tests.** -- Unit: `on_apply` with a fake `SecretsClient` and fake `Reconciler`, - exercising (a) cached deployment → no refresh, reconciler runs; (b) - uncached deployment → refresh runs, reconciler runs; (c) refresh - succeeds but scope still missing → reconciler does NOT run, error - returned. -- E2E: fresh device + new deployment in staging works end-to-end. +- Unit with a fake `OpenbaoSecretStore` trait-object (cached + deployment → no refresh; uncached → refresh runs; refresh succeeds + but scope still missing → reconciler does NOT run, error returned). +- E2E: fresh device + new deployment in staging shows one + "outside cached scope — refreshing" log followed by a clean + reconcile. -**Acceptance.** Manually rolling out a new deployment to a device in -staging shows the agent's logs: "deployment X outside cached scope — -refreshing" once, then a successful reconcile. +**Acceptance.** Manually rolling out a new deployment to a staging +device shows the expected log sequence and the workload starts. --- ## Integration milestone — staging dry run -After all six PRs land: +After PRs 1-7 land: - [ ] Apply the new `OpenbaoJwtAuth` config to staging Bao via the existing `OpenbaoSetupScore`. - [ ] Apply `ZitadelDeploymentsClaimActionScore` to staging Zitadel. -- [ ] Manually provision metadata + policy for one test deployment + - one test device. Confirm the agent can read its secret. +- [ ] Manually provision metadata + group/policy for one test + deployment + one test device. Confirm the agent reads its secret. - [ ] Roll out via the operator: create a `Deployment` CR, confirm - the operator does the three writes in order and the agent + the operator runs the three writes in order and the agent reconciles cleanly. -- [ ] Negative test: try to read a deployment's secret from a device - not in that deployment. Confirm Bao denies it. -- [ ] Negative test: a device's JWT minted before adding it to a - deployment cannot read the new deployment's secret until the - next `refresh`. Verify the refresh path triggers automatically - on the NATS desired-state arrival. +- [ ] Negative: device not in deployment X cannot read X's secret + (expect `AuthScope` propagation, not silent fall-through). +- [ ] Negative: a JWT minted before metadata update cannot read the + new deployment's secret until refresh — confirm the lifecycle + layer triggers refresh automatically. ## Decisions deferred (open questions) These are not blockers; flag them when you hit them. 1. **Vault token TTL.** ADR-025 recommends 15 min. Confirm this works - with the agent's longest expected reconcile window and adjust - `OpenbaoJwtAuth::ttl` in the fleet setup if not. + with the longest expected reconcile window and adjust + `OpenbaoJwtAuth::ttl` if not. 2. **Hard revocation on deployment removal.** ADR-025 leaves this to - "wait for TTL." If a future fleet requires immediate revocation - (e.g. a compromised device), we add an + TTL expiry. If a future fleet requires immediate revocation, add an `OpenbaoTokenRevokeOnRemoval` companion Score. Don't pre-build it. -3. **What happens if Bao is down at agent startup.** The agent should - not crash; it should retry the auth on next reconcile attempt. - `SecretsClient::ensure_session` returning an error is fine — the - lifecycle layer propagates it and the NATS handler logs and - continues. Confirm this behavior in the integration milestone. -4. **Should `harmony-fleet-auth` move?** PR-5 adds an - `invalidate_cache()` method to `CredentialSource`. If we end up - with more Bao-specific code in that crate, consider whether the - secrets client belongs there instead of in - `harmony-fleet-agent`. Defer until at least one other caller (e.g. - the operator reading its own secrets) appears. +3. **Bao down at agent startup.** `OpenbaoSecretStore::new` with + JWT-bearer should not panic; it should fall through to other auth + methods or return an error the agent can retry. Confirm behavior + in the integration milestone and document. +4. **`StoreSource` 403 propagation.** PR-2 introduces a new + `SecretStoreError::AuthScope` variant. Decide whether + `harmony_config::source::store.rs` should treat it as a hard error + (recommended) or continue to fall through (current behavior). + Tests in PR-2 should lock the chosen semantics. +5. **One-keyfile-per-store or pool.** PR-2 assumes one + `ZitadelJwtBearer` per `OpenbaoSecretStore`. If a future use case + needs the agent to authenticate as multiple identities (unlikely), + we'd revisit. Single-identity is the right starting shape. ## Required reading before starting - `docs/adr/025-fleet-device-secret-access.md` — the design. +- `docs/adr/020-1-zitadel-openbao-secure-config-store.md` — the + unified-config rationale; this PR set is the machine-user counterpart. - `docs/adr/023-deploy-architecture.md` — Score discipline for the - new deploy-side artifacts. -- `nats/callout/src/zitadel.rs` — the existing JWT validation pattern; - the JWT auth role config mirrors it. + deploy-side Scores in PRs 3-6. +- `harmony_secret/src/store/openbao.rs` — the store being extended in + PR-2 (auth ladder at line 70-185). +- `harmony_config/src/source/store.rs` — the wrapper the agent uses; + read it to understand the 403-vs-404 question in PR-2. - `fleet/harmony-fleet-auth/src/credentials.rs` — the existing - Zitadel JWT-bearer mint path. PR-5 plugs into the same - `CredentialSource`. -- `harmony/src/modules/openbao/setup.rs` — the JWT auth Score being - extended in PR-1. + JWT-bearer mint path being extracted in PR-1. +- `nats/callout/src/zitadel.rs` — the existing JWT validation pattern; + the Bao JWT auth role mirrors its `bound_issuer` + `bound_audiences` + shape. - `fleet/harmony-fleet-agent/src/main.rs` and `reconciler.rs` — the - agent shape PR-6 modifies. + agent shape PR-7 modifies. diff --git a/docs/adr/025-fleet-device-secret-access.md b/docs/adr/025-fleet-device-secret-access.md index 840305b4..23640a2e 100644 --- a/docs/adr/025-fleet-device-secret-access.md +++ b/docs/adr/025-fleet-device-secret-access.md @@ -102,29 +102,62 @@ their short TTL expires; explicit revocation is available via `bao token revoke` on the device's accessor if hard revocation is needed. -### 3. Agent-side lifecycle layer between NATS and the reconciler +### 3. `harmony_secret` JWT-bearer auth method, consumed via `harmony_config` -The agent gains a new module — `lifecycle/` — sitting between the NATS -KV watcher in `main.rs` and `Reconciler::apply()`. NATS handlers no -longer call the reconciler directly; they call the lifecycle handler, -which orchestrates the steps a new (or changed) deployment requires: +The agent does **not** grow a new secrets client. Per ADR-020 the +unified entry point for both config and secrets is `harmony_config`, +which already chains `StoreSource` through +`harmony_secret`. The missing piece is auth, not retrieval: +`OpenbaoSecretStore::new` today supports a token from env, a cached +token, the Zitadel OIDC **device** flow (humans), and userpass — but +not the Zitadel **JWT-bearer** flow that the agent needs for headless +machine identity. + +The work is therefore mostly inside `harmony_secret`: + +- Extend `OpenbaoSecretStore` with a new auth branch that takes a + Zitadel machine keyfile + Bao JWT role + audience, mints an access + token via RFC 7523 JWT-bearer (the same flow + `fleet/harmony-fleet-auth/src/credentials.rs` already implements + for NATS), and POSTs to `/v1/auth/jwt/login`. +- Pull the JWT-bearer minting itself into `harmony_zitadel_auth` so + the agent's NATS path and the OpenBao auth path share one + implementation. Two real consumers cross the Rule-of-Three + threshold. +- Add interior mutability to `OpenbaoSecretStore` so a re-auth + (re-mint + re-login) can replace the cached Vault token in place + without rebuilding the entire `ConfigManager`. The `SecretStore` + trait stays `&self`-only on the read path; an explicit + `refresh_auth()` is the only mutator. +- Expose `cached_scope() -> HashSet` derived from the + Vault token's identity-group aliases (one call to + `/v1/auth/token/lookup-self` per refresh, cached for the token's + lifetime). + +### 4. Agent-side lifecycle layer between NATS and the reconciler + +A thin module — `fleet/harmony-fleet-agent/src/lifecycle/` — sits +between the NATS KV watcher in `main.rs` and `Reconciler::apply()`, +calling into `harmony_config`'s OpenBao source to refresh the +secret-store auth before reconciling a deployment the agent has not +yet seen: ``` on_desired_state_changed(device, desired_deployments): - if not desired_deployments ⊂ secrets.cached_scope(): - secrets.refresh() # drops Bao + Zitadel tokens, re-mints both - reconciler.apply(...) # cannot start until refresh returns Ok + if not desired_deployments ⊂ secret_store.cached_scope(): + secret_store.refresh_auth() # drops Bao + Zitadel tokens, re-mints both + reconciler.apply(...) # cannot start until refresh returns Ok ``` -The refresh is a single atomic action against the `secrets/` module: it -invalidates both the cached Bao Vault token and the cached Zitadel -access token, re-mints a fresh Zitadel token (so the new `deployments` -claim is reflected), and re-runs `/auth/jwt/login` to obtain a fresh -Vault token. The reconciler runs only after the refresh succeeds. +`refresh_auth()` is atomic at the OpenBao store level: it invalidates +both the cached Bao Vault token and the cached Zitadel access token, +re-mints a fresh Zitadel token (so the new `deployments` claim is +reflected), and re-runs `/v1/auth/jwt/login` to obtain a fresh Vault +token. The reconciler runs only after the refresh succeeds. This layer has no other reason to exist *today*, but it is the -canonical home for any future "auxiliary work that must happen before a -deployment can run" — fetching image-pull credentials, registering +canonical home for any future "auxiliary work that must happen before +a deployment can run" — fetching image-pull credentials, registering with monitoring, requesting a per-deployment NATS credential, etc. Locking that home now avoids those concerns leaking into either the NATS handler or the reconciler when they arrive. @@ -178,6 +211,15 @@ layer also makes the refresh-then-reconcile ordering an explicit property of the agent's architecture, testable with fakes and visible in the code. +**Why `harmony_config` / `harmony_secret`, not a fleet-local secrets +client.** ADR-020 is explicit that `harmony_config` is the unified +config+secret entry point and `OpenbaoSecretStore` is the canonical +OpenBao client. Adding a parallel fleet-only client would duplicate +the auth ladder, cache-file layout, and `kv2` plumbing already in +`harmony_secret`. The fleet's needs are an *additional auth branch*, +not a different store. Putting it where every other Harmony consumer +will also benefit is the correct locus. + ## Consequences **Pros** -- 2.39.5 From 3d01d7482fcb42e8e0cf84ec538edde8673b70dd Mon Sep 17 00:00:00 2001 From: Jean-Gabriel Gill-Couture Date: Mon, 1 Jun 2026 15:15:42 -0400 Subject: [PATCH 3/3] docs: Simplify architecture for openbao sso via harmony config --- .../device_secret_access_handoff.md | 670 ++++++------------ docs/adr/025-fleet-device-secret-access.md | 101 +-- 2 files changed, 257 insertions(+), 514 deletions(-) diff --git a/ROADMAP/fleet_platform/device_secret_access_handoff.md b/ROADMAP/fleet_platform/device_secret_access_handoff.md index 26d8a0e2..53ccc71d 100644 --- a/ROADMAP/fleet_platform/device_secret_access_handoff.md +++ b/ROADMAP/fleet_platform/device_secret_access_handoff.md @@ -1,97 +1,78 @@ # Fleet device secret access — implementation handoff -**Owner:** TBD (assignee) +**Owner:** TBD **Design:** [`docs/adr/025-fleet-device-secret-access.md`](../../docs/adr/025-fleet-device-secret-access.md) **Status:** Ready to start **Written:** 2026-06-01 Read ADR-025 first. This document is the work plan, not the design. -## What we are building, in one paragraph +## Summary The fleet agent reads per-deployment secrets through the existing -`harmony_config` chain. `harmony_config` already wraps OpenBao via -`harmony_secret::OpenbaoSecretStore`, but `OpenbaoSecretStore` only -knows how to authenticate via env token, cached token, Zitadel OIDC -device flow (humans), or userpass. We add a fifth auth branch: Zitadel -**JWT-bearer** (RFC 7523), keyed off the device's existing machine -keyfile — the same root of trust the NATS callout already uses. -Per-deployment scope is carried by the JWT itself as a `deployments` -array claim populated by the operator via Zitadel user metadata; -OpenBao's `groups_claim` binds those values to auto-created external -groups, each with a small policy granting `read` on -`harmony-fleet/data//*`. A thin agent-side `lifecycle/` -module sits between the NATS KV watcher and the reconciler, calling -`OpenbaoSecretStore::refresh_auth()` whenever the desired state -introduces a deployment outside the cached token scope. +`harmony_config` chain, which already wraps OpenBao via +`harmony_secret::OpenbaoSecretStore`. `OpenbaoSecretStore`'s auth +ladder is extended with one new rung — Zitadel **JWT-bearer**, keyed +off the device's existing machine keyfile (the same root of trust the +NATS callout uses). Per-deployment scope rides in the JWT itself as a +`deployments` array claim, populated by the operator through Zitadel +user metadata. OpenBao's `groups_claim` binds those values to +auto-created external groups, each with a small policy granting +`read` on `harmony-fleet/data//*`. The agent's NATS +watcher calls `refresh_auth()` on the store before reconciling a +deployment whose secrets are outside the cached scope. ## Architecture at a glance ``` - Fleet operator (one per fleet, has Zitadel admin scope) + Fleet operator (one per fleet, holds Zitadel admin scope) │ ├── on new Deployment CR: - │ 1. Zitadel: append deployment to device.metadata.deployments - │ 2. OpenBao: upsert external group + policy - │ fleet-deployment- → read on - │ harmony-fleet/data//* - │ 3. NATS: publish desired-state.. + │ 1. FleetDeviceDeploymentMembershipScore + │ ├─ Zitadel: append dep to device.metadata.deployments + │ └─ OpenBao: upsert external group + policy + │ fleet-deployment- → read on + │ harmony-fleet/data//* + │ 2. existing NATS publish (data plane, unchanged) │ ▼ - Fleet agent (has Zitadel machine key from FleetDeviceSetupScore) + Fleet agent │ - ├── NATS handler (main.rs) ── delivers KV event ──┐ - │ │ - │ ▼ - │ lifecycle/ ← NEW (agent) - │ │ - │ if desired ⊄ cached_scope ──┤ - │ │ - │ harmony_config ─┤ - │ │ │ - │ StoreSource (existing wrapper) - │ │ - │ OpenbaoSecretStore ← EXTENDED in harmony_secret - │ └─ new JWT-bearer auth branch - │ └─ refresh_auth() / cached_scope() - │ └─ uses harmony_zitadel_auth for minting - │ │ - │ ▼ - │ reconciler.apply(…) (existing) + ├── NATS handler (main.rs): + │ if !secret_store.cached_scope().contains(&dep) { + │ secret_store.refresh_auth().await?; + │ } + │ reconciler.apply(…) + │ + └── reads secrets via harmony_config (StoreSource wrapping + OpenbaoSecretStore with new JWT-bearer auth rung) ``` -## Pre-flight (do these once, before any PR) +## Pre-flight -- [ ] Read ADR-025 in full. Surface any objection now, not after PR-1 - lands. -- [ ] Confirm in staging: the existing fleet Zitadel project has a - service account usable by the operator with **write user - metadata** scope. If not, file a sub-task to provision it. +- [ ] Confirm staging Zitadel has a service account for the operator + with **write user metadata** scope. File a sub-task if not. - [ ] Confirm the staging OpenBao KV mount `harmony-fleet` exists, or - plan its creation as part of `OpenbaoSetupScore`. -- [ ] Decide deployment ID format. Recommended: the existing - `DeploymentName` newtype (`harmony_reconciler_contracts::fleet`). - Validate that its character set is safe for both KV paths and - Vault external-group names (`[a-zA-Z0-9_-]`). + add its creation to `OpenbaoSetupScore`. +- [ ] Validate `DeploymentName`'s character set is safe for KV paths + and Bao external-group names (`[a-zA-Z0-9_-]`). ## Work breakdown -Seven PRs. Sequenced where they must be, parallel where they can be. -Each is sized to land in a few days; none requires the next to be -designed before it can start. +Six PRs. Dependencies: ``` PR-1 harmony_zitadel_auth: extract JWT-bearer minter - └─ PR-2 harmony_secret/OpenbaoSecretStore: JWT-bearer auth + refresh - └─ PR-7 agent lifecycle/ + main.rs wiring + └─ PR-2 harmony_secret: JWT-bearer auth + refresh_auth + cached_scope + └─ PR-6 agent main.rs: inline refresh-check PR-3 harmony OpenbaoJwtAuth Score: bound_claims + groups_claim PR-4 Zitadel Action Score: deployments claim - └─ PR-5 Operator: ZitadelDeviceDeploymentsScore -PR-3 ─┴─ PR-6 Operator: OpenbaoDeploymentSecretScopeScore + └─ PR-5 Operator: FleetDeviceDeploymentMembershipScore +PR-3 ─┘ ``` -PRs 1, 3, 4 have no prerequisites and can start in parallel. +PRs 1, 3, 4 are independent and can start in parallel. --- @@ -101,24 +82,22 @@ PRs 1, 3, 4 have no prerequisites and can start in parallel. **Depends on:** nothing **Blocks:** PR-2 -**Goal.** Move the pure RFC 7523 JWT-bearer flow (assertion building + -token endpoint POST + caching) out of `harmony-fleet-auth` into the -neutral `harmony_zitadel_auth` crate, so OpenBao auth and NATS auth -share one implementation. Two real consumers cross the Rule-of-Three -threshold; no third is needed before extracting. +Move `MachineKeyFile`, `CachedToken`, `build_assertion*`, `build_scope`, +`build_token_url`, and the mint+cache logic out of +`fleet/harmony-fleet-auth/src/credentials.rs` into a new +`harmony_zitadel_auth/src/jwt_bearer.rs`. Two real consumers (NATS +callout + OpenBao) cross Rule of Three at the second consumer. -**Why not leave it in `harmony-fleet-auth`.** `harmony_secret` cannot -depend on `harmony-fleet-auth` (fleet-specific crate would invert the -dependency graph; `harmony_secret` is general-purpose). The minter is -not fleet-specific — it's Zitadel machine-user identity, which any -Harmony component might need. +`harmony_secret` cannot depend on `harmony-fleet-auth` (fleet-specific +crate; would invert the dependency graph). `harmony_zitadel_auth` is +neutral and already houses the human-OIDC counterpart. **Shape.** ```rust -// harmony_zitadel_auth/src/jwt_bearer.rs (new) +// harmony_zitadel_auth/src/jwt_bearer.rs pub struct ZitadelJwtBearer { - key: MachineKeyFile, // moved from harmony-fleet-auth + key: MachineKeyFile, oidc_issuer_url: String, audience: String, http: reqwest::Client, @@ -129,69 +108,36 @@ impl ZitadelJwtBearer { pub fn new(key: MachineKeyFile, oidc_issuer_url: String, audience: String, danger_accept_invalid_certs: bool) -> Result; - /// Returns a cached token if comfortably valid, otherwise mints fresh. + /// Cached if comfortably valid, otherwise mints fresh. pub async fn bearer_token(&self) -> Result; - /// Force a re-mint on the next bearer_token() call. + /// Force a re-mint on next call. pub fn invalidate_cache(&self); } ``` -Move `MachineKeyFile`, `CachedToken`, `build_assertion*`, `build_scope`, -`build_token_url`, and the constants into this crate. Re-export the -types from `harmony-fleet-auth` so the NATS path keeps compiling. - -**`harmony-fleet-auth` refactor.** `CredentialSource::ZitadelJwt` -becomes a thin wrapper: - -```rust -pub enum CredentialSource { - TomlShared { user: String, pass: String }, - ZitadelJwt { minter: Arc }, -} -``` - -`next_credential` for the JWT variant just calls -`minter.bearer_token()` and wraps it as `NatsCredential::BearerToken`. -All caching logic moves into the minter. - -**Tests.** - -- Port the existing `credentials.rs` unit tests (cache freshness, - assertion claims, scope, token URL, key-file parsing) into the new - crate, since the code being tested moved. -- Add one test confirming the NATS callback path still produces the - same `BearerToken` output as before. +`CredentialSource::ZitadelJwt` in `harmony-fleet-auth` becomes a thin +wrapper around `Arc`. Port the existing pure-builder +tests from `credentials.rs` into the new crate. **Acceptance.** `cargo test -p harmony_zitadel_auth -p harmony-fleet-auth` -clean. NATS auth in the fleet e2e harness still works. +clean. Fleet e2e NATS auth still works. --- -### PR-2 — `OpenbaoSecretStore` JWT-bearer auth + `refresh_auth()` + `cached_scope()` +### PR-2 — `OpenbaoSecretStore` JWT-bearer rung + `refresh_auth` + `cached_scope` **Crate:** `harmony_secret` (`src/store/openbao.rs`) **Depends on:** PR-1 -**Blocks:** PR-7 +**Blocks:** PR-6 -**Goal.** Add the fifth rung to `OpenbaoSecretStore`'s auth ladder: -machine identity via Zitadel JWT-bearer → POST to -`/v1/auth/jwt/login`. Make the resulting Vault token refreshable in -place and expose the deployment scope it carries. +Add the fifth rung to `OpenbaoSecretStore::new`'s auth ladder. +Position: between cached-token (rung 2) and Zitadel OIDC device flow +(rung 4). Triggered only when a machine keyfile is configured; if it +fails, fall through to the existing ladder. -**Auth ladder, after this PR.** Existing order preserved; new rung -slots between cached-token and Zitadel OIDC device flow: - -1. Env token (unchanged) -2. Cached token (unchanged) -3. **Zitadel JWT-bearer (NEW)** — only triggered if a machine keyfile - is configured; if it fails, fall through. -4. Zitadel OIDC device flow (unchanged — humans) -5. Userpass (unchanged) - -**New constructor inputs.** Refactor `OpenbaoSecretStore::new` to a -typed options struct (it already has the `too_many_arguments` clippy -TODO at line 55 — knock that out as part of this work). Add: +**Constructor refactor.** Knock out the `too_many_arguments` clippy +TODO at `openbao.rs:55` while we're adding more args: ```rust pub struct OpenbaoStoreOptions { @@ -200,19 +146,10 @@ pub struct OpenbaoStoreOptions { pub auth_mount: String, pub skip_tls: bool, pub token: Option, - pub username: Option, - pub password: Option, - - // human OIDC (existing) - pub zitadel_sso_url: Option, - pub zitadel_client_id: Option, - - // machine JWT-bearer (NEW) - pub zitadel_jwt_bearer: Option, - - // jwt-auth role on the Bao side (shared between OIDC and JWT-bearer) - pub jwt_role: Option, - pub jwt_auth_mount: Option, + pub username: Option, pub password: Option, + pub zitadel_sso_url: Option, pub zitadel_client_id: Option, + pub zitadel_jwt_bearer: Option, // NEW + pub jwt_role: Option, pub jwt_auth_mount: Option, } pub struct ZitadelJwtBearerConfig { @@ -223,93 +160,64 @@ pub struct ZitadelJwtBearerConfig { } ``` -**Interior mutability for refresh.** The current `client: VaultClient` -field is read-only after construction. Wrap it so `refresh_auth()` can -swap in a fresh client without rebuilding the whole store: +**Refresh capability.** A single `Mutex` over the refresh-affected +state. Refresh is rare and uncontended; `ArcSwap` would be ceremony. ```rust pub struct OpenbaoSecretStore { - client: ArcSwap, // was: VaultClient + inner: Mutex, kv_mount: String, auth_mount: String, - jwt_bearer: Option>, // for refresh + jwt_bearer: Option>, jwt_role: Option, jwt_auth_mount: Option, base_url: String, skip_tls: bool, - scope: ArcSwap>, // deployment IDs from token } -``` -`arc-swap` is already a common pattern in the codebase. Read paths -(`get_raw`, `set_raw`) load the current pointer; refresh stores a new -one. Locks aren't needed on the hot path. +struct Inner { + client: VaultClient, + scope: HashSet, // deployment IDs from the Zitadel JWT +} -**`refresh_auth()` implementation.** - -```rust impl OpenbaoSecretStore { pub async fn refresh_auth(&self) -> Result<(), SecretStoreError> { - let bearer = self.jwt_bearer.as_ref() - .ok_or_else(|| SecretStoreError::Store( - "refresh_auth requires JWT-bearer auth".into()))?; + let bearer = self.jwt_bearer.as_ref().ok_or_else(|| /* err */)?; bearer.invalidate_cache(); let jwt = bearer.bearer_token().await?; - let session = jwt_login( - &self.base_url, self.skip_tls, - self.jwt_auth_mount.as_deref().unwrap_or("jwt"), - self.jwt_role.as_deref().ok_or(/* err */)?, - &jwt, - ).await?; + let scope = decode_deployments_claim(&jwt)?; // pure JWT decode + let session = jwt_login(&self.base_url, /* … */, &jwt).await?; let client = build_vault_client(&self.base_url, self.skip_tls, &session.token)?; - self.client.store(Arc::new(client)); - self.scope.store(Arc::new(session.deployments)); + *self.inner.lock().await = Inner { client, scope }; Ok(()) } - pub fn cached_scope(&self) -> HashSet { - (**self.scope.load()).clone() + pub async fn cached_scope(&self) -> HashSet { + self.inner.lock().await.scope.clone() } } ``` -`jwt_login` POSTs to `/v1/auth/{mount}/login` with `{role, jwt}` and -parses the response. To populate `deployments`, follow with a -`/v1/auth/token/lookup-self` call and read the -`identity_policies` / external-group-alias list. (Vault returns -external group memberships in the `external_namespace_policies` / -`identity_policies` field; document which exact field once verified -against staging.) +`decode_deployments_claim` is a small pure helper that base64-decodes +the JWT body and reads the `deployments` array. No Bao round-trip. +`get_raw` / `set_raw` lock `inner`, take a `&VaultClient`, perform the +call, release. The lock is uncontended on the hot path. -**On initial construction with JWT-bearer.** The constructor wires up -`jwt_bearer` and calls `refresh_auth()` once before returning. If that -fails, fall through to the next auth method (preserves the existing -ladder semantics). - -**Distinguishing 403 from 404 in `get_raw`.** Today `StoreSource` -swallows all errors from `SecretStore::get_raw` and falls through to -the next config source (see `harmony_config/src/source/store.rs:38`). -A 403 (auth scope insufficient) should NOT be treated as "not found" -because falling through to the prompt source would prompt a human for -a secret a misconfigured device cannot read. Map 403 to a distinct -`SecretStoreError::AuthScope { … }` variant and have `StoreSource` -propagate it as a hard `ConfigError`. The lifecycle layer's -`refresh_auth()` precondition prevents reaching this path in practice, -but the defense-in-depth matters. +**On construction with JWT-bearer.** Wire `jwt_bearer` and call +`refresh_auth()` once before returning. If it fails, fall through to +the next ladder rung. **Tests.** -- Unit: ladder ordering (JWT-bearer config present → tried before - OIDC; absent → ladder unchanged). -- Unit: `refresh_auth` against a `reqwest::Server` (or a hand-rolled - fake) returns a new token and updates `cached_scope`. -- Unit: 403 from `get_raw` propagates as `AuthScope`, not `NotFound`. -- Integration: against staging OpenBao, a machine keyfile yields a - Vault token whose scope matches the device's deployments metadata. +- Ladder ordering: JWT-bearer present → tried before OIDC; absent → + unchanged. +- `refresh_auth` against a fake HTTP server updates `cached_scope`. +- `decode_deployments_claim` on a hand-crafted JWT returns the + expected set. **Acceptance.** `cargo test -p harmony_secret` clean. `examples/openbao` -still works for the userpass path. A manual e2e run against staging -authenticates via JWT-bearer and reads a placeholder secret. +userpass path unaffected. Manual staging run authenticates via +JWT-bearer. --- @@ -317,62 +225,40 @@ authenticates via JWT-bearer and reads a placeholder secret. **Crate:** `harmony` (`src/modules/openbao/setup.rs`) **Depends on:** nothing -**Blocks:** PR-6 staging dry run -**Goal.** Make the server-side JWT auth role expressive enough to bind -to a Zitadel project audience, require the `fleet-device` role, and -emit per-deployment external-group aliases. - -**Changes.** +Two new fields, both defaulting to empty so existing callers are +unaffected: ```rust pub struct OpenbaoJwtAuth { - pub oidc_discovery_url: String, - pub bound_issuer: String, - pub role_name: String, - pub bound_audiences: String, - pub user_claim: String, - pub policies: Vec, - pub ttl: String, - pub max_ttl: String, - - // NEW — both default to empty - #[serde(default)] - pub bound_claims_json: String, // → `bound_claims=…` - #[serde(default)] - pub groups_claim: String, // → `groups_claim=…` + // … existing fields … + #[serde(default)] pub bound_claims_json: String, + #[serde(default)] pub groups_claim: String, } ``` -Extend `configure_jwt` (`setup.rs:435+`) to pass the two new flags to -`bao write auth/jwt/role/...` only when non-empty. - -**Tests.** - -- Unit: serde round-trip with both new fields populated and both - empty. -- Existing integration paths unaffected (default-empty leaves - behavior identical). +Extend `configure_jwt` (line 435+) to pass each as a `bao write +auth/jwt/role/...` flag when non-empty. **Acceptance.** `cargo check -p harmony --all-features` clean. -Existing callers compile unchanged. `examples/openbao` still works. +`examples/openbao` still works. --- ### PR-4 — Zitadel Action Score: surface `deployments` metadata as a claim -**Crate:** `harmony` (likely new module under `src/modules/zitadel/` or -`src/modules/sso/`, follow whatever module currently houses Zitadel -deploys) +**Crate:** `harmony` (new module under `src/modules/zitadel/` or +wherever existing Zitadel deploy code lives) **Depends on:** nothing **Blocks:** PR-5 -**Goal.** A declarative Score that ensures the post-access-token-creation -Action exists on the configured Zitadel project, with the script that -copies `user.metadata.deployments` into a top-level `deployments` -claim. +Declarative Score that upserts the post-access-token-creation Action +on the configured Zitadel project. Cannot be avoided: Zitadel emits +`urn:zitadel:iam:user:metadata` as base64-encoded values in a map, +and Bao's `groups_claim` can't consume that shape — the Action +decodes and re-emits as a flat string array. -**Action script** (canonical text, do not modify casually): +**Action script** (canonical text): ```javascript function addDeployments(ctx, api) { @@ -386,7 +272,7 @@ function addDeployments(ctx, api) { if (Array.isArray(deployments)) { api.v1.claims.setClaim("deployments", deployments); } - } catch (_) { /* malformed metadata → no deployments */ } + } catch (_) { /* malformed → no deployments */ } } ``` @@ -396,273 +282,165 @@ function addDeployments(ctx, api) { pub struct ZitadelDeploymentsClaimActionScore { pub project_id: String, pub action_name: String, // default "fleet-deployments-claim" - pub flow_type: FlowType, // ComplementToken - pub trigger_type: TriggerType, // PreAccessTokenCreation } ``` -Interpret upserts the Action via Zitadel Management API and attaches -it to `Complement Token / PreAccessTokenCreation` on the project. -Idempotent. +Interpret upserts via Zitadel Management API, attaches to +`Complement Token / PreAccessTokenCreation`. Idempotent. -**Tests.** - -- Unit: serialization, defaults. -- Manual E2E: stand up a fresh Zitadel, apply the Score, mint a - machine-user token whose owner has metadata - `deployments=["foo","bar"]`, confirm the claim appears at jwt.io. - -**Acceptance.** Manually decoded token from staging contains the -expected `deployments` claim. +**Acceptance.** Manual decode of a staging token shows the +`deployments` claim. --- -### PR-5 — Operator: `ZitadelDeviceDeploymentsScore` + wiring +### PR-5 — `FleetDeviceDeploymentMembershipScore` + operator wiring -**Crate:** `harmony` for the Score; `fleet/harmony-fleet-operator` for -the wiring. -**Depends on:** PR-4 (the Action must exist for the claim to surface) -**Blocks:** PR-7 staging dry run +**Crate:** `harmony` (`src/modules/fleet/` or wherever fleet Scores +live) for the Score; `fleet/harmony-fleet-operator` for wiring. +**Depends on:** PR-3 (Bao JWT role + accessor must exist), PR-4 +(claim must surface) -**Goal.** Operator writes membership to Zitadel as the **first** of -its three writes on every reconciled deployment. - -**Score shape.** +One Score, two writes, declared in execution order. Per ADR-023 the +operator hands off ordering to the Score rather than sequencing the +external API calls by hand. ```rust -pub struct ZitadelDeviceDeploymentsScore { - pub project_id: String, - pub device_user_id: String, // Zitadel machine user ID - pub deployments: Vec, // declarative full set, NOT a diff +pub struct FleetDeviceDeploymentMembershipScore { + pub zitadel_project_id: String, + pub device_user_id: String, // Zitadel machine user ID + pub deployments: Vec, // declarative full set + pub openbao_instance: OpenbaoInstance, + pub kv_mount: String, // "harmony-fleet" + pub jwt_auth_accessor: String, // resolved from PR-3 } ``` -Interpret semantics: **declarative replace**. Read the current -metadata value, compare to the desired set, write only on diff. -Removal is "declare the new set without the removed deployment." +Interpret runs, in order: -**Operator changes** (in `fleet_aggregator.rs` or `device_reconciler.rs`, -wherever per-`Deployment` reconciliation lives today). On reconciling -a `Deployment` CR: +1. **Zitadel metadata** — declarative replace on + `user.metadata.deployments`. Read current value, write only on + diff. Removal is "declare the new set without the removed entry." +2. **OpenBao external group + policy, per deployment in the set.** For + each `` in `deployments`: + - Policy `fleet-deployment-` granting `read` on + `harmony-fleet/data//*` and + `read, list` on `harmony-fleet/metadata//*`. + - External group `` (type=external) with that policy + attached, alias matching the JWT-auth accessor. -1. Compute the desired `device → deployments` map for every device - targeted by the deployment. -2. For each device, compose and run `ZitadelDeviceDeploymentsScore`. -3. Continue to `OpenbaoDeploymentSecretScopeScore` (PR-6) and only - then to the existing NATS publish. +Idempotent throughout. -If step 2 fails, surface as a status condition on the CR and do NOT -proceed to NATS — never publish a desired-state for which the device -cannot acquire scope. +**Operator wiring.** In the existing per-`Deployment` reconciler, +compose this Score before the existing NATS desired-state publish. +If the Score errors, surface as a CR status condition and do not +publish — never tell a device to run a deployment it cannot +authenticate for. + +**Note: NATS publish stays in operator code.** It's data-plane (the +"now do this" half of enrollment), already wired, and changing its +home isn't part of this work. **Tests.** -- Unit on the Score: diff logic (no-op when sets match; full write - when they differ). -- Integration on the operator: a fake Zitadel client receives the - expected write; the order against the existing NATS publish is - preserved. +- Unit on the Score: Zitadel diff logic (no-op on match, full write + on diff); policy text generation; group alias shape. +- Integration: against the real Bao in `examples/openbao` + a fake + Zitadel; second apply is a no-op. -**Acceptance.** Creating a `Deployment` CR in staging causes the -target device's Zitadel metadata to contain the deployment ID. +**Acceptance.** A `Deployment` CR rolled in staging causes both +writes to land before the NATS publish; manual JWT-bearer login at +Bao returns a token with the expected per-deployment policies +attached via external-group bindings. --- -### PR-6 — Operator: per-deployment OpenBao group + policy - -**Crate:** `harmony` (`src/modules/openbao/`) for the Score; operator -for wiring. -**Depends on:** PR-3 (role + accessor must exist) -**Blocks:** PR-7 staging dry run - -**Goal.** When a deployment exists, ensure -`identity/group/` and `fleet-deployment-` exist with -the right shape. - -**Score shape.** - -```rust -pub struct OpenbaoDeploymentSecretScopeScore { - pub instance: OpenbaoInstance, - pub kv_mount: String, // "harmony-fleet" - pub deployment_id: String, - pub jwt_auth_accessor: String, // resolved from the PR-3 role -} -``` - -Interpret produces: - -- Policy `fleet-deployment-`: - - ```hcl - path "harmony-fleet/data//*" { - capabilities = ["read"] - } - path "harmony-fleet/metadata//*" { - capabilities = ["read", "list"] - } - ``` - -- External group `` (`type=external`), policy attached, - with an alias `name=, - canonical_id=, mount_accessor=`. - -Idempotent. - -**Operator wiring.** Same call site as PR-5, after the Zitadel write -and before the NATS publish. - -**Tests.** - -- Unit on policy text generation. -- Integration against the real Bao in `examples/openbao`; second run - is a no-op. - -**Acceptance.** Manually `curl`-ing -`/v1/auth/jwt/login` with a real device JWT returns a Vault token -whose `policies` include `fleet-deployment-` and whose -`identity_policies` reflect the external group binding. - ---- - -### PR-7 — Agent `lifecycle/` module + main.rs wiring +### PR-6 — Agent: inline refresh-check in `main.rs` **Crate:** `fleet/harmony-fleet-agent` -**Depends on:** PR-2 (needs `refresh_auth()` + `cached_scope()`) +**Depends on:** PR-2 -**Goal.** The orchestration layer that calls -`secret_store.refresh_auth()` before `reconciler.apply()` whenever the -desired-state introduces a deployment whose secrets are outside the -cached scope. - -**Module shape.** +Around the existing JetStream KV watcher in +`fleet/harmony-fleet-agent/src/main.rs:110-120`, gate +`reconciler.apply` on a scope check: ```rust -// fleet/harmony-fleet-agent/src/lifecycle/mod.rs -pub struct DeploymentLifecycle { - secret_store: Arc, // shared with ConfigManager - reconciler: Arc, -} - -impl DeploymentLifecycle { - pub async fn on_apply(&self, key: &str, value: &[u8]) -> Result<()> { - let dep = deployment_from_key(key) - .context("desired-state key without deployment component")?; - let scope = self.secret_store.cached_scope(); - if !scope.contains(dep.as_str()) { - tracing::info!(%dep, "deployment outside cached secret scope — refreshing"); - self.secret_store.refresh_auth().await - .with_context(|| format!("refresh secrets for {dep}"))?; - if !self.secret_store.cached_scope().contains(dep.as_str()) { - anyhow::bail!( - "deployment {dep} still missing from token scope after refresh — \ - operator may not have written Zitadel metadata yet" - ); +async_nats::jetstream::kv::Operation::Put => { + if let Some(dep) = deployment_from_key(&entry.key) { + if !secret_store.cached_scope().await.contains(dep.as_str()) { + tracing::info!(%dep, "deployment outside cached scope — refreshing"); + if let Err(e) = secret_store.refresh_auth().await { + tracing::warn!(key = %entry.key, error = %e, "refresh failed"); + continue; } } - self.reconciler.apply(key, value).await } - - pub async fn on_remove(&self, key: &str) -> Result<()> { - // No refresh needed for removal; access lapses naturally on next refresh. - self.reconciler.remove(key).await + if let Err(e) = reconciler.apply(&entry.key, &entry.value).await { + tracing::warn!(key = %entry.key, error = %e, "apply failed"); } } ``` -The `OpenbaoSecretStore` instance is the same one held by the agent's -`ConfigManager`. Either construct it once and share via `Arc`, or -expose a typed accessor on `ConfigManager` so the lifecycle layer can -reach the underlying store. The Arc-sharing is simpler. +`secret_store` is the same `Arc` held by the +agent's `ConfigManager` — share it through whatever construction +path `main.rs` already uses. -**`main.rs` change.** Replace direct calls to `reconciler.apply` / -`reconciler.remove` in the JetStream KV watcher (around line 111-119) -with `lifecycle.on_apply` / `lifecycle.on_remove`. +**No new module.** One consumer, one site, ~10 lines. Per CLAUDE.md +"Rule of Three: introduce an abstraction at the second real +instance." If/when a second pre-reconcile concern (image-pull creds, +monitoring registration) arrives, extract a layer then. -**Bounded retry at the call site.** A race remains: operator may -publish to NATS milliseconds before Zitadel propagates the metadata -write to the token issuer. Handle as a bounded retry around -`lifecycle.on_apply`, not inside the lifecycle module: +**No retry on "still missing after refresh."** The operator's +ordering guarantees Zitadel is consistent before the NATS publish. +If we observe the race in practice, add a single retry then. -```rust -let res = retry_with_backoff( - || lifecycle.on_apply(&entry.key, &entry.value), - /* attempts */ 3, - /* initial */ Duration::from_secs(2), -); -``` +**Tests.** None new — this is wiring between two tested components. +Confirm via the integration milestone below. -**Tests.** - -- Unit with a fake `OpenbaoSecretStore` trait-object (cached - deployment → no refresh; uncached → refresh runs; refresh succeeds - but scope still missing → reconciler does NOT run, error returned). -- E2E: fresh device + new deployment in staging shows one - "outside cached scope — refreshing" log followed by a clean - reconcile. - -**Acceptance.** Manually rolling out a new deployment to a staging -device shows the expected log sequence and the workload starts. +**Acceptance.** Rolling a new deployment to a staging device shows +one "outside cached scope — refreshing" log followed by a clean +reconcile. --- ## Integration milestone — staging dry run -After PRs 1-7 land: +After PRs 1-6 land: -- [ ] Apply the new `OpenbaoJwtAuth` config to staging Bao via the - existing `OpenbaoSetupScore`. +- [ ] Apply the new `OpenbaoJwtAuth` config to staging Bao via + `OpenbaoSetupScore`. - [ ] Apply `ZitadelDeploymentsClaimActionScore` to staging Zitadel. -- [ ] Manually provision metadata + group/policy for one test - deployment + one test device. Confirm the agent reads its secret. -- [ ] Roll out via the operator: create a `Deployment` CR, confirm - the operator runs the three writes in order and the agent - reconciles cleanly. -- [ ] Negative: device not in deployment X cannot read X's secret - (expect `AuthScope` propagation, not silent fall-through). -- [ ] Negative: a JWT minted before metadata update cannot read the - new deployment's secret until refresh — confirm the lifecycle - layer triggers refresh automatically. +- [ ] Hand-provision metadata + group/policy for one test deployment + + one test device. Confirm the agent reads its secret. +- [ ] Roll a `Deployment` CR via the operator. Confirm + `FleetDeviceDeploymentMembershipScore` writes Zitadel and Bao + before NATS, and the agent reconciles cleanly. +- [ ] Negative: device not in deployment X gets a hard error reading + X's secret (not a silent fall-through). +- [ ] Negative: a JWT minted before the metadata update cannot read + the new deployment's secret until the agent's inline refresh + runs. -## Decisions deferred (open questions) +## Decisions deferred -These are not blockers; flag them when you hit them. - -1. **Vault token TTL.** ADR-025 recommends 15 min. Confirm this works - with the longest expected reconcile window and adjust - `OpenbaoJwtAuth::ttl` if not. -2. **Hard revocation on deployment removal.** ADR-025 leaves this to - TTL expiry. If a future fleet requires immediate revocation, add an - `OpenbaoTokenRevokeOnRemoval` companion Score. Don't pre-build it. +1. **Vault token TTL.** ADR-025 recommends 15 min. Confirm in + staging; adjust `OpenbaoJwtAuth::ttl` if needed. +2. **Hard revocation on deployment removal.** Wait for TTL today. + Add a revoke companion only if a real fleet requires it. 3. **Bao down at agent startup.** `OpenbaoSecretStore::new` with - JWT-bearer should not panic; it should fall through to other auth - methods or return an error the agent can retry. Confirm behavior - in the integration milestone and document. -4. **`StoreSource` 403 propagation.** PR-2 introduces a new - `SecretStoreError::AuthScope` variant. Decide whether - `harmony_config::source::store.rs` should treat it as a hard error - (recommended) or continue to fall through (current behavior). - Tests in PR-2 should lock the chosen semantics. -5. **One-keyfile-per-store or pool.** PR-2 assumes one - `ZitadelJwtBearer` per `OpenbaoSecretStore`. If a future use case - needs the agent to authenticate as multiple identities (unlikely), - we'd revisit. Single-identity is the right starting shape. + JWT-bearer must not panic; either fall through or surface a + retryable error. Confirm and document in the milestone. -## Required reading before starting +## Required reading -- `docs/adr/025-fleet-device-secret-access.md` — the design. -- `docs/adr/020-1-zitadel-openbao-secure-config-store.md` — the - unified-config rationale; this PR set is the machine-user counterpart. -- `docs/adr/023-deploy-architecture.md` — Score discipline for the - deploy-side Scores in PRs 3-6. -- `harmony_secret/src/store/openbao.rs` — the store being extended in - PR-2 (auth ladder at line 70-185). -- `harmony_config/src/source/store.rs` — the wrapper the agent uses; - read it to understand the 403-vs-404 question in PR-2. -- `fleet/harmony-fleet-auth/src/credentials.rs` — the existing - JWT-bearer mint path being extracted in PR-1. -- `nats/callout/src/zitadel.rs` — the existing JWT validation pattern; - the Bao JWT auth role mirrors its `bound_issuer` + `bound_audiences` - shape. -- `fleet/harmony-fleet-agent/src/main.rs` and `reconciler.rs` — the - agent shape PR-7 modifies. +- `docs/adr/025-fleet-device-secret-access.md` — design. +- `docs/adr/020-1-zitadel-openbao-secure-config-store.md` — + human-user counterpart of this auth flow. +- `docs/adr/023-deploy-architecture.md` — Score discipline for PRs 3-5. +- `harmony_secret/src/store/openbao.rs` — auth ladder being extended. +- `harmony_config/src/source/store.rs` — wrapper the agent uses. +- `fleet/harmony-fleet-auth/src/credentials.rs` — JWT-bearer mint + path being extracted in PR-1. +- `nats/callout/src/zitadel.rs` — JWT validation shape the Bao role + mirrors (`bound_issuer` + `bound_audiences`). +- `fleet/harmony-fleet-agent/src/main.rs` — site of the PR-6 inline + edit. diff --git a/docs/adr/025-fleet-device-secret-access.md b/docs/adr/025-fleet-device-secret-access.md index 23640a2e..6405dbd1 100644 --- a/docs/adr/025-fleet-device-secret-access.md +++ b/docs/adr/025-fleet-device-secret-access.md @@ -102,65 +102,34 @@ their short TTL expires; explicit revocation is available via `bao token revoke` on the device's accessor if hard revocation is needed. -### 3. `harmony_secret` JWT-bearer auth method, consumed via `harmony_config` +### 3. Client side: JWT-bearer in `harmony_secret`, refresh before reconcile in the agent -The agent does **not** grow a new secrets client. Per ADR-020 the -unified entry point for both config and secrets is `harmony_config`, -which already chains `StoreSource` through -`harmony_secret`. The missing piece is auth, not retrieval: -`OpenbaoSecretStore::new` today supports a token from env, a cached -token, the Zitadel OIDC **device** flow (humans), and userpass — but -not the Zitadel **JWT-bearer** flow that the agent needs for headless -machine identity. +The agent does **not** grow a new secrets client. Per ADR-020, +`harmony_config` is the unified config+secret entry point and already +wraps OpenBao via `harmony_secret::OpenbaoSecretStore`. The missing +piece is auth: `OpenbaoSecretStore` supports env token, cached token, +Zitadel OIDC device flow (humans), and userpass — but not Zitadel +**JWT-bearer** for headless machine identity. -The work is therefore mostly inside `harmony_secret`: +Three additions: -- Extend `OpenbaoSecretStore` with a new auth branch that takes a - Zitadel machine keyfile + Bao JWT role + audience, mints an access - token via RFC 7523 JWT-bearer (the same flow - `fleet/harmony-fleet-auth/src/credentials.rs` already implements - for NATS), and POSTs to `/v1/auth/jwt/login`. -- Pull the JWT-bearer minting itself into `harmony_zitadel_auth` so - the agent's NATS path and the OpenBao auth path share one - implementation. Two real consumers cross the Rule-of-Three - threshold. -- Add interior mutability to `OpenbaoSecretStore` so a re-auth - (re-mint + re-login) can replace the cached Vault token in place - without rebuilding the entire `ConfigManager`. The `SecretStore` - trait stays `&self`-only on the read path; an explicit - `refresh_auth()` is the only mutator. -- Expose `cached_scope() -> HashSet` derived from the - Vault token's identity-group aliases (one call to - `/v1/auth/token/lookup-self` per refresh, cached for the token's - lifetime). +- A fifth rung on `OpenbaoSecretStore`'s auth ladder takes a Zitadel + machine keyfile + Bao JWT role + audience, mints via RFC 7523, and + POSTs to `/v1/auth/jwt/login`. +- The pure minting moves to `harmony_zitadel_auth` so NATS and + OpenBao auth share one implementation (Rule of Three: NATS callout + + OpenBao auth = two real consumers). +- `OpenbaoSecretStore` gains `refresh_auth()` (re-mint + re-login, + guarded by an internal `Mutex`) and `cached_scope() -> + HashSet` derived from decoding the in-hand Zitadel JWT — + no Bao round-trip needed since the `deployments` claim is already + in the token we just minted. -### 4. Agent-side lifecycle layer between NATS and the reconciler - -A thin module — `fleet/harmony-fleet-agent/src/lifecycle/` — sits -between the NATS KV watcher in `main.rs` and `Reconciler::apply()`, -calling into `harmony_config`'s OpenBao source to refresh the -secret-store auth before reconciling a deployment the agent has not -yet seen: - -``` -on_desired_state_changed(device, desired_deployments): - if not desired_deployments ⊂ secret_store.cached_scope(): - secret_store.refresh_auth() # drops Bao + Zitadel tokens, re-mints both - reconciler.apply(...) # cannot start until refresh returns Ok -``` - -`refresh_auth()` is atomic at the OpenBao store level: it invalidates -both the cached Bao Vault token and the cached Zitadel access token, -re-mints a fresh Zitadel token (so the new `deployments` claim is -reflected), and re-runs `/v1/auth/jwt/login` to obtain a fresh Vault -token. The reconciler runs only after the refresh succeeds. - -This layer has no other reason to exist *today*, but it is the -canonical home for any future "auxiliary work that must happen before -a deployment can run" — fetching image-pull credentials, registering -with monitoring, requesting a per-deployment NATS credential, etc. -Locking that home now avoids those concerns leaking into either the -NATS handler or the reconciler when they arrive. +In the agent, the NATS KV watcher consults `cached_scope()` before +each `reconciler.apply()`. If the desired deployment isn't covered, +it calls `refresh_auth()` and proceeds. The check is inline in +`main.rs` — about ten lines around the existing watcher loop. No +new module: one consumer, one site, inlining is the right size. ### Secret path layout @@ -201,24 +170,20 @@ deployment) or one login that resolves to multiple policies. `groups_claim` + external groups gives the latter cleanly: one login, N policies attached automatically. -**Why the lifecycle layer, separately from NATS handling.** The -trigger today is NATS, but the *operation* — "make this device ready -for a new deployment" — is a domain concept that may have other -triggers (admin command, local config reload, agent restart with stale -cache, future control-plane signal). Tying the orchestration to the -NATS message handler conflates transport with business logic. The -layer also makes the refresh-then-reconcile ordering an explicit -property of the agent's architecture, testable with fakes and visible -in the code. - **Why `harmony_config` / `harmony_secret`, not a fleet-local secrets client.** ADR-020 is explicit that `harmony_config` is the unified config+secret entry point and `OpenbaoSecretStore` is the canonical OpenBao client. Adding a parallel fleet-only client would duplicate the auth ladder, cache-file layout, and `kv2` plumbing already in -`harmony_secret`. The fleet's needs are an *additional auth branch*, -not a different store. Putting it where every other Harmony consumer -will also benefit is the correct locus. +`harmony_secret`. The fleet's need is an *additional auth branch*, +not a different store. + +**Why scope is decoded from the Zitadel JWT, not asked of Bao.** The +agent already holds the JWT it's about to present at login; the +`deployments` claim is right there. A `/v1/auth/token/lookup-self` +round-trip after login would compute the same set from the other +direction, paying a network call to recover information already in +hand. ## Consequences -- 2.39.5