diff --git a/ROADMAP/fleet_platform/device_secret_access_handoff.md b/ROADMAP/fleet_platform/device_secret_access_handoff.md new file mode 100644 index 00000000..53ccc71d --- /dev/null +++ b/ROADMAP/fleet_platform/device_secret_access_handoff.md @@ -0,0 +1,446 @@ +# Fleet device secret access — implementation handoff + +**Owner:** TBD +**Design:** [`docs/adr/025-fleet-device-secret-access.md`](../../docs/adr/025-fleet-device-secret-access.md) +**Status:** Ready to start +**Written:** 2026-06-01 + +Read ADR-025 first. This document is the work plan, not the design. + +## Summary + +The fleet agent reads per-deployment secrets through the existing +`harmony_config` chain, which already wraps OpenBao via +`harmony_secret::OpenbaoSecretStore`. `OpenbaoSecretStore`'s auth +ladder is extended with one new rung — Zitadel **JWT-bearer**, keyed +off the device's existing machine keyfile (the same root of trust the +NATS callout uses). Per-deployment scope rides in the JWT itself as a +`deployments` array claim, populated by the operator through Zitadel +user metadata. OpenBao's `groups_claim` binds those values to +auto-created external groups, each with a small policy granting +`read` on `harmony-fleet/data//*`. The agent's NATS +watcher calls `refresh_auth()` on the store before reconciling a +deployment whose secrets are outside the cached scope. + +## Architecture at a glance + +``` + Fleet operator (one per fleet, holds Zitadel admin scope) + │ + ├── on new Deployment CR: + │ 1. FleetDeviceDeploymentMembershipScore + │ ├─ Zitadel: append dep to device.metadata.deployments + │ └─ OpenBao: upsert external group + policy + │ fleet-deployment- → read on + │ harmony-fleet/data//* + │ 2. existing NATS publish (data plane, unchanged) + │ + ▼ + Fleet agent + │ + ├── NATS handler (main.rs): + │ if !secret_store.cached_scope().contains(&dep) { + │ secret_store.refresh_auth().await?; + │ } + │ reconciler.apply(…) + │ + └── reads secrets via harmony_config (StoreSource wrapping + OpenbaoSecretStore with new JWT-bearer auth rung) +``` + +## Pre-flight + +- [ ] Confirm staging Zitadel has a service account for the operator + with **write user metadata** scope. File a sub-task if not. +- [ ] Confirm the staging OpenBao KV mount `harmony-fleet` exists, or + add its creation to `OpenbaoSetupScore`. +- [ ] Validate `DeploymentName`'s character set is safe for KV paths + and Bao external-group names (`[a-zA-Z0-9_-]`). + +## Work breakdown + +Six PRs. Dependencies: + +``` +PR-1 harmony_zitadel_auth: extract JWT-bearer minter + └─ PR-2 harmony_secret: JWT-bearer auth + refresh_auth + cached_scope + └─ PR-6 agent main.rs: inline refresh-check + +PR-3 harmony OpenbaoJwtAuth Score: bound_claims + groups_claim +PR-4 Zitadel Action Score: deployments claim + └─ PR-5 Operator: FleetDeviceDeploymentMembershipScore +PR-3 ─┘ +``` + +PRs 1, 3, 4 are independent and can start in parallel. + +--- + +### PR-1 — Extract Zitadel JWT-bearer minter into `harmony_zitadel_auth` + +**Crate:** `harmony_zitadel_auth` +**Depends on:** nothing +**Blocks:** PR-2 + +Move `MachineKeyFile`, `CachedToken`, `build_assertion*`, `build_scope`, +`build_token_url`, and the mint+cache logic out of +`fleet/harmony-fleet-auth/src/credentials.rs` into a new +`harmony_zitadel_auth/src/jwt_bearer.rs`. Two real consumers (NATS +callout + OpenBao) cross Rule of Three at the second consumer. + +`harmony_secret` cannot depend on `harmony-fleet-auth` (fleet-specific +crate; would invert the dependency graph). `harmony_zitadel_auth` is +neutral and already houses the human-OIDC counterpart. + +**Shape.** + +```rust +// harmony_zitadel_auth/src/jwt_bearer.rs +pub struct ZitadelJwtBearer { + key: MachineKeyFile, + oidc_issuer_url: String, + audience: String, + http: reqwest::Client, + cache: Mutex>, +} + +impl ZitadelJwtBearer { + pub fn new(key: MachineKeyFile, oidc_issuer_url: String, audience: String, + danger_accept_invalid_certs: bool) -> Result; + + /// Cached if comfortably valid, otherwise mints fresh. + pub async fn bearer_token(&self) -> Result; + + /// Force a re-mint on next call. + pub fn invalidate_cache(&self); +} +``` + +`CredentialSource::ZitadelJwt` in `harmony-fleet-auth` becomes a thin +wrapper around `Arc`. Port the existing pure-builder +tests from `credentials.rs` into the new crate. + +**Acceptance.** `cargo test -p harmony_zitadel_auth -p harmony-fleet-auth` +clean. Fleet e2e NATS auth still works. + +--- + +### PR-2 — `OpenbaoSecretStore` JWT-bearer rung + `refresh_auth` + `cached_scope` + +**Crate:** `harmony_secret` (`src/store/openbao.rs`) +**Depends on:** PR-1 +**Blocks:** PR-6 + +Add the fifth rung to `OpenbaoSecretStore::new`'s auth ladder. +Position: between cached-token (rung 2) and Zitadel OIDC device flow +(rung 4). Triggered only when a machine keyfile is configured; if it +fails, fall through to the existing ladder. + +**Constructor refactor.** Knock out the `too_many_arguments` clippy +TODO at `openbao.rs:55` while we're adding more args: + +```rust +pub struct OpenbaoStoreOptions { + pub base_url: String, + pub kv_mount: String, + pub auth_mount: String, + pub skip_tls: bool, + pub token: Option, + pub username: Option, pub password: Option, + pub zitadel_sso_url: Option, pub zitadel_client_id: Option, + pub zitadel_jwt_bearer: Option, // NEW + pub jwt_role: Option, pub jwt_auth_mount: Option, +} + +pub struct ZitadelJwtBearerConfig { + pub key_path: Option, + pub key_json: Option, + pub oidc_issuer_url: String, + pub audience: String, // Zitadel project ID +} +``` + +**Refresh capability.** A single `Mutex` over the refresh-affected +state. Refresh is rare and uncontended; `ArcSwap` would be ceremony. + +```rust +pub struct OpenbaoSecretStore { + inner: Mutex, + kv_mount: String, + auth_mount: String, + jwt_bearer: Option>, + jwt_role: Option, + jwt_auth_mount: Option, + base_url: String, + skip_tls: bool, +} + +struct Inner { + client: VaultClient, + scope: HashSet, // deployment IDs from the Zitadel JWT +} + +impl OpenbaoSecretStore { + pub async fn refresh_auth(&self) -> Result<(), SecretStoreError> { + let bearer = self.jwt_bearer.as_ref().ok_or_else(|| /* err */)?; + bearer.invalidate_cache(); + let jwt = bearer.bearer_token().await?; + let scope = decode_deployments_claim(&jwt)?; // pure JWT decode + let session = jwt_login(&self.base_url, /* … */, &jwt).await?; + let client = build_vault_client(&self.base_url, self.skip_tls, &session.token)?; + *self.inner.lock().await = Inner { client, scope }; + Ok(()) + } + + pub async fn cached_scope(&self) -> HashSet { + self.inner.lock().await.scope.clone() + } +} +``` + +`decode_deployments_claim` is a small pure helper that base64-decodes +the JWT body and reads the `deployments` array. No Bao round-trip. +`get_raw` / `set_raw` lock `inner`, take a `&VaultClient`, perform the +call, release. The lock is uncontended on the hot path. + +**On construction with JWT-bearer.** Wire `jwt_bearer` and call +`refresh_auth()` once before returning. If it fails, fall through to +the next ladder rung. + +**Tests.** + +- Ladder ordering: JWT-bearer present → tried before OIDC; absent → + unchanged. +- `refresh_auth` against a fake HTTP server updates `cached_scope`. +- `decode_deployments_claim` on a hand-crafted JWT returns the + expected set. + +**Acceptance.** `cargo test -p harmony_secret` clean. `examples/openbao` +userpass path unaffected. Manual staging run authenticates via +JWT-bearer. + +--- + +### PR-3 — Extend `OpenbaoJwtAuth` Score with `bound_claims_json` + `groups_claim` + +**Crate:** `harmony` (`src/modules/openbao/setup.rs`) +**Depends on:** nothing + +Two new fields, both defaulting to empty so existing callers are +unaffected: + +```rust +pub struct OpenbaoJwtAuth { + // … existing fields … + #[serde(default)] pub bound_claims_json: String, + #[serde(default)] pub groups_claim: String, +} +``` + +Extend `configure_jwt` (line 435+) to pass each as a `bao write +auth/jwt/role/...` flag when non-empty. + +**Acceptance.** `cargo check -p harmony --all-features` clean. +`examples/openbao` still works. + +--- + +### PR-4 — Zitadel Action Score: surface `deployments` metadata as a claim + +**Crate:** `harmony` (new module under `src/modules/zitadel/` or +wherever existing Zitadel deploy code lives) +**Depends on:** nothing +**Blocks:** PR-5 + +Declarative Score that upserts the post-access-token-creation Action +on the configured Zitadel project. Cannot be avoided: Zitadel emits +`urn:zitadel:iam:user:metadata` as base64-encoded values in a map, +and Bao's `groups_claim` can't consume that shape — the Action +decodes and re-emits as a flat string array. + +**Action script** (canonical text): + +```javascript +function addDeployments(ctx, api) { + const md = ctx.v1.user.getMetadata(); + const entry = md.metadata.find(m => m.key === "deployments"); + if (!entry) return; + try { + const deployments = JSON.parse( + Buffer.from(entry.value, "base64").toString("utf-8") + ); + if (Array.isArray(deployments)) { + api.v1.claims.setClaim("deployments", deployments); + } + } catch (_) { /* malformed → no deployments */ } +} +``` + +**Score shape.** + +```rust +pub struct ZitadelDeploymentsClaimActionScore { + pub project_id: String, + pub action_name: String, // default "fleet-deployments-claim" +} +``` + +Interpret upserts via Zitadel Management API, attaches to +`Complement Token / PreAccessTokenCreation`. Idempotent. + +**Acceptance.** Manual decode of a staging token shows the +`deployments` claim. + +--- + +### PR-5 — `FleetDeviceDeploymentMembershipScore` + operator wiring + +**Crate:** `harmony` (`src/modules/fleet/` or wherever fleet Scores +live) for the Score; `fleet/harmony-fleet-operator` for wiring. +**Depends on:** PR-3 (Bao JWT role + accessor must exist), PR-4 +(claim must surface) + +One Score, two writes, declared in execution order. Per ADR-023 the +operator hands off ordering to the Score rather than sequencing the +external API calls by hand. + +```rust +pub struct FleetDeviceDeploymentMembershipScore { + pub zitadel_project_id: String, + pub device_user_id: String, // Zitadel machine user ID + pub deployments: Vec, // declarative full set + pub openbao_instance: OpenbaoInstance, + pub kv_mount: String, // "harmony-fleet" + pub jwt_auth_accessor: String, // resolved from PR-3 +} +``` + +Interpret runs, in order: + +1. **Zitadel metadata** — declarative replace on + `user.metadata.deployments`. Read current value, write only on + diff. Removal is "declare the new set without the removed entry." +2. **OpenBao external group + policy, per deployment in the set.** For + each `` in `deployments`: + - Policy `fleet-deployment-` granting `read` on + `harmony-fleet/data//*` and + `read, list` on `harmony-fleet/metadata//*`. + - External group `` (type=external) with that policy + attached, alias matching the JWT-auth accessor. + +Idempotent throughout. + +**Operator wiring.** In the existing per-`Deployment` reconciler, +compose this Score before the existing NATS desired-state publish. +If the Score errors, surface as a CR status condition and do not +publish — never tell a device to run a deployment it cannot +authenticate for. + +**Note: NATS publish stays in operator code.** It's data-plane (the +"now do this" half of enrollment), already wired, and changing its +home isn't part of this work. + +**Tests.** + +- Unit on the Score: Zitadel diff logic (no-op on match, full write + on diff); policy text generation; group alias shape. +- Integration: against the real Bao in `examples/openbao` + a fake + Zitadel; second apply is a no-op. + +**Acceptance.** A `Deployment` CR rolled in staging causes both +writes to land before the NATS publish; manual JWT-bearer login at +Bao returns a token with the expected per-deployment policies +attached via external-group bindings. + +--- + +### PR-6 — Agent: inline refresh-check in `main.rs` + +**Crate:** `fleet/harmony-fleet-agent` +**Depends on:** PR-2 + +Around the existing JetStream KV watcher in +`fleet/harmony-fleet-agent/src/main.rs:110-120`, gate +`reconciler.apply` on a scope check: + +```rust +async_nats::jetstream::kv::Operation::Put => { + if let Some(dep) = deployment_from_key(&entry.key) { + if !secret_store.cached_scope().await.contains(dep.as_str()) { + tracing::info!(%dep, "deployment outside cached scope — refreshing"); + if let Err(e) = secret_store.refresh_auth().await { + tracing::warn!(key = %entry.key, error = %e, "refresh failed"); + continue; + } + } + } + if let Err(e) = reconciler.apply(&entry.key, &entry.value).await { + tracing::warn!(key = %entry.key, error = %e, "apply failed"); + } +} +``` + +`secret_store` is the same `Arc` held by the +agent's `ConfigManager` — share it through whatever construction +path `main.rs` already uses. + +**No new module.** One consumer, one site, ~10 lines. Per CLAUDE.md +"Rule of Three: introduce an abstraction at the second real +instance." If/when a second pre-reconcile concern (image-pull creds, +monitoring registration) arrives, extract a layer then. + +**No retry on "still missing after refresh."** The operator's +ordering guarantees Zitadel is consistent before the NATS publish. +If we observe the race in practice, add a single retry then. + +**Tests.** None new — this is wiring between two tested components. +Confirm via the integration milestone below. + +**Acceptance.** Rolling a new deployment to a staging device shows +one "outside cached scope — refreshing" log followed by a clean +reconcile. + +--- + +## Integration milestone — staging dry run + +After PRs 1-6 land: + +- [ ] Apply the new `OpenbaoJwtAuth` config to staging Bao via + `OpenbaoSetupScore`. +- [ ] Apply `ZitadelDeploymentsClaimActionScore` to staging Zitadel. +- [ ] Hand-provision metadata + group/policy for one test deployment + + one test device. Confirm the agent reads its secret. +- [ ] Roll a `Deployment` CR via the operator. Confirm + `FleetDeviceDeploymentMembershipScore` writes Zitadel and Bao + before NATS, and the agent reconciles cleanly. +- [ ] Negative: device not in deployment X gets a hard error reading + X's secret (not a silent fall-through). +- [ ] Negative: a JWT minted before the metadata update cannot read + the new deployment's secret until the agent's inline refresh + runs. + +## Decisions deferred + +1. **Vault token TTL.** ADR-025 recommends 15 min. Confirm in + staging; adjust `OpenbaoJwtAuth::ttl` if needed. +2. **Hard revocation on deployment removal.** Wait for TTL today. + Add a revoke companion only if a real fleet requires it. +3. **Bao down at agent startup.** `OpenbaoSecretStore::new` with + JWT-bearer must not panic; either fall through or surface a + retryable error. Confirm and document in the milestone. + +## Required reading + +- `docs/adr/025-fleet-device-secret-access.md` — design. +- `docs/adr/020-1-zitadel-openbao-secure-config-store.md` — + human-user counterpart of this auth flow. +- `docs/adr/023-deploy-architecture.md` — Score discipline for PRs 3-5. +- `harmony_secret/src/store/openbao.rs` — auth ladder being extended. +- `harmony_config/src/source/store.rs` — wrapper the agent uses. +- `fleet/harmony-fleet-auth/src/credentials.rs` — JWT-bearer mint + path being extracted in PR-1. +- `nats/callout/src/zitadel.rs` — JWT validation shape the Bao role + mirrors (`bound_issuer` + `bound_audiences`). +- `fleet/harmony-fleet-agent/src/main.rs` — site of the PR-6 inline + edit. diff --git a/docs/adr/025-fleet-device-secret-access.md b/docs/adr/025-fleet-device-secret-access.md new file mode 100644 index 00000000..6405dbd1 --- /dev/null +++ b/docs/adr/025-fleet-device-secret-access.md @@ -0,0 +1,319 @@ +# Architecture Decision Record: Fleet Device Secret Access via Zitadel JWT + +Initial Author: Jean-Gabriel Gill-Couture + +Initial Date: 2026-06-01 + +Last Updated Date: 2026-06-01 + +## Status + +Proposed + +## Context + +Fleet agents on devices need to read per-deployment secrets (image-pull +credentials, application secrets, etc.) from OpenBao. The agent already +holds one durable secret: a Zitadel machine-user JWT keyfile dropped by +`FleetDeviceSetupScore`. That key is the basis for the agent's existing +NATS authentication (`nats/callout` validates the Zitadel-minted access +token; `fleet/harmony-fleet-auth/src/credentials.rs` mints it via the +RFC 7523 JWT-bearer flow). + +Three requirements shape the design: + +1. **No new device-side secret.** The Zitadel machine key is already the + single root of trust on a device; the secret-access path must derive + from the same key, not introduce a second one. + +2. **Per-deployment isolation, enforced cryptographically.** A device + enrolled in deployments A and B reads only `A`'s and `B`'s secrets. + A device that hosts no deployments reads nothing. The device cannot + widen its own scope — only the operator can change membership. + +3. **Cross-project safety.** A second Zitadel project (a different + tenant, a different fleet, a malicious org) must not be able to + produce a token that OpenBao accepts. The trust boundary is the + project, not the deployment. + +The kubelet analogy is the architectural north star: the agent is a +small runtime that learns its workload (and the credentials needed to +run it) from a control-plane authority. The agent never decides what it +is allowed to run or read; it presents a signed identity and the +infrastructure decides. + +## Decision + +Three coordinating pieces. + +### 1. OpenBao JWT auth bound to the Zitadel project + +OpenBao's JWT auth method validates incoming tokens against Zitadel's +OIDC discovery URL (JWKS). One auth role per fleet, configured against +**one** Zitadel project: + +``` +bound_issuer = +bound_audiences = +bound_claims = { "urn:zitadel:iam:org:project:roles": "fleet-device" } +user_claim = sub +groups_claim = deployments +``` + +`bound_audiences` is the project boundary. A token minted in any other +Zitadel project has a different `aud` claim and is rejected before any +membership claim is read. This is the same defense +`nats/callout/src/zitadel.rs` already applies via `set_audience`. + +`groups_claim = deployments` instructs OpenBao to read the JWT's +`deployments` array and bind the resulting Vault token to one external +group per element. Each external group carries a per-deployment policy +granting `read` on `harmony-fleet/data//*`. + +### 2. Operator-managed Zitadel metadata as the membership source of truth + +The fleet operator is the only writer of `user.metadata.deployments` +on each device's Zitadel machine user. A Zitadel post-token-creation +**Action** copies that metadata into a top-level `deployments` claim on +the access token. The device never touches its own metadata. + +When a new `Deployment` CR is observed in Kubernetes, the operator +executes three writes in a strict order: + +1. **Zitadel metadata** — append the deployment ID to the device's + `deployments` array (per device targeted by the deployment). +2. **OpenBao external group + policy** — upsert + `identity/group/` (`type=external`, alias matching the + JWT-auth accessor) and policy + `fleet-deployment-` granting + `read` on `harmony-fleet/data//*`. +3. **NATS desired-state** — publish + `desired-state..` with the workload score. + +Reversed, the agent could see the desired-state, attempt a re-auth, +and find the deployment missing from its claims — a "permission denied +for a deployment I was told to run" race that is confusing to debug +and weakens the trust story. Trust state always precedes the workload +signal. + +Removal runs in reverse: NATS delete → (optional) group/policy delete → +metadata removal. Currently-cached Vault tokens retain access until +their short TTL expires; explicit revocation is available via +`bao token revoke` on the device's accessor if hard revocation is +needed. + +### 3. Client side: JWT-bearer in `harmony_secret`, refresh before reconcile in the agent + +The agent does **not** grow a new secrets client. Per ADR-020, +`harmony_config` is the unified config+secret entry point and already +wraps OpenBao via `harmony_secret::OpenbaoSecretStore`. The missing +piece is auth: `OpenbaoSecretStore` supports env token, cached token, +Zitadel OIDC device flow (humans), and userpass — but not Zitadel +**JWT-bearer** for headless machine identity. + +Three additions: + +- A fifth rung on `OpenbaoSecretStore`'s auth ladder takes a Zitadel + machine keyfile + Bao JWT role + audience, mints via RFC 7523, and + POSTs to `/v1/auth/jwt/login`. +- The pure minting moves to `harmony_zitadel_auth` so NATS and + OpenBao auth share one implementation (Rule of Three: NATS callout + + OpenBao auth = two real consumers). +- `OpenbaoSecretStore` gains `refresh_auth()` (re-mint + re-login, + guarded by an internal `Mutex`) and `cached_scope() -> + HashSet` derived from decoding the in-hand Zitadel JWT — + no Bao round-trip needed since the `deployments` claim is already + in the token we just minted. + +In the agent, the NATS KV watcher consults `cached_scope()` before +each `reconciler.apply()`. If the desired deployment isn't covered, +it calls `refresh_auth()` and proceeds. The check is inline in +`main.rs` — about ten lines around the existing watcher loop. No +new module: one consumer, one site, inlining is the right size. + +### Secret path layout + +``` +harmony-fleet/data// +``` + +The Zitadel project ID does **not** appear in the path. Its job is +done at the JWT validation boundary (`bound_audiences`), not repeated +in every key. + +## Rationale + +**Why Zitadel project ID lives in `bound_audiences`, not the path.** +The same trust assertion in two places is duplication, not defense in +depth — both reduce to "the JWT signature is valid for this audience." +Concentrating it at the auth role: + +- gives one source of truth ("which project owns this Bao instance"); +- keeps secret paths readable and operator-friendly; +- decouples secret organization from Zitadel project identity (a + project ID rotation reconfigures one Bao role, not every path). + +**Why user metadata over project roles for deployment membership.** +Project roles in Zitadel live in a flat namespace inside a project. +A handful of roles (`fleet-admin`, `fleet-device`) maps cleanly; one +role per deployment would not — role inventories at hundreds of +deployments per fleet become hard to audit and slow to mutate. +User metadata is a per-machine-user JSON store, naturally +multi-valued, and admin-only-writable. The Zitadel Action that copies +metadata to a claim is a one-time, fleet-wide piece of configuration. + +**Why `groups_claim` over claim-templated paths.** Vault policy +templating (`{{identity.entity.aliases…metadata.}}`) supports +single-value substitution but not iteration over an array. Multiple +deployments per device require either multiple JWT logins (one per +deployment) or one login that resolves to multiple policies. +`groups_claim` + external groups gives the latter cleanly: one login, +N policies attached automatically. + +**Why `harmony_config` / `harmony_secret`, not a fleet-local secrets +client.** ADR-020 is explicit that `harmony_config` is the unified +config+secret entry point and `OpenbaoSecretStore` is the canonical +OpenBao client. Adding a parallel fleet-only client would duplicate +the auth ladder, cache-file layout, and `kv2` plumbing already in +`harmony_secret`. The fleet's need is an *additional auth branch*, +not a different store. + +**Why scope is decoded from the Zitadel JWT, not asked of Bao.** The +agent already holds the JWT it's about to present at login; the +`deployments` claim is right there. A `/v1/auth/token/lookup-self` +round-trip after login would compute the same set from the other +direction, paying a network call to recover information already in +hand. + +## Consequences + +**Pros** + +- One auth root on a device (the existing Zitadel machine key) covers + both NATS and OpenBao access. Rotation, revocation, and inventory + remain centralized. +- The operator owns membership; the agent owns identity. A compromised + device cannot widen its own access. A compromised operator's blast + radius is its own fleet (one Zitadel project, one Bao instance). +- Per-deployment policies are mechanical to generate. Bao policy text + is identical modulo the deployment ID, produced by a small templated + Score. New deployments add one external group + one policy; no + hand-written ACLs. +- The lifecycle layer is a reusable home for future + "before-reconcile" work without further architectural changes. + +**Cons** + +- **Two-token invalidation on membership change.** Both the cached + Zitadel access token and the cached Bao Vault token must be dropped + for new membership to take effect. This is encapsulated in the + `secrets.refresh()` call but is a real round-trip cost (one HTTPS to + Zitadel + one to Bao) on every membership change. Mitigated by the + fact that membership changes are rare relative to secret reads. +- **Removal latency = Vault token TTL.** Removing a device from a + deployment does not immediately revoke its currently-cached Vault + token; access ends at next renewal or TTL expiry. Short TTLs (15 min) + bound the worst case; explicit `bao token revoke -accessor` is + available if needed. +- **Operator gains Zitadel-admin scope.** The operator must hold + credentials that can write user metadata in the Zitadel project. + This is a high-privilege scope and concentrates trust in the + operator. The mitigation is a per-fleet Zitadel project: a + compromised operator can only mutate its own fleet's identities. +- **Zitadel Action required.** Surfacing user metadata as a JWT claim + needs a small Zitadel Action (server-side JavaScript). It is part of + the fleet's Zitadel setup and must be in version control / applied + by the fleet's bootstrap, not configured by hand. (See "Additional + Notes" for the script.) + +## Alternatives considered + +**Project roles for deployment membership.** Rejected: flat namespace +inside a project, no native multi-value semantics, role inventory +explodes at hundreds of deployments per fleet, mutations require +project-admin scope on a coarse-grained API. Kept for the coarse +`fleet-device` / `fleet-admin` distinction the NATS callout already +uses. + +**Project ID embedded in the secret path +(`secrets///...`).** Rejected: the project +isolation is already enforced by `bound_audiences` at the JWT layer. +Encoding it in the path is duplication of the same assertion, couples +the secret tree to a Zitadel ID, and complicates project rotations. +Adds no security: a token that passes `bound_audiences` validation can +read the path regardless; one that fails cannot read anything. + +**Claim-templated single policy +(`{{identity.…metadata.deployment_id}}`).** Rejected for the +multi-deployment case: Vault policy templating does not iterate over +arrays, so a single-policy template can only express "one deployment +per device." Acceptable for a single-deployment-per-device world; the +chosen kubelet-like architecture admits N deployments per device, and +collapsing the chosen `groups_claim` design to this would force +multiple JWT logins per refresh. + +**Static per-device Bao token issued at provisioning.** Rejected: +introduces a second long-lived secret on the device, breaks rotation +(re-provisioning required), and provides no native per-deployment +scoping. + +**OpenBao OIDC code flow.** Rejected: that flow is for human users +with a browser. Devices are headless and already hold a JWT-bearer +identity; using OIDC would re-invent the wheel and require a local +browser-equivalent. + +**Lifecycle layer inside the NATS handler.** Rejected: conflates +transport with domain logic and makes the refresh-then-reconcile +ordering implicit. The dedicated module makes the contract testable +and lets future triggers reuse the same code path. + +## Additional Notes + +### Zitadel Action (token customization) + +A single post-access-token-creation Action per fleet's Zitadel project +copies user metadata `deployments` into a top-level claim: + +```javascript +// Trigger: pre-access-token-creation +function addDeployments(ctx, api) { + const md = ctx.v1.user.getMetadata(); + const entry = md.metadata.find(m => m.key === "deployments"); + if (!entry) return; + try { + const deployments = JSON.parse( + Buffer.from(entry.value, "base64").toString("utf-8") + ); + if (Array.isArray(deployments)) { + api.v1.claims.setClaim("deployments", deployments); + } + } catch (_) { /* malformed metadata is treated as no deployments */ } +} +``` + +The Action lives in Zitadel's "Flows" configuration, attached to the +`Complement Token` flow on the relevant project. A Harmony Score +(`ZitadelTokenCustomizationScore` or similar) is the right home for +applying this declaratively; see plan document for status. + +### Relationship to ADR-016 and ADR-020-1 + +ADR-016 (agent mesh on NATS JetStream) establishes the agent's +existing Zitadel-keyed identity for NATS. This ADR reuses that +identity unchanged. + +ADR-020-1 establishes the human-developer authentication path to +OpenBao via Zitadel's Device Authorization Grant. This ADR is the +machine-user counterpart: same OpenBao, same Zitadel, different +auth-method binding (humans use device code; devices use +JWT-bearer-derived access tokens against `/auth/jwt/login`). + +### Threat model summary + +| Attacker | Capability | Defense | +|---|---|---| +| External (no Zitadel identity) | None | No valid JWT signature; rejected at JWKS validation. | +| Compromised device (key theft) | Full agent scope on its own deployments only | `groups_claim` restricts scope to the device's metadata; Zitadel admin can rotate the machine key and trigger immediate re-issuance. | +| Different Zitadel project (different tenant or malicious org) | Can mint valid Zitadel tokens for its own project | `bound_audiences` rejects at the JWT auth boundary before any claim is read. | +| Compromised operator | Can mutate Zitadel metadata + Bao policies for its fleet | One operator per fleet; operator credentials themselves stored in Bao under a separate auth path; compromise is contained to the operator's project. | +| Compromised Bao | Full access to all stored secrets | Out of scope — Bao is the root of secret trust by definition. ADR-006 covers Bao operational hardening. |