Files
harmony/docs/adr/025-fleet-device-secret-access.md
Jean-Gabriel Gill-Couture 3d01d7482f
All checks were successful
Run Check Script / check (pull_request) Successful in 2m16s
docs: Simplify architecture for openbao sso via harmony config
2026-06-01 15:15:42 -04:00

14 KiB

Architecture Decision Record: Fleet Device Secret Access via Zitadel JWT

Initial Author: Jean-Gabriel Gill-Couture

Initial Date: 2026-06-01

Last Updated Date: 2026-06-01

Status

Proposed

Context

Fleet agents on devices need to read per-deployment secrets (image-pull credentials, application secrets, etc.) from OpenBao. The agent already holds one durable secret: a Zitadel machine-user JWT keyfile dropped by FleetDeviceSetupScore. That key is the basis for the agent's existing NATS authentication (nats/callout validates the Zitadel-minted access token; fleet/harmony-fleet-auth/src/credentials.rs mints it via the RFC 7523 JWT-bearer flow).

Three requirements shape the design:

  1. No new device-side secret. The Zitadel machine key is already the single root of trust on a device; the secret-access path must derive from the same key, not introduce a second one.

  2. Per-deployment isolation, enforced cryptographically. A device enrolled in deployments A and B reads only A's and B's secrets. A device that hosts no deployments reads nothing. The device cannot widen its own scope — only the operator can change membership.

  3. Cross-project safety. A second Zitadel project (a different tenant, a different fleet, a malicious org) must not be able to produce a token that OpenBao accepts. The trust boundary is the project, not the deployment.

The kubelet analogy is the architectural north star: the agent is a small runtime that learns its workload (and the credentials needed to run it) from a control-plane authority. The agent never decides what it is allowed to run or read; it presents a signed identity and the infrastructure decides.

Decision

Three coordinating pieces.

1. OpenBao JWT auth bound to the Zitadel project

OpenBao's JWT auth method validates incoming tokens against Zitadel's OIDC discovery URL (JWKS). One auth role per fleet, configured against one Zitadel project:

bound_issuer    = <Zitadel issuer URL>
bound_audiences = <Zitadel project ID>
bound_claims    = { "urn:zitadel:iam:org:project:roles": "fleet-device" }
user_claim      = sub
groups_claim    = deployments

bound_audiences is the project boundary. A token minted in any other Zitadel project has a different aud claim and is rejected before any membership claim is read. This is the same defense nats/callout/src/zitadel.rs already applies via set_audience.

groups_claim = deployments instructs OpenBao to read the JWT's deployments array and bind the resulting Vault token to one external group per element. Each external group carries a per-deployment policy granting read on harmony-fleet/data/<deployment-id>/*.

2. Operator-managed Zitadel metadata as the membership source of truth

The fleet operator is the only writer of user.metadata.deployments on each device's Zitadel machine user. A Zitadel post-token-creation Action copies that metadata into a top-level deployments claim on the access token. The device never touches its own metadata.

When a new Deployment CR is observed in Kubernetes, the operator executes three writes in a strict order:

  1. Zitadel metadata — append the deployment ID to the device's deployments array (per device targeted by the deployment).
  2. OpenBao external group + policy — upsert identity/group/<deployment-id> (type=external, alias matching the JWT-auth accessor) and policy fleet-deployment-<deployment-id> granting read on harmony-fleet/data/<deployment-id>/*.
  3. NATS desired-state — publish desired-state.<device-id>.<deployment-id> with the workload score.

Reversed, the agent could see the desired-state, attempt a re-auth, and find the deployment missing from its claims — a "permission denied for a deployment I was told to run" race that is confusing to debug and weakens the trust story. Trust state always precedes the workload signal.

Removal runs in reverse: NATS delete → (optional) group/policy delete → metadata removal. Currently-cached Vault tokens retain access until their short TTL expires; explicit revocation is available via bao token revoke on the device's accessor if hard revocation is needed.

3. Client side: JWT-bearer in harmony_secret, refresh before reconcile in the agent

The agent does not grow a new secrets client. Per ADR-020, harmony_config is the unified config+secret entry point and already wraps OpenBao via harmony_secret::OpenbaoSecretStore. The missing piece is auth: OpenbaoSecretStore supports env token, cached token, Zitadel OIDC device flow (humans), and userpass — but not Zitadel JWT-bearer for headless machine identity.

Three additions:

  • A fifth rung on OpenbaoSecretStore's auth ladder takes a Zitadel machine keyfile + Bao JWT role + audience, mints via RFC 7523, and POSTs to /v1/auth/jwt/login.
  • The pure minting moves to harmony_zitadel_auth so NATS and OpenBao auth share one implementation (Rule of Three: NATS callout
    • OpenBao auth = two real consumers).
  • OpenbaoSecretStore gains refresh_auth() (re-mint + re-login, guarded by an internal Mutex) and cached_scope() -> HashSet<String> derived from decoding the in-hand Zitadel JWT — no Bao round-trip needed since the deployments claim is already in the token we just minted.

In the agent, the NATS KV watcher consults cached_scope() before each reconciler.apply(). If the desired deployment isn't covered, it calls refresh_auth() and proceeds. The check is inline in main.rs — about ten lines around the existing watcher loop. No new module: one consumer, one site, inlining is the right size.

Secret path layout

harmony-fleet/data/<deployment-id>/<secret-name>

The Zitadel project ID does not appear in the path. Its job is done at the JWT validation boundary (bound_audiences), not repeated in every key.

Rationale

Why Zitadel project ID lives in bound_audiences, not the path. The same trust assertion in two places is duplication, not defense in depth — both reduce to "the JWT signature is valid for this audience." Concentrating it at the auth role:

  • gives one source of truth ("which project owns this Bao instance");
  • keeps secret paths readable and operator-friendly;
  • decouples secret organization from Zitadel project identity (a project ID rotation reconfigures one Bao role, not every path).

Why user metadata over project roles for deployment membership. Project roles in Zitadel live in a flat namespace inside a project. A handful of roles (fleet-admin, fleet-device) maps cleanly; one role per deployment would not — role inventories at hundreds of deployments per fleet become hard to audit and slow to mutate. User metadata is a per-machine-user JSON store, naturally multi-valued, and admin-only-writable. The Zitadel Action that copies metadata to a claim is a one-time, fleet-wide piece of configuration.

Why groups_claim over claim-templated paths. Vault policy templating ({{identity.entity.aliases…metadata.<key>}}) supports single-value substitution but not iteration over an array. Multiple deployments per device require either multiple JWT logins (one per deployment) or one login that resolves to multiple policies. groups_claim + external groups gives the latter cleanly: one login, N policies attached automatically.

Why harmony_config / harmony_secret, not a fleet-local secrets client. ADR-020 is explicit that harmony_config is the unified config+secret entry point and OpenbaoSecretStore is the canonical OpenBao client. Adding a parallel fleet-only client would duplicate the auth ladder, cache-file layout, and kv2 plumbing already in harmony_secret. The fleet's need is an additional auth branch, not a different store.

Why scope is decoded from the Zitadel JWT, not asked of Bao. The agent already holds the JWT it's about to present at login; the deployments claim is right there. A /v1/auth/token/lookup-self round-trip after login would compute the same set from the other direction, paying a network call to recover information already in hand.

Consequences

Pros

  • One auth root on a device (the existing Zitadel machine key) covers both NATS and OpenBao access. Rotation, revocation, and inventory remain centralized.
  • The operator owns membership; the agent owns identity. A compromised device cannot widen its own access. A compromised operator's blast radius is its own fleet (one Zitadel project, one Bao instance).
  • Per-deployment policies are mechanical to generate. Bao policy text is identical modulo the deployment ID, produced by a small templated Score. New deployments add one external group + one policy; no hand-written ACLs.
  • The lifecycle layer is a reusable home for future "before-reconcile" work without further architectural changes.

Cons

  • Two-token invalidation on membership change. Both the cached Zitadel access token and the cached Bao Vault token must be dropped for new membership to take effect. This is encapsulated in the secrets.refresh() call but is a real round-trip cost (one HTTPS to Zitadel + one to Bao) on every membership change. Mitigated by the fact that membership changes are rare relative to secret reads.
  • Removal latency = Vault token TTL. Removing a device from a deployment does not immediately revoke its currently-cached Vault token; access ends at next renewal or TTL expiry. Short TTLs (15 min) bound the worst case; explicit bao token revoke -accessor is available if needed.
  • Operator gains Zitadel-admin scope. The operator must hold credentials that can write user metadata in the Zitadel project. This is a high-privilege scope and concentrates trust in the operator. The mitigation is a per-fleet Zitadel project: a compromised operator can only mutate its own fleet's identities.
  • Zitadel Action required. Surfacing user metadata as a JWT claim needs a small Zitadel Action (server-side JavaScript). It is part of the fleet's Zitadel setup and must be in version control / applied by the fleet's bootstrap, not configured by hand. (See "Additional Notes" for the script.)

Alternatives considered

Project roles for deployment membership. Rejected: flat namespace inside a project, no native multi-value semantics, role inventory explodes at hundreds of deployments per fleet, mutations require project-admin scope on a coarse-grained API. Kept for the coarse fleet-device / fleet-admin distinction the NATS callout already uses.

Project ID embedded in the secret path (secrets/<project-id>/<deployment>/...). Rejected: the project isolation is already enforced by bound_audiences at the JWT layer. Encoding it in the path is duplication of the same assertion, couples the secret tree to a Zitadel ID, and complicates project rotations. Adds no security: a token that passes bound_audiences validation can read the path regardless; one that fails cannot read anything.

Claim-templated single policy ({{identity.…metadata.deployment_id}}). Rejected for the multi-deployment case: Vault policy templating does not iterate over arrays, so a single-policy template can only express "one deployment per device." Acceptable for a single-deployment-per-device world; the chosen kubelet-like architecture admits N deployments per device, and collapsing the chosen groups_claim design to this would force multiple JWT logins per refresh.

Static per-device Bao token issued at provisioning. Rejected: introduces a second long-lived secret on the device, breaks rotation (re-provisioning required), and provides no native per-deployment scoping.

OpenBao OIDC code flow. Rejected: that flow is for human users with a browser. Devices are headless and already hold a JWT-bearer identity; using OIDC would re-invent the wheel and require a local browser-equivalent.

Lifecycle layer inside the NATS handler. Rejected: conflates transport with domain logic and makes the refresh-then-reconcile ordering implicit. The dedicated module makes the contract testable and lets future triggers reuse the same code path.

Additional Notes

Zitadel Action (token customization)

A single post-access-token-creation Action per fleet's Zitadel project copies user metadata deployments into a top-level claim:

// Trigger: pre-access-token-creation
function addDeployments(ctx, api) {
  const md = ctx.v1.user.getMetadata();
  const entry = md.metadata.find(m => m.key === "deployments");
  if (!entry) return;
  try {
    const deployments = JSON.parse(
      Buffer.from(entry.value, "base64").toString("utf-8")
    );
    if (Array.isArray(deployments)) {
      api.v1.claims.setClaim("deployments", deployments);
    }
  } catch (_) { /* malformed metadata is treated as no deployments */ }
}

The Action lives in Zitadel's "Flows" configuration, attached to the Complement Token flow on the relevant project. A Harmony Score (ZitadelTokenCustomizationScore or similar) is the right home for applying this declaratively; see plan document for status.

Relationship to ADR-016 and ADR-020-1

ADR-016 (agent mesh on NATS JetStream) establishes the agent's existing Zitadel-keyed identity for NATS. This ADR reuses that identity unchanged.

ADR-020-1 establishes the human-developer authentication path to OpenBao via Zitadel's Device Authorization Grant. This ADR is the machine-user counterpart: same OpenBao, same Zitadel, different auth-method binding (humans use device code; devices use JWT-bearer-derived access tokens against /auth/jwt/login).

Threat model summary

Attacker Capability Defense
External (no Zitadel identity) None No valid JWT signature; rejected at JWKS validation.
Compromised device (key theft) Full agent scope on its own deployments only groups_claim restricts scope to the device's metadata; Zitadel admin can rotate the machine key and trigger immediate re-issuance.
Different Zitadel project (different tenant or malicious org) Can mint valid Zitadel tokens for its own project bound_audiences rejects at the JWT auth boundary before any claim is read.
Compromised operator Can mutate Zitadel metadata + Bao policies for its fleet One operator per fleet; operator credentials themselves stored in Bao under a separate auth path; compromise is contained to the operator's project.
Compromised Bao Full access to all stored secrets Out of scope — Bao is the root of secret trust by definition. ADR-006 covers Bao operational hardening.