14 KiB
Architecture Decision Record: Fleet Device Secret Access via Zitadel JWT
Initial Author: Jean-Gabriel Gill-Couture
Initial Date: 2026-06-01
Last Updated Date: 2026-06-01
Status
Proposed
Context
Fleet agents on devices need to read per-deployment secrets (image-pull
credentials, application secrets, etc.) from OpenBao. The agent already
holds one durable secret: a Zitadel machine-user JWT keyfile dropped by
FleetDeviceSetupScore. That key is the basis for the agent's existing
NATS authentication (nats/callout validates the Zitadel-minted access
token; fleet/harmony-fleet-auth/src/credentials.rs mints it via the
RFC 7523 JWT-bearer flow).
Three requirements shape the design:
-
No new device-side secret. The Zitadel machine key is already the single root of trust on a device; the secret-access path must derive from the same key, not introduce a second one.
-
Per-deployment isolation, enforced cryptographically. A device enrolled in deployments A and B reads only
A's andB's secrets. A device that hosts no deployments reads nothing. The device cannot widen its own scope — only the operator can change membership. -
Cross-project safety. A second Zitadel project (a different tenant, a different fleet, a malicious org) must not be able to produce a token that OpenBao accepts. The trust boundary is the project, not the deployment.
The kubelet analogy is the architectural north star: the agent is a small runtime that learns its workload (and the credentials needed to run it) from a control-plane authority. The agent never decides what it is allowed to run or read; it presents a signed identity and the infrastructure decides.
Decision
Three coordinating pieces.
1. OpenBao JWT auth bound to the Zitadel project
OpenBao's JWT auth method validates incoming tokens against Zitadel's OIDC discovery URL (JWKS). One auth role per fleet, configured against one Zitadel project:
bound_issuer = <Zitadel issuer URL>
bound_audiences = <Zitadel project ID>
bound_claims = { "urn:zitadel:iam:org:project:roles": "fleet-device" }
user_claim = sub
groups_claim = deployments
bound_audiences is the project boundary. A token minted in any other
Zitadel project has a different aud claim and is rejected before any
membership claim is read. This is the same defense
nats/callout/src/zitadel.rs already applies via set_audience.
groups_claim = deployments instructs OpenBao to read the JWT's
deployments array and bind the resulting Vault token to one external
group per element. Each external group carries a per-deployment policy
granting read on harmony-fleet/data/<deployment-id>/*.
2. Operator-managed Zitadel metadata as the membership source of truth
The fleet operator is the only writer of user.metadata.deployments
on each device's Zitadel machine user. A Zitadel post-token-creation
Action copies that metadata into a top-level deployments claim on
the access token. The device never touches its own metadata.
When a new Deployment CR is observed in Kubernetes, the operator
executes three writes in a strict order:
- Zitadel metadata — append the deployment ID to the device's
deploymentsarray (per device targeted by the deployment). - OpenBao external group + policy — upsert
identity/group/<deployment-id>(type=external, alias matching the JWT-auth accessor) and policyfleet-deployment-<deployment-id>grantingreadonharmony-fleet/data/<deployment-id>/*. - NATS desired-state — publish
desired-state.<device-id>.<deployment-id>with the workload score.
Reversed, the agent could see the desired-state, attempt a re-auth, and find the deployment missing from its claims — a "permission denied for a deployment I was told to run" race that is confusing to debug and weakens the trust story. Trust state always precedes the workload signal.
Removal runs in reverse: NATS delete → (optional) group/policy delete →
metadata removal. Currently-cached Vault tokens retain access until
their short TTL expires; explicit revocation is available via
bao token revoke on the device's accessor if hard revocation is
needed.
3. Client side: JWT-bearer in harmony_secret, refresh before reconcile in the agent
The agent does not grow a new secrets client. Per ADR-020,
harmony_config is the unified config+secret entry point and already
wraps OpenBao via harmony_secret::OpenbaoSecretStore. The missing
piece is auth: OpenbaoSecretStore supports env token, cached token,
Zitadel OIDC device flow (humans), and userpass — but not Zitadel
JWT-bearer for headless machine identity.
Three additions:
- A fifth rung on
OpenbaoSecretStore's auth ladder takes a Zitadel machine keyfile + Bao JWT role + audience, mints via RFC 7523, and POSTs to/v1/auth/jwt/login. - The pure minting moves to
harmony_zitadel_authso NATS and OpenBao auth share one implementation (Rule of Three: NATS callout- OpenBao auth = two real consumers).
OpenbaoSecretStoregainsrefresh_auth()(re-mint + re-login, guarded by an internalMutex) andcached_scope() -> HashSet<String>derived from decoding the in-hand Zitadel JWT — no Bao round-trip needed since thedeploymentsclaim is already in the token we just minted.
In the agent, the NATS KV watcher consults cached_scope() before
each reconciler.apply(). If the desired deployment isn't covered,
it calls refresh_auth() and proceeds. The check is inline in
main.rs — about ten lines around the existing watcher loop. No
new module: one consumer, one site, inlining is the right size.
Secret path layout
harmony-fleet/data/<deployment-id>/<secret-name>
The Zitadel project ID does not appear in the path. Its job is
done at the JWT validation boundary (bound_audiences), not repeated
in every key.
Rationale
Why Zitadel project ID lives in bound_audiences, not the path.
The same trust assertion in two places is duplication, not defense in
depth — both reduce to "the JWT signature is valid for this audience."
Concentrating it at the auth role:
- gives one source of truth ("which project owns this Bao instance");
- keeps secret paths readable and operator-friendly;
- decouples secret organization from Zitadel project identity (a project ID rotation reconfigures one Bao role, not every path).
Why user metadata over project roles for deployment membership.
Project roles in Zitadel live in a flat namespace inside a project.
A handful of roles (fleet-admin, fleet-device) maps cleanly; one
role per deployment would not — role inventories at hundreds of
deployments per fleet become hard to audit and slow to mutate.
User metadata is a per-machine-user JSON store, naturally
multi-valued, and admin-only-writable. The Zitadel Action that copies
metadata to a claim is a one-time, fleet-wide piece of configuration.
Why groups_claim over claim-templated paths. Vault policy
templating ({{identity.entity.aliases…metadata.<key>}}) supports
single-value substitution but not iteration over an array. Multiple
deployments per device require either multiple JWT logins (one per
deployment) or one login that resolves to multiple policies.
groups_claim + external groups gives the latter cleanly: one login,
N policies attached automatically.
Why harmony_config / harmony_secret, not a fleet-local secrets
client. ADR-020 is explicit that harmony_config is the unified
config+secret entry point and OpenbaoSecretStore is the canonical
OpenBao client. Adding a parallel fleet-only client would duplicate
the auth ladder, cache-file layout, and kv2 plumbing already in
harmony_secret. The fleet's need is an additional auth branch,
not a different store.
Why scope is decoded from the Zitadel JWT, not asked of Bao. The
agent already holds the JWT it's about to present at login; the
deployments claim is right there. A /v1/auth/token/lookup-self
round-trip after login would compute the same set from the other
direction, paying a network call to recover information already in
hand.
Consequences
Pros
- One auth root on a device (the existing Zitadel machine key) covers both NATS and OpenBao access. Rotation, revocation, and inventory remain centralized.
- The operator owns membership; the agent owns identity. A compromised device cannot widen its own access. A compromised operator's blast radius is its own fleet (one Zitadel project, one Bao instance).
- Per-deployment policies are mechanical to generate. Bao policy text is identical modulo the deployment ID, produced by a small templated Score. New deployments add one external group + one policy; no hand-written ACLs.
- The lifecycle layer is a reusable home for future "before-reconcile" work without further architectural changes.
Cons
- Two-token invalidation on membership change. Both the cached
Zitadel access token and the cached Bao Vault token must be dropped
for new membership to take effect. This is encapsulated in the
secrets.refresh()call but is a real round-trip cost (one HTTPS to Zitadel + one to Bao) on every membership change. Mitigated by the fact that membership changes are rare relative to secret reads. - Removal latency = Vault token TTL. Removing a device from a
deployment does not immediately revoke its currently-cached Vault
token; access ends at next renewal or TTL expiry. Short TTLs (15 min)
bound the worst case; explicit
bao token revoke -accessoris available if needed. - Operator gains Zitadel-admin scope. The operator must hold credentials that can write user metadata in the Zitadel project. This is a high-privilege scope and concentrates trust in the operator. The mitigation is a per-fleet Zitadel project: a compromised operator can only mutate its own fleet's identities.
- Zitadel Action required. Surfacing user metadata as a JWT claim needs a small Zitadel Action (server-side JavaScript). It is part of the fleet's Zitadel setup and must be in version control / applied by the fleet's bootstrap, not configured by hand. (See "Additional Notes" for the script.)
Alternatives considered
Project roles for deployment membership. Rejected: flat namespace
inside a project, no native multi-value semantics, role inventory
explodes at hundreds of deployments per fleet, mutations require
project-admin scope on a coarse-grained API. Kept for the coarse
fleet-device / fleet-admin distinction the NATS callout already
uses.
Project ID embedded in the secret path
(secrets/<project-id>/<deployment>/...). Rejected: the project
isolation is already enforced by bound_audiences at the JWT layer.
Encoding it in the path is duplication of the same assertion, couples
the secret tree to a Zitadel ID, and complicates project rotations.
Adds no security: a token that passes bound_audiences validation can
read the path regardless; one that fails cannot read anything.
Claim-templated single policy
({{identity.…metadata.deployment_id}}). Rejected for the
multi-deployment case: Vault policy templating does not iterate over
arrays, so a single-policy template can only express "one deployment
per device." Acceptable for a single-deployment-per-device world; the
chosen kubelet-like architecture admits N deployments per device, and
collapsing the chosen groups_claim design to this would force
multiple JWT logins per refresh.
Static per-device Bao token issued at provisioning. Rejected: introduces a second long-lived secret on the device, breaks rotation (re-provisioning required), and provides no native per-deployment scoping.
OpenBao OIDC code flow. Rejected: that flow is for human users with a browser. Devices are headless and already hold a JWT-bearer identity; using OIDC would re-invent the wheel and require a local browser-equivalent.
Lifecycle layer inside the NATS handler. Rejected: conflates transport with domain logic and makes the refresh-then-reconcile ordering implicit. The dedicated module makes the contract testable and lets future triggers reuse the same code path.
Additional Notes
Zitadel Action (token customization)
A single post-access-token-creation Action per fleet's Zitadel project
copies user metadata deployments into a top-level claim:
// Trigger: pre-access-token-creation
function addDeployments(ctx, api) {
const md = ctx.v1.user.getMetadata();
const entry = md.metadata.find(m => m.key === "deployments");
if (!entry) return;
try {
const deployments = JSON.parse(
Buffer.from(entry.value, "base64").toString("utf-8")
);
if (Array.isArray(deployments)) {
api.v1.claims.setClaim("deployments", deployments);
}
} catch (_) { /* malformed metadata is treated as no deployments */ }
}
The Action lives in Zitadel's "Flows" configuration, attached to the
Complement Token flow on the relevant project. A Harmony Score
(ZitadelTokenCustomizationScore or similar) is the right home for
applying this declaratively; see plan document for status.
Relationship to ADR-016 and ADR-020-1
ADR-016 (agent mesh on NATS JetStream) establishes the agent's existing Zitadel-keyed identity for NATS. This ADR reuses that identity unchanged.
ADR-020-1 establishes the human-developer authentication path to
OpenBao via Zitadel's Device Authorization Grant. This ADR is the
machine-user counterpart: same OpenBao, same Zitadel, different
auth-method binding (humans use device code; devices use
JWT-bearer-derived access tokens against /auth/jwt/login).
Threat model summary
| Attacker | Capability | Defense |
|---|---|---|
| External (no Zitadel identity) | None | No valid JWT signature; rejected at JWKS validation. |
| Compromised device (key theft) | Full agent scope on its own deployments only | groups_claim restricts scope to the device's metadata; Zitadel admin can rotate the machine key and trigger immediate re-issuance. |
| Different Zitadel project (different tenant or malicious org) | Can mint valid Zitadel tokens for its own project | bound_audiences rejects at the JWT auth boundary before any claim is read. |
| Compromised operator | Can mutate Zitadel metadata + Bao policies for its fleet | One operator per fleet; operator credentials themselves stored in Bao under a separate auth path; compromise is contained to the operator's project. |
| Compromised Bao | Full access to all stored secrets | Out of scope — Bao is the root of secret trust by definition. ADR-006 covers Bao operational hardening. |