All checks were successful
Run Check Script / check (pull_request) Successful in 2m16s
320 lines
14 KiB
Markdown
320 lines
14 KiB
Markdown
# Architecture Decision Record: Fleet Device Secret Access via Zitadel JWT
|
|
|
|
Initial Author: Jean-Gabriel Gill-Couture
|
|
|
|
Initial Date: 2026-06-01
|
|
|
|
Last Updated Date: 2026-06-01
|
|
|
|
## Status
|
|
|
|
Proposed
|
|
|
|
## Context
|
|
|
|
Fleet agents on devices need to read per-deployment secrets (image-pull
|
|
credentials, application secrets, etc.) from OpenBao. The agent already
|
|
holds one durable secret: a Zitadel machine-user JWT keyfile dropped by
|
|
`FleetDeviceSetupScore`. That key is the basis for the agent's existing
|
|
NATS authentication (`nats/callout` validates the Zitadel-minted access
|
|
token; `fleet/harmony-fleet-auth/src/credentials.rs` mints it via the
|
|
RFC 7523 JWT-bearer flow).
|
|
|
|
Three requirements shape the design:
|
|
|
|
1. **No new device-side secret.** The Zitadel machine key is already the
|
|
single root of trust on a device; the secret-access path must derive
|
|
from the same key, not introduce a second one.
|
|
|
|
2. **Per-deployment isolation, enforced cryptographically.** A device
|
|
enrolled in deployments A and B reads only `A`'s and `B`'s secrets.
|
|
A device that hosts no deployments reads nothing. The device cannot
|
|
widen its own scope — only the operator can change membership.
|
|
|
|
3. **Cross-project safety.** A second Zitadel project (a different
|
|
tenant, a different fleet, a malicious org) must not be able to
|
|
produce a token that OpenBao accepts. The trust boundary is the
|
|
project, not the deployment.
|
|
|
|
The kubelet analogy is the architectural north star: the agent is a
|
|
small runtime that learns its workload (and the credentials needed to
|
|
run it) from a control-plane authority. The agent never decides what it
|
|
is allowed to run or read; it presents a signed identity and the
|
|
infrastructure decides.
|
|
|
|
## Decision
|
|
|
|
Three coordinating pieces.
|
|
|
|
### 1. OpenBao JWT auth bound to the Zitadel project
|
|
|
|
OpenBao's JWT auth method validates incoming tokens against Zitadel's
|
|
OIDC discovery URL (JWKS). One auth role per fleet, configured against
|
|
**one** Zitadel project:
|
|
|
|
```
|
|
bound_issuer = <Zitadel issuer URL>
|
|
bound_audiences = <Zitadel project ID>
|
|
bound_claims = { "urn:zitadel:iam:org:project:roles": "fleet-device" }
|
|
user_claim = sub
|
|
groups_claim = deployments
|
|
```
|
|
|
|
`bound_audiences` is the project boundary. A token minted in any other
|
|
Zitadel project has a different `aud` claim and is rejected before any
|
|
membership claim is read. This is the same defense
|
|
`nats/callout/src/zitadel.rs` already applies via `set_audience`.
|
|
|
|
`groups_claim = deployments` instructs OpenBao to read the JWT's
|
|
`deployments` array and bind the resulting Vault token to one external
|
|
group per element. Each external group carries a per-deployment policy
|
|
granting `read` on `harmony-fleet/data/<deployment-id>/*`.
|
|
|
|
### 2. Operator-managed Zitadel metadata as the membership source of truth
|
|
|
|
The fleet operator is the only writer of `user.metadata.deployments`
|
|
on each device's Zitadel machine user. A Zitadel post-token-creation
|
|
**Action** copies that metadata into a top-level `deployments` claim on
|
|
the access token. The device never touches its own metadata.
|
|
|
|
When a new `Deployment` CR is observed in Kubernetes, the operator
|
|
executes three writes in a strict order:
|
|
|
|
1. **Zitadel metadata** — append the deployment ID to the device's
|
|
`deployments` array (per device targeted by the deployment).
|
|
2. **OpenBao external group + policy** — upsert
|
|
`identity/group/<deployment-id>` (`type=external`, alias matching the
|
|
JWT-auth accessor) and policy
|
|
`fleet-deployment-<deployment-id>` granting
|
|
`read` on `harmony-fleet/data/<deployment-id>/*`.
|
|
3. **NATS desired-state** — publish
|
|
`desired-state.<device-id>.<deployment-id>` with the workload score.
|
|
|
|
Reversed, the agent could see the desired-state, attempt a re-auth,
|
|
and find the deployment missing from its claims — a "permission denied
|
|
for a deployment I was told to run" race that is confusing to debug
|
|
and weakens the trust story. Trust state always precedes the workload
|
|
signal.
|
|
|
|
Removal runs in reverse: NATS delete → (optional) group/policy delete →
|
|
metadata removal. Currently-cached Vault tokens retain access until
|
|
their short TTL expires; explicit revocation is available via
|
|
`bao token revoke` on the device's accessor if hard revocation is
|
|
needed.
|
|
|
|
### 3. Client side: JWT-bearer in `harmony_secret`, refresh before reconcile in the agent
|
|
|
|
The agent does **not** grow a new secrets client. Per ADR-020,
|
|
`harmony_config` is the unified config+secret entry point and already
|
|
wraps OpenBao via `harmony_secret::OpenbaoSecretStore`. The missing
|
|
piece is auth: `OpenbaoSecretStore` supports env token, cached token,
|
|
Zitadel OIDC device flow (humans), and userpass — but not Zitadel
|
|
**JWT-bearer** for headless machine identity.
|
|
|
|
Three additions:
|
|
|
|
- A fifth rung on `OpenbaoSecretStore`'s auth ladder takes a Zitadel
|
|
machine keyfile + Bao JWT role + audience, mints via RFC 7523, and
|
|
POSTs to `/v1/auth/jwt/login`.
|
|
- The pure minting moves to `harmony_zitadel_auth` so NATS and
|
|
OpenBao auth share one implementation (Rule of Three: NATS callout
|
|
+ OpenBao auth = two real consumers).
|
|
- `OpenbaoSecretStore` gains `refresh_auth()` (re-mint + re-login,
|
|
guarded by an internal `Mutex`) and `cached_scope() ->
|
|
HashSet<String>` derived from decoding the in-hand Zitadel JWT —
|
|
no Bao round-trip needed since the `deployments` claim is already
|
|
in the token we just minted.
|
|
|
|
In the agent, the NATS KV watcher consults `cached_scope()` before
|
|
each `reconciler.apply()`. If the desired deployment isn't covered,
|
|
it calls `refresh_auth()` and proceeds. The check is inline in
|
|
`main.rs` — about ten lines around the existing watcher loop. No
|
|
new module: one consumer, one site, inlining is the right size.
|
|
|
|
### Secret path layout
|
|
|
|
```
|
|
harmony-fleet/data/<deployment-id>/<secret-name>
|
|
```
|
|
|
|
The Zitadel project ID does **not** appear in the path. Its job is
|
|
done at the JWT validation boundary (`bound_audiences`), not repeated
|
|
in every key.
|
|
|
|
## Rationale
|
|
|
|
**Why Zitadel project ID lives in `bound_audiences`, not the path.**
|
|
The same trust assertion in two places is duplication, not defense in
|
|
depth — both reduce to "the JWT signature is valid for this audience."
|
|
Concentrating it at the auth role:
|
|
|
|
- gives one source of truth ("which project owns this Bao instance");
|
|
- keeps secret paths readable and operator-friendly;
|
|
- decouples secret organization from Zitadel project identity (a
|
|
project ID rotation reconfigures one Bao role, not every path).
|
|
|
|
**Why user metadata over project roles for deployment membership.**
|
|
Project roles in Zitadel live in a flat namespace inside a project.
|
|
A handful of roles (`fleet-admin`, `fleet-device`) maps cleanly; one
|
|
role per deployment would not — role inventories at hundreds of
|
|
deployments per fleet become hard to audit and slow to mutate.
|
|
User metadata is a per-machine-user JSON store, naturally
|
|
multi-valued, and admin-only-writable. The Zitadel Action that copies
|
|
metadata to a claim is a one-time, fleet-wide piece of configuration.
|
|
|
|
**Why `groups_claim` over claim-templated paths.** Vault policy
|
|
templating (`{{identity.entity.aliases…metadata.<key>}}`) supports
|
|
single-value substitution but not iteration over an array. Multiple
|
|
deployments per device require either multiple JWT logins (one per
|
|
deployment) or one login that resolves to multiple policies.
|
|
`groups_claim` + external groups gives the latter cleanly: one login,
|
|
N policies attached automatically.
|
|
|
|
**Why `harmony_config` / `harmony_secret`, not a fleet-local secrets
|
|
client.** ADR-020 is explicit that `harmony_config` is the unified
|
|
config+secret entry point and `OpenbaoSecretStore` is the canonical
|
|
OpenBao client. Adding a parallel fleet-only client would duplicate
|
|
the auth ladder, cache-file layout, and `kv2` plumbing already in
|
|
`harmony_secret`. The fleet's need is an *additional auth branch*,
|
|
not a different store.
|
|
|
|
**Why scope is decoded from the Zitadel JWT, not asked of Bao.** The
|
|
agent already holds the JWT it's about to present at login; the
|
|
`deployments` claim is right there. A `/v1/auth/token/lookup-self`
|
|
round-trip after login would compute the same set from the other
|
|
direction, paying a network call to recover information already in
|
|
hand.
|
|
|
|
## Consequences
|
|
|
|
**Pros**
|
|
|
|
- One auth root on a device (the existing Zitadel machine key) covers
|
|
both NATS and OpenBao access. Rotation, revocation, and inventory
|
|
remain centralized.
|
|
- The operator owns membership; the agent owns identity. A compromised
|
|
device cannot widen its own access. A compromised operator's blast
|
|
radius is its own fleet (one Zitadel project, one Bao instance).
|
|
- Per-deployment policies are mechanical to generate. Bao policy text
|
|
is identical modulo the deployment ID, produced by a small templated
|
|
Score. New deployments add one external group + one policy; no
|
|
hand-written ACLs.
|
|
- The lifecycle layer is a reusable home for future
|
|
"before-reconcile" work without further architectural changes.
|
|
|
|
**Cons**
|
|
|
|
- **Two-token invalidation on membership change.** Both the cached
|
|
Zitadel access token and the cached Bao Vault token must be dropped
|
|
for new membership to take effect. This is encapsulated in the
|
|
`secrets.refresh()` call but is a real round-trip cost (one HTTPS to
|
|
Zitadel + one to Bao) on every membership change. Mitigated by the
|
|
fact that membership changes are rare relative to secret reads.
|
|
- **Removal latency = Vault token TTL.** Removing a device from a
|
|
deployment does not immediately revoke its currently-cached Vault
|
|
token; access ends at next renewal or TTL expiry. Short TTLs (15 min)
|
|
bound the worst case; explicit `bao token revoke -accessor` is
|
|
available if needed.
|
|
- **Operator gains Zitadel-admin scope.** The operator must hold
|
|
credentials that can write user metadata in the Zitadel project.
|
|
This is a high-privilege scope and concentrates trust in the
|
|
operator. The mitigation is a per-fleet Zitadel project: a
|
|
compromised operator can only mutate its own fleet's identities.
|
|
- **Zitadel Action required.** Surfacing user metadata as a JWT claim
|
|
needs a small Zitadel Action (server-side JavaScript). It is part of
|
|
the fleet's Zitadel setup and must be in version control / applied
|
|
by the fleet's bootstrap, not configured by hand. (See "Additional
|
|
Notes" for the script.)
|
|
|
|
## Alternatives considered
|
|
|
|
**Project roles for deployment membership.** Rejected: flat namespace
|
|
inside a project, no native multi-value semantics, role inventory
|
|
explodes at hundreds of deployments per fleet, mutations require
|
|
project-admin scope on a coarse-grained API. Kept for the coarse
|
|
`fleet-device` / `fleet-admin` distinction the NATS callout already
|
|
uses.
|
|
|
|
**Project ID embedded in the secret path
|
|
(`secrets/<project-id>/<deployment>/...`).** Rejected: the project
|
|
isolation is already enforced by `bound_audiences` at the JWT layer.
|
|
Encoding it in the path is duplication of the same assertion, couples
|
|
the secret tree to a Zitadel ID, and complicates project rotations.
|
|
Adds no security: a token that passes `bound_audiences` validation can
|
|
read the path regardless; one that fails cannot read anything.
|
|
|
|
**Claim-templated single policy
|
|
(`{{identity.…metadata.deployment_id}}`).** Rejected for the
|
|
multi-deployment case: Vault policy templating does not iterate over
|
|
arrays, so a single-policy template can only express "one deployment
|
|
per device." Acceptable for a single-deployment-per-device world; the
|
|
chosen kubelet-like architecture admits N deployments per device, and
|
|
collapsing the chosen `groups_claim` design to this would force
|
|
multiple JWT logins per refresh.
|
|
|
|
**Static per-device Bao token issued at provisioning.** Rejected:
|
|
introduces a second long-lived secret on the device, breaks rotation
|
|
(re-provisioning required), and provides no native per-deployment
|
|
scoping.
|
|
|
|
**OpenBao OIDC code flow.** Rejected: that flow is for human users
|
|
with a browser. Devices are headless and already hold a JWT-bearer
|
|
identity; using OIDC would re-invent the wheel and require a local
|
|
browser-equivalent.
|
|
|
|
**Lifecycle layer inside the NATS handler.** Rejected: conflates
|
|
transport with domain logic and makes the refresh-then-reconcile
|
|
ordering implicit. The dedicated module makes the contract testable
|
|
and lets future triggers reuse the same code path.
|
|
|
|
## Additional Notes
|
|
|
|
### Zitadel Action (token customization)
|
|
|
|
A single post-access-token-creation Action per fleet's Zitadel project
|
|
copies user metadata `deployments` into a top-level claim:
|
|
|
|
```javascript
|
|
// Trigger: pre-access-token-creation
|
|
function addDeployments(ctx, api) {
|
|
const md = ctx.v1.user.getMetadata();
|
|
const entry = md.metadata.find(m => m.key === "deployments");
|
|
if (!entry) return;
|
|
try {
|
|
const deployments = JSON.parse(
|
|
Buffer.from(entry.value, "base64").toString("utf-8")
|
|
);
|
|
if (Array.isArray(deployments)) {
|
|
api.v1.claims.setClaim("deployments", deployments);
|
|
}
|
|
} catch (_) { /* malformed metadata is treated as no deployments */ }
|
|
}
|
|
```
|
|
|
|
The Action lives in Zitadel's "Flows" configuration, attached to the
|
|
`Complement Token` flow on the relevant project. A Harmony Score
|
|
(`ZitadelTokenCustomizationScore` or similar) is the right home for
|
|
applying this declaratively; see plan document for status.
|
|
|
|
### Relationship to ADR-016 and ADR-020-1
|
|
|
|
ADR-016 (agent mesh on NATS JetStream) establishes the agent's
|
|
existing Zitadel-keyed identity for NATS. This ADR reuses that
|
|
identity unchanged.
|
|
|
|
ADR-020-1 establishes the human-developer authentication path to
|
|
OpenBao via Zitadel's Device Authorization Grant. This ADR is the
|
|
machine-user counterpart: same OpenBao, same Zitadel, different
|
|
auth-method binding (humans use device code; devices use
|
|
JWT-bearer-derived access tokens against `/auth/jwt/login`).
|
|
|
|
### Threat model summary
|
|
|
|
| Attacker | Capability | Defense |
|
|
|---|---|---|
|
|
| External (no Zitadel identity) | None | No valid JWT signature; rejected at JWKS validation. |
|
|
| Compromised device (key theft) | Full agent scope on its own deployments only | `groups_claim` restricts scope to the device's metadata; Zitadel admin can rotate the machine key and trigger immediate re-issuance. |
|
|
| Different Zitadel project (different tenant or malicious org) | Can mint valid Zitadel tokens for its own project | `bound_audiences` rejects at the JWT auth boundary before any claim is read. |
|
|
| Compromised operator | Can mutate Zitadel metadata + Bao policies for its fleet | One operator per fleet; operator credentials themselves stored in Bao under a separate auth path; compromise is contained to the operator's project. |
|
|
| Compromised Bao | Full access to all stored secrets | Out of scope — Bao is the root of secret trust by definition. ADR-006 covers Bao operational hardening. |
|