Files
harmony/docs/adr/025-fleet-device-secret-access.md
Jean-Gabriel Gill-Couture 3d01d7482f
All checks were successful
Run Check Script / check (pull_request) Successful in 2m16s
docs: Simplify architecture for openbao sso via harmony config
2026-06-01 15:15:42 -04:00

320 lines
14 KiB
Markdown

# Architecture Decision Record: Fleet Device Secret Access via Zitadel JWT
Initial Author: Jean-Gabriel Gill-Couture
Initial Date: 2026-06-01
Last Updated Date: 2026-06-01
## Status
Proposed
## Context
Fleet agents on devices need to read per-deployment secrets (image-pull
credentials, application secrets, etc.) from OpenBao. The agent already
holds one durable secret: a Zitadel machine-user JWT keyfile dropped by
`FleetDeviceSetupScore`. That key is the basis for the agent's existing
NATS authentication (`nats/callout` validates the Zitadel-minted access
token; `fleet/harmony-fleet-auth/src/credentials.rs` mints it via the
RFC 7523 JWT-bearer flow).
Three requirements shape the design:
1. **No new device-side secret.** The Zitadel machine key is already the
single root of trust on a device; the secret-access path must derive
from the same key, not introduce a second one.
2. **Per-deployment isolation, enforced cryptographically.** A device
enrolled in deployments A and B reads only `A`'s and `B`'s secrets.
A device that hosts no deployments reads nothing. The device cannot
widen its own scope — only the operator can change membership.
3. **Cross-project safety.** A second Zitadel project (a different
tenant, a different fleet, a malicious org) must not be able to
produce a token that OpenBao accepts. The trust boundary is the
project, not the deployment.
The kubelet analogy is the architectural north star: the agent is a
small runtime that learns its workload (and the credentials needed to
run it) from a control-plane authority. The agent never decides what it
is allowed to run or read; it presents a signed identity and the
infrastructure decides.
## Decision
Three coordinating pieces.
### 1. OpenBao JWT auth bound to the Zitadel project
OpenBao's JWT auth method validates incoming tokens against Zitadel's
OIDC discovery URL (JWKS). One auth role per fleet, configured against
**one** Zitadel project:
```
bound_issuer = <Zitadel issuer URL>
bound_audiences = <Zitadel project ID>
bound_claims = { "urn:zitadel:iam:org:project:roles": "fleet-device" }
user_claim = sub
groups_claim = deployments
```
`bound_audiences` is the project boundary. A token minted in any other
Zitadel project has a different `aud` claim and is rejected before any
membership claim is read. This is the same defense
`nats/callout/src/zitadel.rs` already applies via `set_audience`.
`groups_claim = deployments` instructs OpenBao to read the JWT's
`deployments` array and bind the resulting Vault token to one external
group per element. Each external group carries a per-deployment policy
granting `read` on `harmony-fleet/data/<deployment-id>/*`.
### 2. Operator-managed Zitadel metadata as the membership source of truth
The fleet operator is the only writer of `user.metadata.deployments`
on each device's Zitadel machine user. A Zitadel post-token-creation
**Action** copies that metadata into a top-level `deployments` claim on
the access token. The device never touches its own metadata.
When a new `Deployment` CR is observed in Kubernetes, the operator
executes three writes in a strict order:
1. **Zitadel metadata** — append the deployment ID to the device's
`deployments` array (per device targeted by the deployment).
2. **OpenBao external group + policy** — upsert
`identity/group/<deployment-id>` (`type=external`, alias matching the
JWT-auth accessor) and policy
`fleet-deployment-<deployment-id>` granting
`read` on `harmony-fleet/data/<deployment-id>/*`.
3. **NATS desired-state** — publish
`desired-state.<device-id>.<deployment-id>` with the workload score.
Reversed, the agent could see the desired-state, attempt a re-auth,
and find the deployment missing from its claims — a "permission denied
for a deployment I was told to run" race that is confusing to debug
and weakens the trust story. Trust state always precedes the workload
signal.
Removal runs in reverse: NATS delete → (optional) group/policy delete →
metadata removal. Currently-cached Vault tokens retain access until
their short TTL expires; explicit revocation is available via
`bao token revoke` on the device's accessor if hard revocation is
needed.
### 3. Client side: JWT-bearer in `harmony_secret`, refresh before reconcile in the agent
The agent does **not** grow a new secrets client. Per ADR-020,
`harmony_config` is the unified config+secret entry point and already
wraps OpenBao via `harmony_secret::OpenbaoSecretStore`. The missing
piece is auth: `OpenbaoSecretStore` supports env token, cached token,
Zitadel OIDC device flow (humans), and userpass — but not Zitadel
**JWT-bearer** for headless machine identity.
Three additions:
- A fifth rung on `OpenbaoSecretStore`'s auth ladder takes a Zitadel
machine keyfile + Bao JWT role + audience, mints via RFC 7523, and
POSTs to `/v1/auth/jwt/login`.
- The pure minting moves to `harmony_zitadel_auth` so NATS and
OpenBao auth share one implementation (Rule of Three: NATS callout
+ OpenBao auth = two real consumers).
- `OpenbaoSecretStore` gains `refresh_auth()` (re-mint + re-login,
guarded by an internal `Mutex`) and `cached_scope() ->
HashSet<String>` derived from decoding the in-hand Zitadel JWT —
no Bao round-trip needed since the `deployments` claim is already
in the token we just minted.
In the agent, the NATS KV watcher consults `cached_scope()` before
each `reconciler.apply()`. If the desired deployment isn't covered,
it calls `refresh_auth()` and proceeds. The check is inline in
`main.rs` — about ten lines around the existing watcher loop. No
new module: one consumer, one site, inlining is the right size.
### Secret path layout
```
harmony-fleet/data/<deployment-id>/<secret-name>
```
The Zitadel project ID does **not** appear in the path. Its job is
done at the JWT validation boundary (`bound_audiences`), not repeated
in every key.
## Rationale
**Why Zitadel project ID lives in `bound_audiences`, not the path.**
The same trust assertion in two places is duplication, not defense in
depth — both reduce to "the JWT signature is valid for this audience."
Concentrating it at the auth role:
- gives one source of truth ("which project owns this Bao instance");
- keeps secret paths readable and operator-friendly;
- decouples secret organization from Zitadel project identity (a
project ID rotation reconfigures one Bao role, not every path).
**Why user metadata over project roles for deployment membership.**
Project roles in Zitadel live in a flat namespace inside a project.
A handful of roles (`fleet-admin`, `fleet-device`) maps cleanly; one
role per deployment would not — role inventories at hundreds of
deployments per fleet become hard to audit and slow to mutate.
User metadata is a per-machine-user JSON store, naturally
multi-valued, and admin-only-writable. The Zitadel Action that copies
metadata to a claim is a one-time, fleet-wide piece of configuration.
**Why `groups_claim` over claim-templated paths.** Vault policy
templating (`{{identity.entity.aliases…metadata.<key>}}`) supports
single-value substitution but not iteration over an array. Multiple
deployments per device require either multiple JWT logins (one per
deployment) or one login that resolves to multiple policies.
`groups_claim` + external groups gives the latter cleanly: one login,
N policies attached automatically.
**Why `harmony_config` / `harmony_secret`, not a fleet-local secrets
client.** ADR-020 is explicit that `harmony_config` is the unified
config+secret entry point and `OpenbaoSecretStore` is the canonical
OpenBao client. Adding a parallel fleet-only client would duplicate
the auth ladder, cache-file layout, and `kv2` plumbing already in
`harmony_secret`. The fleet's need is an *additional auth branch*,
not a different store.
**Why scope is decoded from the Zitadel JWT, not asked of Bao.** The
agent already holds the JWT it's about to present at login; the
`deployments` claim is right there. A `/v1/auth/token/lookup-self`
round-trip after login would compute the same set from the other
direction, paying a network call to recover information already in
hand.
## Consequences
**Pros**
- One auth root on a device (the existing Zitadel machine key) covers
both NATS and OpenBao access. Rotation, revocation, and inventory
remain centralized.
- The operator owns membership; the agent owns identity. A compromised
device cannot widen its own access. A compromised operator's blast
radius is its own fleet (one Zitadel project, one Bao instance).
- Per-deployment policies are mechanical to generate. Bao policy text
is identical modulo the deployment ID, produced by a small templated
Score. New deployments add one external group + one policy; no
hand-written ACLs.
- The lifecycle layer is a reusable home for future
"before-reconcile" work without further architectural changes.
**Cons**
- **Two-token invalidation on membership change.** Both the cached
Zitadel access token and the cached Bao Vault token must be dropped
for new membership to take effect. This is encapsulated in the
`secrets.refresh()` call but is a real round-trip cost (one HTTPS to
Zitadel + one to Bao) on every membership change. Mitigated by the
fact that membership changes are rare relative to secret reads.
- **Removal latency = Vault token TTL.** Removing a device from a
deployment does not immediately revoke its currently-cached Vault
token; access ends at next renewal or TTL expiry. Short TTLs (15 min)
bound the worst case; explicit `bao token revoke -accessor` is
available if needed.
- **Operator gains Zitadel-admin scope.** The operator must hold
credentials that can write user metadata in the Zitadel project.
This is a high-privilege scope and concentrates trust in the
operator. The mitigation is a per-fleet Zitadel project: a
compromised operator can only mutate its own fleet's identities.
- **Zitadel Action required.** Surfacing user metadata as a JWT claim
needs a small Zitadel Action (server-side JavaScript). It is part of
the fleet's Zitadel setup and must be in version control / applied
by the fleet's bootstrap, not configured by hand. (See "Additional
Notes" for the script.)
## Alternatives considered
**Project roles for deployment membership.** Rejected: flat namespace
inside a project, no native multi-value semantics, role inventory
explodes at hundreds of deployments per fleet, mutations require
project-admin scope on a coarse-grained API. Kept for the coarse
`fleet-device` / `fleet-admin` distinction the NATS callout already
uses.
**Project ID embedded in the secret path
(`secrets/<project-id>/<deployment>/...`).** Rejected: the project
isolation is already enforced by `bound_audiences` at the JWT layer.
Encoding it in the path is duplication of the same assertion, couples
the secret tree to a Zitadel ID, and complicates project rotations.
Adds no security: a token that passes `bound_audiences` validation can
read the path regardless; one that fails cannot read anything.
**Claim-templated single policy
(`{{identity.…metadata.deployment_id}}`).** Rejected for the
multi-deployment case: Vault policy templating does not iterate over
arrays, so a single-policy template can only express "one deployment
per device." Acceptable for a single-deployment-per-device world; the
chosen kubelet-like architecture admits N deployments per device, and
collapsing the chosen `groups_claim` design to this would force
multiple JWT logins per refresh.
**Static per-device Bao token issued at provisioning.** Rejected:
introduces a second long-lived secret on the device, breaks rotation
(re-provisioning required), and provides no native per-deployment
scoping.
**OpenBao OIDC code flow.** Rejected: that flow is for human users
with a browser. Devices are headless and already hold a JWT-bearer
identity; using OIDC would re-invent the wheel and require a local
browser-equivalent.
**Lifecycle layer inside the NATS handler.** Rejected: conflates
transport with domain logic and makes the refresh-then-reconcile
ordering implicit. The dedicated module makes the contract testable
and lets future triggers reuse the same code path.
## Additional Notes
### Zitadel Action (token customization)
A single post-access-token-creation Action per fleet's Zitadel project
copies user metadata `deployments` into a top-level claim:
```javascript
// Trigger: pre-access-token-creation
function addDeployments(ctx, api) {
const md = ctx.v1.user.getMetadata();
const entry = md.metadata.find(m => m.key === "deployments");
if (!entry) return;
try {
const deployments = JSON.parse(
Buffer.from(entry.value, "base64").toString("utf-8")
);
if (Array.isArray(deployments)) {
api.v1.claims.setClaim("deployments", deployments);
}
} catch (_) { /* malformed metadata is treated as no deployments */ }
}
```
The Action lives in Zitadel's "Flows" configuration, attached to the
`Complement Token` flow on the relevant project. A Harmony Score
(`ZitadelTokenCustomizationScore` or similar) is the right home for
applying this declaratively; see plan document for status.
### Relationship to ADR-016 and ADR-020-1
ADR-016 (agent mesh on NATS JetStream) establishes the agent's
existing Zitadel-keyed identity for NATS. This ADR reuses that
identity unchanged.
ADR-020-1 establishes the human-developer authentication path to
OpenBao via Zitadel's Device Authorization Grant. This ADR is the
machine-user counterpart: same OpenBao, same Zitadel, different
auth-method binding (humans use device code; devices use
JWT-bearer-derived access tokens against `/auth/jwt/login`).
### Threat model summary
| Attacker | Capability | Defense |
|---|---|---|
| External (no Zitadel identity) | None | No valid JWT signature; rejected at JWKS validation. |
| Compromised device (key theft) | Full agent scope on its own deployments only | `groups_claim` restricts scope to the device's metadata; Zitadel admin can rotate the machine key and trigger immediate re-issuance. |
| Different Zitadel project (different tenant or malicious org) | Can mint valid Zitadel tokens for its own project | `bound_audiences` rejects at the JWT auth boundary before any claim is read. |
| Compromised operator | Can mutate Zitadel metadata + Bao policies for its fleet | One operator per fleet; operator credentials themselves stored in Bao under a separate auth path; compromise is contained to the operator's project. |
| Compromised Bao | Full access to all stored secrets | Out of scope — Bao is the root of secret trust by definition. ADR-006 covers Bao operational hardening. |