harmony/docs/adr/025-fleet-device-secret-access.md

# Architecture Decision Record: Fleet Device Secret Access via Zitadel JWT

Initial Author: Jean-Gabriel Gill-Couture

Initial Date: 2026-06-01

Last Updated Date: 2026-06-01

## Status

Proposed

## Context

Fleet agents on devices need to read per-deployment secrets (image-pull
credentials, application secrets, etc.) from OpenBao. The agent already
holds one durable secret: a Zitadel machine-user JWT keyfile dropped by
`FleetDeviceSetupScore`. That key is the basis for the agent's existing
NATS authentication (`nats/callout` validates the Zitadel-minted access
token; `fleet/harmony-fleet-auth/src/credentials.rs` mints it via the
RFC 7523 JWT-bearer flow).

Three requirements shape the design:

1. **No new device-side secret.** The Zitadel machine key is already the
   single root of trust on a device; the secret-access path must derive
   from the same key, not introduce a second one.

2. **Per-deployment isolation, enforced cryptographically.** A device
   enrolled in deployments A and B reads only `A`'s and `B`'s secrets.
   A device that hosts no deployments reads nothing. The device cannot
   widen its own scope — only the operator can change membership.

3. **Cross-project safety.** A second Zitadel project (a different
   tenant, a different fleet, a malicious org) must not be able to
   produce a token that OpenBao accepts. The trust boundary is the
   project, not the deployment.

The kubelet analogy is the architectural north star: the agent is a
small runtime that learns its workload (and the credentials needed to
run it) from a control-plane authority. The agent never decides what it
is allowed to run or read; it presents a signed identity and the
infrastructure decides.

## Decision

Three coordinating pieces.

### 1. OpenBao JWT auth bound to the Zitadel project

OpenBao's JWT auth method validates incoming tokens against Zitadel's
OIDC discovery URL (JWKS). One auth role per fleet, configured against
**one** Zitadel project:

```
bound_issuer    = <Zitadel issuer URL>
bound_audiences = <Zitadel project ID>
bound_claims    = { "urn:zitadel:iam:org:project:roles": "fleet-device" }
user_claim      = sub
groups_claim    = deployments
```

`bound_audiences` is the project boundary. A token minted in any other
Zitadel project has a different `aud` claim and is rejected before any
membership claim is read. This is the same defense
`nats/callout/src/zitadel.rs` already applies via `set_audience`.

`groups_claim = deployments` instructs OpenBao to read the JWT's
`deployments` array and bind the resulting Vault token to one external
group per element. Each external group carries a per-deployment policy
granting `read` on `harmony-fleet/data/<deployment-id>/*`.

### 2. Operator-managed Zitadel metadata as the membership source of truth

The fleet operator is the only writer of `user.metadata.deployments`
on each device's Zitadel machine user. A Zitadel post-token-creation
**Action** copies that metadata into a top-level `deployments` claim on
the access token. The device never touches its own metadata.

When a new `Deployment` CR is observed in Kubernetes, the operator
executes three writes in a strict order:

1. **Zitadel metadata** — append the deployment ID to the device's
   `deployments` array (per device targeted by the deployment).
2. **OpenBao external group + policy** — upsert
   `identity/group/<deployment-id>` (`type=external`, alias matching the
   JWT-auth accessor) and policy
   `fleet-deployment-<deployment-id>` granting
   `read` on `harmony-fleet/data/<deployment-id>/*`.
3. **NATS desired-state** — publish
   `desired-state.<device-id>.<deployment-id>` with the workload score.

Reversed, the agent could see the desired-state, attempt a re-auth,
and find the deployment missing from its claims — a "permission denied
for a deployment I was told to run" race that is confusing to debug
and weakens the trust story. Trust state always precedes the workload
signal.

Removal runs in reverse: NATS delete → (optional) group/policy delete →
metadata removal. Currently-cached Vault tokens retain access until
their short TTL expires; explicit revocation is available via
`bao token revoke` on the device's accessor if hard revocation is
needed.

### 3. Client side: JWT-bearer in `harmony_secret`, refresh before reconcile in the agent

The agent does **not** grow a new secrets client. Per ADR-020,
`harmony_config` is the unified config+secret entry point and already
wraps OpenBao via `harmony_secret::OpenbaoSecretStore`. The missing
piece is auth: `OpenbaoSecretStore` supports env token, cached token,
Zitadel OIDC device flow (humans), and userpass — but not Zitadel
**JWT-bearer** for headless machine identity.

Three additions:

- A fifth rung on `OpenbaoSecretStore`'s auth ladder takes a Zitadel
  machine keyfile + Bao JWT role + audience, mints via RFC 7523, and
  POSTs to `/v1/auth/jwt/login`.
- The pure minting moves to `harmony_zitadel_auth` so NATS and
  OpenBao auth share one implementation (Rule of Three: NATS callout
  + OpenBao auth = two real consumers).
- `OpenbaoSecretStore` gains `refresh_auth()` (re-mint + re-login,
  guarded by an internal `Mutex`) and `cached_scope() ->
  HashSet<String>` derived from decoding the in-hand Zitadel JWT —
  no Bao round-trip needed since the `deployments` claim is already
  in the token we just minted.

In the agent, the NATS KV watcher consults `cached_scope()` before
each `reconciler.apply()`. If the desired deployment isn't covered,
it calls `refresh_auth()` and proceeds. The check is inline in
`main.rs` — about ten lines around the existing watcher loop. No
new module: one consumer, one site, inlining is the right size.

### Secret path layout

```
harmony-fleet/data/<deployment-id>/<secret-name>
```

The Zitadel project ID does **not** appear in the path. Its job is
done at the JWT validation boundary (`bound_audiences`), not repeated
in every key.

## Rationale

**Why Zitadel project ID lives in `bound_audiences`, not the path.**
The same trust assertion in two places is duplication, not defense in
depth — both reduce to "the JWT signature is valid for this audience."
Concentrating it at the auth role:

- gives one source of truth ("which project owns this Bao instance");
- keeps secret paths readable and operator-friendly;
- decouples secret organization from Zitadel project identity (a
  project ID rotation reconfigures one Bao role, not every path).

**Why user metadata over project roles for deployment membership.**
Project roles in Zitadel live in a flat namespace inside a project.
A handful of roles (`fleet-admin`, `fleet-device`) maps cleanly; one
role per deployment would not — role inventories at hundreds of
deployments per fleet become hard to audit and slow to mutate.
User metadata is a per-machine-user JSON store, naturally
multi-valued, and admin-only-writable. The Zitadel Action that copies
metadata to a claim is a one-time, fleet-wide piece of configuration.

**Why `groups_claim` over claim-templated paths.** Vault policy
templating (`{{identity.entity.aliases…metadata.<key>}}`) supports
single-value substitution but not iteration over an array. Multiple
deployments per device require either multiple JWT logins (one per
deployment) or one login that resolves to multiple policies.
`groups_claim` + external groups gives the latter cleanly: one login,
N policies attached automatically.

**Why `harmony_config` / `harmony_secret`, not a fleet-local secrets
client.** ADR-020 is explicit that `harmony_config` is the unified
config+secret entry point and `OpenbaoSecretStore` is the canonical
OpenBao client. Adding a parallel fleet-only client would duplicate
the auth ladder, cache-file layout, and `kv2` plumbing already in
`harmony_secret`. The fleet's need is an *additional auth branch*,
not a different store.

**Why scope is decoded from the Zitadel JWT, not asked of Bao.** The
agent already holds the JWT it's about to present at login; the
`deployments` claim is right there. A `/v1/auth/token/lookup-self`
round-trip after login would compute the same set from the other
direction, paying a network call to recover information already in
hand.

## Consequences

**Pros**

- One auth root on a device (the existing Zitadel machine key) covers
  both NATS and OpenBao access. Rotation, revocation, and inventory
  remain centralized.
- The operator owns membership; the agent owns identity. A compromised
  device cannot widen its own access. A compromised operator's blast
  radius is its own fleet (one Zitadel project, one Bao instance).
- Per-deployment policies are mechanical to generate. Bao policy text
  is identical modulo the deployment ID, produced by a small templated
  Score. New deployments add one external group + one policy; no
  hand-written ACLs.
- The lifecycle layer is a reusable home for future
  "before-reconcile" work without further architectural changes.

**Cons**

- **Two-token invalidation on membership change.** Both the cached
  Zitadel access token and the cached Bao Vault token must be dropped
  for new membership to take effect. This is encapsulated in the
  `secrets.refresh()` call but is a real round-trip cost (one HTTPS to
  Zitadel + one to Bao) on every membership change. Mitigated by the
  fact that membership changes are rare relative to secret reads.
- **Removal latency = Vault token TTL.** Removing a device from a
  deployment does not immediately revoke its currently-cached Vault
  token; access ends at next renewal or TTL expiry. Short TTLs (15 min)
  bound the worst case; explicit `bao token revoke -accessor` is
  available if needed.
- **Operator gains Zitadel-admin scope.** The operator must hold
  credentials that can write user metadata in the Zitadel project.
  This is a high-privilege scope and concentrates trust in the
  operator. The mitigation is a per-fleet Zitadel project: a
  compromised operator can only mutate its own fleet's identities.
- **Zitadel Action required.** Surfacing user metadata as a JWT claim
  needs a small Zitadel Action (server-side JavaScript). It is part of
  the fleet's Zitadel setup and must be in version control / applied
  by the fleet's bootstrap, not configured by hand. (See "Additional
  Notes" for the script.)

## Alternatives considered

**Project roles for deployment membership.** Rejected: flat namespace
inside a project, no native multi-value semantics, role inventory
explodes at hundreds of deployments per fleet, mutations require
project-admin scope on a coarse-grained API. Kept for the coarse
`fleet-device` / `fleet-admin` distinction the NATS callout already
uses.

**Project ID embedded in the secret path
(`secrets/<project-id>/<deployment>/...`).** Rejected: the project
isolation is already enforced by `bound_audiences` at the JWT layer.
Encoding it in the path is duplication of the same assertion, couples
the secret tree to a Zitadel ID, and complicates project rotations.
Adds no security: a token that passes `bound_audiences` validation can
read the path regardless; one that fails cannot read anything.

**Claim-templated single policy
(`{{identity.…metadata.deployment_id}}`).** Rejected for the
multi-deployment case: Vault policy templating does not iterate over
arrays, so a single-policy template can only express "one deployment
per device." Acceptable for a single-deployment-per-device world; the
chosen kubelet-like architecture admits N deployments per device, and
collapsing the chosen `groups_claim` design to this would force
multiple JWT logins per refresh.

**Static per-device Bao token issued at provisioning.** Rejected:
introduces a second long-lived secret on the device, breaks rotation
(re-provisioning required), and provides no native per-deployment
scoping.

**OpenBao OIDC code flow.** Rejected: that flow is for human users
with a browser. Devices are headless and already hold a JWT-bearer
identity; using OIDC would re-invent the wheel and require a local
browser-equivalent.

**Lifecycle layer inside the NATS handler.** Rejected: conflates
transport with domain logic and makes the refresh-then-reconcile
ordering implicit. The dedicated module makes the contract testable
and lets future triggers reuse the same code path.

## Additional Notes

### Zitadel Action (token customization)

A single post-access-token-creation Action per fleet's Zitadel project
copies user metadata `deployments` into a top-level claim:

```javascript
// Trigger: pre-access-token-creation
function addDeployments(ctx, api) {
  const md = ctx.v1.user.getMetadata();
  const entry = md.metadata.find(m => m.key === "deployments");
  if (!entry) return;
  try {
    const deployments = JSON.parse(
      Buffer.from(entry.value, "base64").toString("utf-8")
    );
    if (Array.isArray(deployments)) {
      api.v1.claims.setClaim("deployments", deployments);
    }
  } catch (_) { /* malformed metadata is treated as no deployments */ }
}
```

The Action lives in Zitadel's "Flows" configuration, attached to the
`Complement Token` flow on the relevant project. A Harmony Score
(`ZitadelTokenCustomizationScore` or similar) is the right home for
applying this declaratively; see plan document for status.

### Relationship to ADR-016 and ADR-020-1

ADR-016 (agent mesh on NATS JetStream) establishes the agent's
existing Zitadel-keyed identity for NATS. This ADR reuses that
identity unchanged.

ADR-020-1 establishes the human-developer authentication path to
OpenBao via Zitadel's Device Authorization Grant. This ADR is the
machine-user counterpart: same OpenBao, same Zitadel, different
auth-method binding (humans use device code; devices use
JWT-bearer-derived access tokens against `/auth/jwt/login`).

### Threat model summary

| Attacker | Capability | Defense |
|---|---|---|
| External (no Zitadel identity) | None | No valid JWT signature; rejected at JWKS validation. |
| Compromised device (key theft) | Full agent scope on its own deployments only | `groups_claim` restricts scope to the device's metadata; Zitadel admin can rotate the machine key and trigger immediate re-issuance. |
| Different Zitadel project (different tenant or malicious org) | Can mint valid Zitadel tokens for its own project | `bound_audiences` rejects at the JWT auth boundary before any claim is read. |
| Compromised operator | Can mutate Zitadel metadata + Bao policies for its fleet | One operator per fleet; operator credentials themselves stored in Bao under a separate auth path; compromise is contained to the operator's project. |
| Compromised Bao | Full access to all stored secrets | Out of scope — Bao is the root of secret trust by definition. ADR-006 covers Bao operational hardening. |