docs: ADR on how to handle securely fleet device secrets with openbao + zitadel SSO #319
446
ROADMAP/fleet_platform/device_secret_access_handoff.md
Normal file
446
ROADMAP/fleet_platform/device_secret_access_handoff.md
Normal file
@@ -0,0 +1,446 @@
|
||||
# Fleet device secret access — implementation handoff
|
||||
|
||||
**Owner:** TBD
|
||||
**Design:** [`docs/adr/025-fleet-device-secret-access.md`](../../docs/adr/025-fleet-device-secret-access.md)
|
||||
**Status:** Ready to start
|
||||
**Written:** 2026-06-01
|
||||
|
||||
Read ADR-025 first. This document is the work plan, not the design.
|
||||
|
||||
## Summary
|
||||
|
||||
The fleet agent reads per-deployment secrets through the existing
|
||||
`harmony_config` chain, which already wraps OpenBao via
|
||||
`harmony_secret::OpenbaoSecretStore`. `OpenbaoSecretStore`'s auth
|
||||
ladder is extended with one new rung — Zitadel **JWT-bearer**, keyed
|
||||
off the device's existing machine keyfile (the same root of trust the
|
||||
NATS callout uses). Per-deployment scope rides in the JWT itself as a
|
||||
`deployments` array claim, populated by the operator through Zitadel
|
||||
user metadata. OpenBao's `groups_claim` binds those values to
|
||||
auto-created external groups, each with a small policy granting
|
||||
`read` on `harmony-fleet/data/<deployment-id>/*`. The agent's NATS
|
||||
watcher calls `refresh_auth()` on the store before reconciling a
|
||||
deployment whose secrets are outside the cached scope.
|
||||
|
||||
## Architecture at a glance
|
||||
|
||||
```
|
||||
Fleet operator (one per fleet, holds Zitadel admin scope)
|
||||
│
|
||||
├── on new Deployment CR:
|
||||
│ 1. FleetDeviceDeploymentMembershipScore
|
||||
│ ├─ Zitadel: append dep to device.metadata.deployments
|
||||
│ └─ OpenBao: upsert external group + policy
|
||||
│ fleet-deployment-<dep> → read on
|
||||
│ harmony-fleet/data/<dep>/*
|
||||
│ 2. existing NATS publish (data plane, unchanged)
|
||||
│
|
||||
▼
|
||||
Fleet agent
|
||||
│
|
||||
├── NATS handler (main.rs):
|
||||
│ if !secret_store.cached_scope().contains(&dep) {
|
||||
│ secret_store.refresh_auth().await?;
|
||||
│ }
|
||||
│ reconciler.apply(…)
|
||||
│
|
||||
└── reads secrets via harmony_config (StoreSource wrapping
|
||||
OpenbaoSecretStore with new JWT-bearer auth rung)
|
||||
```
|
||||
|
||||
## Pre-flight
|
||||
|
||||
- [ ] Confirm staging Zitadel has a service account for the operator
|
||||
with **write user metadata** scope. File a sub-task if not.
|
||||
- [ ] Confirm the staging OpenBao KV mount `harmony-fleet` exists, or
|
||||
add its creation to `OpenbaoSetupScore`.
|
||||
- [ ] Validate `DeploymentName`'s character set is safe for KV paths
|
||||
and Bao external-group names (`[a-zA-Z0-9_-]`).
|
||||
|
||||
## Work breakdown
|
||||
|
||||
Six PRs. Dependencies:
|
||||
|
||||
```
|
||||
PR-1 harmony_zitadel_auth: extract JWT-bearer minter
|
||||
└─ PR-2 harmony_secret: JWT-bearer auth + refresh_auth + cached_scope
|
||||
└─ PR-6 agent main.rs: inline refresh-check
|
||||
|
||||
PR-3 harmony OpenbaoJwtAuth Score: bound_claims + groups_claim
|
||||
PR-4 Zitadel Action Score: deployments claim
|
||||
└─ PR-5 Operator: FleetDeviceDeploymentMembershipScore
|
||||
PR-3 ─┘
|
||||
```
|
||||
|
||||
PRs 1, 3, 4 are independent and can start in parallel.
|
||||
|
||||
---
|
||||
|
||||
### PR-1 — Extract Zitadel JWT-bearer minter into `harmony_zitadel_auth`
|
||||
|
||||
**Crate:** `harmony_zitadel_auth`
|
||||
**Depends on:** nothing
|
||||
**Blocks:** PR-2
|
||||
|
||||
Move `MachineKeyFile`, `CachedToken`, `build_assertion*`, `build_scope`,
|
||||
`build_token_url`, and the mint+cache logic out of
|
||||
`fleet/harmony-fleet-auth/src/credentials.rs` into a new
|
||||
`harmony_zitadel_auth/src/jwt_bearer.rs`. Two real consumers (NATS
|
||||
callout + OpenBao) cross Rule of Three at the second consumer.
|
||||
|
||||
`harmony_secret` cannot depend on `harmony-fleet-auth` (fleet-specific
|
||||
crate; would invert the dependency graph). `harmony_zitadel_auth` is
|
||||
neutral and already houses the human-OIDC counterpart.
|
||||
|
||||
**Shape.**
|
||||
|
||||
```rust
|
||||
// harmony_zitadel_auth/src/jwt_bearer.rs
|
||||
pub struct ZitadelJwtBearer {
|
||||
key: MachineKeyFile,
|
||||
oidc_issuer_url: String,
|
||||
audience: String,
|
||||
http: reqwest::Client,
|
||||
cache: Mutex<Option<CachedToken>>,
|
||||
}
|
||||
|
||||
impl ZitadelJwtBearer {
|
||||
pub fn new(key: MachineKeyFile, oidc_issuer_url: String, audience: String,
|
||||
danger_accept_invalid_certs: bool) -> Result<Self>;
|
||||
|
||||
/// Cached if comfortably valid, otherwise mints fresh.
|
||||
pub async fn bearer_token(&self) -> Result<String>;
|
||||
|
||||
/// Force a re-mint on next call.
|
||||
pub fn invalidate_cache(&self);
|
||||
}
|
||||
```
|
||||
|
||||
`CredentialSource::ZitadelJwt` in `harmony-fleet-auth` becomes a thin
|
||||
wrapper around `Arc<ZitadelJwtBearer>`. Port the existing pure-builder
|
||||
tests from `credentials.rs` into the new crate.
|
||||
|
||||
**Acceptance.** `cargo test -p harmony_zitadel_auth -p harmony-fleet-auth`
|
||||
clean. Fleet e2e NATS auth still works.
|
||||
|
||||
---
|
||||
|
||||
### PR-2 — `OpenbaoSecretStore` JWT-bearer rung + `refresh_auth` + `cached_scope`
|
||||
|
||||
**Crate:** `harmony_secret` (`src/store/openbao.rs`)
|
||||
**Depends on:** PR-1
|
||||
**Blocks:** PR-6
|
||||
|
||||
Add the fifth rung to `OpenbaoSecretStore::new`'s auth ladder.
|
||||
Position: between cached-token (rung 2) and Zitadel OIDC device flow
|
||||
(rung 4). Triggered only when a machine keyfile is configured; if it
|
||||
fails, fall through to the existing ladder.
|
||||
|
||||
**Constructor refactor.** Knock out the `too_many_arguments` clippy
|
||||
TODO at `openbao.rs:55` while we're adding more args:
|
||||
|
||||
```rust
|
||||
pub struct OpenbaoStoreOptions {
|
||||
pub base_url: String,
|
||||
pub kv_mount: String,
|
||||
pub auth_mount: String,
|
||||
pub skip_tls: bool,
|
||||
pub token: Option<String>,
|
||||
pub username: Option<String>, pub password: Option<String>,
|
||||
pub zitadel_sso_url: Option<String>, pub zitadel_client_id: Option<String>,
|
||||
pub zitadel_jwt_bearer: Option<ZitadelJwtBearerConfig>, // NEW
|
||||
pub jwt_role: Option<String>, pub jwt_auth_mount: Option<String>,
|
||||
}
|
||||
|
||||
pub struct ZitadelJwtBearerConfig {
|
||||
pub key_path: Option<PathBuf>,
|
||||
pub key_json: Option<String>,
|
||||
pub oidc_issuer_url: String,
|
||||
pub audience: String, // Zitadel project ID
|
||||
}
|
||||
```
|
||||
|
||||
**Refresh capability.** A single `Mutex<Inner>` over the refresh-affected
|
||||
state. Refresh is rare and uncontended; `ArcSwap` would be ceremony.
|
||||
|
||||
```rust
|
||||
pub struct OpenbaoSecretStore {
|
||||
inner: Mutex<Inner>,
|
||||
kv_mount: String,
|
||||
auth_mount: String,
|
||||
jwt_bearer: Option<Arc<ZitadelJwtBearer>>,
|
||||
jwt_role: Option<String>,
|
||||
jwt_auth_mount: Option<String>,
|
||||
base_url: String,
|
||||
skip_tls: bool,
|
||||
}
|
||||
|
||||
struct Inner {
|
||||
client: VaultClient,
|
||||
scope: HashSet<String>, // deployment IDs from the Zitadel JWT
|
||||
}
|
||||
|
||||
impl OpenbaoSecretStore {
|
||||
pub async fn refresh_auth(&self) -> Result<(), SecretStoreError> {
|
||||
let bearer = self.jwt_bearer.as_ref().ok_or_else(|| /* err */)?;
|
||||
bearer.invalidate_cache();
|
||||
let jwt = bearer.bearer_token().await?;
|
||||
let scope = decode_deployments_claim(&jwt)?; // pure JWT decode
|
||||
let session = jwt_login(&self.base_url, /* … */, &jwt).await?;
|
||||
let client = build_vault_client(&self.base_url, self.skip_tls, &session.token)?;
|
||||
*self.inner.lock().await = Inner { client, scope };
|
||||
Ok(())
|
||||
}
|
||||
|
||||
pub async fn cached_scope(&self) -> HashSet<String> {
|
||||
self.inner.lock().await.scope.clone()
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`decode_deployments_claim` is a small pure helper that base64-decodes
|
||||
the JWT body and reads the `deployments` array. No Bao round-trip.
|
||||
`get_raw` / `set_raw` lock `inner`, take a `&VaultClient`, perform the
|
||||
call, release. The lock is uncontended on the hot path.
|
||||
|
||||
**On construction with JWT-bearer.** Wire `jwt_bearer` and call
|
||||
`refresh_auth()` once before returning. If it fails, fall through to
|
||||
the next ladder rung.
|
||||
|
||||
**Tests.**
|
||||
|
||||
- Ladder ordering: JWT-bearer present → tried before OIDC; absent →
|
||||
unchanged.
|
||||
- `refresh_auth` against a fake HTTP server updates `cached_scope`.
|
||||
- `decode_deployments_claim` on a hand-crafted JWT returns the
|
||||
expected set.
|
||||
|
||||
**Acceptance.** `cargo test -p harmony_secret` clean. `examples/openbao`
|
||||
userpass path unaffected. Manual staging run authenticates via
|
||||
JWT-bearer.
|
||||
|
||||
---
|
||||
|
||||
### PR-3 — Extend `OpenbaoJwtAuth` Score with `bound_claims_json` + `groups_claim`
|
||||
|
||||
**Crate:** `harmony` (`src/modules/openbao/setup.rs`)
|
||||
**Depends on:** nothing
|
||||
|
||||
Two new fields, both defaulting to empty so existing callers are
|
||||
unaffected:
|
||||
|
||||
```rust
|
||||
pub struct OpenbaoJwtAuth {
|
||||
// … existing fields …
|
||||
#[serde(default)] pub bound_claims_json: String,
|
||||
#[serde(default)] pub groups_claim: String,
|
||||
}
|
||||
```
|
||||
|
||||
Extend `configure_jwt` (line 435+) to pass each as a `bao write
|
||||
auth/jwt/role/...` flag when non-empty.
|
||||
|
||||
**Acceptance.** `cargo check -p harmony --all-features` clean.
|
||||
`examples/openbao` still works.
|
||||
|
||||
---
|
||||
|
||||
### PR-4 — Zitadel Action Score: surface `deployments` metadata as a claim
|
||||
|
||||
**Crate:** `harmony` (new module under `src/modules/zitadel/` or
|
||||
wherever existing Zitadel deploy code lives)
|
||||
**Depends on:** nothing
|
||||
**Blocks:** PR-5
|
||||
|
||||
Declarative Score that upserts the post-access-token-creation Action
|
||||
on the configured Zitadel project. Cannot be avoided: Zitadel emits
|
||||
`urn:zitadel:iam:user:metadata` as base64-encoded values in a map,
|
||||
and Bao's `groups_claim` can't consume that shape — the Action
|
||||
decodes and re-emits as a flat string array.
|
||||
|
||||
**Action script** (canonical text):
|
||||
|
||||
```javascript
|
||||
function addDeployments(ctx, api) {
|
||||
const md = ctx.v1.user.getMetadata();
|
||||
const entry = md.metadata.find(m => m.key === "deployments");
|
||||
if (!entry) return;
|
||||
try {
|
||||
const deployments = JSON.parse(
|
||||
Buffer.from(entry.value, "base64").toString("utf-8")
|
||||
);
|
||||
if (Array.isArray(deployments)) {
|
||||
api.v1.claims.setClaim("deployments", deployments);
|
||||
}
|
||||
} catch (_) { /* malformed → no deployments */ }
|
||||
}
|
||||
```
|
||||
|
||||
**Score shape.**
|
||||
|
||||
```rust
|
||||
pub struct ZitadelDeploymentsClaimActionScore {
|
||||
pub project_id: String,
|
||||
pub action_name: String, // default "fleet-deployments-claim"
|
||||
}
|
||||
```
|
||||
|
||||
Interpret upserts via Zitadel Management API, attaches to
|
||||
`Complement Token / PreAccessTokenCreation`. Idempotent.
|
||||
|
||||
**Acceptance.** Manual decode of a staging token shows the
|
||||
`deployments` claim.
|
||||
|
||||
---
|
||||
|
||||
### PR-5 — `FleetDeviceDeploymentMembershipScore` + operator wiring
|
||||
|
||||
**Crate:** `harmony` (`src/modules/fleet/` or wherever fleet Scores
|
||||
live) for the Score; `fleet/harmony-fleet-operator` for wiring.
|
||||
**Depends on:** PR-3 (Bao JWT role + accessor must exist), PR-4
|
||||
(claim must surface)
|
||||
|
||||
One Score, two writes, declared in execution order. Per ADR-023 the
|
||||
operator hands off ordering to the Score rather than sequencing the
|
||||
external API calls by hand.
|
||||
|
||||
```rust
|
||||
pub struct FleetDeviceDeploymentMembershipScore {
|
||||
pub zitadel_project_id: String,
|
||||
pub device_user_id: String, // Zitadel machine user ID
|
||||
pub deployments: Vec<String>, // declarative full set
|
||||
pub openbao_instance: OpenbaoInstance,
|
||||
pub kv_mount: String, // "harmony-fleet"
|
||||
pub jwt_auth_accessor: String, // resolved from PR-3
|
||||
}
|
||||
```
|
||||
|
||||
Interpret runs, in order:
|
||||
|
||||
1. **Zitadel metadata** — declarative replace on
|
||||
`user.metadata.deployments`. Read current value, write only on
|
||||
diff. Removal is "declare the new set without the removed entry."
|
||||
2. **OpenBao external group + policy, per deployment in the set.** For
|
||||
each `<dep>` in `deployments`:
|
||||
- Policy `fleet-deployment-<dep>` granting `read` on
|
||||
`harmony-fleet/data/<dep>/*` and
|
||||
`read, list` on `harmony-fleet/metadata/<dep>/*`.
|
||||
- External group `<dep>` (type=external) with that policy
|
||||
attached, alias matching the JWT-auth accessor.
|
||||
|
||||
Idempotent throughout.
|
||||
|
||||
**Operator wiring.** In the existing per-`Deployment` reconciler,
|
||||
compose this Score before the existing NATS desired-state publish.
|
||||
If the Score errors, surface as a CR status condition and do not
|
||||
publish — never tell a device to run a deployment it cannot
|
||||
authenticate for.
|
||||
|
||||
**Note: NATS publish stays in operator code.** It's data-plane (the
|
||||
"now do this" half of enrollment), already wired, and changing its
|
||||
home isn't part of this work.
|
||||
|
||||
**Tests.**
|
||||
|
||||
- Unit on the Score: Zitadel diff logic (no-op on match, full write
|
||||
on diff); policy text generation; group alias shape.
|
||||
- Integration: against the real Bao in `examples/openbao` + a fake
|
||||
Zitadel; second apply is a no-op.
|
||||
|
||||
**Acceptance.** A `Deployment` CR rolled in staging causes both
|
||||
writes to land before the NATS publish; manual JWT-bearer login at
|
||||
Bao returns a token with the expected per-deployment policies
|
||||
attached via external-group bindings.
|
||||
|
||||
---
|
||||
|
||||
### PR-6 — Agent: inline refresh-check in `main.rs`
|
||||
|
||||
**Crate:** `fleet/harmony-fleet-agent`
|
||||
**Depends on:** PR-2
|
||||
|
||||
Around the existing JetStream KV watcher in
|
||||
`fleet/harmony-fleet-agent/src/main.rs:110-120`, gate
|
||||
`reconciler.apply` on a scope check:
|
||||
|
||||
```rust
|
||||
async_nats::jetstream::kv::Operation::Put => {
|
||||
if let Some(dep) = deployment_from_key(&entry.key) {
|
||||
if !secret_store.cached_scope().await.contains(dep.as_str()) {
|
||||
tracing::info!(%dep, "deployment outside cached scope — refreshing");
|
||||
if let Err(e) = secret_store.refresh_auth().await {
|
||||
tracing::warn!(key = %entry.key, error = %e, "refresh failed");
|
||||
continue;
|
||||
}
|
||||
}
|
||||
}
|
||||
if let Err(e) = reconciler.apply(&entry.key, &entry.value).await {
|
||||
tracing::warn!(key = %entry.key, error = %e, "apply failed");
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`secret_store` is the same `Arc<OpenbaoSecretStore>` held by the
|
||||
agent's `ConfigManager` — share it through whatever construction
|
||||
path `main.rs` already uses.
|
||||
|
||||
**No new module.** One consumer, one site, ~10 lines. Per CLAUDE.md
|
||||
"Rule of Three: introduce an abstraction at the second real
|
||||
instance." If/when a second pre-reconcile concern (image-pull creds,
|
||||
monitoring registration) arrives, extract a layer then.
|
||||
|
||||
**No retry on "still missing after refresh."** The operator's
|
||||
ordering guarantees Zitadel is consistent before the NATS publish.
|
||||
If we observe the race in practice, add a single retry then.
|
||||
|
||||
**Tests.** None new — this is wiring between two tested components.
|
||||
Confirm via the integration milestone below.
|
||||
|
||||
**Acceptance.** Rolling a new deployment to a staging device shows
|
||||
one "outside cached scope — refreshing" log followed by a clean
|
||||
reconcile.
|
||||
|
||||
---
|
||||
|
||||
## Integration milestone — staging dry run
|
||||
|
||||
After PRs 1-6 land:
|
||||
|
||||
- [ ] Apply the new `OpenbaoJwtAuth` config to staging Bao via
|
||||
`OpenbaoSetupScore`.
|
||||
- [ ] Apply `ZitadelDeploymentsClaimActionScore` to staging Zitadel.
|
||||
- [ ] Hand-provision metadata + group/policy for one test deployment
|
||||
+ one test device. Confirm the agent reads its secret.
|
||||
- [ ] Roll a `Deployment` CR via the operator. Confirm
|
||||
`FleetDeviceDeploymentMembershipScore` writes Zitadel and Bao
|
||||
before NATS, and the agent reconciles cleanly.
|
||||
- [ ] Negative: device not in deployment X gets a hard error reading
|
||||
X's secret (not a silent fall-through).
|
||||
- [ ] Negative: a JWT minted before the metadata update cannot read
|
||||
the new deployment's secret until the agent's inline refresh
|
||||
runs.
|
||||
|
||||
## Decisions deferred
|
||||
|
||||
1. **Vault token TTL.** ADR-025 recommends 15 min. Confirm in
|
||||
staging; adjust `OpenbaoJwtAuth::ttl` if needed.
|
||||
2. **Hard revocation on deployment removal.** Wait for TTL today.
|
||||
Add a revoke companion only if a real fleet requires it.
|
||||
3. **Bao down at agent startup.** `OpenbaoSecretStore::new` with
|
||||
JWT-bearer must not panic; either fall through or surface a
|
||||
retryable error. Confirm and document in the milestone.
|
||||
|
||||
## Required reading
|
||||
|
||||
- `docs/adr/025-fleet-device-secret-access.md` — design.
|
||||
- `docs/adr/020-1-zitadel-openbao-secure-config-store.md` —
|
||||
human-user counterpart of this auth flow.
|
||||
- `docs/adr/023-deploy-architecture.md` — Score discipline for PRs 3-5.
|
||||
- `harmony_secret/src/store/openbao.rs` — auth ladder being extended.
|
||||
- `harmony_config/src/source/store.rs` — wrapper the agent uses.
|
||||
- `fleet/harmony-fleet-auth/src/credentials.rs` — JWT-bearer mint
|
||||
path being extracted in PR-1.
|
||||
- `nats/callout/src/zitadel.rs` — JWT validation shape the Bao role
|
||||
mirrors (`bound_issuer` + `bound_audiences`).
|
||||
- `fleet/harmony-fleet-agent/src/main.rs` — site of the PR-6 inline
|
||||
edit.
|
||||
319
docs/adr/025-fleet-device-secret-access.md
Normal file
319
docs/adr/025-fleet-device-secret-access.md
Normal file
@@ -0,0 +1,319 @@
|
||||
# Architecture Decision Record: Fleet Device Secret Access via Zitadel JWT
|
||||
|
||||
Initial Author: Jean-Gabriel Gill-Couture
|
||||
|
||||
Initial Date: 2026-06-01
|
||||
|
||||
Last Updated Date: 2026-06-01
|
||||
|
||||
## Status
|
||||
|
||||
Proposed
|
||||
|
||||
## Context
|
||||
|
||||
Fleet agents on devices need to read per-deployment secrets (image-pull
|
||||
credentials, application secrets, etc.) from OpenBao. The agent already
|
||||
holds one durable secret: a Zitadel machine-user JWT keyfile dropped by
|
||||
`FleetDeviceSetupScore`. That key is the basis for the agent's existing
|
||||
NATS authentication (`nats/callout` validates the Zitadel-minted access
|
||||
token; `fleet/harmony-fleet-auth/src/credentials.rs` mints it via the
|
||||
RFC 7523 JWT-bearer flow).
|
||||
|
||||
Three requirements shape the design:
|
||||
|
||||
1. **No new device-side secret.** The Zitadel machine key is already the
|
||||
single root of trust on a device; the secret-access path must derive
|
||||
from the same key, not introduce a second one.
|
||||
|
||||
2. **Per-deployment isolation, enforced cryptographically.** A device
|
||||
enrolled in deployments A and B reads only `A`'s and `B`'s secrets.
|
||||
A device that hosts no deployments reads nothing. The device cannot
|
||||
widen its own scope — only the operator can change membership.
|
||||
|
||||
3. **Cross-project safety.** A second Zitadel project (a different
|
||||
tenant, a different fleet, a malicious org) must not be able to
|
||||
produce a token that OpenBao accepts. The trust boundary is the
|
||||
project, not the deployment.
|
||||
|
||||
The kubelet analogy is the architectural north star: the agent is a
|
||||
small runtime that learns its workload (and the credentials needed to
|
||||
run it) from a control-plane authority. The agent never decides what it
|
||||
is allowed to run or read; it presents a signed identity and the
|
||||
infrastructure decides.
|
||||
|
||||
## Decision
|
||||
|
||||
Three coordinating pieces.
|
||||
|
||||
### 1. OpenBao JWT auth bound to the Zitadel project
|
||||
|
||||
OpenBao's JWT auth method validates incoming tokens against Zitadel's
|
||||
OIDC discovery URL (JWKS). One auth role per fleet, configured against
|
||||
**one** Zitadel project:
|
||||
|
||||
```
|
||||
bound_issuer = <Zitadel issuer URL>
|
||||
bound_audiences = <Zitadel project ID>
|
||||
bound_claims = { "urn:zitadel:iam:org:project:roles": "fleet-device" }
|
||||
user_claim = sub
|
||||
groups_claim = deployments
|
||||
```
|
||||
|
||||
`bound_audiences` is the project boundary. A token minted in any other
|
||||
Zitadel project has a different `aud` claim and is rejected before any
|
||||
membership claim is read. This is the same defense
|
||||
`nats/callout/src/zitadel.rs` already applies via `set_audience`.
|
||||
|
||||
`groups_claim = deployments` instructs OpenBao to read the JWT's
|
||||
`deployments` array and bind the resulting Vault token to one external
|
||||
group per element. Each external group carries a per-deployment policy
|
||||
granting `read` on `harmony-fleet/data/<deployment-id>/*`.
|
||||
|
||||
### 2. Operator-managed Zitadel metadata as the membership source of truth
|
||||
|
||||
The fleet operator is the only writer of `user.metadata.deployments`
|
||||
on each device's Zitadel machine user. A Zitadel post-token-creation
|
||||
**Action** copies that metadata into a top-level `deployments` claim on
|
||||
the access token. The device never touches its own metadata.
|
||||
|
||||
When a new `Deployment` CR is observed in Kubernetes, the operator
|
||||
executes three writes in a strict order:
|
||||
|
||||
1. **Zitadel metadata** — append the deployment ID to the device's
|
||||
`deployments` array (per device targeted by the deployment).
|
||||
2. **OpenBao external group + policy** — upsert
|
||||
`identity/group/<deployment-id>` (`type=external`, alias matching the
|
||||
JWT-auth accessor) and policy
|
||||
`fleet-deployment-<deployment-id>` granting
|
||||
`read` on `harmony-fleet/data/<deployment-id>/*`.
|
||||
3. **NATS desired-state** — publish
|
||||
`desired-state.<device-id>.<deployment-id>` with the workload score.
|
||||
|
||||
Reversed, the agent could see the desired-state, attempt a re-auth,
|
||||
and find the deployment missing from its claims — a "permission denied
|
||||
for a deployment I was told to run" race that is confusing to debug
|
||||
and weakens the trust story. Trust state always precedes the workload
|
||||
signal.
|
||||
|
||||
Removal runs in reverse: NATS delete → (optional) group/policy delete →
|
||||
metadata removal. Currently-cached Vault tokens retain access until
|
||||
their short TTL expires; explicit revocation is available via
|
||||
`bao token revoke` on the device's accessor if hard revocation is
|
||||
needed.
|
||||
|
||||
### 3. Client side: JWT-bearer in `harmony_secret`, refresh before reconcile in the agent
|
||||
|
||||
The agent does **not** grow a new secrets client. Per ADR-020,
|
||||
`harmony_config` is the unified config+secret entry point and already
|
||||
wraps OpenBao via `harmony_secret::OpenbaoSecretStore`. The missing
|
||||
piece is auth: `OpenbaoSecretStore` supports env token, cached token,
|
||||
Zitadel OIDC device flow (humans), and userpass — but not Zitadel
|
||||
**JWT-bearer** for headless machine identity.
|
||||
|
||||
Three additions:
|
||||
|
||||
- A fifth rung on `OpenbaoSecretStore`'s auth ladder takes a Zitadel
|
||||
machine keyfile + Bao JWT role + audience, mints via RFC 7523, and
|
||||
POSTs to `/v1/auth/jwt/login`.
|
||||
- The pure minting moves to `harmony_zitadel_auth` so NATS and
|
||||
OpenBao auth share one implementation (Rule of Three: NATS callout
|
||||
+ OpenBao auth = two real consumers).
|
||||
- `OpenbaoSecretStore` gains `refresh_auth()` (re-mint + re-login,
|
||||
guarded by an internal `Mutex`) and `cached_scope() ->
|
||||
HashSet<String>` derived from decoding the in-hand Zitadel JWT —
|
||||
no Bao round-trip needed since the `deployments` claim is already
|
||||
in the token we just minted.
|
||||
|
||||
In the agent, the NATS KV watcher consults `cached_scope()` before
|
||||
each `reconciler.apply()`. If the desired deployment isn't covered,
|
||||
it calls `refresh_auth()` and proceeds. The check is inline in
|
||||
`main.rs` — about ten lines around the existing watcher loop. No
|
||||
new module: one consumer, one site, inlining is the right size.
|
||||
|
||||
### Secret path layout
|
||||
|
||||
```
|
||||
harmony-fleet/data/<deployment-id>/<secret-name>
|
||||
```
|
||||
|
||||
The Zitadel project ID does **not** appear in the path. Its job is
|
||||
done at the JWT validation boundary (`bound_audiences`), not repeated
|
||||
in every key.
|
||||
|
||||
## Rationale
|
||||
|
||||
**Why Zitadel project ID lives in `bound_audiences`, not the path.**
|
||||
The same trust assertion in two places is duplication, not defense in
|
||||
depth — both reduce to "the JWT signature is valid for this audience."
|
||||
Concentrating it at the auth role:
|
||||
|
||||
- gives one source of truth ("which project owns this Bao instance");
|
||||
- keeps secret paths readable and operator-friendly;
|
||||
- decouples secret organization from Zitadel project identity (a
|
||||
project ID rotation reconfigures one Bao role, not every path).
|
||||
|
||||
**Why user metadata over project roles for deployment membership.**
|
||||
Project roles in Zitadel live in a flat namespace inside a project.
|
||||
A handful of roles (`fleet-admin`, `fleet-device`) maps cleanly; one
|
||||
role per deployment would not — role inventories at hundreds of
|
||||
deployments per fleet become hard to audit and slow to mutate.
|
||||
User metadata is a per-machine-user JSON store, naturally
|
||||
multi-valued, and admin-only-writable. The Zitadel Action that copies
|
||||
metadata to a claim is a one-time, fleet-wide piece of configuration.
|
||||
|
||||
**Why `groups_claim` over claim-templated paths.** Vault policy
|
||||
templating (`{{identity.entity.aliases…metadata.<key>}}`) supports
|
||||
single-value substitution but not iteration over an array. Multiple
|
||||
deployments per device require either multiple JWT logins (one per
|
||||
deployment) or one login that resolves to multiple policies.
|
||||
`groups_claim` + external groups gives the latter cleanly: one login,
|
||||
N policies attached automatically.
|
||||
|
||||
**Why `harmony_config` / `harmony_secret`, not a fleet-local secrets
|
||||
client.** ADR-020 is explicit that `harmony_config` is the unified
|
||||
config+secret entry point and `OpenbaoSecretStore` is the canonical
|
||||
OpenBao client. Adding a parallel fleet-only client would duplicate
|
||||
the auth ladder, cache-file layout, and `kv2` plumbing already in
|
||||
`harmony_secret`. The fleet's need is an *additional auth branch*,
|
||||
not a different store.
|
||||
|
||||
**Why scope is decoded from the Zitadel JWT, not asked of Bao.** The
|
||||
agent already holds the JWT it's about to present at login; the
|
||||
`deployments` claim is right there. A `/v1/auth/token/lookup-self`
|
||||
round-trip after login would compute the same set from the other
|
||||
direction, paying a network call to recover information already in
|
||||
hand.
|
||||
|
||||
## Consequences
|
||||
|
||||
**Pros**
|
||||
|
||||
- One auth root on a device (the existing Zitadel machine key) covers
|
||||
both NATS and OpenBao access. Rotation, revocation, and inventory
|
||||
remain centralized.
|
||||
- The operator owns membership; the agent owns identity. A compromised
|
||||
device cannot widen its own access. A compromised operator's blast
|
||||
radius is its own fleet (one Zitadel project, one Bao instance).
|
||||
- Per-deployment policies are mechanical to generate. Bao policy text
|
||||
is identical modulo the deployment ID, produced by a small templated
|
||||
Score. New deployments add one external group + one policy; no
|
||||
hand-written ACLs.
|
||||
- The lifecycle layer is a reusable home for future
|
||||
"before-reconcile" work without further architectural changes.
|
||||
|
||||
**Cons**
|
||||
|
||||
- **Two-token invalidation on membership change.** Both the cached
|
||||
Zitadel access token and the cached Bao Vault token must be dropped
|
||||
for new membership to take effect. This is encapsulated in the
|
||||
`secrets.refresh()` call but is a real round-trip cost (one HTTPS to
|
||||
Zitadel + one to Bao) on every membership change. Mitigated by the
|
||||
fact that membership changes are rare relative to secret reads.
|
||||
- **Removal latency = Vault token TTL.** Removing a device from a
|
||||
deployment does not immediately revoke its currently-cached Vault
|
||||
token; access ends at next renewal or TTL expiry. Short TTLs (15 min)
|
||||
bound the worst case; explicit `bao token revoke -accessor` is
|
||||
available if needed.
|
||||
- **Operator gains Zitadel-admin scope.** The operator must hold
|
||||
credentials that can write user metadata in the Zitadel project.
|
||||
This is a high-privilege scope and concentrates trust in the
|
||||
operator. The mitigation is a per-fleet Zitadel project: a
|
||||
compromised operator can only mutate its own fleet's identities.
|
||||
- **Zitadel Action required.** Surfacing user metadata as a JWT claim
|
||||
needs a small Zitadel Action (server-side JavaScript). It is part of
|
||||
the fleet's Zitadel setup and must be in version control / applied
|
||||
by the fleet's bootstrap, not configured by hand. (See "Additional
|
||||
Notes" for the script.)
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
**Project roles for deployment membership.** Rejected: flat namespace
|
||||
inside a project, no native multi-value semantics, role inventory
|
||||
explodes at hundreds of deployments per fleet, mutations require
|
||||
project-admin scope on a coarse-grained API. Kept for the coarse
|
||||
`fleet-device` / `fleet-admin` distinction the NATS callout already
|
||||
uses.
|
||||
|
||||
**Project ID embedded in the secret path
|
||||
(`secrets/<project-id>/<deployment>/...`).** Rejected: the project
|
||||
isolation is already enforced by `bound_audiences` at the JWT layer.
|
||||
Encoding it in the path is duplication of the same assertion, couples
|
||||
the secret tree to a Zitadel ID, and complicates project rotations.
|
||||
Adds no security: a token that passes `bound_audiences` validation can
|
||||
read the path regardless; one that fails cannot read anything.
|
||||
|
||||
**Claim-templated single policy
|
||||
(`{{identity.…metadata.deployment_id}}`).** Rejected for the
|
||||
multi-deployment case: Vault policy templating does not iterate over
|
||||
arrays, so a single-policy template can only express "one deployment
|
||||
per device." Acceptable for a single-deployment-per-device world; the
|
||||
chosen kubelet-like architecture admits N deployments per device, and
|
||||
collapsing the chosen `groups_claim` design to this would force
|
||||
multiple JWT logins per refresh.
|
||||
|
||||
**Static per-device Bao token issued at provisioning.** Rejected:
|
||||
introduces a second long-lived secret on the device, breaks rotation
|
||||
(re-provisioning required), and provides no native per-deployment
|
||||
scoping.
|
||||
|
||||
**OpenBao OIDC code flow.** Rejected: that flow is for human users
|
||||
with a browser. Devices are headless and already hold a JWT-bearer
|
||||
identity; using OIDC would re-invent the wheel and require a local
|
||||
browser-equivalent.
|
||||
|
||||
**Lifecycle layer inside the NATS handler.** Rejected: conflates
|
||||
transport with domain logic and makes the refresh-then-reconcile
|
||||
ordering implicit. The dedicated module makes the contract testable
|
||||
and lets future triggers reuse the same code path.
|
||||
|
||||
## Additional Notes
|
||||
|
||||
### Zitadel Action (token customization)
|
||||
|
||||
A single post-access-token-creation Action per fleet's Zitadel project
|
||||
copies user metadata `deployments` into a top-level claim:
|
||||
|
||||
```javascript
|
||||
// Trigger: pre-access-token-creation
|
||||
function addDeployments(ctx, api) {
|
||||
const md = ctx.v1.user.getMetadata();
|
||||
const entry = md.metadata.find(m => m.key === "deployments");
|
||||
if (!entry) return;
|
||||
try {
|
||||
const deployments = JSON.parse(
|
||||
Buffer.from(entry.value, "base64").toString("utf-8")
|
||||
);
|
||||
if (Array.isArray(deployments)) {
|
||||
api.v1.claims.setClaim("deployments", deployments);
|
||||
}
|
||||
} catch (_) { /* malformed metadata is treated as no deployments */ }
|
||||
}
|
||||
```
|
||||
|
||||
The Action lives in Zitadel's "Flows" configuration, attached to the
|
||||
`Complement Token` flow on the relevant project. A Harmony Score
|
||||
(`ZitadelTokenCustomizationScore` or similar) is the right home for
|
||||
applying this declaratively; see plan document for status.
|
||||
|
||||
### Relationship to ADR-016 and ADR-020-1
|
||||
|
||||
ADR-016 (agent mesh on NATS JetStream) establishes the agent's
|
||||
existing Zitadel-keyed identity for NATS. This ADR reuses that
|
||||
identity unchanged.
|
||||
|
||||
ADR-020-1 establishes the human-developer authentication path to
|
||||
OpenBao via Zitadel's Device Authorization Grant. This ADR is the
|
||||
machine-user counterpart: same OpenBao, same Zitadel, different
|
||||
auth-method binding (humans use device code; devices use
|
||||
JWT-bearer-derived access tokens against `/auth/jwt/login`).
|
||||
|
||||
### Threat model summary
|
||||
|
||||
| Attacker | Capability | Defense |
|
||||
|---|---|---|
|
||||
| External (no Zitadel identity) | None | No valid JWT signature; rejected at JWKS validation. |
|
||||
| Compromised device (key theft) | Full agent scope on its own deployments only | `groups_claim` restricts scope to the device's metadata; Zitadel admin can rotate the machine key and trigger immediate re-issuance. |
|
||||
| Different Zitadel project (different tenant or malicious org) | Can mint valid Zitadel tokens for its own project | `bound_audiences` rejects at the JWT auth boundary before any claim is read. |
|
||||
| Compromised operator | Can mutate Zitadel metadata + Bao policies for its fleet | One operator per fleet; operator credentials themselves stored in Bao under a separate auth path; compromise is contained to the operator's project. |
|
||||
| Compromised Bao | Full access to all stored secrets | Out of scope — Bao is the root of secret trust by definition. ADR-006 covers Bao operational hardening. |
|
||||
Reference in New Issue
Block a user