16 KiB
Fleet device secret access — implementation handoff
Owner: TBD
Design: docs/adr/025-fleet-device-secret-access.md
Status: Ready to start
Written: 2026-06-01
Read ADR-025 first. This document is the work plan, not the design.
Summary
The fleet agent reads per-deployment secrets through the existing
harmony_config chain, which already wraps OpenBao via
harmony_secret::OpenbaoSecretStore. OpenbaoSecretStore's auth
ladder is extended with one new rung — Zitadel JWT-bearer, keyed
off the device's existing machine keyfile (the same root of trust the
NATS callout uses). Per-deployment scope rides in the JWT itself as a
deployments array claim, populated by the operator through Zitadel
user metadata. OpenBao's groups_claim binds those values to
auto-created external groups, each with a small policy granting
read on harmony-fleet/data/<deployment-id>/*. The agent's NATS
watcher calls refresh_auth() on the store before reconciling a
deployment whose secrets are outside the cached scope.
Architecture at a glance
Fleet operator (one per fleet, holds Zitadel admin scope)
│
├── on new Deployment CR:
│ 1. FleetDeviceDeploymentMembershipScore
│ ├─ Zitadel: append dep to device.metadata.deployments
│ └─ OpenBao: upsert external group + policy
│ fleet-deployment-<dep> → read on
│ harmony-fleet/data/<dep>/*
│ 2. existing NATS publish (data plane, unchanged)
│
▼
Fleet agent
│
├── NATS handler (main.rs):
│ if !secret_store.cached_scope().contains(&dep) {
│ secret_store.refresh_auth().await?;
│ }
│ reconciler.apply(…)
│
└── reads secrets via harmony_config (StoreSource wrapping
OpenbaoSecretStore with new JWT-bearer auth rung)
Pre-flight
- Confirm staging Zitadel has a service account for the operator with write user metadata scope. File a sub-task if not.
- Confirm the staging OpenBao KV mount
harmony-fleetexists, or add its creation toOpenbaoSetupScore. - Validate
DeploymentName's character set is safe for KV paths and Bao external-group names ([a-zA-Z0-9_-]).
Work breakdown
Six PRs. Dependencies:
PR-1 harmony_zitadel_auth: extract JWT-bearer minter
└─ PR-2 harmony_secret: JWT-bearer auth + refresh_auth + cached_scope
└─ PR-6 agent main.rs: inline refresh-check
PR-3 harmony OpenbaoJwtAuth Score: bound_claims + groups_claim
PR-4 Zitadel Action Score: deployments claim
└─ PR-5 Operator: FleetDeviceDeploymentMembershipScore
PR-3 ─┘
PRs 1, 3, 4 are independent and can start in parallel.
PR-1 — Extract Zitadel JWT-bearer minter into harmony_zitadel_auth
Crate: harmony_zitadel_auth
Depends on: nothing
Blocks: PR-2
Move MachineKeyFile, CachedToken, build_assertion*, build_scope,
build_token_url, and the mint+cache logic out of
fleet/harmony-fleet-auth/src/credentials.rs into a new
harmony_zitadel_auth/src/jwt_bearer.rs. Two real consumers (NATS
callout + OpenBao) cross Rule of Three at the second consumer.
harmony_secret cannot depend on harmony-fleet-auth (fleet-specific
crate; would invert the dependency graph). harmony_zitadel_auth is
neutral and already houses the human-OIDC counterpart.
Shape.
// harmony_zitadel_auth/src/jwt_bearer.rs
pub struct ZitadelJwtBearer {
key: MachineKeyFile,
oidc_issuer_url: String,
audience: String,
http: reqwest::Client,
cache: Mutex<Option<CachedToken>>,
}
impl ZitadelJwtBearer {
pub fn new(key: MachineKeyFile, oidc_issuer_url: String, audience: String,
danger_accept_invalid_certs: bool) -> Result<Self>;
/// Cached if comfortably valid, otherwise mints fresh.
pub async fn bearer_token(&self) -> Result<String>;
/// Force a re-mint on next call.
pub fn invalidate_cache(&self);
}
CredentialSource::ZitadelJwt in harmony-fleet-auth becomes a thin
wrapper around Arc<ZitadelJwtBearer>. Port the existing pure-builder
tests from credentials.rs into the new crate.
Acceptance. cargo test -p harmony_zitadel_auth -p harmony-fleet-auth
clean. Fleet e2e NATS auth still works.
PR-2 — OpenbaoSecretStore JWT-bearer rung + refresh_auth + cached_scope
Crate: harmony_secret (src/store/openbao.rs)
Depends on: PR-1
Blocks: PR-6
Add the fifth rung to OpenbaoSecretStore::new's auth ladder.
Position: between cached-token (rung 2) and Zitadel OIDC device flow
(rung 4). Triggered only when a machine keyfile is configured; if it
fails, fall through to the existing ladder.
Constructor refactor. Knock out the too_many_arguments clippy
TODO at openbao.rs:55 while we're adding more args:
pub struct OpenbaoStoreOptions {
pub base_url: String,
pub kv_mount: String,
pub auth_mount: String,
pub skip_tls: bool,
pub token: Option<String>,
pub username: Option<String>, pub password: Option<String>,
pub zitadel_sso_url: Option<String>, pub zitadel_client_id: Option<String>,
pub zitadel_jwt_bearer: Option<ZitadelJwtBearerConfig>, // NEW
pub jwt_role: Option<String>, pub jwt_auth_mount: Option<String>,
}
pub struct ZitadelJwtBearerConfig {
pub key_path: Option<PathBuf>,
pub key_json: Option<String>,
pub oidc_issuer_url: String,
pub audience: String, // Zitadel project ID
}
Refresh capability. A single Mutex<Inner> over the refresh-affected
state. Refresh is rare and uncontended; ArcSwap would be ceremony.
pub struct OpenbaoSecretStore {
inner: Mutex<Inner>,
kv_mount: String,
auth_mount: String,
jwt_bearer: Option<Arc<ZitadelJwtBearer>>,
jwt_role: Option<String>,
jwt_auth_mount: Option<String>,
base_url: String,
skip_tls: bool,
}
struct Inner {
client: VaultClient,
scope: HashSet<String>, // deployment IDs from the Zitadel JWT
}
impl OpenbaoSecretStore {
pub async fn refresh_auth(&self) -> Result<(), SecretStoreError> {
let bearer = self.jwt_bearer.as_ref().ok_or_else(|| /* err */)?;
bearer.invalidate_cache();
let jwt = bearer.bearer_token().await?;
let scope = decode_deployments_claim(&jwt)?; // pure JWT decode
let session = jwt_login(&self.base_url, /* … */, &jwt).await?;
let client = build_vault_client(&self.base_url, self.skip_tls, &session.token)?;
*self.inner.lock().await = Inner { client, scope };
Ok(())
}
pub async fn cached_scope(&self) -> HashSet<String> {
self.inner.lock().await.scope.clone()
}
}
decode_deployments_claim is a small pure helper that base64-decodes
the JWT body and reads the deployments array. No Bao round-trip.
get_raw / set_raw lock inner, take a &VaultClient, perform the
call, release. The lock is uncontended on the hot path.
On construction with JWT-bearer. Wire jwt_bearer and call
refresh_auth() once before returning. If it fails, fall through to
the next ladder rung.
Tests.
- Ladder ordering: JWT-bearer present → tried before OIDC; absent → unchanged.
refresh_authagainst a fake HTTP server updatescached_scope.decode_deployments_claimon a hand-crafted JWT returns the expected set.
Acceptance. cargo test -p harmony_secret clean. examples/openbao
userpass path unaffected. Manual staging run authenticates via
JWT-bearer.
PR-3 — Extend OpenbaoJwtAuth Score with bound_claims_json + groups_claim
Crate: harmony (src/modules/openbao/setup.rs)
Depends on: nothing
Two new fields, both defaulting to empty so existing callers are unaffected:
pub struct OpenbaoJwtAuth {
// … existing fields …
#[serde(default)] pub bound_claims_json: String,
#[serde(default)] pub groups_claim: String,
}
Extend configure_jwt (line 435+) to pass each as a bao write auth/jwt/role/... flag when non-empty.
Acceptance. cargo check -p harmony --all-features clean.
examples/openbao still works.
PR-4 — Zitadel Action Score: surface deployments metadata as a claim
Crate: harmony (new module under src/modules/zitadel/ or
wherever existing Zitadel deploy code lives)
Depends on: nothing
Blocks: PR-5
Declarative Score that upserts the post-access-token-creation Action
on the configured Zitadel project. Cannot be avoided: Zitadel emits
urn:zitadel:iam:user:metadata as base64-encoded values in a map,
and Bao's groups_claim can't consume that shape — the Action
decodes and re-emits as a flat string array.
Action script (canonical text):
function addDeployments(ctx, api) {
const md = ctx.v1.user.getMetadata();
const entry = md.metadata.find(m => m.key === "deployments");
if (!entry) return;
try {
const deployments = JSON.parse(
Buffer.from(entry.value, "base64").toString("utf-8")
);
if (Array.isArray(deployments)) {
api.v1.claims.setClaim("deployments", deployments);
}
} catch (_) { /* malformed → no deployments */ }
}
Score shape.
pub struct ZitadelDeploymentsClaimActionScore {
pub project_id: String,
pub action_name: String, // default "fleet-deployments-claim"
}
Interpret upserts via Zitadel Management API, attaches to
Complement Token / PreAccessTokenCreation. Idempotent.
Acceptance. Manual decode of a staging token shows the
deployments claim.
PR-5 — FleetDeviceDeploymentMembershipScore + operator wiring
Crate: harmony (src/modules/fleet/ or wherever fleet Scores
live) for the Score; fleet/harmony-fleet-operator for wiring.
Depends on: PR-3 (Bao JWT role + accessor must exist), PR-4
(claim must surface)
One Score, two writes, declared in execution order. Per ADR-023 the operator hands off ordering to the Score rather than sequencing the external API calls by hand.
pub struct FleetDeviceDeploymentMembershipScore {
pub zitadel_project_id: String,
pub device_user_id: String, // Zitadel machine user ID
pub deployments: Vec<String>, // declarative full set
pub openbao_instance: OpenbaoInstance,
pub kv_mount: String, // "harmony-fleet"
pub jwt_auth_accessor: String, // resolved from PR-3
}
Interpret runs, in order:
- Zitadel metadata — declarative replace on
user.metadata.deployments. Read current value, write only on diff. Removal is "declare the new set without the removed entry." - OpenBao external group + policy, per deployment in the set. For
each
<dep>indeployments:- Policy
fleet-deployment-<dep>grantingreadonharmony-fleet/data/<dep>/*andread, listonharmony-fleet/metadata/<dep>/*. - External group
<dep>(type=external) with that policy attached, alias matching the JWT-auth accessor.
- Policy
Idempotent throughout.
Operator wiring. In the existing per-Deployment reconciler,
compose this Score before the existing NATS desired-state publish.
If the Score errors, surface as a CR status condition and do not
publish — never tell a device to run a deployment it cannot
authenticate for.
Note: NATS publish stays in operator code. It's data-plane (the "now do this" half of enrollment), already wired, and changing its home isn't part of this work.
Tests.
- Unit on the Score: Zitadel diff logic (no-op on match, full write on diff); policy text generation; group alias shape.
- Integration: against the real Bao in
examples/openbao+ a fake Zitadel; second apply is a no-op.
Acceptance. A Deployment CR rolled in staging causes both
writes to land before the NATS publish; manual JWT-bearer login at
Bao returns a token with the expected per-deployment policies
attached via external-group bindings.
PR-6 — Agent: inline refresh-check in main.rs
Crate: fleet/harmony-fleet-agent
Depends on: PR-2
Around the existing JetStream KV watcher in
fleet/harmony-fleet-agent/src/main.rs:110-120, gate
reconciler.apply on a scope check:
async_nats::jetstream::kv::Operation::Put => {
if let Some(dep) = deployment_from_key(&entry.key) {
if !secret_store.cached_scope().await.contains(dep.as_str()) {
tracing::info!(%dep, "deployment outside cached scope — refreshing");
if let Err(e) = secret_store.refresh_auth().await {
tracing::warn!(key = %entry.key, error = %e, "refresh failed");
continue;
}
}
}
if let Err(e) = reconciler.apply(&entry.key, &entry.value).await {
tracing::warn!(key = %entry.key, error = %e, "apply failed");
}
}
secret_store is the same Arc<OpenbaoSecretStore> held by the
agent's ConfigManager — share it through whatever construction
path main.rs already uses.
No new module. One consumer, one site, ~10 lines. Per CLAUDE.md "Rule of Three: introduce an abstraction at the second real instance." If/when a second pre-reconcile concern (image-pull creds, monitoring registration) arrives, extract a layer then.
No retry on "still missing after refresh." The operator's ordering guarantees Zitadel is consistent before the NATS publish. If we observe the race in practice, add a single retry then.
Tests. None new — this is wiring between two tested components. Confirm via the integration milestone below.
Acceptance. Rolling a new deployment to a staging device shows one "outside cached scope — refreshing" log followed by a clean reconcile.
Integration milestone — staging dry run
After PRs 1-6 land:
- Apply the new
OpenbaoJwtAuthconfig to staging Bao viaOpenbaoSetupScore. - Apply
ZitadelDeploymentsClaimActionScoreto staging Zitadel. - Hand-provision metadata + group/policy for one test deployment + one test device. Confirm the agent reads its secret.
- Roll a
DeploymentCR via the operator. ConfirmFleetDeviceDeploymentMembershipScorewrites Zitadel and Bao before NATS, and the agent reconciles cleanly. - Negative: device not in deployment X gets a hard error reading X's secret (not a silent fall-through).
- Negative: a JWT minted before the metadata update cannot read the new deployment's secret until the agent's inline refresh runs.
Decisions deferred
- Vault token TTL. ADR-025 recommends 15 min. Confirm in
staging; adjust
OpenbaoJwtAuth::ttlif needed. - Hard revocation on deployment removal. Wait for TTL today. Add a revoke companion only if a real fleet requires it.
- Bao down at agent startup.
OpenbaoSecretStore::newwith JWT-bearer must not panic; either fall through or surface a retryable error. Confirm and document in the milestone.
Required reading
docs/adr/025-fleet-device-secret-access.md— design.docs/adr/020-1-zitadel-openbao-secure-config-store.md— human-user counterpart of this auth flow.docs/adr/023-deploy-architecture.md— Score discipline for PRs 3-5.harmony_secret/src/store/openbao.rs— auth ladder being extended.harmony_config/src/source/store.rs— wrapper the agent uses.fleet/harmony-fleet-auth/src/credentials.rs— JWT-bearer mint path being extracted in PR-1.nats/callout/src/zitadel.rs— JWT validation shape the Bao role mirrors (bound_issuer+bound_audiences).fleet/harmony-fleet-agent/src/main.rs— site of the PR-6 inline edit.