Files
harmony/ROADMAP/fleet_platform/device_secret_access_handoff.md
Jean-Gabriel Gill-Couture 3d01d7482f
All checks were successful
Run Check Script / check (pull_request) Successful in 2m16s
docs: Simplify architecture for openbao sso via harmony config
2026-06-01 15:15:42 -04:00

16 KiB

Fleet device secret access — implementation handoff

Owner: TBD Design: docs/adr/025-fleet-device-secret-access.md Status: Ready to start Written: 2026-06-01

Read ADR-025 first. This document is the work plan, not the design.

Summary

The fleet agent reads per-deployment secrets through the existing harmony_config chain, which already wraps OpenBao via harmony_secret::OpenbaoSecretStore. OpenbaoSecretStore's auth ladder is extended with one new rung — Zitadel JWT-bearer, keyed off the device's existing machine keyfile (the same root of trust the NATS callout uses). Per-deployment scope rides in the JWT itself as a deployments array claim, populated by the operator through Zitadel user metadata. OpenBao's groups_claim binds those values to auto-created external groups, each with a small policy granting read on harmony-fleet/data/<deployment-id>/*. The agent's NATS watcher calls refresh_auth() on the store before reconciling a deployment whose secrets are outside the cached scope.

Architecture at a glance

        Fleet operator (one per fleet, holds Zitadel admin scope)
              │
              ├── on new Deployment CR:
              │     1. FleetDeviceDeploymentMembershipScore
              │          ├─ Zitadel: append dep to device.metadata.deployments
              │          └─ OpenBao: upsert external group + policy
              │                       fleet-deployment-<dep> → read on
              │                       harmony-fleet/data/<dep>/*
              │     2. existing NATS publish (data plane, unchanged)
              │
              ▼
        Fleet agent
              │
              ├── NATS handler (main.rs):
              │     if !secret_store.cached_scope().contains(&dep) {
              │         secret_store.refresh_auth().await?;
              │     }
              │     reconciler.apply(…)
              │
              └── reads secrets via harmony_config (StoreSource wrapping
                  OpenbaoSecretStore with new JWT-bearer auth rung)

Pre-flight

  • Confirm staging Zitadel has a service account for the operator with write user metadata scope. File a sub-task if not.
  • Confirm the staging OpenBao KV mount harmony-fleet exists, or add its creation to OpenbaoSetupScore.
  • Validate DeploymentName's character set is safe for KV paths and Bao external-group names ([a-zA-Z0-9_-]).

Work breakdown

Six PRs. Dependencies:

PR-1  harmony_zitadel_auth: extract JWT-bearer minter
  └─ PR-2  harmony_secret: JWT-bearer auth + refresh_auth + cached_scope
       └─ PR-6  agent main.rs: inline refresh-check

PR-3  harmony OpenbaoJwtAuth Score: bound_claims + groups_claim
PR-4  Zitadel Action Score: deployments claim
  └─ PR-5  Operator: FleetDeviceDeploymentMembershipScore
PR-3 ─┘

PRs 1, 3, 4 are independent and can start in parallel.


PR-1 — Extract Zitadel JWT-bearer minter into harmony_zitadel_auth

Crate: harmony_zitadel_auth Depends on: nothing Blocks: PR-2

Move MachineKeyFile, CachedToken, build_assertion*, build_scope, build_token_url, and the mint+cache logic out of fleet/harmony-fleet-auth/src/credentials.rs into a new harmony_zitadel_auth/src/jwt_bearer.rs. Two real consumers (NATS callout + OpenBao) cross Rule of Three at the second consumer.

harmony_secret cannot depend on harmony-fleet-auth (fleet-specific crate; would invert the dependency graph). harmony_zitadel_auth is neutral and already houses the human-OIDC counterpart.

Shape.

// harmony_zitadel_auth/src/jwt_bearer.rs
pub struct ZitadelJwtBearer {
    key: MachineKeyFile,
    oidc_issuer_url: String,
    audience: String,
    http: reqwest::Client,
    cache: Mutex<Option<CachedToken>>,
}

impl ZitadelJwtBearer {
    pub fn new(key: MachineKeyFile, oidc_issuer_url: String, audience: String,
               danger_accept_invalid_certs: bool) -> Result<Self>;

    /// Cached if comfortably valid, otherwise mints fresh.
    pub async fn bearer_token(&self) -> Result<String>;

    /// Force a re-mint on next call.
    pub fn invalidate_cache(&self);
}

CredentialSource::ZitadelJwt in harmony-fleet-auth becomes a thin wrapper around Arc<ZitadelJwtBearer>. Port the existing pure-builder tests from credentials.rs into the new crate.

Acceptance. cargo test -p harmony_zitadel_auth -p harmony-fleet-auth clean. Fleet e2e NATS auth still works.


PR-2 — OpenbaoSecretStore JWT-bearer rung + refresh_auth + cached_scope

Crate: harmony_secret (src/store/openbao.rs) Depends on: PR-1 Blocks: PR-6

Add the fifth rung to OpenbaoSecretStore::new's auth ladder. Position: between cached-token (rung 2) and Zitadel OIDC device flow (rung 4). Triggered only when a machine keyfile is configured; if it fails, fall through to the existing ladder.

Constructor refactor. Knock out the too_many_arguments clippy TODO at openbao.rs:55 while we're adding more args:

pub struct OpenbaoStoreOptions {
    pub base_url: String,
    pub kv_mount: String,
    pub auth_mount: String,
    pub skip_tls: bool,
    pub token: Option<String>,
    pub username: Option<String>, pub password: Option<String>,
    pub zitadel_sso_url: Option<String>, pub zitadel_client_id: Option<String>,
    pub zitadel_jwt_bearer: Option<ZitadelJwtBearerConfig>,  // NEW
    pub jwt_role: Option<String>, pub jwt_auth_mount: Option<String>,
}

pub struct ZitadelJwtBearerConfig {
    pub key_path: Option<PathBuf>,
    pub key_json: Option<String>,
    pub oidc_issuer_url: String,
    pub audience: String,                // Zitadel project ID
}

Refresh capability. A single Mutex<Inner> over the refresh-affected state. Refresh is rare and uncontended; ArcSwap would be ceremony.

pub struct OpenbaoSecretStore {
    inner: Mutex<Inner>,
    kv_mount: String,
    auth_mount: String,
    jwt_bearer: Option<Arc<ZitadelJwtBearer>>,
    jwt_role: Option<String>,
    jwt_auth_mount: Option<String>,
    base_url: String,
    skip_tls: bool,
}

struct Inner {
    client: VaultClient,
    scope: HashSet<String>,    // deployment IDs from the Zitadel JWT
}

impl OpenbaoSecretStore {
    pub async fn refresh_auth(&self) -> Result<(), SecretStoreError> {
        let bearer = self.jwt_bearer.as_ref().ok_or_else(|| /* err */)?;
        bearer.invalidate_cache();
        let jwt = bearer.bearer_token().await?;
        let scope = decode_deployments_claim(&jwt)?;       // pure JWT decode
        let session = jwt_login(&self.base_url, /* … */, &jwt).await?;
        let client = build_vault_client(&self.base_url, self.skip_tls, &session.token)?;
        *self.inner.lock().await = Inner { client, scope };
        Ok(())
    }

    pub async fn cached_scope(&self) -> HashSet<String> {
        self.inner.lock().await.scope.clone()
    }
}

decode_deployments_claim is a small pure helper that base64-decodes the JWT body and reads the deployments array. No Bao round-trip. get_raw / set_raw lock inner, take a &VaultClient, perform the call, release. The lock is uncontended on the hot path.

On construction with JWT-bearer. Wire jwt_bearer and call refresh_auth() once before returning. If it fails, fall through to the next ladder rung.

Tests.

  • Ladder ordering: JWT-bearer present → tried before OIDC; absent → unchanged.
  • refresh_auth against a fake HTTP server updates cached_scope.
  • decode_deployments_claim on a hand-crafted JWT returns the expected set.

Acceptance. cargo test -p harmony_secret clean. examples/openbao userpass path unaffected. Manual staging run authenticates via JWT-bearer.


PR-3 — Extend OpenbaoJwtAuth Score with bound_claims_json + groups_claim

Crate: harmony (src/modules/openbao/setup.rs) Depends on: nothing

Two new fields, both defaulting to empty so existing callers are unaffected:

pub struct OpenbaoJwtAuth {
    // … existing fields …
    #[serde(default)] pub bound_claims_json: String,
    #[serde(default)] pub groups_claim: String,
}

Extend configure_jwt (line 435+) to pass each as a bao write auth/jwt/role/... flag when non-empty.

Acceptance. cargo check -p harmony --all-features clean. examples/openbao still works.


PR-4 — Zitadel Action Score: surface deployments metadata as a claim

Crate: harmony (new module under src/modules/zitadel/ or wherever existing Zitadel deploy code lives) Depends on: nothing Blocks: PR-5

Declarative Score that upserts the post-access-token-creation Action on the configured Zitadel project. Cannot be avoided: Zitadel emits urn:zitadel:iam:user:metadata as base64-encoded values in a map, and Bao's groups_claim can't consume that shape — the Action decodes and re-emits as a flat string array.

Action script (canonical text):

function addDeployments(ctx, api) {
  const md = ctx.v1.user.getMetadata();
  const entry = md.metadata.find(m => m.key === "deployments");
  if (!entry) return;
  try {
    const deployments = JSON.parse(
      Buffer.from(entry.value, "base64").toString("utf-8")
    );
    if (Array.isArray(deployments)) {
      api.v1.claims.setClaim("deployments", deployments);
    }
  } catch (_) { /* malformed → no deployments */ }
}

Score shape.

pub struct ZitadelDeploymentsClaimActionScore {
    pub project_id: String,
    pub action_name: String,        // default "fleet-deployments-claim"
}

Interpret upserts via Zitadel Management API, attaches to Complement Token / PreAccessTokenCreation. Idempotent.

Acceptance. Manual decode of a staging token shows the deployments claim.


PR-5 — FleetDeviceDeploymentMembershipScore + operator wiring

Crate: harmony (src/modules/fleet/ or wherever fleet Scores live) for the Score; fleet/harmony-fleet-operator for wiring. Depends on: PR-3 (Bao JWT role + accessor must exist), PR-4 (claim must surface)

One Score, two writes, declared in execution order. Per ADR-023 the operator hands off ordering to the Score rather than sequencing the external API calls by hand.

pub struct FleetDeviceDeploymentMembershipScore {
    pub zitadel_project_id: String,
    pub device_user_id: String,            // Zitadel machine user ID
    pub deployments: Vec<String>,          // declarative full set
    pub openbao_instance: OpenbaoInstance,
    pub kv_mount: String,                  // "harmony-fleet"
    pub jwt_auth_accessor: String,         // resolved from PR-3
}

Interpret runs, in order:

  1. Zitadel metadata — declarative replace on user.metadata.deployments. Read current value, write only on diff. Removal is "declare the new set without the removed entry."
  2. OpenBao external group + policy, per deployment in the set. For each <dep> in deployments:
    • Policy fleet-deployment-<dep> granting read on harmony-fleet/data/<dep>/* and read, list on harmony-fleet/metadata/<dep>/*.
    • External group <dep> (type=external) with that policy attached, alias matching the JWT-auth accessor.

Idempotent throughout.

Operator wiring. In the existing per-Deployment reconciler, compose this Score before the existing NATS desired-state publish. If the Score errors, surface as a CR status condition and do not publish — never tell a device to run a deployment it cannot authenticate for.

Note: NATS publish stays in operator code. It's data-plane (the "now do this" half of enrollment), already wired, and changing its home isn't part of this work.

Tests.

  • Unit on the Score: Zitadel diff logic (no-op on match, full write on diff); policy text generation; group alias shape.
  • Integration: against the real Bao in examples/openbao + a fake Zitadel; second apply is a no-op.

Acceptance. A Deployment CR rolled in staging causes both writes to land before the NATS publish; manual JWT-bearer login at Bao returns a token with the expected per-deployment policies attached via external-group bindings.


PR-6 — Agent: inline refresh-check in main.rs

Crate: fleet/harmony-fleet-agent Depends on: PR-2

Around the existing JetStream KV watcher in fleet/harmony-fleet-agent/src/main.rs:110-120, gate reconciler.apply on a scope check:

async_nats::jetstream::kv::Operation::Put => {
    if let Some(dep) = deployment_from_key(&entry.key) {
        if !secret_store.cached_scope().await.contains(dep.as_str()) {
            tracing::info!(%dep, "deployment outside cached scope — refreshing");
            if let Err(e) = secret_store.refresh_auth().await {
                tracing::warn!(key = %entry.key, error = %e, "refresh failed");
                continue;
            }
        }
    }
    if let Err(e) = reconciler.apply(&entry.key, &entry.value).await {
        tracing::warn!(key = %entry.key, error = %e, "apply failed");
    }
}

secret_store is the same Arc<OpenbaoSecretStore> held by the agent's ConfigManager — share it through whatever construction path main.rs already uses.

No new module. One consumer, one site, ~10 lines. Per CLAUDE.md "Rule of Three: introduce an abstraction at the second real instance." If/when a second pre-reconcile concern (image-pull creds, monitoring registration) arrives, extract a layer then.

No retry on "still missing after refresh." The operator's ordering guarantees Zitadel is consistent before the NATS publish. If we observe the race in practice, add a single retry then.

Tests. None new — this is wiring between two tested components. Confirm via the integration milestone below.

Acceptance. Rolling a new deployment to a staging device shows one "outside cached scope — refreshing" log followed by a clean reconcile.


Integration milestone — staging dry run

After PRs 1-6 land:

  • Apply the new OpenbaoJwtAuth config to staging Bao via OpenbaoSetupScore.
  • Apply ZitadelDeploymentsClaimActionScore to staging Zitadel.
  • Hand-provision metadata + group/policy for one test deployment + one test device. Confirm the agent reads its secret.
  • Roll a Deployment CR via the operator. Confirm FleetDeviceDeploymentMembershipScore writes Zitadel and Bao before NATS, and the agent reconciles cleanly.
  • Negative: device not in deployment X gets a hard error reading X's secret (not a silent fall-through).
  • Negative: a JWT minted before the metadata update cannot read the new deployment's secret until the agent's inline refresh runs.

Decisions deferred

  1. Vault token TTL. ADR-025 recommends 15 min. Confirm in staging; adjust OpenbaoJwtAuth::ttl if needed.
  2. Hard revocation on deployment removal. Wait for TTL today. Add a revoke companion only if a real fleet requires it.
  3. Bao down at agent startup. OpenbaoSecretStore::new with JWT-bearer must not panic; either fall through or surface a retryable error. Confirm and document in the milestone.

Required reading

  • docs/adr/025-fleet-device-secret-access.md — design.
  • docs/adr/020-1-zitadel-openbao-secure-config-store.md — human-user counterpart of this auth flow.
  • docs/adr/023-deploy-architecture.md — Score discipline for PRs 3-5.
  • harmony_secret/src/store/openbao.rs — auth ladder being extended.
  • harmony_config/src/source/store.rs — wrapper the agent uses.
  • fleet/harmony-fleet-auth/src/credentials.rs — JWT-bearer mint path being extracted in PR-1.
  • nats/callout/src/zitadel.rs — JWT validation shape the Bao role mirrors (bound_issuer + bound_audiences).
  • fleet/harmony-fleet-agent/src/main.rs — site of the PR-6 inline edit.