Files
harmony/docs/guides/fleet-zitadel-faq.md
Jean-Gabriel Gill-Couture 612d934ad4 docs(fleet): manual JWT-bearer mint + NATS write recipe
Working PyJWT script + nats CLI commands for talking to a
callout-protected NATS by hand. Distills what we learned debugging
the auth chain: which scope claims matter, why the audience is the
project id (not the API app's clientId), how to read OIDC_AUDIENCE
off the live callout instead of trusting the cache, and the failure
modes — including the PyJWT vs jwt package collision that costs
30 minutes the first time you hit it.

Cross-linked from fleet-zitadel-faq.md.
2026-05-05 01:43:36 -04:00

8.2 KiB
Raw Permalink Blame History

Fleet × Zitadel FAQ

Technical reference for the Zitadel setup behind the fleet auth callout. Describes what exists, why it's that way, and where each piece lives in the code.

Code anchors:

  • examples/fleet_e2e_demo/src/lib.rs — bring-up flow
  • harmony/src/modules/zitadel/setup.rsZitadelSetupScore
  • harmony/src/modules/zitadel/mod.rs — Helm install
  • nats/callout/src/handler.rs — auth callout
  • fleet/harmony-fleet-agent/src/credentials.rs — JWT-bearer mint

What is an "application" in Zitadel?

An OIDC client config: clientId, allowed grant types, redirect URIs (browser apps only), PKCE settings (browser apps only).

Apps are not containers for users or roles — those live one level up at the org. An app is the entry point a service uses to delegate auth to Zitadel.

The nats app is API type: JWT-bearer / client-credentials only, no browser flow. Headless agents never see a login page. The app's clientId is what tokens carry as aud and what the auth callout validates against (OIDC_AUDIENCE env on the callout Deployment).

Why are users and roles at org level instead of per-project?

Roles are defined inside a project but are essentially labels — strings + display names with no inherent permissions. Each app enforces them in code (the callout maps device → a permission template).

Users live at org level so one identity can hold roles across multiple projects in the same org and SSO between them. Role grants are the join: "user X has roles [A, B] on project Y."

The only privilege ladder Zitadel enforces directly is at the instance/org level (IAM-Owner, Org-Owner). Project roles say nothing about Zitadel admin rights.

What is each service account for?

User Created by Purpose
iam-admin Helm FirstInstance.Org.Machine IAM-Owner. Its PAT (iam-admin-pat k8s Secret) drives the management API from ZitadelSetupScore.
login-client Helm FirstInstance.Org.LoginClient Internal — Zitadel's login UI pod uses it to call back into Zitadel. Don't touch.
fleet-ops fleet_e2e_demo admin setup fleet-admin role grant, JSON key, used by tests and admin tooling.
device-vm-device-NN fleet_e2e_demo::provision_device One per VM. JSON key copied to /etc/fleet-agent/zitadel-key.json. device role grant.
ops-station, sensor-a, sensor-b, intruder fleet_auth_callout (separate example) Leftovers from previous runs. Postgres survives cluster recreates. Harmless, deletable.

The device- prefix on per-device usernames is intentional: Zitadel emits the username verbatim in the access token's client_id claim. The callout strips device- to recover the bare device id used for NATS subject interpolation (DEVICE_ID_PREFIX_STRIP=device- env var on the callout; nats/callout/src/zitadel.rs::extract_device_id).

How does the agent authenticate? Are JWTs / refresh tokens cached?

On disk the agent keeps only the JSON machine key (RSA private key) at /etc/fleet-agent/zitadel-key.json.

It does NOT store:

  • access tokens (in memory only)
  • refresh tokens (the JWT-bearer flow has none — RFC 7523 is stateless by design)

On every NATS (re)connect, credentials.rs::zitadel_mint:

  1. Builds a JWT assertion with exp = now + 60s, signs it with the RSA key
  2. POSTs it to <zitadel>/oauth/v2/token with grant type urn:ietf:params:oauth:grant-type:jwt-bearer
  3. Receives an access token (~12h validity), caches it in memory
  4. Re-mints when within 5min of expiry (TOKEN_REFRESH_LEEWAY_SECS)

What happens to an offline agent?

Time offline Behavior
0 ~12 h Cached access token still valid. Reconnects work transparently.
> ~12 h Token expired. Agent enters reconnect loop until network returns, then mints fresh on first successful reach.

The RSA key never expires until rotated server-side.

Where are the lifetimes set?

  • Access token TTL — Zitadel UI: Org → Settings → OIDC Settings → "Access Token Lifetime" (default 12 h).
  • Assertion TTL — hardcoded 60 s in credentials.rs::ASSERTION_LIFETIME_SECS. Zitadel rejects assertions where exp - iat > 60 s; this is server-enforced, not a knob.
  • Machine key TTL — set when the key is created in harmony/src/modules/zitadel/setup.rs::create_machine_key.

Why is a JSON machine key more secure than a PAT?

Both are "if stolen, full impersonation" — the same blast radius. The difference is in leak surface:

  • PAT: a 60-char bearer string sent on every authenticated request. Every log line, every env dump, every misrouted request is a leak opportunity.
  • JSON key: an RSA private key. Only ever signs short-lived (60 s) assertions sent to one endpoint (<zitadel>/oauth/v2/token). The bearer token NATS sees is the access token — short-lived (12 h max), scoped, distinct from the long-term secret. A full network capture of the agent ↔ NATS traffic yields only access tokens that expire within 12 h.

Plus: Zitadel allows multiple keys per machine user, so rotation is zero-downtime (mint new → push to device → delete old). PATs rotate one-at-a-time and are disruptive.

What this does not defend against: a fully compromised device where the attacker reads the keyfile. That requires hardware (TPM / secure element) and is out of scope.

The machine keys expire in year 9999. Isn't that effectively forever?

Yes. Currently set in ZitadelSetupScore::create_machine_key as a known-bad default chosen for demo convenience (re-running tests shouldn't produce expired keys mid-run). Tracked as a known issue.

Why is the IAM-Owner PAT stored as a plain k8s Secret?

K8s Secrets are base64-encoded, not encrypted at rest unless etcd encryption-at-rest is explicitly enabled with a KMS provider. Anyone with get secrets in the zitadel namespace effectively has Zitadel admin.

The PAT exists because ZitadelSetupScore calls Zitadel's management API (create project, role, machine user, mint key), which requires IAM-Owner privileges. A PAT is the simplest credential that survives across applies.

This is a known production-hardening gap. Harmony has the harmony_secret crate (ADR-020) with OpenBao and local-encrypted-file backends; the Score is currently wired against a k8s Secret only.

What lifetime is set for the human admin password — why does the ConfigMap show one that doesn't work?

ZitadelScore regenerates a random admin password on every apply and writes it to the rendered ConfigMap. Helm's FirstInstance block only seeds Postgres on the first install against an empty DB, so re-applies render a new ConfigMap password but leave the original Postgres hash untouched. The displayed password is stale on every apply after the first.

To recover access: use the iam-admin-pat to call Zitadel's management API and reset the human admin's password directly. Tracked as a known bug.

Quick reference — tokens on the wire

Token Lives where Lifetime Signed by Purpose
Assertion Agent memory, in-flight 60 s Agent (RSA key) "I'm machine user X — give me an access token"
Access token Agent memory + on-the-wire to NATS ~12 h Zitadel "Zitadel says I'm device X with role device"
NATS user JWT NATS server connection state callout-defined (~30 s) Auth callout (NKey) "I have these permissions on these subjects"

The agent only holds the RSA key on disk and the access token in memory. The NATS user JWT is server-internal — agents don't see it.

Code map

Topic File
Helm install, masterkey, admin password harmony/src/modules/zitadel/mod.rs
Project/role/machine user provisioning harmony/src/modules/zitadel/setup.rs
Per-device machine user + key handoff examples/fleet_e2e_demo/src/lib.rs::provision_device
JWT-bearer mint fleet/harmony-fleet-agent/src/credentials.rs::zitadel_mint
Auth callout decision tree nats/callout/src/handler.rs::decide
Per-device permission template nats/callout/src/permissions.rs::device_default
End-to-end rehearsal runbook examples/fleet_e2e_demo/RUNBOOK.md
Manual JWT-bearer mint + NATS write recipe fleet-manual-token-mint.md