Working PyJWT script + nats CLI commands for talking to a callout-protected NATS by hand. Distills what we learned debugging the auth chain: which scope claims matter, why the audience is the project id (not the API app's clientId), how to read OIDC_AUDIENCE off the live callout instead of trusting the cache, and the failure modes — including the PyJWT vs jwt package collision that costs 30 minutes the first time you hit it. Cross-linked from fleet-zitadel-faq.md.
8.2 KiB
Fleet × Zitadel FAQ
Technical reference for the Zitadel setup behind the fleet auth callout. Describes what exists, why it's that way, and where each piece lives in the code.
Code anchors:
examples/fleet_e2e_demo/src/lib.rs— bring-up flowharmony/src/modules/zitadel/setup.rs—ZitadelSetupScoreharmony/src/modules/zitadel/mod.rs— Helm installnats/callout/src/handler.rs— auth calloutfleet/harmony-fleet-agent/src/credentials.rs— JWT-bearer mint
What is an "application" in Zitadel?
An OIDC client config: clientId, allowed grant types, redirect
URIs (browser apps only), PKCE settings (browser apps only).
Apps are not containers for users or roles — those live one level up at the org. An app is the entry point a service uses to delegate auth to Zitadel.
The nats app is API type: JWT-bearer / client-credentials
only, no browser flow. Headless agents never see a login page.
The app's clientId is what tokens carry as aud and what the
auth callout validates against (OIDC_AUDIENCE env on the callout
Deployment).
Why are users and roles at org level instead of per-project?
Roles are defined inside a project but are essentially labels —
strings + display names with no inherent permissions. Each app
enforces them in code (the callout maps device → a
permission template).
Users live at org level so one identity can hold roles across multiple projects in the same org and SSO between them. Role grants are the join: "user X has roles [A, B] on project Y."
The only privilege ladder Zitadel enforces directly is at the instance/org level (IAM-Owner, Org-Owner). Project roles say nothing about Zitadel admin rights.
What is each service account for?
| User | Created by | Purpose |
|---|---|---|
iam-admin |
Helm FirstInstance.Org.Machine |
IAM-Owner. Its PAT (iam-admin-pat k8s Secret) drives the management API from ZitadelSetupScore. |
login-client |
Helm FirstInstance.Org.LoginClient |
Internal — Zitadel's login UI pod uses it to call back into Zitadel. Don't touch. |
fleet-ops |
fleet_e2e_demo admin setup |
fleet-admin role grant, JSON key, used by tests and admin tooling. |
device-vm-device-NN |
fleet_e2e_demo::provision_device |
One per VM. JSON key copied to /etc/fleet-agent/zitadel-key.json. device role grant. |
ops-station, sensor-a, sensor-b, intruder |
fleet_auth_callout (separate example) |
Leftovers from previous runs. Postgres survives cluster recreates. Harmless, deletable. |
The device- prefix on per-device usernames is intentional:
Zitadel emits the username verbatim in the access token's
client_id claim. The callout strips device- to recover the
bare device id used for NATS subject interpolation
(DEVICE_ID_PREFIX_STRIP=device- env var on the callout;
nats/callout/src/zitadel.rs::extract_device_id).
How does the agent authenticate? Are JWTs / refresh tokens cached?
On disk the agent keeps only the JSON machine key (RSA
private key) at /etc/fleet-agent/zitadel-key.json.
It does NOT store:
- access tokens (in memory only)
- refresh tokens (the JWT-bearer flow has none — RFC 7523 is stateless by design)
On every NATS (re)connect, credentials.rs::zitadel_mint:
- Builds a JWT assertion with
exp = now + 60s, signs it with the RSA key - POSTs it to
<zitadel>/oauth/v2/tokenwith grant typeurn:ietf:params:oauth:grant-type:jwt-bearer - Receives an access token (~12h validity), caches it in memory
- Re-mints when within 5min of expiry
(
TOKEN_REFRESH_LEEWAY_SECS)
What happens to an offline agent?
| Time offline | Behavior |
|---|---|
| 0 – ~12 h | Cached access token still valid. Reconnects work transparently. |
| > ~12 h | Token expired. Agent enters reconnect loop until network returns, then mints fresh on first successful reach. |
The RSA key never expires until rotated server-side.
Where are the lifetimes set?
- Access token TTL — Zitadel UI: Org → Settings → OIDC Settings → "Access Token Lifetime" (default 12 h).
- Assertion TTL — hardcoded 60 s in
credentials.rs::ASSERTION_LIFETIME_SECS. Zitadel rejects assertions whereexp - iat > 60 s; this is server-enforced, not a knob. - Machine key TTL — set when the key is created in
harmony/src/modules/zitadel/setup.rs::create_machine_key.
Why is a JSON machine key more secure than a PAT?
Both are "if stolen, full impersonation" — the same blast radius. The difference is in leak surface:
- PAT: a 60-char bearer string sent on every authenticated request. Every log line, every env dump, every misrouted request is a leak opportunity.
- JSON key: an RSA private key. Only ever signs short-lived
(60 s) assertions sent to one endpoint
(
<zitadel>/oauth/v2/token). The bearer token NATS sees is the access token — short-lived (12 h max), scoped, distinct from the long-term secret. A full network capture of the agent ↔ NATS traffic yields only access tokens that expire within 12 h.
Plus: Zitadel allows multiple keys per machine user, so rotation is zero-downtime (mint new → push to device → delete old). PATs rotate one-at-a-time and are disruptive.
What this does not defend against: a fully compromised device where the attacker reads the keyfile. That requires hardware (TPM / secure element) and is out of scope.
The machine keys expire in year 9999. Isn't that effectively forever?
Yes. Currently set in ZitadelSetupScore::create_machine_key as
a known-bad default chosen for demo convenience (re-running tests
shouldn't produce expired keys mid-run). Tracked as a known issue.
Why is the IAM-Owner PAT stored as a plain k8s Secret?
K8s Secrets are base64-encoded, not encrypted at rest unless
etcd encryption-at-rest is explicitly enabled with a KMS provider.
Anyone with get secrets in the zitadel namespace effectively
has Zitadel admin.
The PAT exists because ZitadelSetupScore calls Zitadel's
management API (create project, role, machine user, mint key),
which requires IAM-Owner privileges. A PAT is the simplest
credential that survives across applies.
This is a known production-hardening gap. Harmony has the
harmony_secret crate (ADR-020) with OpenBao and local-encrypted-file
backends; the Score is currently wired against a k8s Secret only.
What lifetime is set for the human admin password — why does the ConfigMap show one that doesn't work?
ZitadelScore regenerates a random admin password on every apply
and writes it to the rendered ConfigMap. Helm's FirstInstance
block only seeds Postgres on the first install against an
empty DB, so re-applies render a new ConfigMap password but leave
the original Postgres hash untouched. The displayed password is
stale on every apply after the first.
To recover access: use the iam-admin-pat to call Zitadel's
management API and reset the human admin's password directly.
Tracked as a known bug.
Quick reference — tokens on the wire
| Token | Lives where | Lifetime | Signed by | Purpose |
|---|---|---|---|---|
| Assertion | Agent memory, in-flight | 60 s | Agent (RSA key) | "I'm machine user X — give me an access token" |
| Access token | Agent memory + on-the-wire to NATS | ~12 h | Zitadel | "Zitadel says I'm device X with role device" |
| NATS user JWT | NATS server connection state | callout-defined (~30 s) | Auth callout (NKey) | "I have these permissions on these subjects" |
The agent only holds the RSA key on disk and the access token in memory. The NATS user JWT is server-internal — agents don't see it.
Code map
| Topic | File |
|---|---|
| Helm install, masterkey, admin password | harmony/src/modules/zitadel/mod.rs |
| Project/role/machine user provisioning | harmony/src/modules/zitadel/setup.rs |
| Per-device machine user + key handoff | examples/fleet_e2e_demo/src/lib.rs::provision_device |
| JWT-bearer mint | fleet/harmony-fleet-agent/src/credentials.rs::zitadel_mint |
| Auth callout decision tree | nats/callout/src/handler.rs::decide |
| Per-device permission template | nats/callout/src/permissions.rs::device_default |
| End-to-end rehearsal runbook | examples/fleet_e2e_demo/RUNBOOK.md |
| Manual JWT-bearer mint + NATS write recipe | fleet-manual-token-mint.md |