Files
harmony/docs/guides/fleet-zitadel-faq.md
Jean-Gabriel Gill-Couture 612d934ad4 docs(fleet): manual JWT-bearer mint + NATS write recipe
Working PyJWT script + nats CLI commands for talking to a
callout-protected NATS by hand. Distills what we learned debugging
the auth chain: which scope claims matter, why the audience is the
project id (not the API app's clientId), how to read OIDC_AUDIENCE
off the live callout instead of trusting the cache, and the failure
modes — including the PyJWT vs jwt package collision that costs
30 minutes the first time you hit it.

Cross-linked from fleet-zitadel-faq.md.
2026-05-05 01:43:36 -04:00

186 lines
8.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Fleet × Zitadel FAQ
Technical reference for the Zitadel setup behind the fleet
auth callout. Describes what exists, why it's that way, and where
each piece lives in the code.
Code anchors:
- `examples/fleet_e2e_demo/src/lib.rs` — bring-up flow
- `harmony/src/modules/zitadel/setup.rs``ZitadelSetupScore`
- `harmony/src/modules/zitadel/mod.rs` — Helm install
- `nats/callout/src/handler.rs` — auth callout
- `fleet/harmony-fleet-agent/src/credentials.rs` — JWT-bearer mint
---
## What is an "application" in Zitadel?
An OIDC client config: `clientId`, allowed grant types, redirect
URIs (browser apps only), PKCE settings (browser apps only).
Apps are not containers for users or roles — those live one
level up at the org. An app is the entry point a service uses to
delegate auth to Zitadel.
The `nats` app is **API type**: JWT-bearer / client-credentials
only, no browser flow. Headless agents never see a login page.
The app's `clientId` is what tokens carry as `aud` and what the
auth callout validates against (`OIDC_AUDIENCE` env on the callout
Deployment).
## Why are users and roles at org level instead of per-project?
Roles are defined inside a project but are essentially labels —
strings + display names with no inherent permissions. Each app
enforces them in code (the callout maps `device` → a
permission template).
Users live at org level so one identity can hold roles across
multiple projects in the same org and SSO between them. Role
grants are the join: "user X has roles \[A, B\] on project Y."
The only privilege ladder Zitadel enforces directly is at the
instance/org level (IAM-Owner, Org-Owner). Project roles say
nothing about Zitadel admin rights.
## What is each service account for?
| User | Created by | Purpose |
| --- | --- | --- |
| `iam-admin` | Helm `FirstInstance.Org.Machine` | IAM-Owner. Its PAT (`iam-admin-pat` k8s Secret) drives the management API from `ZitadelSetupScore`. |
| `login-client` | Helm `FirstInstance.Org.LoginClient` | Internal — Zitadel's login UI pod uses it to call back into Zitadel. Don't touch. |
| `fleet-ops` | `fleet_e2e_demo` admin setup | `fleet-admin` role grant, JSON key, used by tests and admin tooling. |
| `device-vm-device-NN` | `fleet_e2e_demo::provision_device` | One per VM. JSON key copied to `/etc/fleet-agent/zitadel-key.json`. `device` role grant. |
| `ops-station`, `sensor-a`, `sensor-b`, `intruder` | `fleet_auth_callout` (separate example) | Leftovers from previous runs. Postgres survives cluster recreates. Harmless, deletable. |
The `device-` prefix on per-device usernames is intentional:
Zitadel emits the username verbatim in the access token's
`client_id` claim. The callout strips `device-` to recover the
bare device id used for NATS subject interpolation
(`DEVICE_ID_PREFIX_STRIP=device-` env var on the callout;
`nats/callout/src/zitadel.rs::extract_device_id`).
## How does the agent authenticate? Are JWTs / refresh tokens cached?
On disk the agent keeps **only the JSON machine key** (RSA
private key) at `/etc/fleet-agent/zitadel-key.json`.
It does NOT store:
- access tokens (in memory only)
- refresh tokens (the JWT-bearer flow has none — RFC 7523 is
stateless by design)
On every NATS (re)connect, `credentials.rs::zitadel_mint`:
1. Builds a JWT assertion with `exp = now + 60s`, signs it with
the RSA key
2. POSTs it to `<zitadel>/oauth/v2/token` with grant type
`urn:ietf:params:oauth:grant-type:jwt-bearer`
3. Receives an access token (~12h validity), caches it in memory
4. Re-mints when within 5min of expiry
(`TOKEN_REFRESH_LEEWAY_SECS`)
## What happens to an offline agent?
| Time offline | Behavior |
| --- | --- |
| 0 ~12 h | Cached access token still valid. Reconnects work transparently. |
| > ~12 h | Token expired. Agent enters reconnect loop until network returns, then mints fresh on first successful reach. |
The RSA key never expires until rotated server-side.
## Where are the lifetimes set?
- **Access token TTL** — Zitadel UI: Org → Settings → OIDC
Settings → "Access Token Lifetime" (default 12 h).
- **Assertion TTL** — hardcoded 60 s in
`credentials.rs::ASSERTION_LIFETIME_SECS`. Zitadel rejects
assertions where `exp - iat > 60 s`; this is server-enforced,
not a knob.
- **Machine key TTL** — set when the key is created in
`harmony/src/modules/zitadel/setup.rs::create_machine_key`.
## Why is a JSON machine key more secure than a PAT?
Both are "if stolen, full impersonation" — the same blast radius.
The difference is in leak surface:
- **PAT**: a 60-char bearer string sent on every authenticated
request. Every log line, every env dump, every misrouted
request is a leak opportunity.
- **JSON key**: an RSA private key. Only ever signs short-lived
(60 s) assertions sent to one endpoint
(`<zitadel>/oauth/v2/token`). The bearer token NATS sees is
the access token — short-lived (12 h max), scoped, distinct
from the long-term secret. A full network capture of the
agent ↔ NATS traffic yields only access tokens that expire
within 12 h.
Plus: Zitadel allows multiple keys per machine user, so rotation
is zero-downtime (mint new → push to device → delete old). PATs
rotate one-at-a-time and are disruptive.
What this does not defend against: a fully compromised device
where the attacker reads the keyfile. That requires hardware
(TPM / secure element) and is out of scope.
## The machine keys expire in year 9999. Isn't that effectively forever?
Yes. Currently set in `ZitadelSetupScore::create_machine_key` as
a known-bad default chosen for demo convenience (re-running tests
shouldn't produce expired keys mid-run). Tracked as a known issue.
## Why is the IAM-Owner PAT stored as a plain k8s Secret?
K8s Secrets are base64-encoded, **not** encrypted at rest unless
etcd encryption-at-rest is explicitly enabled with a KMS provider.
Anyone with `get secrets` in the `zitadel` namespace effectively
has Zitadel admin.
The PAT exists because `ZitadelSetupScore` calls Zitadel's
management API (create project, role, machine user, mint key),
which requires IAM-Owner privileges. A PAT is the simplest
credential that survives across applies.
This is a known production-hardening gap. Harmony has the
`harmony_secret` crate (ADR-020) with OpenBao and local-encrypted-file
backends; the Score is currently wired against a k8s Secret only.
## What lifetime is set for the human admin password — why does the ConfigMap show one that doesn't work?
`ZitadelScore` regenerates a random admin password on every apply
and writes it to the rendered ConfigMap. Helm's `FirstInstance`
block only seeds Postgres on the **first** install against an
empty DB, so re-applies render a new ConfigMap password but leave
the original Postgres hash untouched. The displayed password is
stale on every apply after the first.
To recover access: use the `iam-admin-pat` to call Zitadel's
management API and reset the human admin's password directly.
Tracked as a known bug.
## Quick reference — tokens on the wire
| Token | Lives where | Lifetime | Signed by | Purpose |
| --- | --- | --- | --- | --- |
| **Assertion** | Agent memory, in-flight | 60 s | Agent (RSA key) | "I'm machine user X — give me an access token" |
| **Access token** | Agent memory + on-the-wire to NATS | ~12 h | Zitadel | "Zitadel says I'm device X with role `device`" |
| **NATS user JWT** | NATS server connection state | callout-defined (~30 s) | Auth callout (NKey) | "I have these permissions on these subjects" |
The agent only holds the RSA key on disk and the access token
in memory. The NATS user JWT is server-internal — agents don't
see it.
## Code map
| Topic | File |
| --- | --- |
| Helm install, masterkey, admin password | `harmony/src/modules/zitadel/mod.rs` |
| Project/role/machine user provisioning | `harmony/src/modules/zitadel/setup.rs` |
| Per-device machine user + key handoff | `examples/fleet_e2e_demo/src/lib.rs::provision_device` |
| JWT-bearer mint | `fleet/harmony-fleet-agent/src/credentials.rs::zitadel_mint` |
| Auth callout decision tree | `nats/callout/src/handler.rs::decide` |
| Per-device permission template | `nats/callout/src/permissions.rs::device_default` |
| End-to-end rehearsal runbook | `examples/fleet_e2e_demo/RUNBOOK.md` |
| Manual JWT-bearer mint + NATS write recipe | [`fleet-manual-token-mint.md`](./fleet-manual-token-mint.md) |