Working PyJWT script + nats CLI commands for talking to a callout-protected NATS by hand. Distills what we learned debugging the auth chain: which scope claims matter, why the audience is the project id (not the API app's clientId), how to read OIDC_AUDIENCE off the live callout instead of trusting the cache, and the failure modes — including the PyJWT vs jwt package collision that costs 30 minutes the first time you hit it. Cross-linked from fleet-zitadel-faq.md.
186 lines
8.2 KiB
Markdown
186 lines
8.2 KiB
Markdown
# Fleet × Zitadel FAQ
|
||
|
||
Technical reference for the Zitadel setup behind the fleet
|
||
auth callout. Describes what exists, why it's that way, and where
|
||
each piece lives in the code.
|
||
|
||
Code anchors:
|
||
- `examples/fleet_e2e_demo/src/lib.rs` — bring-up flow
|
||
- `harmony/src/modules/zitadel/setup.rs` — `ZitadelSetupScore`
|
||
- `harmony/src/modules/zitadel/mod.rs` — Helm install
|
||
- `nats/callout/src/handler.rs` — auth callout
|
||
- `fleet/harmony-fleet-agent/src/credentials.rs` — JWT-bearer mint
|
||
|
||
---
|
||
|
||
## What is an "application" in Zitadel?
|
||
|
||
An OIDC client config: `clientId`, allowed grant types, redirect
|
||
URIs (browser apps only), PKCE settings (browser apps only).
|
||
|
||
Apps are not containers for users or roles — those live one
|
||
level up at the org. An app is the entry point a service uses to
|
||
delegate auth to Zitadel.
|
||
|
||
The `nats` app is **API type**: JWT-bearer / client-credentials
|
||
only, no browser flow. Headless agents never see a login page.
|
||
The app's `clientId` is what tokens carry as `aud` and what the
|
||
auth callout validates against (`OIDC_AUDIENCE` env on the callout
|
||
Deployment).
|
||
|
||
## Why are users and roles at org level instead of per-project?
|
||
|
||
Roles are defined inside a project but are essentially labels —
|
||
strings + display names with no inherent permissions. Each app
|
||
enforces them in code (the callout maps `device` → a
|
||
permission template).
|
||
|
||
Users live at org level so one identity can hold roles across
|
||
multiple projects in the same org and SSO between them. Role
|
||
grants are the join: "user X has roles \[A, B\] on project Y."
|
||
|
||
The only privilege ladder Zitadel enforces directly is at the
|
||
instance/org level (IAM-Owner, Org-Owner). Project roles say
|
||
nothing about Zitadel admin rights.
|
||
|
||
## What is each service account for?
|
||
|
||
| User | Created by | Purpose |
|
||
| --- | --- | --- |
|
||
| `iam-admin` | Helm `FirstInstance.Org.Machine` | IAM-Owner. Its PAT (`iam-admin-pat` k8s Secret) drives the management API from `ZitadelSetupScore`. |
|
||
| `login-client` | Helm `FirstInstance.Org.LoginClient` | Internal — Zitadel's login UI pod uses it to call back into Zitadel. Don't touch. |
|
||
| `fleet-ops` | `fleet_e2e_demo` admin setup | `fleet-admin` role grant, JSON key, used by tests and admin tooling. |
|
||
| `device-vm-device-NN` | `fleet_e2e_demo::provision_device` | One per VM. JSON key copied to `/etc/fleet-agent/zitadel-key.json`. `device` role grant. |
|
||
| `ops-station`, `sensor-a`, `sensor-b`, `intruder` | `fleet_auth_callout` (separate example) | Leftovers from previous runs. Postgres survives cluster recreates. Harmless, deletable. |
|
||
|
||
The `device-` prefix on per-device usernames is intentional:
|
||
Zitadel emits the username verbatim in the access token's
|
||
`client_id` claim. The callout strips `device-` to recover the
|
||
bare device id used for NATS subject interpolation
|
||
(`DEVICE_ID_PREFIX_STRIP=device-` env var on the callout;
|
||
`nats/callout/src/zitadel.rs::extract_device_id`).
|
||
|
||
## How does the agent authenticate? Are JWTs / refresh tokens cached?
|
||
|
||
On disk the agent keeps **only the JSON machine key** (RSA
|
||
private key) at `/etc/fleet-agent/zitadel-key.json`.
|
||
|
||
It does NOT store:
|
||
- access tokens (in memory only)
|
||
- refresh tokens (the JWT-bearer flow has none — RFC 7523 is
|
||
stateless by design)
|
||
|
||
On every NATS (re)connect, `credentials.rs::zitadel_mint`:
|
||
|
||
1. Builds a JWT assertion with `exp = now + 60s`, signs it with
|
||
the RSA key
|
||
2. POSTs it to `<zitadel>/oauth/v2/token` with grant type
|
||
`urn:ietf:params:oauth:grant-type:jwt-bearer`
|
||
3. Receives an access token (~12h validity), caches it in memory
|
||
4. Re-mints when within 5min of expiry
|
||
(`TOKEN_REFRESH_LEEWAY_SECS`)
|
||
|
||
## What happens to an offline agent?
|
||
|
||
| Time offline | Behavior |
|
||
| --- | --- |
|
||
| 0 – ~12 h | Cached access token still valid. Reconnects work transparently. |
|
||
| > ~12 h | Token expired. Agent enters reconnect loop until network returns, then mints fresh on first successful reach. |
|
||
|
||
The RSA key never expires until rotated server-side.
|
||
|
||
## Where are the lifetimes set?
|
||
|
||
- **Access token TTL** — Zitadel UI: Org → Settings → OIDC
|
||
Settings → "Access Token Lifetime" (default 12 h).
|
||
- **Assertion TTL** — hardcoded 60 s in
|
||
`credentials.rs::ASSERTION_LIFETIME_SECS`. Zitadel rejects
|
||
assertions where `exp - iat > 60 s`; this is server-enforced,
|
||
not a knob.
|
||
- **Machine key TTL** — set when the key is created in
|
||
`harmony/src/modules/zitadel/setup.rs::create_machine_key`.
|
||
|
||
## Why is a JSON machine key more secure than a PAT?
|
||
|
||
Both are "if stolen, full impersonation" — the same blast radius.
|
||
The difference is in leak surface:
|
||
|
||
- **PAT**: a 60-char bearer string sent on every authenticated
|
||
request. Every log line, every env dump, every misrouted
|
||
request is a leak opportunity.
|
||
- **JSON key**: an RSA private key. Only ever signs short-lived
|
||
(60 s) assertions sent to one endpoint
|
||
(`<zitadel>/oauth/v2/token`). The bearer token NATS sees is
|
||
the access token — short-lived (12 h max), scoped, distinct
|
||
from the long-term secret. A full network capture of the
|
||
agent ↔ NATS traffic yields only access tokens that expire
|
||
within 12 h.
|
||
|
||
Plus: Zitadel allows multiple keys per machine user, so rotation
|
||
is zero-downtime (mint new → push to device → delete old). PATs
|
||
rotate one-at-a-time and are disruptive.
|
||
|
||
What this does not defend against: a fully compromised device
|
||
where the attacker reads the keyfile. That requires hardware
|
||
(TPM / secure element) and is out of scope.
|
||
|
||
## The machine keys expire in year 9999. Isn't that effectively forever?
|
||
|
||
Yes. Currently set in `ZitadelSetupScore::create_machine_key` as
|
||
a known-bad default chosen for demo convenience (re-running tests
|
||
shouldn't produce expired keys mid-run). Tracked as a known issue.
|
||
|
||
## Why is the IAM-Owner PAT stored as a plain k8s Secret?
|
||
|
||
K8s Secrets are base64-encoded, **not** encrypted at rest unless
|
||
etcd encryption-at-rest is explicitly enabled with a KMS provider.
|
||
Anyone with `get secrets` in the `zitadel` namespace effectively
|
||
has Zitadel admin.
|
||
|
||
The PAT exists because `ZitadelSetupScore` calls Zitadel's
|
||
management API (create project, role, machine user, mint key),
|
||
which requires IAM-Owner privileges. A PAT is the simplest
|
||
credential that survives across applies.
|
||
|
||
This is a known production-hardening gap. Harmony has the
|
||
`harmony_secret` crate (ADR-020) with OpenBao and local-encrypted-file
|
||
backends; the Score is currently wired against a k8s Secret only.
|
||
|
||
## What lifetime is set for the human admin password — why does the ConfigMap show one that doesn't work?
|
||
|
||
`ZitadelScore` regenerates a random admin password on every apply
|
||
and writes it to the rendered ConfigMap. Helm's `FirstInstance`
|
||
block only seeds Postgres on the **first** install against an
|
||
empty DB, so re-applies render a new ConfigMap password but leave
|
||
the original Postgres hash untouched. The displayed password is
|
||
stale on every apply after the first.
|
||
|
||
To recover access: use the `iam-admin-pat` to call Zitadel's
|
||
management API and reset the human admin's password directly.
|
||
Tracked as a known bug.
|
||
|
||
## Quick reference — tokens on the wire
|
||
|
||
| Token | Lives where | Lifetime | Signed by | Purpose |
|
||
| --- | --- | --- | --- | --- |
|
||
| **Assertion** | Agent memory, in-flight | 60 s | Agent (RSA key) | "I'm machine user X — give me an access token" |
|
||
| **Access token** | Agent memory + on-the-wire to NATS | ~12 h | Zitadel | "Zitadel says I'm device X with role `device`" |
|
||
| **NATS user JWT** | NATS server connection state | callout-defined (~30 s) | Auth callout (NKey) | "I have these permissions on these subjects" |
|
||
|
||
The agent only holds the RSA key on disk and the access token
|
||
in memory. The NATS user JWT is server-internal — agents don't
|
||
see it.
|
||
|
||
## Code map
|
||
|
||
| Topic | File |
|
||
| --- | --- |
|
||
| Helm install, masterkey, admin password | `harmony/src/modules/zitadel/mod.rs` |
|
||
| Project/role/machine user provisioning | `harmony/src/modules/zitadel/setup.rs` |
|
||
| Per-device machine user + key handoff | `examples/fleet_e2e_demo/src/lib.rs::provision_device` |
|
||
| JWT-bearer mint | `fleet/harmony-fleet-agent/src/credentials.rs::zitadel_mint` |
|
||
| Auth callout decision tree | `nats/callout/src/handler.rs::decide` |
|
||
| Per-device permission template | `nats/callout/src/permissions.rs::device_default` |
|
||
| End-to-end rehearsal runbook | `examples/fleet_e2e_demo/RUNBOOK.md` |
|
||
| Manual JWT-bearer mint + NATS write recipe | [`fleet-manual-token-mint.md`](./fleet-manual-token-mint.md) |
|