harmony/docs/guides/fleet-zitadel-faq.md

# Fleet × Zitadel FAQ

Technical reference for the Zitadel setup behind the fleet
auth callout. Describes what exists, why it's that way, and where
each piece lives in the code.

Code anchors:
- `examples/fleet_e2e_demo/src/lib.rs` — bring-up flow
- `harmony/src/modules/zitadel/setup.rs` — `ZitadelSetupScore`
- `harmony/src/modules/zitadel/mod.rs` — Helm install
- `nats/callout/src/handler.rs` — auth callout
- `fleet/harmony-fleet-agent/src/credentials.rs` — JWT-bearer mint

---

## What is an "application" in Zitadel?

An OIDC client config: `clientId`, allowed grant types, redirect
URIs (browser apps only), PKCE settings (browser apps only).

Apps are not containers for users or roles — those live one
level up at the org. An app is the entry point a service uses to
delegate auth to Zitadel.

The `nats` app is **API type**: JWT-bearer / client-credentials
only, no browser flow. Headless agents never see a login page.
The app's `clientId` is what tokens carry as `aud` and what the
auth callout validates against (`OIDC_AUDIENCE` env on the callout
Deployment).

## Why are users and roles at org level instead of per-project?

Roles are defined inside a project but are essentially labels —
strings + display names with no inherent permissions. Each app
enforces them in code (the callout maps `device` → a
permission template).

Users live at org level so one identity can hold roles across
multiple projects in the same org and SSO between them. Role
grants are the join: "user X has roles \[A, B\] on project Y."

The only privilege ladder Zitadel enforces directly is at the
instance/org level (IAM-Owner, Org-Owner). Project roles say
nothing about Zitadel admin rights.

## What is each service account for?

| User | Created by | Purpose |
| --- | --- | --- |
| `iam-admin` | Helm `FirstInstance.Org.Machine` | IAM-Owner. Its PAT (`iam-admin-pat` k8s Secret) drives the management API from `ZitadelSetupScore`. |
| `login-client` | Helm `FirstInstance.Org.LoginClient` | Internal — Zitadel's login UI pod uses it to call back into Zitadel. Don't touch. |
| `fleet-ops` | `fleet_e2e_demo` admin setup | `fleet-admin` role grant, JSON key, used by tests and admin tooling. |
| `device-vm-device-NN` | `fleet_e2e_demo::provision_device` | One per VM. JSON key copied to `/etc/fleet-agent/zitadel-key.json`. `device` role grant. |
| `ops-station`, `sensor-a`, `sensor-b`, `intruder` | `fleet_auth_callout` (separate example) | Leftovers from previous runs. Postgres survives cluster recreates. Harmless, deletable. |

The `device-` prefix on per-device usernames is intentional:
Zitadel emits the username verbatim in the access token's
`client_id` claim. The callout strips `device-` to recover the
bare device id used for NATS subject interpolation
(`DEVICE_ID_PREFIX_STRIP=device-` env var on the callout;
`nats/callout/src/zitadel.rs::extract_device_id`).

## How does the agent authenticate? Are JWTs / refresh tokens cached?

On disk the agent keeps **only the JSON machine key** (RSA
private key) at `/etc/fleet-agent/zitadel-key.json`.

It does NOT store:
- access tokens (in memory only)
- refresh tokens (the JWT-bearer flow has none — RFC 7523 is
  stateless by design)

On every NATS (re)connect, `credentials.rs::zitadel_mint`:

1. Builds a JWT assertion with `exp = now + 60s`, signs it with
   the RSA key
2. POSTs it to `<zitadel>/oauth/v2/token` with grant type
   `urn:ietf:params:oauth:grant-type:jwt-bearer`
3. Receives an access token (~12h validity), caches it in memory
4. Re-mints when within 5min of expiry
   (`TOKEN_REFRESH_LEEWAY_SECS`)

## What happens to an offline agent?

| Time offline | Behavior |
| --- | --- |
| 0 – ~12 h | Cached access token still valid. Reconnects work transparently. |
| > ~12 h | Token expired. Agent enters reconnect loop until network returns, then mints fresh on first successful reach. |

The RSA key never expires until rotated server-side.

## Where are the lifetimes set?

- **Access token TTL** — Zitadel UI: Org → Settings → OIDC
  Settings → "Access Token Lifetime" (default 12 h).
- **Assertion TTL** — hardcoded 60 s in
  `credentials.rs::ASSERTION_LIFETIME_SECS`. Zitadel rejects
  assertions where `exp - iat > 60 s`; this is server-enforced,
  not a knob.
- **Machine key TTL** — set when the key is created in
  `harmony/src/modules/zitadel/setup.rs::create_machine_key`.

## Why is a JSON machine key more secure than a PAT?

Both are "if stolen, full impersonation" — the same blast radius.
The difference is in leak surface:

- **PAT**: a 60-char bearer string sent on every authenticated
  request. Every log line, every env dump, every misrouted
  request is a leak opportunity.
- **JSON key**: an RSA private key. Only ever signs short-lived
  (60 s) assertions sent to one endpoint
  (`<zitadel>/oauth/v2/token`). The bearer token NATS sees is
  the access token — short-lived (12 h max), scoped, distinct
  from the long-term secret. A full network capture of the
  agent ↔ NATS traffic yields only access tokens that expire
  within 12 h.

Plus: Zitadel allows multiple keys per machine user, so rotation
is zero-downtime (mint new → push to device → delete old). PATs
rotate one-at-a-time and are disruptive.

What this does not defend against: a fully compromised device
where the attacker reads the keyfile. That requires hardware
(TPM / secure element) and is out of scope.

## The machine keys expire in year 9999. Isn't that effectively forever?

Yes. Currently set in `ZitadelSetupScore::create_machine_key` as
a known-bad default chosen for demo convenience (re-running tests
shouldn't produce expired keys mid-run). Tracked as a known issue.

## Why is the IAM-Owner PAT stored as a plain k8s Secret?

K8s Secrets are base64-encoded, **not** encrypted at rest unless
etcd encryption-at-rest is explicitly enabled with a KMS provider.
Anyone with `get secrets` in the `zitadel` namespace effectively
has Zitadel admin.

The PAT exists because `ZitadelSetupScore` calls Zitadel's
management API (create project, role, machine user, mint key),
which requires IAM-Owner privileges. A PAT is the simplest
credential that survives across applies.

This is a known production-hardening gap. Harmony has the
`harmony_secret` crate (ADR-020) with OpenBao and local-encrypted-file
backends; the Score is currently wired against a k8s Secret only.

## What lifetime is set for the human admin password — why does the ConfigMap show one that doesn't work?

`ZitadelScore` regenerates a random admin password on every apply
and writes it to the rendered ConfigMap. Helm's `FirstInstance`
block only seeds Postgres on the **first** install against an
empty DB, so re-applies render a new ConfigMap password but leave
the original Postgres hash untouched. The displayed password is
stale on every apply after the first.

To recover access: use the `iam-admin-pat` to call Zitadel's
management API and reset the human admin's password directly.
Tracked as a known bug.

## Quick reference — tokens on the wire

| Token | Lives where | Lifetime | Signed by | Purpose |
| --- | --- | --- | --- | --- |
| **Assertion** | Agent memory, in-flight | 60 s | Agent (RSA key) | "I'm machine user X — give me an access token" |
| **Access token** | Agent memory + on-the-wire to NATS | ~12 h | Zitadel | "Zitadel says I'm device X with role `device`" |
| **NATS user JWT** | NATS server connection state | callout-defined (~30 s) | Auth callout (NKey) | "I have these permissions on these subjects" |

The agent only holds the RSA key on disk and the access token
in memory. The NATS user JWT is server-internal — agents don't
see it.

## Code map

| Topic | File |
| --- | --- |
| Helm install, masterkey, admin password | `harmony/src/modules/zitadel/mod.rs` |
| Project/role/machine user provisioning | `harmony/src/modules/zitadel/setup.rs` |
| Per-device machine user + key handoff | `examples/fleet_e2e_demo/src/lib.rs::provision_device` |
| JWT-bearer mint | `fleet/harmony-fleet-agent/src/credentials.rs::zitadel_mint` |
| Auth callout decision tree | `nats/callout/src/handler.rs::decide` |
| Per-device permission template | `nats/callout/src/permissions.rs::device_default` |
| End-to-end rehearsal runbook | `examples/fleet_e2e_demo/RUNBOOK.md` |
| Manual JWT-bearer mint + NATS write recipe | [`fleet-manual-token-mint.md`](./fleet-manual-token-mint.md) |