Files
harmony/docs/guides/fleet-manual-token-mint.md
Jean-Gabriel Gill-Couture 612d934ad4 docs(fleet): manual JWT-bearer mint + NATS write recipe
Working PyJWT script + nats CLI commands for talking to a
callout-protected NATS by hand. Distills what we learned debugging
the auth chain: which scope claims matter, why the audience is the
project id (not the API app's clientId), how to read OIDC_AUDIENCE
off the live callout instead of trusting the cache, and the failure
modes — including the PyJWT vs jwt package collision that costs
30 minutes the first time you hit it.

Cross-linked from fleet-zitadel-faq.md.
2026-05-05 01:43:36 -04:00

190 lines
8.0 KiB
Markdown

# Manual Zitadel token mint + NATS write
Operator-side recipe for talking to a callout-protected NATS by
hand: sign a JWT-bearer assertion with a Zitadel machine user's
private key, exchange it for an access token, drive `nats` CLI
commands with the token. Useful for debugging the auth chain,
poking the desired-state KV without the operator running, and
validating that a deployed callout is actually accepting what
you think it should.
Read [fleet-zitadel-faq.md](./fleet-zitadel-faq.md) first for the
underlying mechanism (RFC 7523 JWT-bearer flow, why we sign
locally, what each claim means).
## Inputs you need
Five strings:
| Input | Where to find it |
| --- | --- |
| `OIDC_ISSUER_URL` (the Zitadel base URL) | callout Deployment env: `kubectl exec -n fleet-system deploy/fleet-callout -- printenv OIDC_ISSUER_URL` |
| `project_id` (becomes the access token's `aud`) | callout Deployment env: `OIDC_AUDIENCE` |
| Machine user's `userId` | the JSON keyfile's `userId` field |
| Machine user's `keyId` | the JSON keyfile's `keyId` field |
| Private RSA key (PEM) | the JSON keyfile's `key` field |
Get the `fleet-ops` (admin role) JSON keyfile from the cache:
```bash
jq -r '.machine_keys["fleet-ops"]' \
~/.local/share/harmony/zitadel/client-config.json \
> /tmp/fleet-ops.json
jq -r '.userId' /tmp/fleet-ops.json # → user_id
jq -r '.keyId' /tmp/fleet-ops.json # → key_id
jq -r '.key' /tmp/fleet-ops.json > /tmp/fleet-ops.pem
```
The cache may drift from the deployed Zitadel state if Zitadel has
been re-seeded; **always pull `OIDC_AUDIENCE` from the running
callout**, not from the cache. The cache fix landed in commit
`f4d6fb94` but older entries can still trip you up.
## Mint script (PyJWT)
```python
# pip install PyJWT requests ← MUST be PyJWT, not the `jwt` package.
# The two share `import jwt`; `jwt` (the package) refuses raw PEM
# strings and demands an AbstractJWKBase wrapper. PyJWT takes PEM
# directly. If you ever see `TypeError: key must be an instance of
# a class implements jwt.AbstractJWKBase`, you have the wrong one.
import jwt, time, requests
# These come from the running callout + Zitadel. Don't reuse stale
# values from a checked-in note; verify against the live cluster.
OIDC_ISSUER_URL = "http://sso.fleet.local:8080"
PROJECT_ID = "371158654839160853" # = OIDC_AUDIENCE on callout
USER_ID = "..." # from machine keyfile
KEY_ID = "..." # from machine keyfile
key = open("/tmp/fleet-ops.pem").read()
now = int(time.time())
assertion = jwt.encode(
{
"iss": USER_ID,
"sub": USER_ID,
"aud": OIDC_ISSUER_URL, # for Zitadel itself, NOT the project_id
"exp": now + 60, # Zitadel rejects exp - iat > 60s
"iat": now,
},
key,
algorithm="RS256",
headers={"kid": KEY_ID}, # PyJWT spelling — `headers=`, not `optional_headers=`
)
r = requests.post(
f"{OIDC_ISSUER_URL}/oauth/v2/token",
data={
"grant_type": "urn:ietf:params:oauth:grant-type:jwt-bearer",
"assertion": assertion,
# Three scopes:
# openid — base OIDC
# urn:zitadel:iam:org:projects:roles — PLURAL.
# Without this, Zitadel omits the role claim and the
# callout rejects with "no authorized role in token".
# urn:zitadel:iam:org:project:id:<id>:aud — singular.
# Tells Zitadel to put <id> into the access token's
# `aud` claim, which the callout's audience check
# compares against OIDC_AUDIENCE.
"scope": (
"openid "
"urn:zitadel:iam:org:projects:roles "
f"urn:zitadel:iam:org:project:id:{PROJECT_ID}:aud"
),
},
)
r.raise_for_status()
token = r.json()["access_token"]
# Sanity check — decode without verifying signature so you can see
# what Zitadel actually emitted. If anything below is wrong, the
# callout will reject your token.
print(jwt.decode(token, options={"verify_signature": False}))
print(token)
```
Expected decoded claims (the parts the callout will check):
| Claim | What it should be | Why |
| --- | --- | --- |
| `iss` | `OIDC_ISSUER_URL` (byte-equal) | Callout: `validation.set_issuer(&[&self.issuer_url])` |
| `aud` | `["<PROJECT_ID>"]` | Callout: `validation.set_audience(&[&self.audience])`; the array form is Zitadel's default |
| `exp` | ~now + 12h | Zitadel default access token TTL |
| `client_id` | the machine user's username (`fleet-ops`, `device-vm-device-00`, …) | Callout uses this as `device_id_claim` (with optional `DEVICE_ID_PREFIX_STRIP` applied) |
| `urn:zitadel:iam:org:project:<PROJECT_ID>:roles` | object with role names as keys (e.g. `{"fleet-admin": {"<orgId>": "<orgName>"}}`) | Callout uses this as `roles_claim` and admits the role if `fleet-admin` or `device` is present |
If any of these is wrong, fix the script before bothering with NATS.
## Drive NATS with the token
`nats --token=<bearer>` puts the value into the CONNECT frame's
`auth_token`, which is what the callout expects.
```bash
NATS_SERVER=192.168.122.1:30422 # libvirt host's port mapping
TOKEN=$(python3 mint.py | tail -1) # last line is the raw token
# Read everything (admin role allows >):
nats --server "$NATS_SERVER" --token "$TOKEN" kv ls device-info
nats --server "$NATS_SERVER" --token "$TOKEN" kv get device-info info.vm-device-00
# Write a desired state — agent's KV watcher fires within 1s,
# reconciler creates the podman container.
nats --server "$NATS_SERVER" --token "$TOKEN" \
kv put desired-state vm-device-00.hello-web '{
"name": "hello-web",
"type": "PodmanV0",
"data": {
"services": [{
"name": "testnginx",
"image": "docker.io/nginx:latest",
"ports": ["8080:80"]
}]
}
}'
```
The exact JSON shape comes from
`harmony-reconciler-contracts/src/fleet.rs` — read that crate when
in doubt about field names, NOT this doc; this doc is a worked
example and may drift.
## Common failures and what they mean
| Symptom | Likely cause |
| --- | --- |
| `TypeError: key must be an instance of … AbstractJWKBase` | Wrong PyPI package. `pip uninstall jwt && pip install PyJWT`. |
| HTTP 400 from `/oauth/v2/token`: `"invalid_grant_type"` | Forgot the percent-encoded form encoding, OR `grant_type` value mistyped. The full URN is `urn:ietf:params:oauth:grant-type:jwt-bearer`. |
| HTTP 400: `"jwt: token is expired"` | Your assertion's `exp` is in the past. Wall-clock skew between your laptop and the cluster — sync NTP. |
| Token mints but no `urn:zitadel:…:roles` claim | Missing the **plural** `urn:zitadel:iam:org:projects:roles` in scope. |
| Token mints but `aud` is the issuer URL instead of the project id | Forgot the `urn:zitadel:iam:org:project:id:<id>:aud` scope. |
| NATS CLI: `nats: Authorization Violation` | Token is good but callout rejected it — check `kubectl logs -n fleet-system -l app=fleet-callout` for the actual reason. The most common ones are "InvalidAudience" (your `aud` ≠ deployed `OIDC_AUDIENCE`) and "no authorized role in token". |
| Callout log: `JWT validation failed: InvalidIssuer` | Trailing slash drift. `OIDC_ISSUER_URL=http://sso.fleet.local:8080/``http://sso.fleet.local:8080`. Match exactly. |
When the callout rejects, **its log is the source of truth**, not
your decoded claims. The validation error includes which check
failed; work backwards from there.
## Rotating the deployed `OIDC_AUDIENCE`
If Zitadel was re-seeded and `OIDC_AUDIENCE` on the callout now
points at a non-existent project:
```bash
# 1. Confirm the live project id
oc -n zitadel exec -ti deploy/zitadel -- /bin/sh -c \
'curl -s -H "Authorization: Bearer $PAT" \
$ZITADEL_URL/management/v1/projects/_search \
| jq ".result[] | select(.name == \"fleet\") | .id"'
# 2. Re-run the bring-up — the live-query fix in f4d6fb94 will
# refresh OIDC_AUDIENCE on the next NatsAuthCalloutScore apply.
```
The shape of `mint.py` doesn't change between regular operation
and post-recovery — you just plug in fresh values for
`OIDC_AUDIENCE` and `PROJECT_ID`.