Files

Jean-Gabriel Gill-Couture 6cbecee6e1 feat(fleet-device-enroll): require + validate --device-id (RFC1123)

The auto-generated `Id::default()` shape (`fb5310_Qm2kPoQ`) contains
underscores and uppercase, so once the agent published its
DeviceInfo and the operator tried to upsert a Device CR using
`device_id` as `metadata.name`, kube rejected it:

  ApiError: Device.fleet.nationtech.io "fb5310_Qm2kPoQ" is invalid:
  metadata.name: Invalid value ... must consist of lower case
  alphanumeric characters, '-' ...

Failing at operator-reconcile time is bad UX: the Zitadel machine
user is already provisioned, the agent is already running, and the
auth callout's per-device permissions are already templated to a
device_id the kube layer will never accept. Re-enrolling requires
manually deleting state in three places.

Makes `--device-id` **required** and validates it against RFC1123
DNS subdomain rules upfront, before any Zitadel call:

* non-empty, ≤253 chars total
* dot-separated labels, each 1-63 chars, lowercase a-z + 0-9 + `-`
* labels must start AND end with an alphanumeric

Stricter than just "kube name valid" because the same id flows into
NATS subjects (auth callout's permission templates) — `_`/uppercase
silently passes NATS auth but breaks the kube path. Rejecting at
the CLI is the only failure point that catches both layers in one
place.

8 unit tests cover the accept set + every reject path
(underscore — the regression that triggered this — uppercase,
leading/trailing dash, empty, consecutive dots, label too long,
total too long).

CLI banner + README updated. The `Id::default()` fallback path is
removed entirely; no backward compat with the old auto-generated
shape (the user explicitly opted out — anything that ran before now
needs re-enrollment with an explicit id).

2026-05-06 13:43:11 -04:00

10 KiB

Raw Permalink Blame History

Example: Fleet Device Enroll

Enrolls a device into the fleet by minting its Zitadel machine user + JSON key inline (browser SSO or pre-acquired admin token), then runs FleetDeviceSetupScore against the device to install podman, drop the keyfile + agent config, and bring up the agent under systemd.

Two operator workflows land on the same code path:

Dev-on-device — developer runs the score on a Pi with keyboard + display attached. Browser opens locally, dev signs in with their personal SSO account, the score provisions credentials for that one device.
Production-via-SSH — operator runs the score from a workstation, targets each device over SSH. Browser opens once on the workstation. (Per-batch token caching is on the roadmap; v0 re-prompts per device but the browser session cookie keeps the click cheap.)

How to use

Prerequisites

A running staging install (Zitadel + NATS + auth callout + operator) — see examples/fleet_staging_install/.
The Zitadel project ID for fleet (from the staging install output).
A cross-compiled fleet-agent binary for the target arch.
For VM rehearsal: libvirt + qemu-system-aarch64 + xorriso installed locally. Run cargo run -p example_fleet_vm_setup -- --bootstrap-only --arch aarch64 once to prime the asset cache and SSH keys.
Your Zitadel SSO account must hold a role permitting machine-user, role-grant, and machine-key creation (typically IAM_OWNER or ORG_OWNER).

Build flavors

The crate has two flavors selected by Cargo features:

Flavor	Command	What it includes
Workstation (default)	`cargo build --release -p example_fleet_device_enroll`	Everything: `--launch-pi-vm`, `--vm-rehearsal`, full enrollment. Pulls in libvirt via the `vm-rehearsal` feature.
Device-side (cross-compile)	`cargo build --release --target aarch64-unknown-linux-musl -p example_fleet_device_enroll --no-default-features`	Enrollment-only — no VM-rehearsal flags, no libvirt. Builds for arm64. Use the musl target, not gnu (see below).

Why musl, not gnu

Building with --target aarch64-unknown-linux-gnu links against the host's glibc. On a current Arch / Fedora workstation that's glibc 2.41+; on the device it might be glibc 2.36 (Debian 12) or 2.41 (Debian 13). When the workstation's glibc is newer than the device's, the binary fails to start with:

./fleet_device_enroll: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.39' not found

aarch64-unknown-linux-musl produces a fully static binary linked against musl libc, which is bundled in. It runs on any aarch64 Linux regardless of the host's libc generation — Debian 12, 13, Pi OS, Alpine, all the same. That's what we want for a device-side binary that gets shipped onto whatever userland the production line happens to flash.

One-time musl setup

rustup target add aarch64-unknown-linux-musl
# Arch:   sudo pacman -S aarch64-linux-musl   (AUR) or use mold-aarch64
# Fedora: sudo dnf install gcc-aarch64-linux-gnu  (we use musl-cross via rustup)

You may need to point Cargo at the right linker. In ~/.cargo/config.toml:

[target.aarch64-unknown-linux-musl]
linker = "aarch64-linux-musl-gcc"

Or use cross (cargo install cross) which handles the toolchain automatically:

cross build --release --target aarch64-unknown-linux-musl \
  -p example_fleet_device_enroll --no-default-features

Copying to the device

scp target/aarch64-unknown-linux-musl/release/fleet_device_enroll pi@<host>:

Then SSH to the device and run it as documented in Dev-on-device above.

Quickstart — Pi-equivalent VM rehearsal

Boot a Pi-equivalent VM (Debian bookworm arm64 generic-cloud — same Debian base Pi OS is built on; Pi OS itself is locked to Pi hardware and won't boot in generic KVM) with one command:

cargo run -p example_fleet_device_enroll -- --launch-pi-vm

The command boots the VM and exits, printing the SSH connection details and a suggested next command. From there, enroll the running VM:

./target/debug/fleet_device_enroll \
  --target ssh://fleet-admin@<VM_IP> \
  --device-id pi-rehearsal-01 \
  --issuer-url https://sso-staging.cb1.nationtech.io \
  --audience <PROJECT_ID> \
  --nats-url wss://nats-fleet-staging.cb1.nationtech.io \
  --admin-oidc-client-id <CLIENT_ID> \
  --agent-binary target/aarch64-unknown-linux-gnu/release/fleet-agent

--device-id is required and validated against RFC1123 subdomain rules (lowercase alphanumeric + -, must start and end with an alphanumeric, ≤253 chars total / ≤63 chars per label). Same id is reused for the agent's TOML, the Zitadel machine username (device-<id>), and the Kubernetes Device CR — so anything kube wouldn't accept as a metadata.name is rejected upfront here instead of three layers down at operator-reconcile time.

The browser opens to Zitadel's device-code login. Sign in with your SSO account; the score mints the per-device user, drops the keyfile, and brings up the agent.

Dev-on-device

Run the binary on the Pi itself, omit --target entirely. The score uses ansible's local connection and runs everything on the same machine — no SSH, no keypair:

fleet_device_enroll \
  --issuer-url https://sso.example.com \
  --audience <PROJECT_ID> \
  --nats-url wss://nats.example.com \
  --admin-oidc-client-id <CLIENT_ID> \
  --agent-binary /usr/local/bin/fleet-agent \
  --device-id pi-001 \
  --labels group=lab,arch=aarch64

Browser opens on the Pi's local display. The dev signs in once; the score handles the rest. Sudo prompts the operator's password if passwordless sudo isn't configured (which is fine — Debian's default).

Auto-installs python3-venv on first run if missing (Debian splits it out of base python3); the score detects the failure, runs sudo apt-get install -y python3-venv, and retries the venv create.

Production-via-SSH

Operator runs from a workstation, targeting devices on the LAN:

fleet_device_enroll \
  --target ssh://pi@10.0.0.42 \
  --issuer-url https://sso.example.com \
  --audience <PROJECT_ID> \
  --nats-url wss://nats.example.com \
  --agent-binary ./build/fleet-agent-aarch64 \
  --device-id batch7-042 \
  --labels group=batch7,site=warehouse-east

Each invocation re-prompts the browser. Token caching across runs is tracked in ROADMAP/fleet_platform/device_enrollment_token_caching.md.

Non-interactive (CI / scripted)

Skip the browser by passing a Bearer token:

HARMONY_ZITADEL_ADMIN_TOKEN=<pat-or-access-token> \
fleet_device_enroll \
  --target ssh://pi@10.0.0.42 \
  --issuer-url https://sso.example.com \
  --audience <PROJECT_ID> \
  --nats-url wss://nats.example.com \
  --agent-binary ./build/fleet-agent-aarch64

What the score does on the device

For each invocation the score:

Calls Zitadel /management/v1/* with the admin token to find-or-create the device's machine user, grant it the device role on the fleet project, and mint a JSON key (idempotent on user + grant; always mints a new key because Zitadel doesn't return existing material).
SSHes to the target, ensures podman + systemd-container packages, creates the fleet-agent user with linger, activates the user-scoped podman socket.
Uploads the agent binary to /usr/local/bin/fleet-agent.
Drops the JSON keyfile at /etc/fleet-agent/zitadel-key.json (mode 0640, owned by fleet-agent).
Renders /etc/fleet-agent/config.toml with the agent's NATS URLs, labels, and [credentials] block pointing at the keyfile.
Installs and starts fleet-agent.service. Restarts only if config / binary / unit changed.

The agent then mints NATS JWTs from the keyfile via the auth callout's JWT-bearer flow and registers itself in the device-info KV.

Verification

After enrollment, the device's heartbeat should appear within seconds:

nats kv get fleet-device-info <device-id>

Or watch via the operator's dashboard / CRs:

kubectl get fleetdev   # devices CRD

SSO `client_id` — where to get it

--admin-oidc-client-id is the numeric Zitadel-assigned client_id, not the human-readable app name. When fleet_staging_install provisions the harmony-cli device-code app, Zitadel generates a numeric client_id like 371639797157987125@fleet. The staging install prints this value in its final summary block — copy it from there.

If you ever need to look it up after the fact, it's in the staging-install operator's local cache:

jq -r '.apps."harmony-cli"' ~/.local/share/harmony/zitadel/client-config.json

That cache is on the operator's workstation (the host that ran fleet_staging_install). The device itself doesn't have it — the operator must pass --admin-oidc-client-id <numeric> explicitly when running enrollment from the device, or set HARMONY_ZITADEL_ADMIN_TOKEN to skip SSO entirely.

Common failure modes

invalid_client: no active client not found — --admin-oidc-client-id is wrong. Most likely you passed the app name (harmony-cli) instead of the numeric client_id. See above.
Project '<name>' not visible to the current Zitadel token — your SSO token's primary org differs from where the project lives. Most common when the staging install created the project as the system iam-admin user (system org) and you're signing in with a personal Zitadel account (your own org). Pass --admin-org-id <id> (find it in Zitadel UI → Organization → Resource ID). Alternatively, the score now logs projects visible in current org context: … right before the error — that list shows what your token CAN see, which usually pinpoints the org mismatch.
403 on management API — operator SSO account doesn't hold a role permitting management calls. Grant IAM_OWNER (or equivalent scoped permission) in Zitadel admin UI.
CaUsedAsEndEntity from rustls — talking to a dev cluster with a self-signed cert. Pass --danger-accept-invalid-certs.
Browser doesn't open over SSH — webbrowser can't find a GUI. The score still prints the URL; copy it into a browser on your workstation.

CLI flags

Run fleet_device_enroll --help for the full surface.

10 KiB Raw Permalink Blame History