The auto-generated `Id::default()` shape (`fb5310_Qm2kPoQ`) contains underscores and uppercase, so once the agent published its DeviceInfo and the operator tried to upsert a Device CR using `device_id` as `metadata.name`, kube rejected it: ApiError: Device.fleet.nationtech.io "fb5310_Qm2kPoQ" is invalid: metadata.name: Invalid value ... must consist of lower case alphanumeric characters, '-' ... Failing at operator-reconcile time is bad UX: the Zitadel machine user is already provisioned, the agent is already running, and the auth callout's per-device permissions are already templated to a device_id the kube layer will never accept. Re-enrolling requires manually deleting state in three places. Makes `--device-id` **required** and validates it against RFC1123 DNS subdomain rules upfront, before any Zitadel call: * non-empty, ≤253 chars total * dot-separated labels, each 1-63 chars, lowercase a-z + 0-9 + `-` * labels must start AND end with an alphanumeric Stricter than just "kube name valid" because the same id flows into NATS subjects (auth callout's permission templates) — `_`/uppercase silently passes NATS auth but breaks the kube path. Rejecting at the CLI is the only failure point that catches both layers in one place. 8 unit tests cover the accept set + every reject path (underscore — the regression that triggered this — uppercase, leading/trailing dash, empty, consecutive dots, label too long, total too long). CLI banner + README updated. The `Id::default()` fallback path is removed entirely; no backward compat with the old auto-generated shape (the user explicitly opted out — anything that ran before now needs re-enrollment with an explicit id).
10 KiB
Example: Fleet Device Enroll
Enrolls a device into the fleet by minting its Zitadel machine user + JSON key inline (browser SSO or pre-acquired admin token), then runs FleetDeviceSetupScore against the device to install podman, drop the keyfile + agent config, and bring up the agent under systemd.
Two operator workflows land on the same code path:
- Dev-on-device — developer runs the score on a Pi with keyboard + display attached. Browser opens locally, dev signs in with their personal SSO account, the score provisions credentials for that one device.
- Production-via-SSH — operator runs the score from a workstation, targets each device over SSH. Browser opens once on the workstation. (Per-batch token caching is on the roadmap; v0 re-prompts per device but the browser session cookie keeps the click cheap.)
How to use
Prerequisites
- A running staging install (Zitadel + NATS + auth callout + operator) — see
examples/fleet_staging_install/. - The Zitadel project ID for
fleet(from the staging install output). - A cross-compiled
fleet-agentbinary for the target arch. - For VM rehearsal: libvirt + qemu-system-aarch64 + xorriso installed locally. Run
cargo run -p example_fleet_vm_setup -- --bootstrap-only --arch aarch64once to prime the asset cache and SSH keys. - Your Zitadel SSO account must hold a role permitting machine-user, role-grant, and machine-key creation (typically
IAM_OWNERorORG_OWNER).
Build flavors
The crate has two flavors selected by Cargo features:
| Flavor | Command | What it includes |
|---|---|---|
| Workstation (default) | cargo build --release -p example_fleet_device_enroll |
Everything: --launch-pi-vm, --vm-rehearsal, full enrollment. Pulls in libvirt via the vm-rehearsal feature. |
| Device-side (cross-compile) | cargo build --release --target aarch64-unknown-linux-musl -p example_fleet_device_enroll --no-default-features |
Enrollment-only — no VM-rehearsal flags, no libvirt. Builds for arm64. Use the musl target, not gnu (see below). |
Why musl, not gnu
Building with --target aarch64-unknown-linux-gnu links against the host's glibc. On a current Arch / Fedora workstation that's glibc 2.41+; on the device it might be glibc 2.36 (Debian 12) or 2.41 (Debian 13). When the workstation's glibc is newer than the device's, the binary fails to start with:
./fleet_device_enroll: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.39' not found
aarch64-unknown-linux-musl produces a fully static binary linked against musl libc, which is bundled in. It runs on any aarch64 Linux regardless of the host's libc generation — Debian 12, 13, Pi OS, Alpine, all the same. That's what we want for a device-side binary that gets shipped onto whatever userland the production line happens to flash.
One-time musl setup
rustup target add aarch64-unknown-linux-musl
# Arch: sudo pacman -S aarch64-linux-musl (AUR) or use mold-aarch64
# Fedora: sudo dnf install gcc-aarch64-linux-gnu (we use musl-cross via rustup)
You may need to point Cargo at the right linker. In ~/.cargo/config.toml:
[target.aarch64-unknown-linux-musl]
linker = "aarch64-linux-musl-gcc"
Or use cross (cargo install cross) which handles the toolchain automatically:
cross build --release --target aarch64-unknown-linux-musl \
-p example_fleet_device_enroll --no-default-features
Copying to the device
scp target/aarch64-unknown-linux-musl/release/fleet_device_enroll pi@<host>:
Then SSH to the device and run it as documented in Dev-on-device above.
Quickstart — Pi-equivalent VM rehearsal
Boot a Pi-equivalent VM (Debian bookworm arm64 generic-cloud — same Debian base Pi OS is built on; Pi OS itself is locked to Pi hardware and won't boot in generic KVM) with one command:
cargo run -p example_fleet_device_enroll -- --launch-pi-vm
The command boots the VM and exits, printing the SSH connection details and a suggested next command. From there, enroll the running VM:
./target/debug/fleet_device_enroll \
--target ssh://fleet-admin@<VM_IP> \
--device-id pi-rehearsal-01 \
--issuer-url https://sso-staging.cb1.nationtech.io \
--audience <PROJECT_ID> \
--nats-url wss://nats-fleet-staging.cb1.nationtech.io \
--admin-oidc-client-id <CLIENT_ID> \
--agent-binary target/aarch64-unknown-linux-gnu/release/fleet-agent
--device-id is required and validated against RFC1123 subdomain rules (lowercase alphanumeric + -, must start and end with an alphanumeric, ≤253 chars total / ≤63 chars per label). Same id is reused for the agent's TOML, the Zitadel machine username (device-<id>), and the Kubernetes Device CR — so anything kube wouldn't accept as a metadata.name is rejected upfront here instead of three layers down at operator-reconcile time.
The browser opens to Zitadel's device-code login. Sign in with your SSO account; the score mints the per-device user, drops the keyfile, and brings up the agent.
Dev-on-device
Run the binary on the Pi itself, omit --target entirely. The score uses ansible's local connection and runs everything on the same machine — no SSH, no keypair:
fleet_device_enroll \
--issuer-url https://sso.example.com \
--audience <PROJECT_ID> \
--nats-url wss://nats.example.com \
--admin-oidc-client-id <CLIENT_ID> \
--agent-binary /usr/local/bin/fleet-agent \
--device-id pi-001 \
--labels group=lab,arch=aarch64
Browser opens on the Pi's local display. The dev signs in once; the score handles the rest. Sudo prompts the operator's password if passwordless sudo isn't configured (which is fine — Debian's default).
Auto-installs python3-venv on first run if missing (Debian splits it out of base python3); the score detects the failure, runs sudo apt-get install -y python3-venv, and retries the venv create.
Production-via-SSH
Operator runs from a workstation, targeting devices on the LAN:
fleet_device_enroll \
--target ssh://pi@10.0.0.42 \
--issuer-url https://sso.example.com \
--audience <PROJECT_ID> \
--nats-url wss://nats.example.com \
--agent-binary ./build/fleet-agent-aarch64 \
--device-id batch7-042 \
--labels group=batch7,site=warehouse-east
Each invocation re-prompts the browser. Token caching across runs is tracked in ROADMAP/fleet_platform/device_enrollment_token_caching.md.
Non-interactive (CI / scripted)
Skip the browser by passing a Bearer token:
HARMONY_ZITADEL_ADMIN_TOKEN=<pat-or-access-token> \
fleet_device_enroll \
--target ssh://pi@10.0.0.42 \
--issuer-url https://sso.example.com \
--audience <PROJECT_ID> \
--nats-url wss://nats.example.com \
--agent-binary ./build/fleet-agent-aarch64
What the score does on the device
For each invocation the score:
- Calls Zitadel
/management/v1/*with the admin token to find-or-create the device's machine user, grant it thedevicerole on the fleet project, and mint a JSON key (idempotent on user + grant; always mints a new key because Zitadel doesn't return existing material). - SSHes to the target, ensures
podman+systemd-containerpackages, creates thefleet-agentuser with linger, activates the user-scoped podman socket. - Uploads the agent binary to
/usr/local/bin/fleet-agent. - Drops the JSON keyfile at
/etc/fleet-agent/zitadel-key.json(mode 0640, owned byfleet-agent). - Renders
/etc/fleet-agent/config.tomlwith the agent's NATS URLs, labels, and[credentials]block pointing at the keyfile. - Installs and starts
fleet-agent.service. Restarts only if config / binary / unit changed.
The agent then mints NATS JWTs from the keyfile via the auth callout's JWT-bearer flow and registers itself in the device-info KV.
Verification
After enrollment, the device's heartbeat should appear within seconds:
nats kv get fleet-device-info <device-id>
Or watch via the operator's dashboard / CRs:
kubectl get fleetdev # devices CRD
SSO client_id — where to get it
--admin-oidc-client-id is the numeric Zitadel-assigned client_id, not the human-readable app name. When fleet_staging_install provisions the harmony-cli device-code app, Zitadel generates a numeric client_id like 371639797157987125@fleet. The staging install prints this value in its final summary block — copy it from there.
If you ever need to look it up after the fact, it's in the staging-install operator's local cache:
jq -r '.apps."harmony-cli"' ~/.local/share/harmony/zitadel/client-config.json
That cache is on the operator's workstation (the host that ran fleet_staging_install). The device itself doesn't have it — the operator must pass --admin-oidc-client-id <numeric> explicitly when running enrollment from the device, or set HARMONY_ZITADEL_ADMIN_TOKEN to skip SSO entirely.
Common failure modes
invalid_client: no active client not found—--admin-oidc-client-idis wrong. Most likely you passed the app name (harmony-cli) instead of the numeric client_id. See above.Project '<name>' not visible to the current Zitadel token— your SSO token's primary org differs from where the project lives. Most common when the staging install created the project as the system iam-admin user (system org) and you're signing in with a personal Zitadel account (your own org). Pass--admin-org-id <id>(find it in Zitadel UI → Organization → Resource ID). Alternatively, the score now logsprojects visible in current org context: …right before the error — that list shows what your token CAN see, which usually pinpoints the org mismatch.- 403 on management API — operator SSO account doesn't hold a role permitting management calls. Grant
IAM_OWNER(or equivalent scoped permission) in Zitadel admin UI. CaUsedAsEndEntityfrom rustls — talking to a dev cluster with a self-signed cert. Pass--danger-accept-invalid-certs.- Browser doesn't open over SSH —
webbrowsercan't find a GUI. The score still prints the URL; copy it into a browser on your workstation.
CLI flags
Run fleet_device_enroll --help for the full surface.