Files
Jean-Gabriel Gill-Couture 6cbecee6e1 feat(fleet-device-enroll): require + validate --device-id (RFC1123)
The auto-generated `Id::default()` shape (`fb5310_Qm2kPoQ`) contains
underscores and uppercase, so once the agent published its
DeviceInfo and the operator tried to upsert a Device CR using
`device_id` as `metadata.name`, kube rejected it:

  ApiError: Device.fleet.nationtech.io "fb5310_Qm2kPoQ" is invalid:
  metadata.name: Invalid value ... must consist of lower case
  alphanumeric characters, '-' ...

Failing at operator-reconcile time is bad UX: the Zitadel machine
user is already provisioned, the agent is already running, and the
auth callout's per-device permissions are already templated to a
device_id the kube layer will never accept. Re-enrolling requires
manually deleting state in three places.

Makes `--device-id` **required** and validates it against RFC1123
DNS subdomain rules upfront, before any Zitadel call:

* non-empty, ≤253 chars total
* dot-separated labels, each 1-63 chars, lowercase a-z + 0-9 + `-`
* labels must start AND end with an alphanumeric

Stricter than just "kube name valid" because the same id flows into
NATS subjects (auth callout's permission templates) — `_`/uppercase
silently passes NATS auth but breaks the kube path. Rejecting at
the CLI is the only failure point that catches both layers in one
place.

8 unit tests cover the accept set + every reject path
(underscore — the regression that triggered this — uppercase,
leading/trailing dash, empty, consecutive dots, label too long,
total too long).

CLI banner + README updated. The `Id::default()` fallback path is
removed entirely; no backward compat with the old auto-generated
shape (the user explicitly opted out — anything that ran before now
needs re-enrollment with an explicit id).
2026-05-06 13:43:11 -04:00

194 lines
10 KiB
Markdown

# Example: Fleet Device Enroll
Enrolls a device into the fleet by minting its Zitadel machine user + JSON key inline (browser SSO or pre-acquired admin token), then runs `FleetDeviceSetupScore` against the device to install podman, drop the keyfile + agent config, and bring up the agent under systemd.
Two operator workflows land on the same code path:
- **Dev-on-device** — developer runs the score on a Pi with keyboard + display attached. Browser opens locally, dev signs in with their personal SSO account, the score provisions credentials for that one device.
- **Production-via-SSH** — operator runs the score from a workstation, targets each device over SSH. Browser opens once on the workstation. (Per-batch token caching is on the roadmap; v0 re-prompts per device but the browser session cookie keeps the click cheap.)
## How to use
### Prerequisites
- A running staging install (Zitadel + NATS + auth callout + operator) — see `examples/fleet_staging_install/`.
- The Zitadel project ID for `fleet` (from the staging install output).
- A cross-compiled `fleet-agent` binary for the target arch.
- For VM rehearsal: libvirt + qemu-system-aarch64 + xorriso installed locally. Run `cargo run -p example_fleet_vm_setup -- --bootstrap-only --arch aarch64` once to prime the asset cache and SSH keys.
- Your Zitadel SSO account must hold a role permitting machine-user, role-grant, and machine-key creation (typically `IAM_OWNER` or `ORG_OWNER`).
### Build flavors
The crate has two flavors selected by Cargo features:
| Flavor | Command | What it includes |
|---|---|---|
| **Workstation** (default) | `cargo build --release -p example_fleet_device_enroll` | Everything: `--launch-pi-vm`, `--vm-rehearsal`, full enrollment. Pulls in libvirt via the `vm-rehearsal` feature. |
| **Device-side** (cross-compile) | `cargo build --release --target aarch64-unknown-linux-musl -p example_fleet_device_enroll --no-default-features` | Enrollment-only — no VM-rehearsal flags, no libvirt. Builds for arm64. **Use the musl target, not gnu** (see below). |
#### Why musl, not gnu
Building with `--target aarch64-unknown-linux-gnu` links against the host's glibc. On a current Arch / Fedora workstation that's glibc 2.41+; on the device it might be glibc 2.36 (Debian 12) or 2.41 (Debian 13). When the workstation's glibc is newer than the device's, the binary fails to start with:
```
./fleet_device_enroll: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.39' not found
```
`aarch64-unknown-linux-musl` produces a **fully static binary** linked against musl libc, which is bundled in. It runs on any aarch64 Linux regardless of the host's libc generation — Debian 12, 13, Pi OS, Alpine, all the same. That's what we want for a device-side binary that gets shipped onto whatever userland the production line happens to flash.
#### One-time musl setup
```bash
rustup target add aarch64-unknown-linux-musl
# Arch: sudo pacman -S aarch64-linux-musl (AUR) or use mold-aarch64
# Fedora: sudo dnf install gcc-aarch64-linux-gnu (we use musl-cross via rustup)
```
You may need to point Cargo at the right linker. In `~/.cargo/config.toml`:
```toml
[target.aarch64-unknown-linux-musl]
linker = "aarch64-linux-musl-gcc"
```
Or use `cross` (`cargo install cross`) which handles the toolchain automatically:
```bash
cross build --release --target aarch64-unknown-linux-musl \
-p example_fleet_device_enroll --no-default-features
```
#### Copying to the device
```bash
scp target/aarch64-unknown-linux-musl/release/fleet_device_enroll pi@<host>:
```
Then SSH to the device and run it as documented in [Dev-on-device](#dev-on-device) above.
### Quickstart — Pi-equivalent VM rehearsal
Boot a Pi-equivalent VM (Debian bookworm arm64 generic-cloud — same Debian base Pi OS is built on; Pi OS itself is locked to Pi hardware and won't boot in generic KVM) with one command:
```bash
cargo run -p example_fleet_device_enroll -- --launch-pi-vm
```
The command boots the VM and exits, printing the SSH connection details and a suggested next command. From there, enroll the running VM:
```bash
./target/debug/fleet_device_enroll \
--target ssh://fleet-admin@<VM_IP> \
--device-id pi-rehearsal-01 \
--issuer-url https://sso-staging.cb1.nationtech.io \
--audience <PROJECT_ID> \
--nats-url wss://nats-fleet-staging.cb1.nationtech.io \
--admin-oidc-client-id <CLIENT_ID> \
--agent-binary target/aarch64-unknown-linux-gnu/release/fleet-agent
```
`--device-id` is required and validated against RFC1123 subdomain rules (lowercase alphanumeric + `-`, must start and end with an alphanumeric, ≤253 chars total / ≤63 chars per label). Same id is reused for the agent's TOML, the Zitadel machine username (`device-<id>`), and the Kubernetes Device CR — so anything kube wouldn't accept as a `metadata.name` is rejected upfront here instead of three layers down at operator-reconcile time.
The browser opens to Zitadel's device-code login. Sign in with your SSO account; the score mints the per-device user, drops the keyfile, and brings up the agent.
### Dev-on-device
Run the binary on the Pi itself, omit `--target` entirely. The score uses ansible's local connection and runs everything on the same machine — no SSH, no keypair:
```bash
fleet_device_enroll \
--issuer-url https://sso.example.com \
--audience <PROJECT_ID> \
--nats-url wss://nats.example.com \
--admin-oidc-client-id <CLIENT_ID> \
--agent-binary /usr/local/bin/fleet-agent \
--device-id pi-001 \
--labels group=lab,arch=aarch64
```
Browser opens on the Pi's local display. The dev signs in once; the score handles the rest. Sudo prompts the operator's password if passwordless sudo isn't configured (which is fine — Debian's default).
Auto-installs `python3-venv` on first run if missing (Debian splits it out of base python3); the score detects the failure, runs `sudo apt-get install -y python3-venv`, and retries the venv create.
### Production-via-SSH
Operator runs from a workstation, targeting devices on the LAN:
```bash
fleet_device_enroll \
--target ssh://pi@10.0.0.42 \
--issuer-url https://sso.example.com \
--audience <PROJECT_ID> \
--nats-url wss://nats.example.com \
--agent-binary ./build/fleet-agent-aarch64 \
--device-id batch7-042 \
--labels group=batch7,site=warehouse-east
```
Each invocation re-prompts the browser. Token caching across runs is tracked in `ROADMAP/fleet_platform/device_enrollment_token_caching.md`.
### Non-interactive (CI / scripted)
Skip the browser by passing a Bearer token:
```bash
HARMONY_ZITADEL_ADMIN_TOKEN=<pat-or-access-token> \
fleet_device_enroll \
--target ssh://pi@10.0.0.42 \
--issuer-url https://sso.example.com \
--audience <PROJECT_ID> \
--nats-url wss://nats.example.com \
--agent-binary ./build/fleet-agent-aarch64
```
## What the score does on the device
For each invocation the score:
1. Calls Zitadel `/management/v1/*` with the admin token to find-or-create the device's machine user, grant it the `device` role on the fleet project, and mint a JSON key (idempotent on user + grant; always mints a new key because Zitadel doesn't return existing material).
2. SSHes to the target, ensures `podman` + `systemd-container` packages, creates the `fleet-agent` user with linger, activates the user-scoped podman socket.
3. Uploads the agent binary to `/usr/local/bin/fleet-agent`.
4. Drops the JSON keyfile at `/etc/fleet-agent/zitadel-key.json` (mode 0640, owned by `fleet-agent`).
5. Renders `/etc/fleet-agent/config.toml` with the agent's NATS URLs, labels, and `[credentials]` block pointing at the keyfile.
6. Installs and starts `fleet-agent.service`. Restarts only if config / binary / unit changed.
The agent then mints NATS JWTs from the keyfile via the auth callout's JWT-bearer flow and registers itself in the `device-info` KV.
## Verification
After enrollment, the device's heartbeat should appear within seconds:
```bash
nats kv get fleet-device-info <device-id>
```
Or watch via the operator's dashboard / CRs:
```bash
kubectl get fleetdev # devices CRD
```
## SSO `client_id` — where to get it
`--admin-oidc-client-id` is the **numeric Zitadel-assigned client_id**, not the human-readable app name. When `fleet_staging_install` provisions the `harmony-cli` device-code app, Zitadel generates a numeric client_id like `371639797157987125@fleet`. The staging install prints this value in its final summary block — copy it from there.
If you ever need to look it up after the fact, it's in the staging-install operator's local cache:
```bash
jq -r '.apps."harmony-cli"' ~/.local/share/harmony/zitadel/client-config.json
```
That cache is on the **operator's workstation** (the host that ran `fleet_staging_install`). The device itself doesn't have it — the operator must pass `--admin-oidc-client-id <numeric>` explicitly when running enrollment from the device, or set `HARMONY_ZITADEL_ADMIN_TOKEN` to skip SSO entirely.
## Common failure modes
- **`invalid_client: no active client not found`** — `--admin-oidc-client-id` is wrong. Most likely you passed the app name (`harmony-cli`) instead of the numeric client_id. See above.
- **`Project '<name>' not visible to the current Zitadel token`** — your SSO token's primary org differs from where the project lives. Most common when the staging install created the project as the system iam-admin user (system org) and you're signing in with a personal Zitadel account (your own org). Pass `--admin-org-id <id>` (find it in Zitadel UI → Organization → Resource ID). Alternatively, the score now logs `projects visible in current org context: …` right before the error — that list shows what your token CAN see, which usually pinpoints the org mismatch.
- **403 on management API** — operator SSO account doesn't hold a role permitting management calls. Grant `IAM_OWNER` (or equivalent scoped permission) in Zitadel admin UI.
- **`CaUsedAsEndEntity` from rustls** — talking to a dev cluster with a self-signed cert. Pass `--danger-accept-invalid-certs`.
- **Browser doesn't open over SSH** — `webbrowser` can't find a GUI. The score still prints the URL; copy it into a browser on your workstation.
## CLI flags
Run `fleet_device_enroll --help` for the full surface.