Walks through: build+push images, namespace creation, KUBECONFIG sanity, fleet_staging_install run, layer-by-layer verification (Zitadel cert + URL, NATS pod + callout subscribe, operator auth + controller, public WSS reachable, CRDs registered), per-device machine user creation in Zitadel UI, agent config TOML render + launch, end-to-end Deployment CRD walk, common failure modes with diagnostic commands, teardown. Cross-linked from the existing FAQ + manual-mint-recipe guides.
16 KiB
Fleet staging install on OKD
End-to-end runbook for deploying the fleet stack (Zitadel + NATS +
auth callout + operator) on an OKD cluster, with a remote agent
connecting through the public WSS endpoint. Targets the staging
shape — single-instance NATS, public Zitadel + NATS WS Routes with
edge-TLS via cert-manager, env-only Secret config (no volume
mounts) so default restricted-v2 SCC is enough.
Time budget: ~30 min on a warm cluster, ~60 min cold.
0. Prereqs
ocCLI logged in with cluster-admin (or at least cluster-scoped privileges on the namespaces below — namespace create, CRD apply, ClusterRole create).podmanon your laptop, authenticated to the destination registry (defaulthub.nationtech.io/harmony—podman loginif needed).helmon PATH (used by Harmony's helm chart Scores).- The staging cluster has:
- cert-manager installed and a
ClusterIssuerready for the cluster's base domain (default name:letsencrypt-prod— override with--cluster-issuerif yours differs). - CNPG (cloudnative-pg) operator installed (Zitadel relies on it for its Postgres cluster).
- DNS: the chosen
--base-domainresolves to the OKD ingress router. Forcb1.nationtech.io, that means*.cb1.nationtech.ioor at leastsso-staging.cb1.nationtech.ioandnats-fleet-staging.cb1.nationtech.iomust point at the OKD router VIP. If you're using the cluster's apps domain (apps.cb1.nationtech.io), set--base-domainaccordingly.
- cert-manager installed and a
- Access to write a
[credentials]TOML on whichever machine will run the agent (your laptop is fine for the demo).
1. Build and push images
The staging install pulls operator + auth-callout images from your registry. The helper script builds both, tags them, and pushes:
cd /path/to/harmony
./fleet/scripts/build_and_push_images.sh
Defaults: REGISTRY=hub.nationtech.io/harmony, IMAGE_TAG=dev,
PUSH=1. Override with environment variables. Skip the push (e.g.
to inspect the images locally first) with PUSH=0.
Output ends with the exact --operator-image / --callout-image
flags to paste into step 4.
Verify:
podman images | grep harmony # both refs present locally
podman pull hub.nationtech.io/harmony/harmony-fleet-operator:dev # registry confirmed
2. Create namespaces
oc new-project zitadel-staging
oc new-project fleet-staging
If hub.nationtech.io requires authentication, add the imagePullSecret
to both namespaces (each pod that pulls from the registry needs it):
# adjust to whatever you have for hub.nationtech.io
oc -n fleet-staging secrets link default <hub-pull-secret> --for=pull
oc -n zitadel-staging secrets link default <hub-pull-secret> --for=pull
(For Zitadel + Postgres the chart pulls from public registries, so
the secret is only strictly required in fleet-staging for the
operator + callout images. Linking both is safest.)
3. Set KUBECONFIG and verify cluster context
export KUBECONFIG=$ADMIN_KUBECONFIG
oc whoami
oc config current-context
oc get clusterversion # confirm OKD reachable + healthy
The install runs with this KUBECONFIG. Double-check before
running step 4 — Harmony's K8sAnywhereTopology::from_env() honors
this and there's no second confirmation prompt.
4. Run fleet_staging_install
cargo run --release -p example_fleet_staging_install -- \
--base-domain cb1.nationtech.io \
--operator-image hub.nationtech.io/harmony/harmony-fleet-operator:dev \
--callout-image hub.nationtech.io/harmony/harmony-nats-callout:dev
Optional flags (defaults shown):
--cluster-issuer letsencrypt-prod
--fleet-namespace fleet-staging
--zitadel-namespace zitadel-staging
--nats-account FLEET
--zitadel-version v4.12.1
--project-name fleet
--admin-role fleet-admin
--device-role device
--operator-username fleet-operator
--admin-username fleet-ops
Step-by-step the binary does:
- Zitadel helm install — Postgres (CNPG) + Zitadel chart into
--zitadel-namespace. Edge-TLS Route atsso-staging.<base>with cert-manager-driven certificate. - Zitadel setup — project, two roles (
fleet-admin,device), API appnats, and two machine users (fleet-opsfor manual admin work,fleet-operatorfor the operator pod). Both get JSON keys cached at~/.local/share/harmony/zitadel/client-config.json. - NATS install — single-instance JetStream,
auth_calloutblock referencing the issuer NKey pubkey, WebSocket listener on 8080. Edge-TLS Route atnats-fleet-staging.<base>. - Auth callout deployment — env-only Secret config (no mounts), wired to the same issuer + Zitadel project audience.
- Operator deployment — single Secret holding the credentials
TOML (with the operator's JSON keyfile inlined). One env var,
FLEET_OPERATOR_CREDENTIALS_TOML, no volumes.
The binary prints the URLs + project_id at the end. Save that block — you'll need the project_id for the agent config.
Expected output tail:
=== fleet-staging install complete ===
Zitadel: https://sso-staging.cb1.nationtech.io/
NATS WS public: wss://nats-fleet-staging.cb1.nationtech.io/
NATS in-cluster: nats://fleet-nats.fleet-staging.svc.cluster.local:4222
Operator: oc -n fleet-staging get deploy/harmony-fleet-operator
Auth callout: oc -n fleet-staging get deploy/fleet-callout
Project id: 371xxxxxxxxxxxxxxx
Admin user: fleet-ops (machine key in ~/.local/share/harmony/zitadel/client-config.json)
Operator user: fleet-operator (machine key embedded in operator's Secret)
5. Verify each layer
5.1 Zitadel reachable, certificate provisioned
# pod up
oc -n zitadel-staging get pods
# expect: zitadel-* Running, zitadel-pg-1/2 Running
# Route + certificate (cert-manager creates the secret)
oc -n zitadel-staging get route
oc -n zitadel-staging get certificate
# OIDC discovery from the public URL
curl -s https://sso-staging.cb1.nationtech.io/.well-known/openid-configuration | jq .issuer
# expect: "https://sso-staging.cb1.nationtech.io"
If curl fails with TLS errors, the cert-manager certificate isn't
ready yet. Watch its status:
oc -n zitadel-staging describe certificate
oc -n cert-manager logs deploy/cert-manager --tail=50
A Ready condition True + secretName: zitadel-tls populated
means the Route can serve HTTPS.
5.2 NATS pod up, callout connected
oc -n fleet-staging get pods
# expect:
# fleet-nats-0 2/2 Running (NATS + reloader sidecar)
# fleet-callout-... 1/1 Running
oc -n fleet-staging logs deploy/fleet-callout --tail=30 | grep -E "starting|JWKS|listening"
# expect:
# starting harmony NATS auth callout
# JWKS refreshed count=2
# auth callout service listening subject="$SYS.REQ.USER.AUTH"
If the callout pod CrashLoopBackOff:
oc -n fleet-staging logs deploy/fleet-callout --previous --tail=30
Most common: OIDC issuer URL mismatch. The callout's
OIDC_ISSUER_URL env must byte-equal what Zitadel emits as iss in
its discovery doc. Check both:
oc -n fleet-staging exec deploy/fleet-callout -- printenv OIDC_ISSUER_URL
# vs.
curl -s https://sso-staging.cb1.nationtech.io/.well-known/openid-configuration | jq .issuer
5.3 Operator authenticated and running
oc -n fleet-staging get pods -l app.kubernetes.io/name=harmony-fleet-operator
oc -n fleet-staging logs deploy/harmony-fleet-operator --tail=30
Look for, in order:
minted fresh Zitadel access token audience=<project_id>
connected successfully server=4222
NATS connected
KV bucket ready bucket=desired-state
starting Deployment controller
device-reconciler: watching device-info KV
aggregator: startup complete
If you see Permissions Violation errors, the callout's
OIDC_AUDIENCE (project_id at deploy time) doesn't match the
project_id in Zitadel today. Re-run step 4 — the live-query fix
in the Zitadel setup will refresh.
5.4 NATS WSS reachable from outside the cluster
curl -sSI https://nats-fleet-staging.cb1.nationtech.io/ | head -5
Expect a 4xx (NATS doesn't speak HTTP, but the TLS handshake should succeed and you'll get back a WebSocket-upgrade-related response). A connection refused or TLS handshake error means the Route or cert-manager is unhappy.
5.5 CRDs registered
oc get crd | grep fleet.nationtech.io
# expect:
# deployments.fleet.nationtech.io
# devices.fleet.nationtech.io
6. Connect a remote agent
The fleet agent runs on the device (laptop, Pi, anywhere with outbound HTTPS). It needs:
- Its own Zitadel machine user with the
devicerole grant. - The JSON keyfile from that user.
- A
[credentials]TOML pointing at the public Zitadel + the WSS NATS URL.
6.1 Mint a per-device machine user
Use oc port-forward or a helper to call Zitadel's API. Easier
path: drop a quick Score that adds one machine user. For tonight,
do it from the Zitadel UI:
- Browse to
https://sso-staging.cb1.nationtech.io/ui/console/, log in as the human admin (password from Zitadel ConfigMap on first install — seedocs/guides/fleet-zitadel-faq.md). - Pick the
Defaultorg →fleetproject → Roles → confirmdeviceexists. - Org → Users → Service Users → New: name
device-laptop-01, userNamedevice-laptop-01. Save. - The user's "Personal Information" tab → Authorizations or
"Authorization" → "+New" — grant the
fleetproject'sdevicerole to this user. - The user's "Keys" tab → "+New", type
JSON, expiration future date. Download the keyfile JSON — Zitadel only shows the private half once. Save as~/.local/share/harmony/fleet/agents/device-laptop-01.json.
6.2 Build the agent locally
cargo build --release -p harmony-fleet-agent
ls -la target/release/harmony-fleet-agent
6.3 Render the agent's config TOML
PROJECT_ID=$(oc -n fleet-staging exec deploy/fleet-callout -- printenv OIDC_AUDIENCE)
cat > /tmp/fleet-agent-config.toml <<EOF
[agent]
device_id = "device-laptop-01"
[nats]
urls = ["wss://nats-fleet-staging.cb1.nationtech.io"]
[credentials]
type = "zitadel-jwt"
key_path = "/etc/fleet-agent/zitadel-key.json"
oidc_issuer_url = "https://sso-staging.cb1.nationtech.io"
audience = "$PROJECT_ID"
[labels]
env = "staging"
location = "laptop"
arch = "$(uname -m)"
EOF
The agent's username convention is device-<device_id>, matching
the callout's DEVICE_ID_PREFIX_STRIP=device-. The Zitadel machine
user must literally be device-laptop-01 for the JWT-bearer flow
to extract the right device id.
6.4 Run the agent
sudo mkdir -p /etc/fleet-agent
sudo cp ~/.local/share/harmony/fleet/agents/device-laptop-01.json \
/etc/fleet-agent/zitadel-key.json
sudo chown $(id -u):$(id -g) /etc/fleet-agent/zitadel-key.json
sudo chmod 0400 /etc/fleet-agent/zitadel-key.json
FLEET_AGENT_CONFIG=/tmp/fleet-agent-config.toml \
RUST_LOG=info \
./target/release/harmony-fleet-agent
Watch the log:
fleet-agent-v0 starting device_id=device-laptop-01
podman socket ready
inventory loaded hostname=...
connecting to NATS ["wss://nats-fleet-staging.cb1.nationtech.io"]
minted fresh Zitadel access token audience=<project_id>
connected successfully server=...
NATS connected
fleet publisher ready
watching KV keys filter=device-laptop-01.>
If you hit Permissions Violation errors after connected:
- check
oc -n fleet-staging logs deploy/fleet-callout --tail=20— it'll show why the JWT was rejected (audience, role claim, device_id format).
6.5 Verify the operator created a Device CR
oc get devices
# expect:
# NAME AGE
# device-laptop-01 Xs
oc describe device device-laptop-01
# labels block reflects what the agent sent in [labels]
7. Drive a deployment end to end
cat > /tmp/hello-web.yaml <<'EOF'
apiVersion: fleet.nationtech.io/v1alpha1
kind: Deployment
metadata:
name: hello-web
spec:
score:
type: PodmanV0
data:
services:
- name: testnginx
image: docker.io/nginx:latest
ports:
- "8080:80"
targetSelector:
matchLabels:
env: staging
rollout:
strategy: Immediate
EOF
oc apply -f /tmp/hello-web.yaml
# Status reflect-back from the agent (takes ~5-15s)
oc get deployment.fleet.nationtech.io hello-web -o yaml | yq '.status'
# expect:
# aggregate:
# matchedDeviceCount: 1
# succeeded: 1
# failed: 0
# pending: 0
# On the device:
podman ps
# expect: testnginx running, port 8080→80
curl -sS http://localhost:8080 | head -3
8. Common failure modes
| Symptom | Cause / fix |
|---|---|
cert-manager Certificate stuck False for 5+ min |
DNS for the host doesn't resolve to the OKD router yet. dig sso-staging.<base> +short should match the cluster's ingress IP. Or your letsencrypt-prod ClusterIssuer is using HTTP01 and the route isn't reachable from Let's Encrypt. |
Operator pod Error: constructing CredentialSource |
The credentials TOML in the Secret is malformed. oc -n fleet-staging get secret harmony-fleet-operator-secrets -o jsonpath='{.data.credentials\.toml}' | base64 -d and inspect; the key_json field must be a valid JSON keyfile string (multi-line triple-quoted in TOML is fine). |
Operator pod Permissions Violation after NATS connected |
Issuer pubkey or project_id mismatch between callout and NATS chart values, or Zitadel was reset and the operator's machine key no longer authenticates. Re-run cargo run -p example_fleet_staging_install. |
Agent: Zitadel token endpoint returned 400: invalid_grant_type |
TOML scope assembly bug or wrong audience. Confirm audience matches oc exec deploy/fleet-callout -- printenv OIDC_AUDIENCE. |
Agent: connects, then Permissions Violation for Publish to "$KV.device-info..." |
The device's machine user has no device role grant. Add via Zitadel UI → user → Authorizations. |
Deployment.fleet.nationtech.io CR applied but matchedDeviceCount: 0 |
targetSelector.matchLabels doesn't match any Device CR's metadata.labels. oc get devices --show-labels. |
| Container redeploys every 30s on the device | Known FIXME — the agent's matches_spec returns false for any spec with env or volumes. For the demo, use trivial specs (the hello-web above is fine). Tracked in harmony/src/modules/podman/topology.rs. |
9. Tear down
The Helm releases own the bulk of the resources, so the cleanest recovery from a broken state is:
helm -n zitadel-staging uninstall zitadel
helm -n fleet-staging uninstall fleet-nats
oc -n fleet-staging delete deploy/harmony-fleet-operator deploy/fleet-callout
oc -n fleet-staging delete secret harmony-fleet-operator-secrets fleet-callout-secrets
oc -n zitadel-staging delete pgcluster zitadel-pg --ignore-not-found
oc delete project zitadel-staging fleet-staging
# CRDs persist (helm.sh/resource-policy: keep). Delete by hand if you
# really want a clean slate:
oc delete crd deployments.fleet.nationtech.io devices.fleet.nationtech.io
The host-side ~/.local/share/harmony/zitadel/client-config.json
caches machine keys + project IDs from this install. Wipe it before
re-installing against a freshly reset Zitadel:
rm -f ~/.local/share/harmony/zitadel/client-config.json
(The cache-vs-live drift bug is fixed — ZitadelSetupScore now
re-queries Zitadel for IDs on every apply — but stale machine-key
material from a deleted Zitadel project will fail at JWT-bearer
mint until you delete + re-create.)
10. Cross-reference
fleet-zitadel-faq.md— concepts behind Zitadel projects, roles, machine users, audit-trail decisions.fleet-manual-token-mint.md— worked recipe for minting an admin token by hand and using it withnats kvcommands.examples/fleet_staging_install/src/main.rs— the install code itself; the comments narrate every step.harmony/src/modules/fleet/server.rs::FleetServerScore— composable form of the same install for callers that don't need the intermediate read ofZitadelClientConfig.