Files
harmony/docs/guides/fleet-staging-install.md
Jean-Gabriel Gill-Couture f727e4dbea docs(fleet): step-by-step OKD staging install runbook
Walks through: build+push images, namespace creation, KUBECONFIG
sanity, fleet_staging_install run, layer-by-layer verification
(Zitadel cert + URL, NATS pod + callout subscribe, operator auth +
controller, public WSS reachable, CRDs registered), per-device
machine user creation in Zitadel UI, agent config TOML render +
launch, end-to-end Deployment CRD walk, common failure modes with
diagnostic commands, teardown.

Cross-linked from the existing FAQ + manual-mint-recipe guides.
2026-05-05 12:01:16 -04:00

16 KiB

Fleet staging install on OKD

End-to-end runbook for deploying the fleet stack (Zitadel + NATS + auth callout + operator) on an OKD cluster, with a remote agent connecting through the public WSS endpoint. Targets the staging shape — single-instance NATS, public Zitadel + NATS WS Routes with edge-TLS via cert-manager, env-only Secret config (no volume mounts) so default restricted-v2 SCC is enough.

Time budget: ~30 min on a warm cluster, ~60 min cold.

0. Prereqs

  • oc CLI logged in with cluster-admin (or at least cluster-scoped privileges on the namespaces below — namespace create, CRD apply, ClusterRole create).
  • podman on your laptop, authenticated to the destination registry (default hub.nationtech.io/harmonypodman login if needed).
  • helm on PATH (used by Harmony's helm chart Scores).
  • The staging cluster has:
    • cert-manager installed and a ClusterIssuer ready for the cluster's base domain (default name: letsencrypt-prod — override with --cluster-issuer if yours differs).
    • CNPG (cloudnative-pg) operator installed (Zitadel relies on it for its Postgres cluster).
    • DNS: the chosen --base-domain resolves to the OKD ingress router. For cb1.nationtech.io, that means *.cb1.nationtech.io or at least sso-staging.cb1.nationtech.io and nats-fleet-staging.cb1.nationtech.io must point at the OKD router VIP. If you're using the cluster's apps domain (apps.cb1.nationtech.io), set --base-domain accordingly.
  • Access to write a [credentials] TOML on whichever machine will run the agent (your laptop is fine for the demo).

1. Build and push images

The staging install pulls operator + auth-callout images from your registry. The helper script builds both, tags them, and pushes:

cd /path/to/harmony
./fleet/scripts/build_and_push_images.sh

Defaults: REGISTRY=hub.nationtech.io/harmony, IMAGE_TAG=dev, PUSH=1. Override with environment variables. Skip the push (e.g. to inspect the images locally first) with PUSH=0.

Output ends with the exact --operator-image / --callout-image flags to paste into step 4.

Verify:

podman images | grep harmony   # both refs present locally
podman pull hub.nationtech.io/harmony/harmony-fleet-operator:dev   # registry confirmed

2. Create namespaces

oc new-project zitadel-staging
oc new-project fleet-staging

If hub.nationtech.io requires authentication, add the imagePullSecret to both namespaces (each pod that pulls from the registry needs it):

# adjust to whatever you have for hub.nationtech.io
oc -n fleet-staging   secrets link default <hub-pull-secret> --for=pull
oc -n zitadel-staging secrets link default <hub-pull-secret> --for=pull

(For Zitadel + Postgres the chart pulls from public registries, so the secret is only strictly required in fleet-staging for the operator + callout images. Linking both is safest.)

3. Set KUBECONFIG and verify cluster context

export KUBECONFIG=$ADMIN_KUBECONFIG
oc whoami
oc config current-context
oc get clusterversion        # confirm OKD reachable + healthy

The install runs with this KUBECONFIG. Double-check before running step 4 — Harmony's K8sAnywhereTopology::from_env() honors this and there's no second confirmation prompt.

4. Run fleet_staging_install

cargo run --release -p example_fleet_staging_install -- \
  --base-domain cb1.nationtech.io \
  --operator-image hub.nationtech.io/harmony/harmony-fleet-operator:dev \
  --callout-image  hub.nationtech.io/harmony/harmony-nats-callout:dev

Optional flags (defaults shown):

  --cluster-issuer    letsencrypt-prod
  --fleet-namespace   fleet-staging
  --zitadel-namespace zitadel-staging
  --nats-account      FLEET
  --zitadel-version   v4.12.1
  --project-name      fleet
  --admin-role        fleet-admin
  --device-role       device
  --operator-username fleet-operator
  --admin-username    fleet-ops

Step-by-step the binary does:

  1. Zitadel helm install — Postgres (CNPG) + Zitadel chart into --zitadel-namespace. Edge-TLS Route at sso-staging.<base> with cert-manager-driven certificate.
  2. Zitadel setup — project, two roles (fleet-admin, device), API app nats, and two machine users (fleet-ops for manual admin work, fleet-operator for the operator pod). Both get JSON keys cached at ~/.local/share/harmony/zitadel/client-config.json.
  3. NATS install — single-instance JetStream, auth_callout block referencing the issuer NKey pubkey, WebSocket listener on 8080. Edge-TLS Route at nats-fleet-staging.<base>.
  4. Auth callout deployment — env-only Secret config (no mounts), wired to the same issuer + Zitadel project audience.
  5. Operator deployment — single Secret holding the credentials TOML (with the operator's JSON keyfile inlined). One env var, FLEET_OPERATOR_CREDENTIALS_TOML, no volumes.

The binary prints the URLs + project_id at the end. Save that block — you'll need the project_id for the agent config.

Expected output tail:

=== fleet-staging install complete ===
Zitadel:           https://sso-staging.cb1.nationtech.io/
NATS WS public:    wss://nats-fleet-staging.cb1.nationtech.io/
NATS in-cluster:   nats://fleet-nats.fleet-staging.svc.cluster.local:4222
Operator:          oc -n fleet-staging get deploy/harmony-fleet-operator
Auth callout:      oc -n fleet-staging get deploy/fleet-callout
Project id:        371xxxxxxxxxxxxxxx
Admin user:        fleet-ops (machine key in ~/.local/share/harmony/zitadel/client-config.json)
Operator user:     fleet-operator (machine key embedded in operator's Secret)

5. Verify each layer

5.1 Zitadel reachable, certificate provisioned

# pod up
oc -n zitadel-staging get pods
# expect: zitadel-* Running, zitadel-pg-1/2 Running

# Route + certificate (cert-manager creates the secret)
oc -n zitadel-staging get route
oc -n zitadel-staging get certificate

# OIDC discovery from the public URL
curl -s https://sso-staging.cb1.nationtech.io/.well-known/openid-configuration | jq .issuer
# expect: "https://sso-staging.cb1.nationtech.io"

If curl fails with TLS errors, the cert-manager certificate isn't ready yet. Watch its status:

oc -n zitadel-staging describe certificate
oc -n cert-manager logs deploy/cert-manager --tail=50

A Ready condition True + secretName: zitadel-tls populated means the Route can serve HTTPS.

5.2 NATS pod up, callout connected

oc -n fleet-staging get pods
# expect:
#   fleet-nats-0          2/2 Running   (NATS + reloader sidecar)
#   fleet-callout-...     1/1 Running

oc -n fleet-staging logs deploy/fleet-callout --tail=30 | grep -E "starting|JWKS|listening"
# expect:
#   starting harmony NATS auth callout
#   JWKS refreshed count=2
#   auth callout service listening subject="$SYS.REQ.USER.AUTH"

If the callout pod CrashLoopBackOff:

oc -n fleet-staging logs deploy/fleet-callout --previous --tail=30

Most common: OIDC issuer URL mismatch. The callout's OIDC_ISSUER_URL env must byte-equal what Zitadel emits as iss in its discovery doc. Check both:

oc -n fleet-staging exec deploy/fleet-callout -- printenv OIDC_ISSUER_URL
# vs.
curl -s https://sso-staging.cb1.nationtech.io/.well-known/openid-configuration | jq .issuer

5.3 Operator authenticated and running

oc -n fleet-staging get pods -l app.kubernetes.io/name=harmony-fleet-operator
oc -n fleet-staging logs deploy/harmony-fleet-operator --tail=30

Look for, in order:

minted fresh Zitadel access token audience=<project_id>
connected successfully server=4222
NATS connected
KV bucket ready bucket=desired-state
starting Deployment controller
device-reconciler: watching device-info KV
aggregator: startup complete

If you see Permissions Violation errors, the callout's OIDC_AUDIENCE (project_id at deploy time) doesn't match the project_id in Zitadel today. Re-run step 4 — the live-query fix in the Zitadel setup will refresh.

5.4 NATS WSS reachable from outside the cluster

curl -sSI https://nats-fleet-staging.cb1.nationtech.io/ | head -5

Expect a 4xx (NATS doesn't speak HTTP, but the TLS handshake should succeed and you'll get back a WebSocket-upgrade-related response). A connection refused or TLS handshake error means the Route or cert-manager is unhappy.

5.5 CRDs registered

oc get crd | grep fleet.nationtech.io
# expect:
#   deployments.fleet.nationtech.io
#   devices.fleet.nationtech.io

6. Connect a remote agent

The fleet agent runs on the device (laptop, Pi, anywhere with outbound HTTPS). It needs:

  • Its own Zitadel machine user with the device role grant.
  • The JSON keyfile from that user.
  • A [credentials] TOML pointing at the public Zitadel + the WSS NATS URL.

6.1 Mint a per-device machine user

Use oc port-forward or a helper to call Zitadel's API. Easier path: drop a quick Score that adds one machine user. For tonight, do it from the Zitadel UI:

  1. Browse to https://sso-staging.cb1.nationtech.io/ui/console/, log in as the human admin (password from Zitadel ConfigMap on first install — see docs/guides/fleet-zitadel-faq.md).
  2. Pick the Default org → fleet project → Roles → confirm device exists.
  3. Org → Users → Service Users → New: name device-laptop-01, userName device-laptop-01. Save.
  4. The user's "Personal Information" tab → Authorizations or "Authorization" → "+New" — grant the fleet project's device role to this user.
  5. The user's "Keys" tab → "+New", type JSON, expiration future date. Download the keyfile JSON — Zitadel only shows the private half once. Save as ~/.local/share/harmony/fleet/agents/device-laptop-01.json.

6.2 Build the agent locally

cargo build --release -p harmony-fleet-agent
ls -la target/release/harmony-fleet-agent

6.3 Render the agent's config TOML

PROJECT_ID=$(oc -n fleet-staging exec deploy/fleet-callout -- printenv OIDC_AUDIENCE)
cat > /tmp/fleet-agent-config.toml <<EOF
[agent]
device_id = "device-laptop-01"

[nats]
urls = ["wss://nats-fleet-staging.cb1.nationtech.io"]

[credentials]
type = "zitadel-jwt"
key_path = "/etc/fleet-agent/zitadel-key.json"
oidc_issuer_url = "https://sso-staging.cb1.nationtech.io"
audience = "$PROJECT_ID"

[labels]
env = "staging"
location = "laptop"
arch = "$(uname -m)"
EOF

The agent's username convention is device-<device_id>, matching the callout's DEVICE_ID_PREFIX_STRIP=device-. The Zitadel machine user must literally be device-laptop-01 for the JWT-bearer flow to extract the right device id.

6.4 Run the agent

sudo mkdir -p /etc/fleet-agent
sudo cp ~/.local/share/harmony/fleet/agents/device-laptop-01.json \
       /etc/fleet-agent/zitadel-key.json
sudo chown $(id -u):$(id -g) /etc/fleet-agent/zitadel-key.json
sudo chmod 0400 /etc/fleet-agent/zitadel-key.json

FLEET_AGENT_CONFIG=/tmp/fleet-agent-config.toml \
  RUST_LOG=info \
  ./target/release/harmony-fleet-agent

Watch the log:

fleet-agent-v0 starting device_id=device-laptop-01
podman socket ready
inventory loaded hostname=...
connecting to NATS ["wss://nats-fleet-staging.cb1.nationtech.io"]
minted fresh Zitadel access token audience=<project_id>
connected successfully server=...
NATS connected
fleet publisher ready
watching KV keys filter=device-laptop-01.>

If you hit Permissions Violation errors after connected:

  • check oc -n fleet-staging logs deploy/fleet-callout --tail=20 — it'll show why the JWT was rejected (audience, role claim, device_id format).

6.5 Verify the operator created a Device CR

oc get devices
# expect:
#   NAME                AGE
#   device-laptop-01    Xs
oc describe device device-laptop-01
# labels block reflects what the agent sent in [labels]

7. Drive a deployment end to end

cat > /tmp/hello-web.yaml <<'EOF'
apiVersion: fleet.nationtech.io/v1alpha1
kind: Deployment
metadata:
  name: hello-web
spec:
  score:
    type: PodmanV0
    data:
      services:
        - name: testnginx
          image: docker.io/nginx:latest
          ports:
            - "8080:80"
  targetSelector:
    matchLabels:
      env: staging
  rollout:
    strategy: Immediate
EOF

oc apply -f /tmp/hello-web.yaml

# Status reflect-back from the agent (takes ~5-15s)
oc get deployment.fleet.nationtech.io hello-web -o yaml | yq '.status'
# expect:
#   aggregate:
#     matchedDeviceCount: 1
#     succeeded: 1
#     failed: 0
#     pending: 0

# On the device:
podman ps
# expect: testnginx running, port 8080→80
curl -sS http://localhost:8080 | head -3

8. Common failure modes

Symptom Cause / fix
cert-manager Certificate stuck False for 5+ min DNS for the host doesn't resolve to the OKD router yet. dig sso-staging.<base> +short should match the cluster's ingress IP. Or your letsencrypt-prod ClusterIssuer is using HTTP01 and the route isn't reachable from Let's Encrypt.
Operator pod Error: constructing CredentialSource The credentials TOML in the Secret is malformed. oc -n fleet-staging get secret harmony-fleet-operator-secrets -o jsonpath='{.data.credentials\.toml}' | base64 -d and inspect; the key_json field must be a valid JSON keyfile string (multi-line triple-quoted in TOML is fine).
Operator pod Permissions Violation after NATS connected Issuer pubkey or project_id mismatch between callout and NATS chart values, or Zitadel was reset and the operator's machine key no longer authenticates. Re-run cargo run -p example_fleet_staging_install.
Agent: Zitadel token endpoint returned 400: invalid_grant_type TOML scope assembly bug or wrong audience. Confirm audience matches oc exec deploy/fleet-callout -- printenv OIDC_AUDIENCE.
Agent: connects, then Permissions Violation for Publish to "$KV.device-info..." The device's machine user has no device role grant. Add via Zitadel UI → user → Authorizations.
Deployment.fleet.nationtech.io CR applied but matchedDeviceCount: 0 targetSelector.matchLabels doesn't match any Device CR's metadata.labels. oc get devices --show-labels.
Container redeploys every 30s on the device Known FIXME — the agent's matches_spec returns false for any spec with env or volumes. For the demo, use trivial specs (the hello-web above is fine). Tracked in harmony/src/modules/podman/topology.rs.

9. Tear down

The Helm releases own the bulk of the resources, so the cleanest recovery from a broken state is:

helm -n zitadel-staging uninstall zitadel
helm -n fleet-staging   uninstall fleet-nats
oc -n fleet-staging delete deploy/harmony-fleet-operator deploy/fleet-callout
oc -n fleet-staging delete secret harmony-fleet-operator-secrets fleet-callout-secrets
oc -n zitadel-staging delete pgcluster zitadel-pg --ignore-not-found
oc delete project zitadel-staging fleet-staging

# CRDs persist (helm.sh/resource-policy: keep). Delete by hand if you
# really want a clean slate:
oc delete crd deployments.fleet.nationtech.io devices.fleet.nationtech.io

The host-side ~/.local/share/harmony/zitadel/client-config.json caches machine keys + project IDs from this install. Wipe it before re-installing against a freshly reset Zitadel:

rm -f ~/.local/share/harmony/zitadel/client-config.json

(The cache-vs-live drift bug is fixed — ZitadelSetupScore now re-queries Zitadel for IDs on every apply — but stale machine-key material from a deleted Zitadel project will fail at JWT-bearer mint until you delete + re-create.)

10. Cross-reference

  • fleet-zitadel-faq.md — concepts behind Zitadel projects, roles, machine users, audit-trail decisions.
  • fleet-manual-token-mint.md — worked recipe for minting an admin token by hand and using it with nats kv commands.
  • examples/fleet_staging_install/src/main.rs — the install code itself; the comments narrate every step.
  • harmony/src/modules/fleet/server.rs::FleetServerScore — composable form of the same install for callers that don't need the intermediate read of ZitadelClientConfig.