The image shipped empty CSS: build.rs shells out to the tailwindcss v4
CLI and silently falls back to an empty bundle when it's absent — which
it was in the rust:slim builder, so /static/tailwind.css served nothing.
- Dockerfile: install the pinned Tailwind v4 standalone CLI (curl) in the
builder and set TAILWIND_REQUIRED=1.
- build.rs: when TAILWIND_REQUIRED is set (container/prod), a missing or
failing CLI is now a hard build error instead of empty CSS; dev builds
keep the soft fallback for the `serve-web --css-from` workflow. The env
is a rerun trigger, so the first required build regenerates rather than
reusing a cache-mounted empty bundle.
Verified: with the CLI on PATH the embedded bundle is ~26 KB; with
TAILWIND_REQUIRED=1 and no CLI the build fails as intended.
Was ~150 lines with the host-derivation repeated three times and
reference tables ahead of the happy path. Rewrite as a 4-step staging
quickstart (the main content), with the counterintuitive bits demoted to
short "when login fails" + "config reference" sections. ~55 lines.
Step-by-step for wiring the operator dashboard's browser SSO: the Zitadel
app settings the code requires (Web app, PKCE/no-secret, redirect +
post-logout URIs), each config value mapped to its source
(ZitadelAuthConfig + cookie key), how to provide them (staging via
FleetDeployConfig/Secrets with hosts derived from base_domain; local via
HARMONY_CONFIG_* env), the derived endpoints, and the common-failure
gotchas (iss/aud/redirect mismatch, no client secret, localhost dev mode,
≥64-byte cookie key). Grounded in harmony_zitadel_auth's login/jwks code.
Registered in SUMMARY and cross-linked from web-auth-security.
The Device CRD gained a status subresource (liveness reconciler), but
RBAC treats `devices/status` as a resource distinct from `devices`, so
the operator's patch_status 403'd. Add a ClusterRole rule granting
get/update/patch on `devices/status`, mirroring `deployments/status`.
A test locks both status subresources in the role.
The json!-based renderer set `ingressClassName` to the literal string
`"default"` (quotes included) when no class was given — an invalid
IngressClass reference, so the Ingress was never claimed/routed. The
fleet operator passes None, so it hit exactly that.
Rebuild the Ingress from typed k8s_openapi structs. `None` now omits
`ingressClassName` so the cluster's default IngressClass claims the
resource (per docs/guides/kubernetes-ingress.md); `Some(x)` passes it
through unchanged. cert-manager annotations + the tls block are typed
too, dropping the serde_json::Value patching and from_value().unwrap().
Tests cover omit-when-none, pass-through-when-set, and backend/path.
The auth code (Reda's, proven locally) read 7 FLEET_AUTH_* env vars at
the pod. Replace that with one typed Config value each, loaded the
Harmony way.
- harmony_zitadel_auth: ZitadelAuthConfig is now a `Config` (Serialize/
Deserialize/JsonSchema). Add OperatorCookieKey (secret Config) with a
base64→Key decode. Drop config_from_env/cookie_key_from_env + the
FLEET_AUTH_* consts.
- operator: serve_dashboard loads ZitadelAuthConfig + OperatorCookieKey
via ConfigClient::for_namespace (EnvSource → OpenBao). No env soup.
- deploy: resolves the values (hosts derived from base_domain, client_id
+ audiences from FleetDeployConfig, cookie key from FleetDeploySecrets)
and bakes them into the operator Secret as HARMONY_CONFIG_<KEY> JSON.
The published chart wires the env→Secret refs at publish time
(optional, pod-light); the deploy fills the Secret at deploy time —
same pattern as the NATS credentials. A test locks the baked env names
to the structs' Config keys.
- fleet_staging_install seeds a generated cookie key; dev.sh exports the
two HARMONY_CONFIG_* JSON values instead of 7 vars.
Dashboard serves once the Zitadel app allows the staging redirect URIs
(fleet-stg.<base>/auth/callback) — the one remaining non-code step.
`--from-tag` only exists because CI passes $GITHUB_REF_NAME; making every
caller wrap a version in a fake `harmony-fleet-operator-v…` string just
to strip it back off was bad UX. Add `--version` (the bare image+chart
version), keep `--from-tag` optional for the CI path — symmetric with
harmony-fleet-deploy's `--operator-chart-version`/`--from-tag`. The dev
script now passes `--version` directly.
Adds fleet/scripts/dev-deploy-operator.sh: a unique semver-dev version
drives harmony-fleet-publish (docker-build the web-frontend image,
generate + push the chart) then harmony-fleet-deploy (helm upgrade
--install onto staging with the dashboard ingress + Service + cert).
Skips the git-tag → CI → release ceremony for fast iteration; a unique
version per run sidesteps mutable-:dev image/chart cache traps.
Dockerfile: BuildKit cache mounts for the cargo registry + target/, so
the iterate loop recompiles only changed crates instead of the whole
workspace. Build-time only — image is identical, cold CI just rebuilds.
chart test: the empty-Secret removal (662ef395) left
chart_includes_credentials_secret_and_env_var asserting a file that no
longer exists. Reframed to assert the hydrated chart omits the Secret
while the Deployment still references the out-of-band one.
Reworks the real-data path so the operator is the single write side and
the dashboard is a thin read projection over Kubernetes CRs, removing the
duplicated derivation the on-demand version carried.
Operator (write side):
- Device gains a status subresource (DeviceStatus: lastHeartbeat +
Reachability). A new device_status reconciler watches device-heartbeat
KV and reflects liveness onto Device.status on a tick — the home the
CRD doc designated for "device conditions from heartbeat staleness".
The staleness threshold now lives in exactly one place.
Dashboard (read side):
- RealFleetService reads only kube: Device CRs (labels/inventory/status)
+ Deployment CRs (.status.aggregate). No NATS, no KV scanning, no
staleness re-derivation. get_deployment_devices and the per-device
deployment column filter Device CRs with the canonical selector_matches
over real labels — the lossy label→tag→label round-trip is gone.
Deployment:
- The operator now serves the dashboard in-process beside the reconcile
loop (best-effort; CR-only, so the web side needs no NATS creds and its
failure never tears down the controller). The image builds with
--features web-frontend so the pod actually serves the UI — it didn't
before. serve-web stays for offline (--mock) UI dev.
Device-level Failing/Pending move to the deployment view (accurate
aggregate counts); per-device status is liveness + blacklist. Unit tests
cover liveness reflection, status mapping, version parsing, alerts.
RealFleetService implements FleetService against the same sources the
reconcile loop owns, read-only:
- Device/Deployment CRs (kube) for the registry, desired intent, and
the aggregator-maintained .status.aggregate (target/healthy/failing/
pending counts, deployment status).
- device-heartbeat KV → last ping + device status (Stale after 90s).
- device-state KV → per-device phase → Failing/Pending, primary deployment.
Status, dashboard counts, and alerts (one critical per failing
deployment, one warning per stale device; acks held in-memory) are all
derived from live state. Deployment version is the first service's image
tag. blacklist_device patches a label on the Device CR; run_command
stays a seam (needs agent-side transport).
serve_web now connects NATS + kube and builds RealFleetService when not
--mock (the bail is gone); --mock still uses the seeded MockFleetService
for offline UI work. Reads are on-demand per request — fine at staging
scale, a cache can follow.
Unit tests cover status derivation, primary-deployment selection,
version parsing, and alert derivation.