NationTech/harmony

feat(fleet-operator): real dashboard data from kube CRs + NATS KV #322

Merged

johnride merged 11 commits from feat/fleet-operator-real-data into master

2026-06-02 03:48:30 +00:00

Author	SHA1	Message	Date
Jean-Gabriel Gill-Couture	9be4f63636	fix(fleet-operator): build + embed Tailwind CSS in the container image All checks were successful Run Check Script / check (pull_request) Successful in 2m17s Details The image shipped empty CSS: build.rs shells out to the tailwindcss v4 CLI and silently falls back to an empty bundle when it's absent — which it was in the rust:slim builder, so /static/tailwind.css served nothing. - Dockerfile: install the pinned Tailwind v4 standalone CLI (curl) in the builder and set TAILWIND_REQUIRED=1. - build.rs: when TAILWIND_REQUIRED is set (container/prod), a missing or failing CLI is now a hard build error instead of empty CSS; dev builds keep the soft fallback for the `serve-web --css-from` workflow. The env is a rerun trigger, so the first required build regenerates rather than reusing a cache-mounted empty bundle. Verified: with the CLI on PATH the embedded bundle is ~26 KB; with TAILWIND_REQUIRED=1 and no CLI the build fails as intended.	2026-06-01 23:26:38 -04:00
Jean-Gabriel Gill-Couture	1c0e9df682	docs: trim operator SSO guide to a quickstart-first page All checks were successful Run Check Script / check (pull_request) Successful in 2m19s Details Was ~150 lines with the host-derivation repeated three times and reference tables ahead of the happy path. Rewrite as a 4-step staging quickstart (the main content), with the counterintuitive bits demoted to short "when login fails" + "config reference" sections. ~55 lines.	2026-06-01 23:08:48 -04:00
Jean-Gabriel Gill-Couture	75aac243c7	docs: operator dashboard SSO (Zitadel) setup guide All checks were successful Run Check Script / check (pull_request) Successful in 2m18s Details Step-by-step for wiring the operator dashboard's browser SSO: the Zitadel app settings the code requires (Web app, PKCE/no-secret, redirect + post-logout URIs), each config value mapped to its source (ZitadelAuthConfig + cookie key), how to provide them (staging via FleetDeployConfig/Secrets with hosts derived from base_domain; local via HARMONY_CONFIG_* env), the derived endpoints, and the common-failure gotchas (iss/aud/redirect mismatch, no client secret, localhost dev mode, ≥64-byte cookie key). Grounded in harmony_zitadel_auth's login/jwks code. Registered in SUMMARY and cross-linked from web-auth-security.	2026-06-01 23:06:45 -04:00
Jean-Gabriel Gill-Couture	aab5e82119	feat: slight improvement of temp version date format for better human readability All checks were successful Run Check Script / check (pull_request) Successful in 2m17s Details	2026-06-01 23:03:55 -04:00
Jean-Gabriel Gill-Couture	2c4fe5c8d6	fix(fleet): grant the operator RBAC on devices/status All checks were successful Run Check Script / check (pull_request) Successful in 2m24s Details The Device CRD gained a status subresource (liveness reconciler), but RBAC treats `devices/status` as a resource distinct from `devices`, so the operator's patch_status 403'd. Add a ClusterRole rule granting get/update/patch on `devices/status`, mirroring `deployments/status`. A test locks both status subresources in the role.	2026-06-01 22:31:20 -04:00
Jean-Gabriel Gill-Couture	174d0b4304	fix(k8s): build Ingress from typed structs; omit ingressClassName when unset All checks were successful Run Check Script / check (pull_request) Successful in 2m19s Details The json!-based renderer set `ingressClassName` to the literal string `"default"` (quotes included) when no class was given — an invalid IngressClass reference, so the Ingress was never claimed/routed. The fleet operator passes None, so it hit exactly that. Rebuild the Ingress from typed k8s_openapi structs. `None` now omits `ingressClassName` so the cluster's default IngressClass claims the resource (per docs/guides/kubernetes-ingress.md); `Some(x)` passes it through unchanged. cert-manager annotations + the tls block are typed too, dropping the serde_json::Value patching and from_value().unwrap(). Tests cover omit-when-none, pass-through-when-set, and backend/path.	2026-06-01 22:20:42 -04:00
Jean-Gabriel Gill-Couture	78bb5d77d8	feat(fleet-operator): dashboard SSO config via ConfigClient, not env soup All checks were successful Run Check Script / check (pull_request) Successful in 2m19s Details The auth code (Reda's, proven locally) read 7 FLEET_AUTH_* env vars at the pod. Replace that with one typed Config value each, loaded the Harmony way. - harmony_zitadel_auth: ZitadelAuthConfig is now a `Config` (Serialize/ Deserialize/JsonSchema). Add OperatorCookieKey (secret Config) with a base64→Key decode. Drop config_from_env/cookie_key_from_env + the FLEET_AUTH_* consts. - operator: serve_dashboard loads ZitadelAuthConfig + OperatorCookieKey via ConfigClient::for_namespace (EnvSource → OpenBao). No env soup. - deploy: resolves the values (hosts derived from base_domain, client_id + audiences from FleetDeployConfig, cookie key from FleetDeploySecrets) and bakes them into the operator Secret as HARMONY_CONFIG_<KEY> JSON. The published chart wires the env→Secret refs at publish time (optional, pod-light); the deploy fills the Secret at deploy time — same pattern as the NATS credentials. A test locks the baked env names to the structs' Config keys. - fleet_staging_install seeds a generated cookie key; dev.sh exports the two HARMONY_CONFIG_* JSON values instead of 7 vars. Dashboard serves once the Zitadel app allows the staging redirect URIs (fleet-stg.<base>/auth/callback) — the one remaining non-code step.	2026-06-01 22:07:58 -04:00
Jean-Gabriel Gill-Couture	08243e218b	feat(fleet): harmony-fleet-publish takes a bare --version All checks were successful Run Check Script / check (pull_request) Successful in 2m14s Details `--from-tag` only exists because CI passes $GITHUB_REF_NAME; making every caller wrap a version in a fake `harmony-fleet-operator-v…` string just to strip it back off was bad UX. Add `--version` (the bare image+chart version), keep `--from-tag` optional for the CI path — symmetric with harmony-fleet-deploy's `--operator-chart-version`/`--from-tag`. The dev script now passes `--version` directly.	2026-06-01 21:18:13 -04:00
Jean-Gabriel Gill-Couture	17c56d2997	feat(fleet): one-command dev build+deploy loop for the operator All checks were successful Run Check Script / check (pull_request) Successful in 2m21s Details Adds fleet/scripts/dev-deploy-operator.sh: a unique semver-dev version drives harmony-fleet-publish (docker-build the web-frontend image, generate + push the chart) then harmony-fleet-deploy (helm upgrade --install onto staging with the dashboard ingress + Service + cert). Skips the git-tag → CI → release ceremony for fast iteration; a unique version per run sidesteps mutable-:dev image/chart cache traps. Dockerfile: BuildKit cache mounts for the cargo registry + target/, so the iterate loop recompiles only changed crates instead of the whole workspace. Build-time only — image is identical, cold CI just rebuilds. chart test: the empty-Secret removal (`662ef395`) left chart_includes_credentials_secret_and_env_var asserting a file that no longer exists. Reframed to assert the hydrated chart omits the Secret while the Deployment still references the out-of-band one.	2026-06-01 21:09:28 -04:00
Jean-Gabriel Gill-Couture	2ed0eccb45	refactor(fleet-operator): CQRS dashboard — operator owns liveness, UI reads CRs Some checks failed Run Check Script / check (pull_request) Failing after 2m6s Details Reworks the real-data path so the operator is the single write side and the dashboard is a thin read projection over Kubernetes CRs, removing the duplicated derivation the on-demand version carried. Operator (write side): - Device gains a status subresource (DeviceStatus: lastHeartbeat + Reachability). A new device_status reconciler watches device-heartbeat KV and reflects liveness onto Device.status on a tick — the home the CRD doc designated for "device conditions from heartbeat staleness". The staleness threshold now lives in exactly one place. Dashboard (read side): - RealFleetService reads only kube: Device CRs (labels/inventory/status) + Deployment CRs (.status.aggregate). No NATS, no KV scanning, no staleness re-derivation. get_deployment_devices and the per-device deployment column filter Device CRs with the canonical selector_matches over real labels — the lossy label→tag→label round-trip is gone. Deployment: - The operator now serves the dashboard in-process beside the reconcile loop (best-effort; CR-only, so the web side needs no NATS creds and its failure never tears down the controller). The image builds with --features web-frontend so the pod actually serves the UI — it didn't before. serve-web stays for offline (--mock) UI dev. Device-level Failing/Pending move to the deployment view (accurate aggregate counts); per-device status is liveness + blacklist. Unit tests cover liveness reflection, status mapping, version parsing, alerts.	2026-06-01 19:47:54 -04:00
Jean-Gabriel Gill-Couture	9b94fc12a9	feat(fleet-operator): real dashboard data from kube CRs + NATS KV Some checks failed Run Check Script / check (pull_request) Failing after 2m7s Details RealFleetService implements FleetService against the same sources the reconcile loop owns, read-only: - Device/Deployment CRs (kube) for the registry, desired intent, and the aggregator-maintained .status.aggregate (target/healthy/failing/ pending counts, deployment status). - device-heartbeat KV → last ping + device status (Stale after 90s). - device-state KV → per-device phase → Failing/Pending, primary deployment. Status, dashboard counts, and alerts (one critical per failing deployment, one warning per stale device; acks held in-memory) are all derived from live state. Deployment version is the first service's image tag. blacklist_device patches a label on the Device CR; run_command stays a seam (needs agent-side transport). serve_web now connects NATS + kube and builds RealFleetService when not --mock (the bail is gone); --mock still uses the seeded MockFleetService for offline UI work. Reads are on-demand per request — fine at staging scale, a cache can follow. Unit tests cover status derivation, primary-deployment selection, version parsing, and alert derivation.	2026-06-01 17:42:14 -04:00