feat(fleet-operator): real dashboard data from kube CRs + NATS KV #322

Merged
johnride merged 11 commits from feat/fleet-operator-real-data into master 2026-06-02 03:48:30 +00:00
Owner

RealFleetService implements FleetService against the same sources the
reconcile loop owns, read-only:

  • Device/Deployment CRs (kube) for the registry, desired intent, and
    the aggregator-maintained .status.aggregate (target/healthy/failing/
    pending counts, deployment status).
  • device-heartbeat KV → last ping + device status (Stale after 90s).
  • device-state KV → per-device phase → Failing/Pending, primary deployment.

Status, dashboard counts, and alerts (one critical per failing
deployment, one warning per stale device; acks held in-memory) are all
derived from live state. Deployment version is the first service's image
tag. blacklist_device patches a label on the Device CR; run_command
stays a seam (needs agent-side transport).

serve_web now connects NATS + kube and builds RealFleetService when not
--mock (the bail is gone); --mock still uses the seeded MockFleetService
for offline UI work. Reads are on-demand per request — fine at staging
scale, a cache can follow.

Unit tests cover status derivation, primary-deployment selection,
version parsing, and alert derivation.

RealFleetService implements FleetService against the same sources the reconcile loop owns, read-only: - Device/Deployment CRs (kube) for the registry, desired intent, and the aggregator-maintained .status.aggregate (target/healthy/failing/ pending counts, deployment status). - device-heartbeat KV → last ping + device status (Stale after 90s). - device-state KV → per-device phase → Failing/Pending, primary deployment. Status, dashboard counts, and alerts (one critical per failing deployment, one warning per stale device; acks held in-memory) are all derived from live state. Deployment version is the first service's image tag. blacklist_device patches a label on the Device CR; run_command stays a seam (needs agent-side transport). serve_web now connects NATS + kube and builds RealFleetService when not --mock (the bail is gone); --mock still uses the seeded MockFleetService for offline UI work. Reads are on-demand per request — fine at staging scale, a cache can follow. Unit tests cover status derivation, primary-deployment selection, version parsing, and alert derivation.
johnride added 1 commit 2026-06-01 23:18:47 +00:00
feat(fleet-operator): real dashboard data from kube CRs + NATS KV
Some checks failed
Run Check Script / check (pull_request) Failing after 2m7s
9b94fc12a9
RealFleetService implements FleetService against the same sources the
reconcile loop owns, read-only:
- Device/Deployment CRs (kube) for the registry, desired intent, and
  the aggregator-maintained .status.aggregate (target/healthy/failing/
  pending counts, deployment status).
- device-heartbeat KV → last ping + device status (Stale after 90s).
- device-state KV → per-device phase → Failing/Pending, primary deployment.

Status, dashboard counts, and alerts (one critical per failing
deployment, one warning per stale device; acks held in-memory) are all
derived from live state. Deployment version is the first service's image
tag. blacklist_device patches a label on the Device CR; run_command
stays a seam (needs agent-side transport).

serve_web now connects NATS + kube and builds RealFleetService when not
--mock (the bail is gone); --mock still uses the seeded MockFleetService
for offline UI work. Reads are on-demand per request — fine at staging
scale, a cache can follow.

Unit tests cover status derivation, primary-deployment selection,
version parsing, and alert derivation.
johnride reviewed 2026-06-01 23:30:15 +00:00
johnride left a comment
Author
Owner

I feel like this is under-engineered a bit, the real.rs file being 400 lines long with tens of utility functions that feel like duplicate logic that is core to the operator for a significant number of them.

Overall this code makes sense but does not feel like high quality craftmanship. I think the changes could be even smaller and better integrated with the rest of the operator while remaining well architected.

I feel like this is under-engineered a bit, the real.rs file being 400 lines long with tens of utility functions that feel like duplicate logic that is core to the operator for a significant number of them. Overall this code makes sense but does not feel like high quality craftmanship. I think the changes could be even smaller and better integrated with the rest of the operator while remaining well architected.
johnride added 1 commit 2026-06-01 23:48:01 +00:00
refactor(fleet-operator): CQRS dashboard — operator owns liveness, UI reads CRs
Some checks failed
Run Check Script / check (pull_request) Failing after 2m6s
2ed0eccb45
Reworks the real-data path so the operator is the single write side and
the dashboard is a thin read projection over Kubernetes CRs, removing the
duplicated derivation the on-demand version carried.

Operator (write side):
- Device gains a status subresource (DeviceStatus: lastHeartbeat +
  Reachability). A new device_status reconciler watches device-heartbeat
  KV and reflects liveness onto Device.status on a tick — the home the
  CRD doc designated for "device conditions from heartbeat staleness".
  The staleness threshold now lives in exactly one place.

Dashboard (read side):
- RealFleetService reads only kube: Device CRs (labels/inventory/status)
  + Deployment CRs (.status.aggregate). No NATS, no KV scanning, no
  staleness re-derivation. get_deployment_devices and the per-device
  deployment column filter Device CRs with the canonical selector_matches
  over real labels — the lossy label→tag→label round-trip is gone.

Deployment:
- The operator now serves the dashboard in-process beside the reconcile
  loop (best-effort; CR-only, so the web side needs no NATS creds and its
  failure never tears down the controller). The image builds with
  --features web-frontend so the pod actually serves the UI — it didn't
  before. serve-web stays for offline (--mock) UI dev.

Device-level Failing/Pending move to the deployment view (accurate
aggregate counts); per-device status is liveness + blacklist. Unit tests
cover liveness reflection, status mapping, version parsing, alerts.
johnride added 1 commit 2026-06-02 01:09:33 +00:00
feat(fleet): one-command dev build+deploy loop for the operator
All checks were successful
Run Check Script / check (pull_request) Successful in 2m21s
17c56d2997
Adds fleet/scripts/dev-deploy-operator.sh: a unique semver-dev version
drives harmony-fleet-publish (docker-build the web-frontend image,
generate + push the chart) then harmony-fleet-deploy (helm upgrade
--install onto staging with the dashboard ingress + Service + cert).
Skips the git-tag → CI → release ceremony for fast iteration; a unique
version per run sidesteps mutable-:dev image/chart cache traps.

Dockerfile: BuildKit cache mounts for the cargo registry + target/, so
the iterate loop recompiles only changed crates instead of the whole
workspace. Build-time only — image is identical, cold CI just rebuilds.

chart test: the empty-Secret removal (662ef395) left
chart_includes_credentials_secret_and_env_var asserting a file that no
longer exists. Reframed to assert the hydrated chart omits the Secret
while the Deployment still references the out-of-band one.
johnride added 1 commit 2026-06-02 01:18:16 +00:00
feat(fleet): harmony-fleet-publish takes a bare --version
All checks were successful
Run Check Script / check (pull_request) Successful in 2m14s
08243e218b
`--from-tag` only exists because CI passes $GITHUB_REF_NAME; making every
caller wrap a version in a fake `harmony-fleet-operator-v…` string just
to strip it back off was bad UX. Add `--version` (the bare image+chart
version), keep `--from-tag` optional for the CI path — symmetric with
harmony-fleet-deploy's `--operator-chart-version`/`--from-tag`. The dev
script now passes `--version` directly.
johnride added 1 commit 2026-06-02 02:08:02 +00:00
feat(fleet-operator): dashboard SSO config via ConfigClient, not env soup
All checks were successful
Run Check Script / check (pull_request) Successful in 2m19s
78bb5d77d8
The auth code (Reda's, proven locally) read 7 FLEET_AUTH_* env vars at
the pod. Replace that with one typed Config value each, loaded the
Harmony way.

- harmony_zitadel_auth: ZitadelAuthConfig is now a `Config` (Serialize/
  Deserialize/JsonSchema). Add OperatorCookieKey (secret Config) with a
  base64→Key decode. Drop config_from_env/cookie_key_from_env + the
  FLEET_AUTH_* consts.
- operator: serve_dashboard loads ZitadelAuthConfig + OperatorCookieKey
  via ConfigClient::for_namespace (EnvSource → OpenBao). No env soup.
- deploy: resolves the values (hosts derived from base_domain, client_id
  + audiences from FleetDeployConfig, cookie key from FleetDeploySecrets)
  and bakes them into the operator Secret as HARMONY_CONFIG_<KEY> JSON.
  The published chart wires the env→Secret refs at publish time
  (optional, pod-light); the deploy fills the Secret at deploy time —
  same pattern as the NATS credentials. A test locks the baked env names
  to the structs' Config keys.
- fleet_staging_install seeds a generated cookie key; dev.sh exports the
  two HARMONY_CONFIG_* JSON values instead of 7 vars.

Dashboard serves once the Zitadel app allows the staging redirect URIs
(fleet-stg.<base>/auth/callback) — the one remaining non-code step.
johnride added 1 commit 2026-06-02 02:20:46 +00:00
fix(k8s): build Ingress from typed structs; omit ingressClassName when unset
All checks were successful
Run Check Script / check (pull_request) Successful in 2m19s
174d0b4304
The json!-based renderer set `ingressClassName` to the literal string
`"default"` (quotes included) when no class was given — an invalid
IngressClass reference, so the Ingress was never claimed/routed. The
fleet operator passes None, so it hit exactly that.

Rebuild the Ingress from typed k8s_openapi structs. `None` now omits
`ingressClassName` so the cluster's default IngressClass claims the
resource (per docs/guides/kubernetes-ingress.md); `Some(x)` passes it
through unchanged. cert-manager annotations + the tls block are typed
too, dropping the serde_json::Value patching and from_value().unwrap().

Tests cover omit-when-none, pass-through-when-set, and backend/path.
johnride added 1 commit 2026-06-02 02:31:24 +00:00
fix(fleet): grant the operator RBAC on devices/status
All checks were successful
Run Check Script / check (pull_request) Successful in 2m24s
2c4fe5c8d6
The Device CRD gained a status subresource (liveness reconciler), but
RBAC treats `devices/status` as a resource distinct from `devices`, so
the operator's patch_status 403'd. Add a ClusterRole rule granting
get/update/patch on `devices/status`, mirroring `deployments/status`.
A test locks both status subresources in the role.
johnride added 1 commit 2026-06-02 03:04:00 +00:00
feat: slight improvement of temp version date format for better human readability
All checks were successful
Run Check Script / check (pull_request) Successful in 2m17s
aab5e82119
johnride added 1 commit 2026-06-02 03:06:49 +00:00
docs: operator dashboard SSO (Zitadel) setup guide
All checks were successful
Run Check Script / check (pull_request) Successful in 2m18s
75aac243c7
Step-by-step for wiring the operator dashboard's browser SSO: the Zitadel
app settings the code requires (Web app, PKCE/no-secret, redirect +
post-logout URIs), each config value mapped to its source
(ZitadelAuthConfig + cookie key), how to provide them (staging via
FleetDeployConfig/Secrets with hosts derived from base_domain; local via
HARMONY_CONFIG_* env), the derived endpoints, and the common-failure
gotchas (iss/aud/redirect mismatch, no client secret, localhost dev mode,
≥64-byte cookie key). Grounded in harmony_zitadel_auth's login/jwks code.

Registered in SUMMARY and cross-linked from web-auth-security.
johnride added 1 commit 2026-06-02 03:08:51 +00:00
docs: trim operator SSO guide to a quickstart-first page
All checks were successful
Run Check Script / check (pull_request) Successful in 2m19s
1c0e9df682
Was ~150 lines with the host-derivation repeated three times and
reference tables ahead of the happy path. Rewrite as a 4-step staging
quickstart (the main content), with the counterintuitive bits demoted to
short "when login fails" + "config reference" sections. ~55 lines.
johnride added 1 commit 2026-06-02 03:26:41 +00:00
fix(fleet-operator): build + embed Tailwind CSS in the container image
All checks were successful
Run Check Script / check (pull_request) Successful in 2m17s
9be4f63636
The image shipped empty CSS: build.rs shells out to the tailwindcss v4
CLI and silently falls back to an empty bundle when it's absent — which
it was in the rust:slim builder, so /static/tailwind.css served nothing.

- Dockerfile: install the pinned Tailwind v4 standalone CLI (curl) in the
  builder and set TAILWIND_REQUIRED=1.
- build.rs: when TAILWIND_REQUIRED is set (container/prod), a missing or
  failing CLI is now a hard build error instead of empty CSS; dev builds
  keep the soft fallback for the `serve-web --css-from` workflow. The env
  is a rerun trigger, so the first required build regenerates rather than
  reusing a cache-mounted empty bundle.

Verified: with the CLI on PATH the embedded bundle is ~26 KB; with
TAILWIND_REQUIRED=1 and no CLI the build fails as intended.
johnride merged commit b70c001ba9 into master 2026-06-02 03:48:30 +00:00
johnride deleted branch feat/fleet-operator-real-data 2026-06-02 03:48:31 +00:00
Sign in to join this conversation.
No Reviewers
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: NationTech/harmony#322
No description provided.