feat(fleet-operator): real dashboard data from kube CRs + NATS KV #322

johnride · 2026-06-01T23:18:46Z

johnride commented

2026-06-01 23:18:46 +00:00

RealFleetService implements FleetService against the same sources the
reconcile loop owns, read-only:

Device/Deployment CRs (kube) for the registry, desired intent, and
the aggregator-maintained .status.aggregate (target/healthy/failing/
pending counts, deployment status).
device-heartbeat KV → last ping + device status (Stale after 90s).
device-state KV → per-device phase → Failing/Pending, primary deployment.

Status, dashboard counts, and alerts (one critical per failing
deployment, one warning per stale device; acks held in-memory) are all
derived from live state. Deployment version is the first service's image
tag. blacklist_device patches a label on the Device CR; run_command
stays a seam (needs agent-side transport).

serve_web now connects NATS + kube and builds RealFleetService when not
--mock (the bail is gone); --mock still uses the seeded MockFleetService
for offline UI work. Reads are on-demand per request — fine at staging
scale, a cache can follow.

Unit tests cover status derivation, primary-deployment selection,
version parsing, and alert derivation.

RealFleetService implements FleetService against the same sources the reconcile loop owns, read-only: - Device/Deployment CRs (kube) for the registry, desired intent, and the aggregator-maintained .status.aggregate (target/healthy/failing/ pending counts, deployment status). - device-heartbeat KV → last ping + device status (Stale after 90s). - device-state KV → per-device phase → Failing/Pending, primary deployment. Status, dashboard counts, and alerts (one critical per failing deployment, one warning per stale device; acks held in-memory) are all derived from live state. Deployment version is the first service's image tag. blacklist_device patches a label on the Device CR; run_command stays a seam (needs agent-side transport). serve_web now connects NATS + kube and builds RealFleetService when not --mock (the bail is gone); --mock still uses the seeded MockFleetService for offline UI work. Reads are on-demand per request — fine at staging scale, a cache can follow. Unit tests cover status derivation, primary-deployment selection, version parsing, and alert derivation.

johnride added 1 commit 2026-06-01 23:18:47 +00:00

feat(fleet-operator): real dashboard data from kube CRs + NATS KV

Run Check Script / check (pull_request) Failing after 2m7s

Details

9b94fc12a9

RealFleetService implements FleetService against the same sources the
reconcile loop owns, read-only:
- Device/Deployment CRs (kube) for the registry, desired intent, and
  the aggregator-maintained .status.aggregate (target/healthy/failing/
  pending counts, deployment status).
- device-heartbeat KV → last ping + device status (Stale after 90s).
- device-state KV → per-device phase → Failing/Pending, primary deployment.

Status, dashboard counts, and alerts (one critical per failing
deployment, one warning per stale device; acks held in-memory) are all
derived from live state. Deployment version is the first service's image
tag. blacklist_device patches a label on the Device CR; run_command
stays a seam (needs agent-side transport).

serve_web now connects NATS + kube and builds RealFleetService when not
--mock (the bail is gone); --mock still uses the seeded MockFleetService
for offline UI work. Reads are on-demand per request — fine at staging
scale, a cache can follow.

Unit tests cover status derivation, primary-deployment selection,
version parsing, and alert derivation.

johnride reviewed 2026-06-01 23:30:15 +00:00

johnride left a comment

I feel like this is under-engineered a bit, the real.rs file being 400 lines long with tens of utility functions that feel like duplicate logic that is core to the operator for a significant number of them.

Overall this code makes sense but does not feel like high quality craftmanship. I think the changes could be even smaller and better integrated with the rest of the operator while remaining well architected.

I feel like this is under-engineered a bit, the real.rs file being 400 lines long with tens of utility functions that feel like duplicate logic that is core to the operator for a significant number of them. Overall this code makes sense but does not feel like high quality craftmanship. I think the changes could be even smaller and better integrated with the rest of the operator while remaining well architected.

johnride added 1 commit 2026-06-01 23:48:01 +00:00

refactor(fleet-operator): CQRS dashboard — operator owns liveness, UI reads CRs

Run Check Script / check (pull_request) Failing after 2m6s

Details

2ed0eccb45

Reworks the real-data path so the operator is the single write side and
the dashboard is a thin read projection over Kubernetes CRs, removing the
duplicated derivation the on-demand version carried.

Operator (write side):
- Device gains a status subresource (DeviceStatus: lastHeartbeat +
  Reachability). A new device_status reconciler watches device-heartbeat
  KV and reflects liveness onto Device.status on a tick — the home the
  CRD doc designated for "device conditions from heartbeat staleness".
  The staleness threshold now lives in exactly one place.

Dashboard (read side):
- RealFleetService reads only kube: Device CRs (labels/inventory/status)
  + Deployment CRs (.status.aggregate). No NATS, no KV scanning, no
  staleness re-derivation. get_deployment_devices and the per-device
  deployment column filter Device CRs with the canonical selector_matches
  over real labels — the lossy label→tag→label round-trip is gone.

Deployment:
- The operator now serves the dashboard in-process beside the reconcile
  loop (best-effort; CR-only, so the web side needs no NATS creds and its
  failure never tears down the controller). The image builds with
  --features web-frontend so the pod actually serves the UI — it didn't
  before. serve-web stays for offline (--mock) UI dev.

Device-level Failing/Pending move to the deployment view (accurate
aggregate counts); per-device status is liveness + blacklist. Unit tests
cover liveness reflection, status mapping, version parsing, alerts.

johnride added 1 commit 2026-06-02 01:09:33 +00:00

feat(fleet): one-command dev build+deploy loop for the operator

Run Check Script / check (pull_request) Successful in 2m21s

Details

17c56d2997

Adds fleet/scripts/dev-deploy-operator.sh: a unique semver-dev version
drives harmony-fleet-publish (docker-build the web-frontend image,
generate + push the chart) then harmony-fleet-deploy (helm upgrade
--install onto staging with the dashboard ingress + Service + cert).
Skips the git-tag → CI → release ceremony for fast iteration; a unique
version per run sidesteps mutable-:dev image/chart cache traps.

Dockerfile: BuildKit cache mounts for the cargo registry + target/, so
the iterate loop recompiles only changed crates instead of the whole
workspace. Build-time only — image is identical, cold CI just rebuilds.

chart test: the empty-Secret removal (662ef395) left
chart_includes_credentials_secret_and_env_var asserting a file that no
longer exists. Reframed to assert the hydrated chart omits the Secret
while the Deployment still references the out-of-band one.

johnride added 1 commit 2026-06-02 01:18:16 +00:00

feat(fleet): harmony-fleet-publish takes a bare --version

Run Check Script / check (pull_request) Successful in 2m14s

Details

08243e218b

`--from-tag` only exists because CI passes $GITHUB_REF_NAME; making every
caller wrap a version in a fake `harmony-fleet-operator-v…` string just
to strip it back off was bad UX. Add `--version` (the bare image+chart
version), keep `--from-tag` optional for the CI path — symmetric with
harmony-fleet-deploy's `--operator-chart-version`/`--from-tag`. The dev
script now passes `--version` directly.

johnride added 1 commit 2026-06-02 02:08:02 +00:00

feat(fleet-operator): dashboard SSO config via ConfigClient, not env soup

Run Check Script / check (pull_request) Successful in 2m19s

Details

78bb5d77d8

The auth code (Reda's, proven locally) read 7 FLEET_AUTH_* env vars at
the pod. Replace that with one typed Config value each, loaded the
Harmony way.

- harmony_zitadel_auth: ZitadelAuthConfig is now a `Config` (Serialize/
  Deserialize/JsonSchema). Add OperatorCookieKey (secret Config) with a
  base64→Key decode. Drop config_from_env/cookie_key_from_env + the
  FLEET_AUTH_* consts.
- operator: serve_dashboard loads ZitadelAuthConfig + OperatorCookieKey
  via ConfigClient::for_namespace (EnvSource → OpenBao). No env soup.
- deploy: resolves the values (hosts derived from base_domain, client_id
  + audiences from FleetDeployConfig, cookie key from FleetDeploySecrets)
  and bakes them into the operator Secret as HARMONY_CONFIG_<KEY> JSON.
  The published chart wires the env→Secret refs at publish time
  (optional, pod-light); the deploy fills the Secret at deploy time —
  same pattern as the NATS credentials. A test locks the baked env names
  to the structs' Config keys.
- fleet_staging_install seeds a generated cookie key; dev.sh exports the
  two HARMONY_CONFIG_* JSON values instead of 7 vars.

Dashboard serves once the Zitadel app allows the staging redirect URIs
(fleet-stg.<base>/auth/callback) — the one remaining non-code step.

johnride added 1 commit 2026-06-02 02:20:46 +00:00

fix(k8s): build Ingress from typed structs; omit ingressClassName when unset

Run Check Script / check (pull_request) Successful in 2m19s

Details

174d0b4304

The json!-based renderer set `ingressClassName` to the literal string
`"default"` (quotes included) when no class was given — an invalid
IngressClass reference, so the Ingress was never claimed/routed. The
fleet operator passes None, so it hit exactly that.

Rebuild the Ingress from typed k8s_openapi structs. `None` now omits
`ingressClassName` so the cluster's default IngressClass claims the
resource (per docs/guides/kubernetes-ingress.md); `Some(x)` passes it
through unchanged. cert-manager annotations + the tls block are typed
too, dropping the serde_json::Value patching and from_value().unwrap().

Tests cover omit-when-none, pass-through-when-set, and backend/path.

johnride added 1 commit 2026-06-02 02:31:24 +00:00

fix(fleet): grant the operator RBAC on devices/status

Run Check Script / check (pull_request) Successful in 2m24s

Details

2c4fe5c8d6

The Device CRD gained a status subresource (liveness reconciler), but
RBAC treats `devices/status` as a resource distinct from `devices`, so
the operator's patch_status 403'd. Add a ClusterRole rule granting
get/update/patch on `devices/status`, mirroring `deployments/status`.
A test locks both status subresources in the role.

johnride added 1 commit 2026-06-02 03:04:00 +00:00

feat: slight improvement of temp version date format for better human readability

Run Check Script / check (pull_request) Successful in 2m17s

Details

aab5e82119

johnride added 1 commit 2026-06-02 03:06:49 +00:00

docs: operator dashboard SSO (Zitadel) setup guide

Run Check Script / check (pull_request) Successful in 2m18s

Details

75aac243c7

Step-by-step for wiring the operator dashboard's browser SSO: the Zitadel
app settings the code requires (Web app, PKCE/no-secret, redirect +
post-logout URIs), each config value mapped to its source
(ZitadelAuthConfig + cookie key), how to provide them (staging via
FleetDeployConfig/Secrets with hosts derived from base_domain; local via
HARMONY_CONFIG_* env), the derived endpoints, and the common-failure
gotchas (iss/aud/redirect mismatch, no client secret, localhost dev mode,
≥64-byte cookie key). Grounded in harmony_zitadel_auth's login/jwks code.

Registered in SUMMARY and cross-linked from web-auth-security.

johnride added 1 commit 2026-06-02 03:08:51 +00:00

docs: trim operator SSO guide to a quickstart-first page

Run Check Script / check (pull_request) Successful in 2m19s

Details

1c0e9df682

Was ~150 lines with the host-derivation repeated three times and
reference tables ahead of the happy path. Rewrite as a 4-step staging
quickstart (the main content), with the counterintuitive bits demoted to
short "when login fails" + "config reference" sections. ~55 lines.

johnride added 1 commit 2026-06-02 03:26:41 +00:00

fix(fleet-operator): build + embed Tailwind CSS in the container image

Run Check Script / check (pull_request) Successful in 2m17s

Details

9be4f63636

The image shipped empty CSS: build.rs shells out to the tailwindcss v4
CLI and silently falls back to an empty bundle when it's absent — which
it was in the rust:slim builder, so /static/tailwind.css served nothing.

- Dockerfile: install the pinned Tailwind v4 standalone CLI (curl) in the
  builder and set TAILWIND_REQUIRED=1.
- build.rs: when TAILWIND_REQUIRED is set (container/prod), a missing or
  failing CLI is now a hard build error instead of empty CSS; dev builds
  keep the soft fallback for the `serve-web --css-from` workflow. The env
  is a rerun trigger, so the first required build regenerates rather than
  reusing a cache-mounted empty bundle.

Verified: with the CLI on PATH the embedded bundle is ~26 KB; with
TAILWIND_REQUIRED=1 and no CLI the build fails as intended.

johnride merged commit b70c001ba9 into master

2026-06-02 03:48:30 +00:00

johnride deleted branch feat/fleet-operator-real-data

2026-06-02 03:48:31 +00:00

johnride referenced this issue from a commit

2026-06-02 03:48:32 +00:00

Merge pull request 'feat(fleet-operator): real dashboard data from kube CRs + NATS KV' (#322) from feat/fleet-operator-real-data into master

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: NationTech/harmony#322