Files
harmony/ROADMAP/fleet_platform/v0_3_plan.md
Jean-Gabriel Gill-Couture f7299ebe2b
Some checks failed
Run Check Script / check (pull_request) Failing after 51s
refactor(fleet-deploy): rename HARMONY_SECRET_NAMESPACE to HARMONY_CONFIG_NAMESPACE
The env var name was a misnomer — ConfigClient resolves both config and
secrets, not just secrets. The struct field was already config_namespace.
Legacy SecretManager keeps the old var; this forces migration to
ConfigClient for new code.
2026-05-31 09:13:39 -04:00

5.4 KiB

Fleet Platform v0.3 — Staging to production-ready

Written 2026-05-31. Picks up after OpenBao + Zitadel + NATS + callout + operator are deployed and functional on staging (2-3 weeks old versions).

Current state

  • OpenBao running at secrets-stg.cb1.nationtech.io
  • Zitadel running at sso-stg.cb1.nationtech.io
  • NATS + auth callout deployed in fleet-staging namespace
  • Operator deployed (older version, 2-3 weeks old)
  • Config-driven OpenBao installer (examples/openbao)
  • harmony-fleet-deploy binary reads FleetDeployConfig + FleetDeploySecrets from OpenBao

Immediate next steps

1. Provision operator credentials in OpenBao

  • Fetch existing creds from the running cluster:
    oc -n fleet-staging get secret harmony-fleet-operator-secrets -o jsonpath='{.data.credentials\.toml}' | base64 -d
    
  • Seed into OpenBao at secret/data/fleet-staging/FleetDeploySecrets:
    export VAULT_ADDR=https://secrets-stg.cb1.nationtech.io
    export VAULT_TOKEN=<root token>
    oc -n fleet-staging get secret harmony-fleet-operator-secrets -o jsonpath='{.data.credentials\.toml}' | base64 -d \
      | jq -Rs '{value: ({operator_credentials_toml: .} | tojson)}' \
      | bao kv put secret/fleet-staging/FleetDeploySecrets -
    
  • Verify the secret is readable: bao kv get secret/fleet-staging/FleetDeploySecrets

2. Private repo deploy script

  • Create .envrc with minimal env:
    export OPENBAO_URL=https://secrets-stg.cb1.nationtech.io
    export HARMONY_CONFIG_NAMESPACE=fleet-staging
    # export OPENBAO_TOKEN=<root token for now; SSO later>
    
  • Write deploy invocation (shell script or just harmony-fleet-deploy call):
    harmony-fleet-deploy --from-tag harmony-fleet-operator-vX.Y.Z --yes
    
  • Commit .envrc + script to private repo (shared with teammates)

3. Execute operator upgrade

  • Run the deploy script from the private repo
  • Verify operator pod starts and connects to NATS
  • Verify operator reconciles existing CRs (check logs)
  • Confirm no regression in existing fleet functionality

4. Operator UI ingress (trivial)

  • Expose operator UI with TLS ingress on fleet-stg.<base_domain>
  • Verify the UI loads and serves the SPA
  • Confirm no auth gate yet (SSO is next)

5. SSO login flow

  • Wire operator UI to Zitadel SSO at sso-stg.<base_domain>
  • Test login/logout flow end-to-end
  • Verify session persistence across page reloads
  • Confirm RBAC: only authorized Zitadel users can access the UI

6. Real data in UI

  • Replace mock device list with live device-info KV data
  • Replace mock deployment list with live Deployment CR data
  • Wire per-device drilldown to real DeviceInfo + last-heartbeat + agent version
  • NATS tail panel: SSE stream of device-info and device-state updates (plain text)
  • Verify data refreshes without manual reload

Configuration model

Environment (minimal, committed in private repo)

OPENBAO_URL=https://secrets-stg.cb1.nationtech.io
HARMONY_CONFIG_NAMESPACE=fleet-staging
# SSO auth or root token (SSO is the goal)

OpenBao (read via ConfigClient)

  • FleetDeployConfig (k8s namespaces, NATS URL, chart coords) at secret/data/fleet-staging/FleetDeployConfig
  • FleetDeploySecrets (operator creds) at secret/data/fleet-staging/FleetDeploySecrets

Missing features (post-UI)

Auth & credentials

  • Per-device OpenBao policies (templated policies, one role per device type)
  • Device identity claim in JWT (Zitadel client_id with device- prefix)
  • OpenBao JWT auth role granularity (extend OpenbaoJwtAuth to list of roles)
  • Move k8s namespaces + chart coords into ConfigClient config struct (env = only identifier + auth)

Operator capabilities

  • Agent upgrade path (ADR-022 exists; implementation pending)
  • Device enrollment flow (operator-facing runbook)
  • Revoke device / rotate key operations
  • Fleet-wide rollout strategies (canary, %-based) on top of agent-upgrade primitive

Observability

  • Operator logs every CR it acquires (verify output reads well)
  • NATS debugging one-liners in hand-off menu
  • Journald log streaming (currently only .status.aggregate.lastError)
  • Metrics dashboard (deferred until >100 devices)

Quality & hardening

  • Agent config-driven labels ([labels] in agent toml → DeviceInfo)
  • matchExpressions in selectors (currently matchLabels only)
  • Device.status.conditions populated from heartbeat staleness
  • Operator graceful degradation on bad device_id (log + skip, don't restart-loop)
  • Persist nats_auth_pass and issuer NKey via harmony_secret (regenerate-every-run footgun)

Refactors (deferred, non-blocking)

  • Decompose FleetServerScore into independent, ConfigClient-glued Scores
  • Move harmony/modules/fleet/fleet/harmony-fleet/ (ADR-021 pending)
  • Delete examples/fleet_staging_deploy (superseded by fleet_staging_install)
  • Drop K8sAnywhereTopology for ad-hoc Score execution; introduce K8sBareTopology

Principles (carried forward)

  • No yaml in framework code paths
  • Scores describe desired state; topologies expose capabilities
  • Cross-boundary wire types in harmony-reconciler-contracts
  • Never ship untested code
  • Prove claims about upstream before blaming upstream
  • Design the brick before moving the brick