harmony/ROADMAP/fleet_platform/v0_3_plan.md

# Fleet Platform v0.3 — Staging to production-ready

Written 2026-05-31. Picks up after OpenBao + Zitadel + NATS + callout + operator are deployed and functional on staging (2-3 weeks old versions).

## Current state

- [x] OpenBao running at `secrets-stg.cb1.nationtech.io`
- [x] Zitadel running at `sso-stg.cb1.nationtech.io`
- [x] NATS + auth callout deployed in `fleet-staging` namespace
- [x] Operator deployed (older version, 2-3 weeks old)
- [x] Config-driven OpenBao installer (`examples/openbao`)
- [x] `harmony-fleet-deploy` binary reads `FleetDeployConfig` + `FleetDeploySecrets` from OpenBao

## Immediate next steps

### 1. Provision operator credentials in OpenBao

- [ ] Fetch existing creds from the running cluster:
  ```bash
  oc -n fleet-staging get secret harmony-fleet-operator-secrets -o jsonpath='{.data.credentials\.toml}' | base64 -d
  ```
- [ ] Seed into OpenBao at `secret/data/fleet-staging/FleetDeploySecrets`:
  ```bash
  export VAULT_ADDR=https://secrets-stg.cb1.nationtech.io
  export VAULT_TOKEN=<root token>
  oc -n fleet-staging get secret harmony-fleet-operator-secrets -o jsonpath='{.data.credentials\.toml}' | base64 -d \
    | jq -Rs '{value: ({operator_credentials_toml: .} | tojson)}' \
    | bao kv put secret/fleet-staging/FleetDeploySecrets -
  ```
- [ ] Verify the secret is readable: `bao kv get secret/fleet-staging/FleetDeploySecrets`

### 2. Private repo deploy script

- [ ] Create `.envrc` with minimal env:
  ```bash
  export OPENBAO_URL=https://secrets-stg.cb1.nationtech.io
  export HARMONY_CONFIG_NAMESPACE=fleet-staging
  # export OPENBAO_TOKEN=<root token for now; SSO later>
  ```
- [ ] Write deploy invocation (shell script or just `harmony-fleet-deploy` call):
  ```bash
  harmony-fleet-deploy --from-tag harmony-fleet-operator-vX.Y.Z --yes
  ```
- [ ] Commit `.envrc` + script to private repo (shared with teammates)

### 3. Execute operator upgrade

- [ ] Run the deploy script from the private repo
- [ ] Verify operator pod starts and connects to NATS
- [ ] Verify operator reconciles existing CRs (check logs)
- [ ] Confirm no regression in existing fleet functionality

### 4. Operator UI ingress (trivial)

- [ ] Expose operator UI with TLS ingress on `fleet-stg.<base_domain>`
- [ ] Verify the UI loads and serves the SPA
- [ ] Confirm no auth gate yet (SSO is next)

### 5. SSO login flow

- [ ] Wire operator UI to Zitadel SSO at `sso-stg.<base_domain>`
- [ ] Test login/logout flow end-to-end
- [ ] Verify session persistence across page reloads
- [ ] Confirm RBAC: only authorized Zitadel users can access the UI

### 6. Real data in UI

- [ ] Replace mock device list with live `device-info` KV data
- [ ] Replace mock deployment list with live `Deployment` CR data
- [ ] Wire per-device drilldown to real `DeviceInfo` + last-heartbeat + agent version
- [ ] NATS tail panel: SSE stream of `device-info` and `device-state` updates (plain text)
- [ ] Verify data refreshes without manual reload

## Configuration model

### Environment (minimal, committed in private repo)

```bash
OPENBAO_URL=https://secrets-stg.cb1.nationtech.io
HARMONY_CONFIG_NAMESPACE=fleet-staging
# SSO auth or root token (SSO is the goal)
```

### OpenBao (read via ConfigClient)

- `FleetDeployConfig` (k8s namespaces, NATS URL, chart coords) at `secret/data/fleet-staging/FleetDeployConfig`
- `FleetDeploySecrets` (operator creds) at `secret/data/fleet-staging/FleetDeploySecrets`

## Missing features (post-UI)

### Auth & credentials

- [ ] Per-device OpenBao policies (templated policies, one role per device type)
- [ ] Device identity claim in JWT (Zitadel `client_id` with `device-` prefix)
- [ ] OpenBao JWT auth role granularity (extend `OpenbaoJwtAuth` to list of roles)
- [x] Move k8s namespaces + chart coords into `ConfigClient` config struct (env = only identifier + auth)

### Operator capabilities

- [ ] Agent upgrade path (ADR-022 exists; implementation pending)
- [ ] Device enrollment flow (operator-facing runbook)
- [ ] Revoke device / rotate key operations
- [ ] Fleet-wide rollout strategies (canary, %-based) on top of agent-upgrade primitive

### Observability

- [ ] Operator logs every CR it acquires (verify output reads well)
- [ ] NATS debugging one-liners in hand-off menu
- [ ] Journald log streaming (currently only `.status.aggregate.lastError`)
- [ ] Metrics dashboard (deferred until >100 devices)

### Quality & hardening

- [ ] Agent config-driven labels (`[labels]` in agent toml → DeviceInfo)
- [ ] `matchExpressions` in selectors (currently `matchLabels` only)
- [ ] `Device.status.conditions` populated from heartbeat staleness
- [ ] Operator graceful degradation on bad device_id (log + skip, don't restart-loop)
- [ ] Persist `nats_auth_pass` and issuer NKey via `harmony_secret` (regenerate-every-run footgun)

### Refactors (deferred, non-blocking)

- [ ] Decompose `FleetServerScore` into independent, ConfigClient-glued Scores
- [ ] Move `harmony/modules/fleet/` → `fleet/harmony-fleet/` (ADR-021 pending)
- [ ] Delete `examples/fleet_staging_deploy` (superseded by `fleet_staging_install`)
- [ ] Drop `K8sAnywhereTopology` for ad-hoc Score execution; introduce `K8sBareTopology`

## Principles (carried forward)

- No yaml in framework code paths
- Scores describe desired state; topologies expose capabilities
- Cross-boundary wire types in `harmony-reconciler-contracts`
- Never ship untested code
- Prove claims about upstream before blaming upstream
- Design the brick before moving the brick