Some checks failed
Run Check Script / check (pull_request) Failing after 41s
- Add `fleet/README.md`: overview of the crates, ADR-023 pointer, quickstart for the e2e ping test, env knobs (`HARMONY_FLEET_E2E`, `FLEET_E2E_KEEP`, `RUST_LOG`), how to connect to NATS from the host and in-cluster, how to inspect the agent, the `harmony-fleet-deploy` production CLI, the operator dashboard, and the roadmap (Zitadel + callout next). - `prune_stale_namespaces` now polls until each pruned namespace is fully gone (up to 90 s). NATS NodePort 30423 is cluster-scoped, so a still-`Terminating` namespace from the prior run was blocking the new bring-up with "provided port is already allocated". Verified: e2e ping test green back-to-back after the fix, with a prior namespace left behind.
154 lines
9.0 KiB
Markdown
154 lines
9.0 KiB
Markdown
# Harmony Fleet
|
|
|
|
IoT / decentralized-edge orchestration for harmony. A fleet stack is:
|
|
|
|
| Component | Crate | Role |
|
|
|---|---|---|
|
|
| **Operator** | [`harmony-fleet-operator`](harmony-fleet-operator/) | Watches `Deployment` CRs, writes desired state into NATS JetStream KV, aggregates device state back into CR status. Runtime binary; no `harmony` dep. |
|
|
| **Agent** | [`harmony-fleet-agent`](harmony-fleet-agent/) | One per device. Watches the desired-state KV, drives the local runtime (podman today), publishes heartbeats + per-deployment state, answers `device-commands.*` request/reply. |
|
|
| **Auth** | [`harmony-fleet-auth`](harmony-fleet-auth/) | Shared NATS credential plumbing — `TomlShared` (dev) and `ZitadelJwt` (prod with auth-callout). |
|
|
| **Deploy** | [`harmony-fleet-deploy`](harmony-fleet-deploy/) | The canonical deploy crate. Imports `harmony` and exposes one `*Score` per component (`FleetOperatorScore`, `FleetAgentScore`, `FleetNatsScore`, `FleetServerScore`). Both the production CLI and the e2e harness compose these — see [ADR-023](../docs/adr/023-deploy-architecture.md). |
|
|
| **E2E harness** | [`harmony-fleet-e2e`](harmony-fleet-e2e/) | Brings the stack up in a fresh k3d namespace and runs integration tests against it. |
|
|
|
|
The on-the-wire types both ends agree on (KV bucket names, key formats, command-protocol payloads) live in [`../harmony-reconciler-contracts`](../harmony-reconciler-contracts/).
|
|
|
|
## Architecture in one line
|
|
|
|
`FleetOperatorScore`, `FleetAgentScore`, etc. are real Rust types with capability-bound `Topology` parameters. Production deploys, the e2e harness, and any future control-plane tool all compose the **same** Scores; the only thing that changes is the `Topology` instance. **No handrolled YAML or imperative manifest factories anywhere.** Read [ADR-023](../docs/adr/023-deploy-architecture.md) before adding deploy logic.
|
|
|
|
---
|
|
|
|
## Quickstart — run the e2e ping test
|
|
|
|
The fastest path to a green fleet stack on your laptop. Requires `podman`, `kubectl`, and `helm` on `$PATH`; everything else (`k3d`, the NATS chart, all images) is fetched / built on demand.
|
|
|
|
```bash
|
|
HARMONY_FLEET_E2E=1 cargo test -p harmony-fleet-e2e --test ping -- --nocapture
|
|
```
|
|
|
|
What it does, in order:
|
|
|
|
1. Ensures a `fleet-e2e` k3d cluster exists (creates one if not). NodePort `30423` on the host forwards to NATS inside the cluster.
|
|
2. Builds `harmony-fleet-agent` in release mode, packages it into `localhost/harmony-fleet-agent:e2e`, and sideloads the image into the k3d cluster's containerd store.
|
|
3. Mints a per-bring-up namespace `e2e-<uuid8>` and prunes any leftover `e2e-*` namespaces from prior runs (NodePort `30423` is cluster-scoped, so a stuck `Terminating` namespace would block the new bring-up — the prune waits up to 90 s for full cleanup before proceeding).
|
|
4. Deploys NATS via `FleetNatsScore` (helm chart, JetStream on, static admin/device users, NodePort Service).
|
|
5. Waits for NATS to be reachable from the host on `nats://localhost:30423` (admin/e2e-admin).
|
|
6. Deploys one `FleetAgentScore { target: Pod }` — runs with `runtime_enabled = false` so it skips podman and only runs the command-server + heartbeat loop.
|
|
7. Waits for the agent Deployment to be Ready.
|
|
8. The test publishes `device-commands.<device_id>.ping` via `FleetCommandsClient::ping` and asserts the agent replies with `{ device_id, agent_version, uptime_s }`.
|
|
|
|
Cold first run: ~80 s (release build of the agent dominates). Warm: ~25 s.
|
|
|
|
### Useful env knobs
|
|
|
|
| Var | Effect |
|
|
|---|---|
|
|
| `HARMONY_FLEET_E2E=1` | Required. Without it the test is skipped — keeps `cargo test --workspace` cheap on machines without k3d. |
|
|
| `FLEET_E2E_KEEP=1` | Skip namespace teardown on Drop. Lets you `kubectl -n e2e-<…> logs deploy/…` after a failure. The next run prunes it. |
|
|
| `RUST_LOG=info` | Or `debug` for the per-message `command dispatch` traces inside `harmony-fleet-agent::command_server`. |
|
|
|
|
### Connecting to NATS while the stack is up
|
|
|
|
```bash
|
|
# Host-side, via the NodePort
|
|
nats://localhost:30423 # user=admin pass=e2e-admin (full access)
|
|
nats://localhost:30423 # user=device pass=e2e-device (device permissions)
|
|
```
|
|
|
|
```bash
|
|
# In-cluster, from any Pod in the same namespace
|
|
nats://fleet-nats.e2e-<uuid8>.svc.cluster.local:4222
|
|
```
|
|
|
|
`FLEET_E2E_KEEP=1` + the harness's stdout line `[e2e] NATS: nats://127.0.0.1:30423 …` is the path most tests will take — leave the harness running, point a NATS client at that URL.
|
|
|
|
### Inspecting the agent
|
|
|
|
```bash
|
|
# Find your namespace
|
|
kubectl get ns -l harmony.io/managed-by=fleet-e2e
|
|
|
|
# Tail the agent
|
|
kubectl -n e2e-<uuid8> logs deploy/fleet-agent-<device-id> -f
|
|
|
|
# Tail NATS (StatefulSet, not Deployment)
|
|
kubectl -n e2e-<uuid8> logs sts/fleet-nats -c nats -f
|
|
|
|
# Send a ping by hand (requires the `nats` CLI:
|
|
# https://github.com/nats-io/natscli/releases)
|
|
nats --server nats://localhost:30423 --user admin --password e2e-admin \
|
|
request "device-commands.vm-device-00-<uuid8>.ping" ""
|
|
```
|
|
|
|
You should see something like `{"device_id":"vm-device-00-<uuid8>","agent_version":"0.1.0","uptime_s":12}`.
|
|
|
|
### Cleaning up
|
|
|
|
The shared `OnceCell` in `harmony-fleet-e2e` lives for the test binary's lifetime, so namespaces survive a `cargo test` exit (the static is never explicitly dropped). The next `cargo test` invocation prunes them. To force a manual cleanup:
|
|
|
|
```bash
|
|
kubectl delete ns -l harmony.io/managed-by=fleet-e2e
|
|
# wipe the whole cluster:
|
|
k3d cluster delete fleet-e2e
|
|
```
|
|
|
|
---
|
|
|
|
## Production deploys
|
|
|
|
`harmony-fleet-deploy` is the binary that puts the fleet stack on a real cluster (OKD, vanilla k8s, anywhere `K8sAnywhereTopology` can reach). It composes `FleetNatsScore` + `FleetOperatorScore` + `FleetAgentScore` against the topology you point it at.
|
|
|
|
```bash
|
|
# Default: K8sAnywhereTopology against whatever KUBECONFIG points at
|
|
cargo run -p harmony-fleet-deploy -- \
|
|
--namespace fleet-system \
|
|
--operator-image hub.nationtech.io/harmony/harmony-fleet-operator:dev \
|
|
--agent-image hub.nationtech.io/harmony/harmony-fleet-agent:dev \
|
|
--agent-device-id fleet-agent-01
|
|
|
|
# Pick a single component with the harmony_cli filter
|
|
cargo run -p harmony-fleet-deploy -- \
|
|
--namespace fleet-system \
|
|
-- --filter FleetOperatorScore --all
|
|
```
|
|
|
|
`harmony-fleet-deploy` reads its full config from CLI flags + env vars (`FLEET_NAMESPACE`, `FLEET_OPERATOR_IMAGE`, …). The minimal-CLI surface is deliberate — per ADR-023 the long-term answer is a plugin-discovery layer over `harmony-*` binaries; until that lands, deploy crates stay small and use the existing `harmony_cli`.
|
|
|
|
### Connecting to the operator
|
|
|
|
The operator runs as a single-replica Deployment in `--namespace` (default `fleet-system`).
|
|
|
|
```bash
|
|
# Tail logs
|
|
kubectl -n fleet-system logs deploy/harmony-fleet-operator -f
|
|
|
|
# Port-forward the embedded web dashboard (web-frontend feature)
|
|
kubectl -n fleet-system port-forward deploy/harmony-fleet-operator 18080:18080
|
|
|
|
# Or run the dashboard standalone with seeded fake data — no NATS, no cluster
|
|
cargo run -p harmony-fleet-operator --features web-frontend -- serve-web --mock
|
|
# browse http://127.0.0.1:18080
|
|
```
|
|
|
|
---
|
|
|
|
## Existing manual rehearsal — `examples/fleet_e2e_demo`
|
|
|
|
`examples/fleet_e2e_demo` brings up a *fuller* stack than the e2e harness — real Zitadel, the auth-callout, libvirt VM agents over SSH — at the cost of a 5-min cold start. It's the manual rehearsal flow; not what you want during the dev loop. See the example's [`RUNBOOK.md`](../examples/fleet_e2e_demo/RUNBOOK.md).
|
|
|
|
The harness and the rehearsal will converge: the [follow-up PR](#whats-next) lifts `FleetCalloutScore` + a mock-OIDC fixture into `harmony-fleet-deploy`, at which point the harness can run the full production auth path in ~30 s instead of 5 min, and `fleet_e2e_demo` thins down to a caller over the same Scores.
|
|
|
|
---
|
|
|
|
## What's next
|
|
|
|
This branch lands the deploy-architecture cleanup (ADR-023), the per-component Scores, and the ping path. Slated immediately after:
|
|
|
|
1. **Zitadel + auth callout in `harmony-fleet-deploy`.** New `FleetCalloutScore` (preset over `NatsAuthCalloutScore`) plus an in-cluster mock-OIDC fixture so the e2e harness can exercise the real auth-callout code path without paying Zitadel's 5-min cold-start cost. The harness's `AuthMode::Callout` variant is already on the public API for this.
|
|
2. **Operator pod in the e2e harness.** `FleetOperatorScore` is already in the deploy crate; wiring it into the harness gives integration tests against the actual `Deployment` / `Device` reconcile loops.
|
|
3. **`Verb::Logs` and `Verb::Exec`** — the next two verbs on the `device-commands.*` protocol. Same harness, same TDD shape as `ping`.
|
|
4. **CRD types out of `harmony` core.** `harmony::modules::fleet::operator::crd` is the last fleet-deploy thing still living in `harmony`. The `ReconcileScore` payload coupling is the only blocker.
|
|
5. **Smoke-test contract.** ADR-023 principle 4 — every Score blocks on a smoke test before `deploy` returns success. Today the e2e suite plays that role; the trait/companion shape lands once it's been validated in practice.
|
|
|
|
See [`PLAN_requests_over_nats.md`](PLAN_requests_over_nats.md) for the full TDD-style plan this branch implements.
|