feat/fleet-e2e-harness-and-ping #286

Closed
johnride wants to merge 18 commits from feat/fleet-e2e-harness-and-ping into feat/iot-walking-skeleton
Owner
No description provided.
johnride added 5 commits 2026-05-19 14:37:03 +00:00
feat: maud + htmx + tailwindcss frontend for fleet operator, initial commit, still much work to do
Some checks failed
Run Check Script / check (pull_request) Failing after 59s
ee95a5d1a3
First slice of the device-commands.* protocol from
fleet/requests_over_nats.md. Lands `Verb::Ping` plus the harness that
proves it works against a real in-cluster agent.

Wire types (`harmony-reconciler-contracts::commands`):
- `Verb::Ping`, `CommandRequest`, `PingReply`, `ErrorReply`/`ErrorKind`
- `device_command_subject` / `device_command_subscription` helpers
- `X-Harmony-*` header constants

Agent:
- `command_server.rs` subscribes on `device-commands.<id>.>` and
  dispatches verbs; ping handler replies with `PingReply`
- New `[agent].runtime_enabled` config flag (default true). When
  false, podman init + reconciler loop are skipped so the agent can
  run as a Pod on containerd-only k3d nodes; command server +
  heartbeat still run
- `Dockerfile`: canonical multi-stage build for production registries

Operator:
- `commands::FleetCommandsClient` with typed `CommandError`
  (`DeviceOffline` via `no_responders`, `Timeout`, `BadReply`, `Nats`)

E2E harness (`harmony-fleet-e2e`):
- Library crate + integration test. `Stack::bring_up` provisions a
  fresh `e2e-<uuid8>` namespace in a shared `fleet-e2e` k3d cluster,
  deploys NATS (UserPass auth, JetStream on) + the agent Pod, returns
  a connected admin NATS client, and tears the namespace down on Drop
- v1 ships `AuthMode::UserPass` only; the `Callout` variant is
  reserved on the public API for the follow-up PR that adds the mock
  OIDC fixture + NatsAuthCalloutScore deployment
- Operator pod deployment is also follow-up — for ping the test
  process drives `FleetCommandsClient` directly against the cluster's
  NATS NodePort
- `HARMONY_FLEET_E2E=1` gates the integration test so default
  `cargo test --workspace` runs don't depend on k3d/podman
- Image build + sideload mirrors the `fleet_auth_callout` pattern:
  host `cargo build --release` → single-stage Dockerfile → `podman
  build` → `k3d image import`. ~12s warm bring-up, ~80s cold
The previous e2e harness handrolled k8s manifests in `stack.rs`,
bypassing the Score-Topology-Interpret machinery harmony exists to
provide. This commit:

1. **ADR-023** codifies the rules: deploy with Scores (not
   manifests), e2e uses the same Scores as production, one Score
   per component, deploy blocks on smoke-test success, deploy logic
   lives in `*-deploy` crates, topologies are compile-time,
   thiserror over anyhow. CLAUDE.md mirrors the principles.

2. **New `fleet/harmony-fleet-deploy` crate** is the canonical home
   for fleet-component Scores:
   - `FleetOperatorScore` + helm-chart generator + `install_crds`
     moved out of `harmony::modules::fleet::operator` (they should
     never have lived in `harmony` core). `FleetServerScore`
     (composite of NATS + operator + Zitadel + callout) moved too.
   - New `FleetNatsScore` (preset over `NatsHelmChartScore` with
     fleet's required values; v1 supports `UserPass` auth, callout
     mode reserved on the public API for PR 1.5).
   - New `FleetAgentScore` with `FleetAgentTarget::Pod`; `Vm`
     target is a future variant that absorbs `FleetDeviceSetupScore`.
   - `harmony-fleet-deploy` binary built on the existing
     `harmony_cli` crate — no new CLI scaffolding.

3. **Operator runtime binary trimmed**: `Install` and `Chart`
   subcommands removed; both jobs now belong to
   `harmony-fleet-deploy`. The runtime binary becomes leaner.

4. **E2E harness rewritten** as a thin Score composer:
   `harmony-fleet-e2e/src/stack.rs` deploys the stack via
   `FleetNatsScore` + `FleetAgentScore`. The inline NATS manifest
   factory and the bespoke agent Pod renderer are gone.
   - Bring-up runs once per test binary via `shared_stack` +
     `tokio::sync::OnceCell` (matches the `fleet_e2e_demo` pattern).
   - Stale `e2e-*` namespaces from prior runs get pruned at
     startup so the leaks the OnceCell creates don't compound.

5. **`thiserror` for the agent's `CommandServer`** — replaces the
   anyhow-based surface with typed `CommandError` /
   `CommandServerError`.

6. **Memory** captures eight load-bearing principles (saved to
   `~/.claude/projects/.../memory/`) so future sessions don't drift
   back into manifest-handrolling.

Verified: `cargo test -p harmony-fleet-e2e --test ping` green
end-to-end against k3d in 25s warm.
docs(fleet): top-level README; harden e2e namespace prune to wait for NodePort release
Some checks failed
Run Check Script / check (pull_request) Failing after 41s
1b21176215
- Add `fleet/README.md`: overview of the crates, ADR-023 pointer,
  quickstart for the e2e ping test, env knobs (`HARMONY_FLEET_E2E`,
  `FLEET_E2E_KEEP`, `RUST_LOG`), how to connect to NATS from the host
  and in-cluster, how to inspect the agent, the `harmony-fleet-deploy`
  production CLI, the operator dashboard, and the roadmap (Zitadel +
  callout next).
- `prune_stale_namespaces` now polls until each pruned namespace is
  fully gone (up to 90 s). NATS NodePort 30423 is cluster-scoped, so
  a still-`Terminating` namespace from the prior run was blocking the
  new bring-up with "provided port is already allocated".

Verified: e2e ping test green back-to-back after the fix, with a
prior namespace left behind.
johnride changed target branch from feat/iot-walking-skeleton to feat/fleet-operator-web-frontend-maud 2026-05-19 20:27:17 +00:00
johnride reviewed 2026-05-19 21:16:00 +00:00
johnride left a comment
Author
Owner

Big features, deserves some work to improve architecture quality but this ADR 23 and claude.md additions should help a lot.

Big features, deserves some work to improve architecture quality but this ADR 23 and claude.md additions should help a lot.
@@ -0,0 +53,4 @@
/// Concrete inputs for the `Pod` target.
#[derive(Debug, Clone, Serialize)]
pub struct PodTarget {
Author
Owner

This should be a type that actually lives in the agent crate, everything is rust so we can use the real types from the agent itself to wire its deployment as safely as possible.

This should be a type that actually lives in the agent crate, everything is rust so we can use the real types from the agent itself to wire its deployment as safely as possible.
@@ -0,0 +228,4 @@
}
s
};
let toml = format!(
Author
Owner

Can't we serialize a type to toml instead of building a fragile string here?

Can't we serialize a type to toml instead of building a fragile string here?
@@ -0,0 +151,4 @@
// `service.merge.spec.ports` list is required — see that example's
// comment for the upstream chart's `service.ports.<name>.merge`
// quirk that forces this shape.
format!(
Author
Owner

we already do some of this dangerous yaml crafting in the existing nats scores in harmony, can't we leverage proper types to setup the score instead of this dangerous inner yaml here?

we already do some of this dangerous yaml crafting in the existing nats scores in harmony, can't we leverage proper types to setup the score instead of this dangerous inner yaml here?
johnride changed target branch from feat/fleet-operator-web-frontend-maud to master 2026-05-20 10:35:10 +00:00
reda added 1 commit 2026-05-20 12:23:59 +00:00
import ChecksumAlgo
Some checks failed
Run Check Script / check (pull_request) Failing after 3m18s
fe801ec6ad
reda added 1 commit 2026-05-20 12:33:04 +00:00
lower debug level for tests
Some checks failed
Run Check Script / check (pull_request) Failing after 1m54s
9596c6e50c
reda added 1 commit 2026-05-20 12:44:25 +00:00
create kubeconfig before creating client
Some checks failed
Run Check Script / check (pull_request) Failing after 1m53s
13cbd6aa89
reda added 1 commit 2026-05-20 12:52:47 +00:00
explicitly create kube config dir
Some checks failed
Run Check Script / check (pull_request) Failing after 34s
d19980505b
reda added 1 commit 2026-05-20 12:54:46 +00:00
format line
Some checks failed
Run Check Script / check (pull_request) Failing after 1m52s
30dfd36bc0
reda added 1 commit 2026-05-20 13:01:23 +00:00
logs for debugging
Some checks failed
Run Check Script / check (pull_request) Failing after 35s
54a264ff59
reda added 1 commit 2026-05-20 13:02:51 +00:00
format code
Some checks failed
Run Check Script / check (pull_request) Failing after 1m52s
00a1a6e6c6
reda added 1 commit 2026-05-20 13:17:10 +00:00
validate k3d is installed
Some checks failed
Run Check Script / check (pull_request) Failing after 1m50s
6bbffbd4dd
reda added 1 commit 2026-05-20 13:20:19 +00:00
install k3d and check other tools
Some checks failed
Run Check Script / check (pull_request) Has been cancelled
598454cdf8
reda added 1 commit 2026-05-20 13:21:35 +00:00
install tools
Some checks failed
Run Check Script / check (pull_request) Failing after 2m0s
32c484d0ad
reda added 2 commits 2026-05-20 13:33:56 +00:00
debug k3d paths
Some checks failed
Run Check Script / check (pull_request) Failing after 1m51s
b33253dfb2
reda added 1 commit 2026-05-20 13:37:55 +00:00
more debugging
Some checks failed
Run Check Script / check (pull_request) Failing after 1m59s
f981fa3d9b
reda added 1 commit 2026-05-20 13:49:25 +00:00
reorder instruction
Some checks failed
Run Check Script / check (pull_request) Failing after 2m7s
29b90d4ec1
reda added 1 commit 2026-05-20 13:59:13 +00:00
log PATH
Some checks failed
Run Check Script / check (pull_request) Failing after 1m54s
77579728ef
reda added 1 commit 2026-05-20 14:07:02 +00:00
add kubectl installation step
Some checks failed
Run Check Script / check (pull_request) Failing after 2m10s
63895a76fb
reda added 1 commit 2026-05-20 14:14:01 +00:00
add step for connectivity check
Some checks failed
Run Check Script / check (pull_request) Failing after 1m54s
7b772552d1
reda added 1 commit 2026-05-20 14:19:48 +00:00
make container share host network
Some checks failed
Run Check Script / check (pull_request) Failing after 1m59s
caf8882964
johnride changed target branch from master to feat/iot-walking-skeleton 2026-05-20 14:52:44 +00:00
johnride closed this pull request 2026-05-22 22:14:18 +00:00
Some checks failed
Run Check Script / check (pull_request) Failing after 1m59s

Pull request closed

Sign in to join this conversation.
No Reviewers
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: NationTech/harmony#286
No description provided.