Files
harmony/fleet/harmony-fleet-operator
Jean-Gabriel Gill-Couture 76ecf6da42
All checks were successful
Run Check Script / check (pull_request) Successful in 2m35s
feat(fleet): agent self-upgrade + auto-rollback protocol, ADR-022 (Ch4)
Full ADR-022 protocol end to end. The state-machine brain and the operator's
commit decision are exhaustively unit-tested; OS side-effects sit behind a seam
so they're faked in tests and real on-device.

Contracts (harmony-reconciler-contracts):
- agent-upgrade marker + status KV buckets, AgentUpgradePhase, agent_version on
  the heartbeat, Verb::UpgradeStop on the command protocol.

Shared (new crate harmony_downloadable_asset):
- download + SHA-256 verify, lifted from k3d's pub(crate) copy; k3d now depends
  on it (DRY — second consumer is the agent). Tested with httptest.

Agent (harmony-fleet-agent):
- `drive`: Staging -> Verifying -> CutoverReady -> wait-for-operator-stop, with
  heartbeat-timeout revert. 6 unit tests incl. every failure/rollback path.
- UpgradeExecutor seam + real SystemdUpgradeExecutor: download+verify,
  `--self-test`, atomic symlink swap, systemd-run transient unit, revert. The
  executor self-heals the on-disk layout so first-upgrade rollback is safe even
  before M1 (preserves the running binary at its versioned path).
- `--self-test` flag; Verb::UpgradeStop handling gated by an armed
  UpgradeStopSignal so only the cutover-waiting old agent acts (both agents are
  subscribed). The agent never self-stops.

Operator (harmony-fleet-operator):
- upgrade_coordinator: sends the stop ONLY after independently observing the new
  version's heartbeat (single source of truth); reflects currentVersion + the
  upgrade phase onto the Device CR. 2 unit tests on the commit decision.
- FleetCommandsClient::upgrade_stop; Device.status.{currentVersion, upgrade}.

Deviations + flagged follow-ups (M1 clean install, libvirt vX->vX+1 e2e) in
ROADMAP/fleet_platform/ch4-agent-upgrade-status.md. Marker/status ride NATS KV
(survive operator restart, per Ch2).
2026-06-05 15:26:38 -04:00
..
2026-05-22 14:07:52 -04:00
2026-05-22 14:07:52 -04:00

harmony-fleet-operator

IoT operator — reconciles Deployment CRDs into NATS KV desired-state and aggregates device/deployment state back into CR status.

Web frontend (optional)

A small server-side dashboard is built into the operator behind the web-frontend cargo feature. Stack: axum + maud (HTML-in-Rust) + vendored HTMX + Tailwind CSS. No WASM, no cargo-leptos, no JS build toolchain — cargo build --features web-frontend is the whole build.

Why this stack

Every interaction is an HTTP request that returns an HTML fragment, and HTMX swaps it into the DOM. There is no client-side state. The presentation layer is intentionally thin:

async fn devices_handler(State(s): State<AppState>) -> Result<Markup, AppError> {
    let devices = s.fleet.list_devices().await?;
    Ok(page("Devices", s.live_reload, devices_view::page(&devices)))
}

Each handler is extract state → call domain service → render Maud markup. All real work — listing devices, blacklisting, etc. — lives in service::FleetService, a trait the dashboard, tests, and a future CLI all share. Presentation never reaches past that trait.

Why Maud instead of Leptos? We don't use Leptos's reactivity (it's pure SSR + HTMX), so the runtime/macro footprint was dead weight. Maud is a compile-time HTML macro that produces a Markup value — smaller dep tree, faster compiles, same Rust-flavored ergonomics.

Why HTMX + xterm.js for interactivity? A real terminal needs xterm.js in the browser regardless; once that JS exists, HTMX (~14 KB) is a rounding error and lets every other interaction stay declarative in markup (hx-post, hx-target, hx-swap).

Why everything bundled? The operator already ships as a single container. Tailwind CSS, HTMX, and the HTMX SSE extension are all embedded via include_bytes! so air-gapped clusters get the dashboard with nothing extra to mount. The only build-time external is the standalone tailwindcss v4 CLI — missing-CLI degrades gracefully (warning + empty embedded CSS); the dev workflow uses --css-from instead anyway.

Running it locally (mock data, no NATS, no kube)

# One-time: install the standalone Tailwind v4 CLI (single static binary).
curl -L -o ~/.local/bin/tailwindcss \
  https://github.com/tailwindlabs/tailwindcss/releases/latest/download/tailwindcss-linux-x64
chmod +x ~/.local/bin/tailwindcss

Two terminals for the dev loop:

# Terminal 1 — Tailwind sidecar, regenerates CSS on every class change.
tailwindcss \
  -i fleet/harmony-fleet-operator/style/input.css \
  -o fleet/harmony-fleet-operator/style/dist/tailwind.css \
  --watch

# Terminal 2 — the operator, serving the dashboard against fake data and
# reading CSS from Tailwind's output. `--live-reload` reloads the browser
# tab whenever you restart the server.
cargo run -p harmony-fleet-operator --features web-frontend -- serve-web \
  --mock \
  --css-from fleet/harmony-fleet-operator/style/dist/tailwind.css \
  --live-reload

Open http://localhost:18080.

--mock uses MockFleetService, an in-memory seeded dataset (10 fake devices in mixed states, 4 deployments). You can click "Blacklist" on a row and the row will swap in place to reflect the new status — this exercises the same FleetService API the real impl will satisfy. No NATS, no Kubernetes cluster needed.

Iteration cost

Change Reload step
Tailwind class in a Maud template edit → save → refresh tab (Tailwind sidecar already rebuilt CSS; no Rust compile)
Maud template structure / handler logic edit → cargo run restarts → --live-reload auto-refreshes
FleetService types edit → cargo run restarts → tab auto-refreshes

The Rust recompile is the actual floor. Tailwind changes never trigger one.

Production builds

# Once, before cargo build: produce the embedded CSS.
tailwindcss \
  -i fleet/harmony-fleet-operator/style/input.css \
  -o fleet/harmony-fleet-operator/style/dist/tailwind.css \
  --minify

cargo build -p harmony-fleet-operator --features web-frontend --release

The release binary serves the embedded CSS unless you pass --css-from at runtime. (build.rs will also run tailwindcss if it's on PATH; the manual step above is just a guarantee that the embedded copy is correct.)

Layout

fleet/harmony-fleet-operator/
├── src/
│   ├── service/             ← domain abstraction (FleetService trait + Mock)
│   │   ├── mod.rs           ← trait + summary types
│   │   └── mock.rs          ← in-memory seeded data
│   └── frontend/            ← presentation layer (cfg web-frontend)
│       ├── server.rs        ← axum router + handlers
│       ├── layout.rs        ← page shell (Maud)
│       ├── assets.rs        ← embedded Tailwind/HTMX bytes
│       └── views/
│           ├── dashboard.rs
│           ├── devices.rs   ← also exposes `row()` for HTMX swaps
│           └── deployments.rs
├── style/
│   └── input.css            ← Tailwind v4 entry point
└── vendor/
    ├── htmx.min.js          ← HTMX v2.0.9
    └── htmx-ext-sse.js      ← SSE extension (used by future log-tail views)

What's deferred

  • Real FleetService impl (wraps the kube client + NATS KV the reconcilers already use). serve-web without --mock currently errors out.
  • Zitadel SSO + admin-role check. v1 assumes an oauth2-proxy fronts the dashboard at the cluster edge.
  • Live log tail (SSE-based, HTMX sse-swap) — the wiring is in place.
  • Interactive shell (xterm.js + axum WS + portable-pty) — separate design.