feat: scaffold IoT walking skeleton — podman module, operator, and agent #264

Merged

johnride merged 210 commits from feat/iot-walking-skeleton into master

2026-05-22 22:16:18 +00:00

Author	SHA1	Message	Date
Jean-Gabriel Gill-Couture	46ceb6b493	chore(fleet): drop pre_merge_checklist.md — served its purpose Some checks failed Run Check Script / check (pull_request) Failing after 1m50s Details Branch is ready to merge; the checklist was working scaffolding for that. Remaining deferred items (CI image libvirt-dev, smoke-test contract, bash → Rust smoke migration, ignored-test CI runner, ADR-024) live in the merge commit body and should be tracked as real issues from there.	2026-05-22 18:11:27 -04:00
Jean-Gabriel Gill-Couture	a4b3d18bd6	refactor(fleet): drop deploy-crate dev creds, HARMONY_* env vars, lean docs Caller must pass `UserPassCredentials` to `FleetNatsScore::user_pass` — no more `e2e-admin`/`e2e-device` defaults shipped in the library. The deploy binary reads `HARMONY_FLEET_*` env vars (default namespace `harmony-fleet-system`) and fails fast when NATS creds aren't set. Also: `style/dist/` gitignored, `manual_mint/mint.py` moved next to `nats/callout/` with README + secrets gitignore (the real RSA key that was sitting untracked has been removed), `architecture_review.md` moved to `docs/adr/drafts/024-`, three low-value ROADMAP docs deleted. Updates pre-merge checklist (§1.6, §1.8, §3.1, §5).	2026-05-22 17:54:48 -04:00
Jean-Gabriel Gill-Couture	b37c76d0a5	chore: fix fmt Some checks failed Run Check Script / check (pull_request) Failing after 2m5s Details	2026-05-22 17:04:49 -04:00
Reda Tarzalt	f273d07657	add test for operator and update read me Some checks failed Run Check Script / check (pull_request) Failing after 38s Details	2026-05-22 14:57:45 -04:00
Reda Tarzalt	1bedbd0f62	resolve conflicts Some checks failed Run Check Script / check (pull_request) Failing after 36s Details	2026-05-22 14:07:52 -04:00
Reda Tarzalt	5c39d048ab	add tests for finalizer and kv bucket	2026-05-22 14:05:07 -04:00
Jean-Gabriel Gill-Couture	433a66dac2	feat: fleet e2e x86 support Some checks failed Run Check Script / check (pull_request) Failing after 39s Details	2026-05-22 12:39:43 -04:00
Jean-Gabriel Gill-Couture	ba685baddb	doc: fleet e2e x86 arch support Some checks failed Run Check Script / check (pull_request) Failing after 2m23s Details	2026-05-20 22:47:52 -04:00
Jean-Gabriel Gill-Couture	8e6e1fa1bc	feat: fleet e2e x86 vm support Some checks failed Run Check Script / check (pull_request) Failing after 37s Details	2026-05-20 21:49:51 -04:00
Jean-Gabriel Gill-Couture	cdebeb8a9f	feat: Fleet E2E tests harness improving a lot, firing up a VM and testing agent behavior Some checks failed Run Check Script / check (pull_request) Failing after 2m3s Details	2026-05-20 21:42:47 -04:00
Jean-Gabriel Gill-Couture	20b14c4648	doc: fleet/ARCHITECTURE.html overview of the whole fleet plubings, components, flows, code cheatsheet Some checks failed Run Check Script / check (pull_request) Failing after 1m52s Details	2026-05-20 15:45:51 -04:00
Jean-Gabriel Gill-Couture	34807511b4	feat: refactor fleet agent config into a strongly typed struct, remove brittle string processing Some checks failed Run Check Script / check (pull_request) Failing after 1m51s Details	2026-05-20 13:41:40 -04:00
Jean-Gabriel Gill-Couture	4e80101f26	fix: interactive test now has injected mock data Some checks failed Run Check Script / check (pull_request) Failing after 1m51s Details	2026-05-20 12:21:34 -04:00
Jean-Gabriel Gill-Couture	7ab15415c7	chore: Fix clippy and ci lints, cleanup docs a bit, rewrite adr 023 with better language, etc. Some checks failed Run Check Script / check (pull_request) Failing after 1m52s Details	2026-05-20 12:03:19 -04:00
Jean-Gabriel Gill-Couture	fdd8383caa	Merge commit '1b211762' into feat/iot-walking-skeleton Some checks failed Run Check Script / check (pull_request) Failing after 37s Details	2026-05-20 11:00:27 -04:00
Jean-Gabriel Gill-Couture	b72ac7c99d	chore: Move claude.md to agents.md and symlink back Some checks failed Run Check Script / check (pull_request) Failing after 38s Details	2026-05-20 10:19:41 -04:00
johnride	7620077b9b	Merge pull request 'feat/fleet-operator-web-frontend-maud' (#283 ) from feat/fleet-operator-web-frontend-maud into feat/iot-walking-skeleton Some checks failed Run Check Script / check (pull_request) Failing after 38s Details Reviewed-on: #283	2026-05-20 10:35:06 +00:00
Reda Tarzalt	96e7d43b2f	add auth to frontend through lib (#284 ) Some checks failed Run Check Script / check (pull_request) Failing after 38s Details Adds OIDC login support to the harmony-fleet-operator web dashboard using Zitadel SSO. pkce was the recommended option for this since we don't need to hold on to any secret. We compute a value on server before sending the data to Zitadel who validates authenticity by recomputing the hash and comparing the two values. pkce Auth flow 1. User visits a protected dashboard route, like /devices. 2. If no valid harmony_fleet_session cookie exists, the app redirects to /login. 3. /login creates: - random state - random pkce_code_verifier - derived code_challenge = base64url(sha256(pkce_code_verifier)) 4. The app stores state and pkce_code_verifier in a temporary HTTP-only login-attempt cookie. 5. The browser is redirected to Zitadel’s authorize endpoint with: - client_id - redirect_uri - scope - state - code_challenge - code_challenge_method=S256 6. After SSO login, Zitadel redirects back to /auth/callback?code=...&state=.... 7. The callback handler: - parses the raw query into a strict success/failure enum - reads the temporary login-attempt cookie - validates returned state - exchanges code + pkce_code_verifier for tokens - validates the returned ID token using OIDC discovery/JWKS - creates a local harmony_fleet_session cookie - redirects to / 8. Protected routes validate the local dashboard session cookie on each request. 9. /logout clears the dashboard session cookie and redirects to /login. --- Auth middleware responses depending on request type: - normal browser request: redirect to /login - SSE request: 401 authentication required - HTMX request: 401 with HX-Redirect: /login (HTMX redirect is more idiomatic than through Axum for this) Reviewed-on: #284 Reviewed-by: johnride <jg@nationtech.io> Co-authored-by: Reda Tarzalt <tarzaltreda@gmail.com> Co-committed-by: Reda Tarzalt <tarzaltreda@gmail.com>	2026-05-19 20:37:08 +00:00
Jean-Gabriel Gill-Couture	1b21176215	docs(fleet): top-level README; harden e2e namespace prune to wait for NodePort release Some checks failed Run Check Script / check (pull_request) Failing after 41s Details - Add `fleet/README.md`: overview of the crates, ADR-023 pointer, quickstart for the e2e ping test, env knobs (`HARMONY_FLEET_E2E`, `FLEET_E2E_KEEP`, `RUST_LOG`), how to connect to NATS from the host and in-cluster, how to inspect the agent, the `harmony-fleet-deploy` production CLI, the operator dashboard, and the roadmap (Zitadel + callout next). - `prune_stale_namespaces` now polls until each pruned namespace is fully gone (up to 90 s). NATS NodePort 30423 is cluster-scoped, so a still-`Terminating` namespace from the prior run was blocking the new bring-up with "provided port is already allocated". Verified: e2e ping test green back-to-back after the fix, with a prior namespace left behind.	2026-05-18 23:12:01 -04:00
Jean-Gabriel Gill-Couture	020ebcb1f9	refactor(fleet): deploy-architecture cleanup per ADR-023 — Scores everywhere, deploy crate, principles in CLAUDE.md The previous e2e harness handrolled k8s manifests in `stack.rs`, bypassing the Score-Topology-Interpret machinery harmony exists to provide. This commit: 1. ADR-023 codifies the rules: deploy with Scores (not manifests), e2e uses the same Scores as production, one Score per component, deploy blocks on smoke-test success, deploy logic lives in `-deploy` crates, topologies are compile-time, thiserror over anyhow. CLAUDE.md mirrors the principles. 2. New `fleet/harmony-fleet-deploy` crate* is the canonical home for fleet-component Scores: - `FleetOperatorScore` + helm-chart generator + `install_crds` moved out of `harmony::modules::fleet::operator` (they should never have lived in `harmony` core). `FleetServerScore` (composite of NATS + operator + Zitadel + callout) moved too. - New `FleetNatsScore` (preset over `NatsHelmChartScore` with fleet's required values; v1 supports `UserPass` auth, callout mode reserved on the public API for PR 1.5). - New `FleetAgentScore` with `FleetAgentTarget::Pod`; `Vm` target is a future variant that absorbs `FleetDeviceSetupScore`. - `harmony-fleet-deploy` binary built on the existing `harmony_cli` crate — no new CLI scaffolding. 3. Operator runtime binary trimmed: `Install` and `Chart` subcommands removed; both jobs now belong to `harmony-fleet-deploy`. The runtime binary becomes leaner. 4. E2E harness rewritten as a thin Score composer: `harmony-fleet-e2e/src/stack.rs` deploys the stack via `FleetNatsScore` + `FleetAgentScore`. The inline NATS manifest factory and the bespoke agent Pod renderer are gone. - Bring-up runs once per test binary via `shared_stack` + `tokio::sync::OnceCell` (matches the `fleet_e2e_demo` pattern). - Stale `e2e-` namespaces from prior runs get pruned at startup so the leaks the OnceCell creates don't compound. 5. `thiserror` for the agent's `CommandServer`* — replaces the anyhow-based surface with typed `CommandError` / `CommandServerError`. 6. Memory captures eight load-bearing principles (saved to `~/.claude/projects/.../memory/`) so future sessions don't drift back into manifest-handrolling. Verified: `cargo test -p harmony-fleet-e2e --test ping` green end-to-end against k3d in 25s warm.	2026-05-18 22:54:50 -04:00
Jean-Gabriel Gill-Couture	d013246a68	feat(fleet): request/reply commands over NATS — wire types, agent server, operator client, e2e harness First slice of the device-commands.* protocol from fleet/requests_over_nats.md. Lands `Verb::Ping` plus the harness that proves it works against a real in-cluster agent. Wire types (`harmony-reconciler-contracts::commands`): - `Verb::Ping`, `CommandRequest`, `PingReply`, `ErrorReply`/`ErrorKind` - `device_command_subject` / `device_command_subscription` helpers - `X-Harmony-*` header constants Agent: - `command_server.rs` subscribes on `device-commands.<id>.>` and dispatches verbs; ping handler replies with `PingReply` - New `[agent].runtime_enabled` config flag (default true). When false, podman init + reconciler loop are skipped so the agent can run as a Pod on containerd-only k3d nodes; command server + heartbeat still run - `Dockerfile`: canonical multi-stage build for production registries Operator: - `commands::FleetCommandsClient` with typed `CommandError` (`DeviceOffline` via `no_responders`, `Timeout`, `BadReply`, `Nats`) E2E harness (`harmony-fleet-e2e`): - Library crate + integration test. `Stack::bring_up` provisions a fresh `e2e-<uuid8>` namespace in a shared `fleet-e2e` k3d cluster, deploys NATS (UserPass auth, JetStream on) + the agent Pod, returns a connected admin NATS client, and tears the namespace down on Drop - v1 ships `AuthMode::UserPass` only; the `Callout` variant is reserved on the public API for the follow-up PR that adds the mock OIDC fixture + NatsAuthCalloutScore deployment - Operator pod deployment is also follow-up — for ping the test process drives `FleetCommandsClient` directly against the cluster's NATS NodePort - `HARMONY_FLEET_E2E=1` gates the integration test so default `cargo test --workspace` runs don't depend on k3d/podman - Image build + sideload mirrors the `fleet_auth_callout` pattern: host `cargo build --release` → single-stage Dockerfile → `podman build` → `k3d image import`. ~12s warm bring-up, ~80s cold	2026-05-18 09:47:36 -04:00
Jean-Gabriel Gill-Couture	ee95a5d1a3	feat: maud + htmx + tailwindcss frontend for fleet operator, initial commit, still much work to do Some checks failed Run Check Script / check (pull_request) Failing after 59s Details	2026-05-11 22:43:56 -04:00
Jean-Gabriel Gill-Couture	b163656859	chore: cargo fmt	2026-05-11 16:48:52 -04:00
Jean-Gabriel Gill-Couture	616c05d5a4	docs: fleet architecture review — inventory, principles, alternatives Some checks failed Run Check Script / check (pull_request) Failing after 52s Details Working document for the architectural redesign of the fleet platform before v0.1 ships to production. Captures four sections of research: §1 — Current state inventory. Markdown-bullet map of every public type, score, trait, and module across `harmony/modules/fleet/`, `harmony-reconciler-contracts`, and `fleet/harmony-fleet-/`. Sorted by domain meaning (identity, desired state, observed state, setup, plumbing) rather than location, so the cross-cutting concerns become visible. Includes a text "diagram" of the dependency graph showing the two problematic edges: runtime crates importing CRD types from the framework crate (`harmony-fleet-operator` ← `harmony::modules::fleet::operator::crd` verified at `controller.rs:37`, `device_reconciler.rs:21`, `main.rs:9`) and the agent importing podman wire types from the framework crate (`harmony-fleet-agent` ← `harmony::modules::podman` verified at `main.rs:21-22`, `reconciler.rs:11`). §2 — Theory review. Pulls principles from JG's Pour l'amour des compilateurs* talk (2026-04-30), its references (Crichton, Feldman, Maguire, Goedecke, Fowler), and harmony's own load-bearing ADRs (002 hexagonal, 003 infrastructure abstractions, 015 higher- order topologies, 016 agent + global mesh, 018 template hydration). Synthesizes eight design principles for the redesign — including Goedecke's guardrail that "type-driven" ≠ "type-everything" so we don't over-fit the cardinality argument. §3 — Ten concrete shape problems (P1–P10), framed as cardinality mismatches, leaky boundaries, and "is this resolved yet" branches rather than bugs. P1 is the placement issue JG flagged in code review; P2 is `FleetDeviceAuth`'s mixed resolved/unresolved states; P10 is the credential-shape staircase across operator workstation / operator pod / agent. §4 — Five design alternatives, each scored against P1–P10: A. Move + thin façade (conservative cleanup). B. Resolved-only at boundaries + capability traits (principled incremental). C. Dataflow reframe (events in, state out). D. Fleet as kube control plane, period (deliberately weird). E. Algebra of fleets (deliberately mathematical). A is too little, C/D/E are right-shape but wrong-timing for the 3-day window. B is the working recommendation, with explicit awareness that D is the v2.0 destination and the capability traits in B are the seam that lets us migrate without breaking callers. §5 sketches a concrete shape for B: new `harmony-fleet/` domain crate with no framework dependency, `harmony-fleet-adapters-*` crates for NATS/Zitadel/kube, the existing operator/agent/auth crates wire adapters together, the framework's `harmony::modules::fleet` collapses to a re-export module that goes away by v0.2. §6 — Five open questions for JG's review before locking the choice. §7 — explicit "spike one slice, then commit or back out" process so we don't lock the wrong shape. Not an ADR yet. The ADR happens after JG agrees on which alternative is the working hypothesis and the spike confirms the shape feels better in code than on paper.	2026-05-07 05:20:25 -04:00
Jean-Gabriel Gill-Couture	c926ff3c4b	chore: warning sweep — manual cleanup of remaining 105 → 0 Picks up where the auto-fix pass left off. Workspace warning count goes from 105 to 0 across `cargo build --workspace --all-targets`. Three categories of fixes: 1. Mechanical fixes the auto-pass couldn't handle (unused imports inside braced multi-name `use` statements, unused variables that needed an underscore prefix without breaking other references): batched via a small Python script, then 6 manual edits where the warning location and the actual identifier were on different lines. 2. Dead-code that's intentionally kept around for future wiring or debug visibility — `#[allow(dead_code)]` at the right scope: - 19 individual items (struct fields, methods, free functions, type aliases, enum variants), e.g. `default_namespace` / `default_cluster_issuer` in zitadel/mod.rs (used via serde defaults, opaque to rustc), `score` fields on the OKD bootstrap interpret types, `crd_exists` methods on the prometheus alerting scores, the `harmony_inventory_agent::local_presence::{DiscoveryEvent, discover_agents}` re-exports. - 5 module-level allows for files where most items are aspirational scaffolding (harmony_agent's replica workflow, opnsense-config dnsmasq, three opnsense-api examples). 3. Special cases that needed real fixes, not allows: - `opnsense-config-xml/src/data/haproxy.rs`: deprecated `rand::thread_rng` / `Rng::gen` updated to `rng()` / `random`. - `harmony_secret/src/lib.rs`: the `secrete2etest` integration test gate is now declared in Cargo.toml's `[lints.rust] unexpected_cfgs.check-cfg`; the gated test module is structured so its dead `TestSecret`/`TestUserMeta` types come along for the cfg ride and don't show up as unconditional dead code. - `harmony/src/modules/nats/score_nats_k8s.rs:241`: `K8sIngressScore { name: todo!(), ... }`'s unreachable expression annotated. - `harmony/src/domain/topology/k8s_anywhere/k8s_anywhere.rs:982`: wrap the dead-after-`return Ok(Noop)` branch in `#[allow(unreachable_code)] { ... }`. Behavior unchanged. - `examples/try_rust_webapp/Cargo.toml`: `autobins = false` so `src/main.rs` isn't auto-registered as both bin AND example. All 16 lib-test suites pass: 437 tests, 0 failed, 13 ignored. Ready for `-Dwarnings` in CI as a follow-up — the gate makes sense once we're sure no contributor's local builds slip warnings back in.	2026-05-06 23:09:12 -04:00
Jean-Gabriel Gill-Couture	50f62b6437	chore: warning sweep — auto-fix pass + scoped allows for generated code Workspace warning count: 408 → 105. Three buckets cleared: * Auto-fixable (`cargo fix` + `cargo clippy --fix`): unused imports removed, unused variables prefixed with `_`, deprecated method calls updated. Applied across harmony, harmony-k8s, harmony-agent, harmony_inventory_agent, the fleet/ workspace, and ~15 examples. * Generated code (opnsense-api/src/generated/): 269 snake_case warnings + ~10 unreachable-pattern warnings come from CamelCase-preserving bindings to OPNsense's HAProxy/Caddy XML schemas. Scoped a single `#[allow(non_snake_case, unreachable_patterns)]` at `pub mod generated;` rather than fighting the codegen — renaming would break serde round-trips and the codegen would regenerate them anyway. * opnsense-codegen parser's defensive `let...else` guards on `XmlNode` (currently single-variant): file-level `#![allow(irrefutable_let_patterns)]` with a comment explaining why we keep the `else` arms (they re-arm if the IR grows a second variant). `harmony_inventory_agent::local_presence::{DiscoveryEvent, discover_agents}` re-exports were stripped twice by the auto-fix passes (consumers live in another crate, so the local crate looks "unused" to lint). Anchored with explicit `pub use` + an `#[allow(unused_imports)]` annotation noting why. All 151 harmony lib tests still pass. Remaining ~105 warnings are mostly real dead code in non-fleet modules + a handful of unused-imports/variables clippy couldn't auto-resolve; cleared in the next pass.	2026-05-06 22:51:44 -04:00
Jean-Gabriel Gill-Couture	064fa1da0d	docs: v0.2 roadmap + ADR-022 fleet agent upgrade procedure Two design documents framing the next push. `ROADMAP/fleet_platform/v0_2_plan.md` — three-day production push. Replaces the open-ended chapter structure of v0_1_plan.md for the period between the walking-skeleton merge and v0.1.0 in production. Focus is locking the fleet module's public API surface so the inevitable physical refactor (out of `harmony/modules/fleet/`, into `fleet/harmony-fleet/`) is mechanical when we get to it. Anchored in the principle from JG's Pour l'amour des compilateurs talk: design the brick before moving the brick. `docs/adr/022-fleet-agent-upgrade.md` — agent upgrade procedure. K8s rolling-update shape applied to one host: drain in-flight work, stage versioned binary alongside old, smoke-test, atomic symlink swap, both agents alive briefly, operator verifies new agent's heartbeat then sends explicit stop signal to old, old exits cleanly. No version is ever erased — N-history on disk is the rollback target. Operator-driven cutover (not self-stopping) so the most-trusted side decides the handoff. Implementation deferred to post-v0.1 backlog; spec exists so anyone can build it without reinventing the design. ADR README index updated.	2026-05-06 22:51:14 -04:00
Jean-Gabriel Gill-Couture	6cbecee6e1	feat(fleet-device-enroll): require + validate `--device-id` (RFC1123) The auto-generated `Id::default()` shape (`fb5310_Qm2kPoQ`) contains underscores and uppercase, so once the agent published its DeviceInfo and the operator tried to upsert a Device CR using `device_id` as `metadata.name`, kube rejected it: ApiError: Device.fleet.nationtech.io "fb5310_Qm2kPoQ" is invalid: metadata.name: Invalid value ... must consist of lower case alphanumeric characters, '-' ... Failing at operator-reconcile time is bad UX: the Zitadel machine user is already provisioned, the agent is already running, and the auth callout's per-device permissions are already templated to a device_id the kube layer will never accept. Re-enrolling requires manually deleting state in three places. Makes `--device-id` required and validates it against RFC1123 DNS subdomain rules upfront, before any Zitadel call: * non-empty, ≤253 chars total * dot-separated labels, each 1-63 chars, lowercase a-z + 0-9 + `-` * labels must start AND end with an alphanumeric Stricter than just "kube name valid" because the same id flows into NATS subjects (auth callout's permission templates) — `_`/uppercase silently passes NATS auth but breaks the kube path. Rejecting at the CLI is the only failure point that catches both layers in one place. 8 unit tests cover the accept set + every reject path (underscore — the regression that triggered this — uppercase, leading/trailing dash, empty, consecutive dots, label too long, total too long). CLI banner + README updated. The `Id::default()` fallback path is removed entirely; no backward compat with the old auto-generated shape (the user explicitly opted out — anything that ran before now needs re-enrollment with an explicit id).	2026-05-06 13:43:11 -04:00
Jean-Gabriel Gill-Couture	af06177502	fix(deps): enable async-nats `websockets` feature for wss:// support Some checks failed Run Check Script / check (pull_request) Failing after 1m0s Details The fleet agent connects to NATS via the OKD edge-TLS Route at `wss://nats-fleet-stg.cb1.nationtech.io`. Without the `websockets` feature on async-nats, the connector parses the URL but doesn't know how to do the HTTP Upgrade — it opens a raw TCP socket to port 443 and sits waiting for NATS's plaintext `INFO` frame, which never comes (the OKD router speaks TLS+HTTPS, not raw NATS). 30s later: ERROR async_nats::connector: expected INFO, got nothing Error: Nats connection FAILED : IO error: expected INFO, got nothing …and systemd restart-loops forever. `websockets` isn't in async-nats 0.45's default feature set; the crate's own Cargo.toml lists it as `websockets = ["dep:tokio-websockets"]`. Enabling it on the workspace dep makes the connector route `wss://` URLs through tokio-websockets which does the TLS+upgrade dance correctly. Curl already proved the server-side path works (`101 Switching Protocols` + NATS `INFO`); the missing piece was always client support. The operator wasn't affected because it talks to NATS in-cluster on `nats://fleet-nats.fleet-staging.svc.cluster.local:4222` (plain TCP). Only external clients going through the public wss:// Route hit this.	2026-05-06 13:20:48 -04:00
Jean-Gabriel Gill-Couture	8ff0f0dd65	fix(nats): enable JetStream on the auth-callout account NATS server-level `jetstream: { ... }` config doesn't extend to explicit accounts — each one has to opt in individually with `jetstream: enabled` (or a per-account quota object). The rendered values block declared `FLEET` and `SYS` accounts but never enabled JetStream on `FLEET`, so the operator's first call to create its desired-state KV bucket died immediately with: JetStream error: JetStream not enabled for account (code 503, error code 10039) Adds `jetstream: enabled` to the callout account block in `render_values_yaml`. SYS deliberately stays without it — system account doesn't host streams. Reference: https://docs.nats.io/nats-concepts/jetstream/account_jetstream Adds `auth_callout_account_has_jetstream_enabled` regression test that: * asserts `jetstream: enabled` appears under the callout account block in the rendered YAML; * defense-in-depth: asserts `jetstream:` does NOT appear under SYS, so a future regex slip can't silently flip system-account JetStream on.	2026-05-06 12:48:12 -04:00
Jean-Gabriel Gill-Couture	e6ee5070b0	fix(fleet-operator): use TOML literal strings for inline JSON keyfile The operator's `credentials.toml` embeds Zitadel's JSON machine-key content under `key_json`. Both `fleet_staging_install` and the docstring example used basic triple-quoted strings (`"""..."""`), which interpret backslash escapes — every `\n` in the embedded RSA private key gets expanded to a literal 0x0A before the value lands in the operator's env var. The operator's `harmony-fleet-auth` deserializer then runs `serde_json::from_str` on a "JSON" string that contains raw control chars inside string literals and rejects it with "control character found while parsing a string at line 2 column 0". The fix is a one-character delta: switch to TOML literal multi-line strings (triple single-quote). Literal strings preserve backslash sequences as-is, so `\n` reaches the JSON parser as the two chars `\` + `n`, gets interpreted as a string escape, and the multi-line PEM decodes correctly. Updates `fleet_staging_install`'s `format!()` template to render `key_json = '''<json>'''` and rewrites the docstring example on `OperatorCredentials::credentials_toml` to spell out which string form is required, with the failure mode that comes from picking the wrong one.	2026-05-06 12:15:26 -04:00
Jean-Gabriel Gill-Couture	71312d27ba	fix(fleet-operator): apply credentials Secret before helm install Some checks failed Run Check Script / check (pull_request) Failing after 57s Details The operator chart's Deployment references `harmony-fleet-operator-secrets` via `envFrom`/`secretKeyRef` for the `FLEET_OPERATOR_CREDENTIALS_TOML` env var, but the Secret is intentionally NOT bundled in the on-disk helm chart (credentials are operator-environment-specific — see comment in `chart::build_chart`). The chart docs say "applies the Secret directly via `operator_secret()` (used as a `K8sResourceScore`)", but `FleetOperatorInterpret::execute` never actually did that. Result: the operator pod stalls forever in `CreateContainerConfigError` with `secret "harmony-fleet-operator-secrets" not found`. Fix: when `score.credentials` is set, build the Secret via `operator_secret(&chart_options)` and apply it via `K8sResourceScore` before the helm install fires. This way kube has the Secret in place by the time the chart's Deployment lands and the pod starts cleanly. Mirrors the pattern `NatsAuthCalloutScore` already uses for its own callout Secret. Trait bound widens from `T: Topology + HelmCommand` to `T: Topology + HelmCommand + K8sclient` to support the `K8sResourceScore::interpret` call. The only existing caller (`fleet_staging_install`) drives this through `K8sAnywhereTopology` which already implements all three. When `credentials` is `None` (no-auth dev mode) we skip the Secret apply entirely — the chart's Deployment doesn't reference it in that case either.	2026-05-06 11:53:10 -04:00
Jean-Gabriel Gill-Couture	9baae65171	chore(fleet-agent): default tracing filter to `info` Some checks failed Run Check Script / check (pull_request) Failing after 1m39s Details `EnvFilter::from_default_env()` returns the empty filter when `RUST_LOG` isn't set, which silences every log line. The systemd unit installed by `FleetDeviceSetupScore` does pass `RUST_LOG=info`, but a hand-launched binary, an overridden unit, or any other invocation path produced a silent agent — including the dev-on-device run the user just hit. Switches to `try_from_default_env().unwrap_or_else(\|_\| EnvFilter::new("info"))` so: * RUST_LOG unset → info-level by default (what the operator wants the moment they look for logs). * RUST_LOG set → respected as before (`RUST_LOG=debug` for troubleshooting, `RUST_LOG=warn` if it's too chatty, etc.). The systemd unit's existing `Environment=RUST_LOG=info` line is left in place — explicit + harmless, and lets a customer toggle the unit's verbosity without rebuilding the binary.	2026-05-06 11:17:21 -04:00
Jean-Gabriel Gill-Couture	0891798073	fix(linux): wait for user manager bus after `loginctl enable-linger` `loginctl enable-linger` returns to the caller before logind has actually finished bringing up `user@<uid>.service`. The next step in `FleetDeviceSetupScore` (Step 4/7 — activating user-scoped podman.socket) calls `systemctl --user` against the just-lingered user, which fails with: Failed to connect to user scope bus via local transport: No such file or directory …because `/run/user/<uid>/bus` doesn't exist yet. The user manager is on its way up but the score has already moved on. Reproducible on a fresh dev-on-device run. Adds a `wait_for_user_bus` helper that polls `/run/user/<uid>/bus` for up to 5s after `enable-linger`. We've never seen the wait take more than a fraction of a second in practice; 5s is a generous ceiling that gives a clear error pointing at the right diagnostic commands (`journalctl -u user@<uid>.service`, `loginctl user-status`) if logind is genuinely stuck.	2026-05-06 10:29:16 -04:00
Jean-Gabriel Gill-Couture	32bad8f746	feat(linux): local connection mode + auto-install python3-venv on Debian Two ergonomic fixes for the dev-on-device workflow. (1) Ansible local connection. `LinuxHostTopology` always went through SSH, so running `fleet_device_enroll` with `--target ssh://you@127.0.0.1` required the operator to set up sshd loopback access on their own Pi — clunky for a dev who's sitting in front of the device. Adds `LinuxLocalhostTopology` that drives the same `LinuxHostConfiguration` trait surface using ansible's `-c local` connection (no SSH at all) plus direct `sh -c` subprocess calls for the loginctl / systemctl --user paths. The configurator now takes a unified `AnsibleConnection<'a>` enum (`Ssh { host, creds }` \| `Local { sudo_password }`) instead of a `(host, creds)` pair. Internal `host_exec`/`host_sudo_exec` helpers branch by transport and return the same `SshCommandOutput` shape either way, so the public methods (ping, ensure_package, ensure_file, etc.) are transport-agnostic. `fleet_device_enroll` switches `--target` to optional: omitted → local, present → SSH. No magic `localhost` string, no special-case for 127.0.0.1. README + the flag's help text describe both modes. (2) Auto-install `python3-venv` on Debian. First-run venv creation fails on stock Debian/Ubuntu with `ensurepip is not available` because Debian splits venv into the `python3-venv` apt package. `ensure_ansible_venv` now detects that failure, checks for `/etc/debian_version`, runs `sudo apt-get update && sudo apt-get install -y python3-venv`, and retries. Idempotent on re-runs (apt is a noop when already installed). On non-Debian or genuinely broken environments, the operator gets a clear error pointing at the right install command per distro family. Sudo prompts for a password if not configured passwordless — that's fine, the operator expects it.	2026-05-06 10:19:13 -04:00
Jean-Gabriel Gill-Couture	e1d74bae45	fix(zitadel): support cross-org admin via x-zitadel-orgid + better diagnostics Real symptom from a staging run: Error: FleetDeviceSetupScore: Project 'fleet' not found in Zitadel — run ZitadelSetupScore first to create it …even though the project clearly existed and was visible in the Zitadel UI. Cause: `/management/v1/` scopes by the caller's org. The SSO operator's primary org is whatever org their personal account lives in; the project was created by the system iam-admin user, in the system org. With no `x-zitadel-orgid` override, the search runs in the operator's org and returns empty. Project effectively "invisible" to that token. Three changes: `ZitadelSetupScore` gains `admin_org_id: Option<String>`. When set, every management API call sends `x-zitadel-orgid: <id>`. Plumbed through `request()` next to the existing conditional `Host:` header. Default `None`, serde-default for backward compat. * `FleetDeviceAuth::ZitadelEnroll` gains a matching `admin_org_id` field, threaded through `resolve_zitadel_enroll` into the synthetic `ZitadelSetupScore` connection it builds for `mint_device_credentials`. CLI surface: `--admin-org-id` on `fleet_device_enroll`, with help text explaining the symptom and where to find the value (Zitadel UI → Organization → Resource ID). * `find_project` now uses a `nameQuery` filter rather than scanning the full default-paginated list, so it doesn't depend on the project being on page 1. When the filter returns empty it falls back to an unfiltered enumeration and logs the project names that ARE visible to the token — that list is usually enough for the operator to spot an org-context mismatch in seconds. The not-found error in `mint_device_credentials` was rewritten to spell out the three real causes (org context, role, no project) instead of the misleading "run ZitadelSetupScore first". All 7 existing `ZitadelSetupScore` initializer sites updated with `admin_org_id: None`. README's troubleshooting section gets the new failure-mode entry.	2026-05-05 23:13:47 -04:00
Jean-Gabriel Gill-Couture	a1c9e33955	fix(fleet): provision device-code OIDC app + require numeric client_id Some checks failed Run Check Script / check (pull_request) Failing after 1m6s Details The SSO login from `fleet_device_enroll` was hitting Zitadel with the app name (`harmony-cli`) as the OAuth client_id, getting back: 400 Bad Request: invalid_client: no active client not found Two real problems behind that error: * `fleet_staging_install` never created the device-code OIDC app in the first place. Its `applications: vec![]` was empty — the only Zitadel resources provisioned were the API app, the project roles, and the machine users. The `harmony-cli` device-code app that the enrollment example assumed was provisioned simply did not exist. Adds it via `ZitadelApplication { app_type: DeviceCode }` so a fresh staging install yields a real OIDC app. * `--admin-oidc-client-id` defaulted to the literal string `"harmony-cli"`, which is the app's display name, not the client_id. Zitadel issues numeric client_ids of the form `<number>@<project>` when the app is created — that's what OAuth endpoints want. Defaulting to the name was misleading: it produces no warning, just a confusing 400 from Zitadel about a "client not found" that the operator can't easily map back to "wrong field passed to the flag". Removes the default; the flag is now required when SSO is in use (skipped only with `--admin-token`). Help text and README spell out the distinction explicitly. The staging install now reads the resolved client_id from `ZitadelClientConfig::client_id(...)` and prints it in the success banner, alongside a copy-paste-ready `fleet_device_enroll` invocation. README also documents the post-install lookup path (`jq -r '.apps."harmony-cli"' ~/.local/share/harmony/zitadel/client-config.json`) and adds the `invalid_client` error to the troubleshooting list.	2026-05-05 23:01:57 -04:00
Jean-Gabriel Gill-Couture	21fc76d770	chore(fleet): bump rehearsal VM to Debian trixie + recommend musl target Two related issues from a real run. (1) Image was Debian 12 bookworm — released June 2023, glibc 2.36, two releases old by mid-2026. Bumping to Debian 13 trixie (current stable since Aug 2025, glibc 2.41) keeps the rehearsal kernel + userland roughly aligned with what's likely sitting on a fresh Pi imaged today. URL pattern is unchanged (`cloud.debian.org/.../latest/`), still no sha pin (latest/ rotates per point release; swap to a dated subdir if cryptographic provenance matters). The `cdrom` is still attached as virtio-blk read-only — that fix is independent and still required (Debian's cloud-arm64 kernel ships without ahci.ko). Renames in `harmony::modules::fleet`: ensure_debian_bookworm_arm64_cloud_image → ensure_debian_trixie_arm64_cloud_image DEBIAN_BOOKWORM_CLOUDIMG_ARM64_{URL,FILENAME} → DEBIAN_TRIXIE_CLOUDIMG_ARM64_{URL,FILENAME} (2) The device-side `--target aarch64-unknown-linux-gnu` cross-compile produced a binary that linked against the workstation's glibc (2.41 on a current Arch host). Running it on the rehearsal VM (Debian 12 / 13) blew up immediately: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.39' not found This is fundamental to the gnu target — the binary depends dynamically on whatever glibc the host happens to have. The fix isn't a workaround on the harmony side; it's switching the device build to `aarch64-unknown-linux-musl`, which produces a fully-static binary that runs on any aarch64 Linux regardless of the device's libc generation. README updated with the musl recipe (rustup target, cargo config linker, optional `cross` shortcut) and the rationale for why musl beats gnu for device-side cross-compiles. Workstation build is unchanged.	2026-05-05 22:50:45 -04:00
Jean-Gabriel Gill-Couture	7ad2dc9bd5	feat(fleet-device-enroll): feature-gate VM-rehearsal so the binary cross-compiles for arm64 `harmony`'s `kvm` feature pulls in `libvirt`, which doesn't link on aarch64-unknown-linux-gnu (no aarch64 `libvirt-dev` package on most distros). The device-side workflow needs a binary that runs ON the Pi and only does enrollment — no VM-rehearsal — but the example was unconditionally enabling `kvm`, so the cross-compile failed at link time with `undefined reference to virStoragePoolFree` etc. Fixes by gating the rehearsal bits behind a new `vm-rehearsal` Cargo feature (default-on for workstation builds, opt-out via `--no-default-features` for device builds): * `Cargo.toml`: harmony dep is now `default-features = false, features = ["podman"]` (podman is needed unconditionally — the operator CRD types depend on it). New `vm-rehearsal` feature enables `harmony/kvm` on demand. * `main.rs`: every libvirt-touching import, CLI flag (`--launch-pi-vm`, `--vm-rehearsal`, `--vm-`), CLI branch, and helper function (`boot__vm`, `RehearsalImage`) is now `#[cfg(feature = "vm-rehearsal")]`. With the feature off, none of it is referenced and nothing tries to link libvirt. * README: documents both build flavors with copy-paste commands. Workstation build (unchanged): cargo build --release -p example_fleet_device_enroll Device-side build (the new path): cargo build --release --target aarch64-unknown-linux-gnu \ -p example_fleet_device_enroll --no-default-features	2026-05-05 22:42:49 -04:00
Jean-Gabriel Gill-Couture	bc7c8808bb	fix(kvm): attach cloud-init seed as virtio-blk read-only disk, not SATA cdrom Symptom: `--launch-pi-vm` boots a Debian bookworm arm64 VM, SSH comes up, but the configured `fleet-admin` user doesn't exist and key auth fails. The seed ISO is well-formed (CIDATA volume label, valid user-data, valid meta-data), but cloud-init never finds it. Root cause: Debian's `linux-image-cloud-arm64` kernel — and other slimmed cloud-image kernels — ship WITHOUT `ahci.ko`, because real clouds don't expose SATA. The SATA cdrom we attach is invisible to the guest: * `dmesg` has zero ata/ahci/scsi/sr0 lines (confirmed by inspecting the post-boot overlay's journald). * `blkid -tLABEL=CIDATA` returns nothing. * cloud-init's NoCloud datasource gives up, falls through to `DataSourceNone`, applies no user-data, the user the score wanted to create never gets created. Final cloud-init log line: `Cloud-init v. 22.4.2 finished at … Datasource DataSourceNone` `cc_final_message.py[WARNING]: Used fallback datasource` Fix: attach the seed as `device='disk'` `bus='virtio'` with `<readonly/>`. virtio-blk is the universal cloud-image baseline — every cloud kernel includes the driver — and cloud-init's NoCloud datasource finds the seed via the volume label regardless of device type. The `cdrom`/`CdromConfig` naming on the public API is kept (callers mentally model the seed as removable media), but the wire shape is now virtio-blk on every arch. Device name moves from `hdb` to `vdb` accordingly. Tests: `domain_xml_cdrom_device_uses_virtio_blk_readonly` pins the new shape and explicitly asserts that the SATA / IDE-cdrom shape does NOT come back — that's the regression this test exists to prevent.	2026-05-05 22:28:30 -04:00
Jean-Gabriel Gill-Couture	cf350d890a	feat(fleet): device enrollment via Zitadel SSO + Pi-equivalent rehearsal VM `FleetDeviceSetupScore` gains `FleetDeviceAuth::ZitadelEnroll` — resolves the device's Zitadel machine user + JSON key inline, then falls through to the existing keyfile-drop flow exactly as if a pre-resolved `ZitadelJwt` had been passed. Two operator workflows fall out of this: * Dev-on-device — developer runs the score on a Pi with display attached, browser opens locally to Zitadel SSO, dev signs in with their personal account (must hold IAM_OWNER or equivalent), score mints credentials for that one device and brings up the agent. * Production-via-SSH — operator runs from a workstation, targets each device over SSH. Browser opens once on the workstation; the resulting access token is in-memory only for v0 (per-batch token caching tracked in ROADMAP/fleet_platform/device_enrollment_token_caching.md). Implementation: * `harmony/src/modules/zitadel/admin_auth.rs` — RFC 8628 device-code flow against Zitadel. Tries `webbrowser::open`, falls back to printing the URL (SSH sessions just see the URL). Minimum scope set is `openid urn:zitadel:iam:org:project🆔zitadel:aud` — enough to call `/management/v1/`, nothing more. `harmony/src/modules/zitadel/setup.rs` — `mint_device_credentials` helper that reuses the existing find-or-create methods (project, machine user, user grant) plus `create_machine_key`. Idempotent on user + grant; always mints a new key because Zitadel does not return existing key material. * `harmony/src/modules/fleet/setup_score.rs` — new `ZitadelEnroll` variant + `AdminAuth::{Sso, Token}`. Resolution runs at the top of execute(); the rest of the score sees a single shape. render_toml's match collapses both Zitadel variants into one arm (they share the issuer/audience/danger fields). * `harmony/src/modules/fleet/assets.rs` — Debian bookworm arm64 generic-cloud image fetcher. This is the same Debian base Raspberry Pi OS is built on; Pi OS itself is locked to Pi hardware (Broadcom firmware) and won't boot in generic KVM. No sha pin (Debian's `latest/` URL rotates per point release); swap to a dated subdir if you need cryptographic provenance. * `examples/fleet_device_enroll/` — single CLI covering both workflows + a `--launch-pi-vm` switch that boots a Pi-equivalent VM with one command and prints the SSH details + suggested follow-up enrollment command. README walks the three flows. Tests: `render_toml_zitadel_enroll_renders_same_as_zitadel_jwt` locks the byte-equivalence between the unresolved (Enroll) and resolved (Jwt) variants — the invariant `execute()` relies on so TOML rendering is independent of when admin auth resolves. Adds `webbrowser` as a regular dependency on `harmony` (small, no feature gate).	2026-05-05 22:08:59 -04:00
Jean-Gabriel Gill-Couture	9e9289ac72	feat(zitadel): URL params, readiness wait, persisted admin password, shared TLS Ingress A grab-bag of fixes the OKD staging install surfaced. Each landed as a diagnosable failure during real deploys: * URL parametrization. ZitadelSetupScore was hardcoded to `http://127.0.0.1:{port}` with a `Host:` header — fine for k3d port-forward, broken everywhere else. Adds `scheme: ZitadelScheme` (Http/Https), `port: Option<u16>` (None → scheme default), and `endpoint: Option<String>` for the rare port-forward case. The `Host:` header is now only injected when `endpoint` is set. * HTTP readiness gate. Helm reports SUCCESS when pods are Ready but on OKD the Route + cert-manager Certificate reconcile asynchronously — the first management call after install was dying with `CaUsedAsEndEntity` (rustls rejecting OKD's bootstrap CA cert served while cert-manager was still issuing). Score now polls `/debug/ready` with retry; treats connect / TLS errors as transient. * Admin password persistence. ZitadelScore was generating a fresh random password on every run, then printing it in the success banner — but Zitadel's chart only honors FirstInstance.* on the first install, so the printed password didn't match what was live in the DB. Now persisted via harmony_secret (LocalFile by default). * Login banner shows full SSO loginName. Default Zitadel org name is ZITADEL → org primary domain is `zitadel.<ExternalDomain>` → admin preferredLoginName is `admin@zitadel.<host>`. Print the full string so the operator pastes the right value. * Shared TLS Secret across Zitadel + login Ingresses. Two cert-manager-annotated Ingresses on the same host create two Certificates → two ACME Orders → competing HTTP01 challenges; the loser's Secret never lands and on OKD the second Ingress's Route is silently never admitted because the controller inlines TLS material into the Route at creation time. Login Ingress now references `zitadel-tls` (same as main) and drops its cert-manager.io annotation. Documented in docs/guides/kubernetes-ingress.md as the canonical pattern with the diagnostic signature so this doesn't get rediscovered. * fleet_staging_deploy namespaces. The OLDER staging deploy example hardcoded `fleet-system` / `zitadel`; renamed to `fleet-staging` / `zitadel-staging` to match `fleet_staging_install`'s convention. Five example call sites updated for the new ZitadelSetupScore shape; fleet_e2e_demo / fleet_auth_callout / harmony_sso pass the k3d port-forward as `endpoint: Some("http://127.0.0.1:8080")`, the staging examples take the defaults (direct https on 443). Tests: 8 new unit tests in setup.rs lock the URL builder, Host-header conditional, scheme serde, and minimal-fields deserialization. One new test in setup_score covers render_toml.	2026-05-05 22:08:30 -04:00
Jean-Gabriel Gill-Couture	8b7b8953cb	fix(crd): use struct EnvVar instead of tuple for k8s schema validity `PodmanService.env: Vec<(String, String)>` made schemars emit `items: [{type: string}, {type: string}]` (OpenAPI tuple validation), which k8s apiextensions rejects with "Forbidden: items must be a schema object and not an array" — install of the operator's `deployments.fleet.nationtech.io` CRD blew up at the Helm step. Introduces `EnvVar { name, value }` in `domain::topology` (with `From<(String,String)>` for ergonomics) and switches both `PodmanService.env` and `ContainerSpec.env` to `Vec<EnvVar>`. schemars now produces `items: { type: object, properties: { name, value } }` which validates cleanly. Adds `env_schema_is_object_not_tuple_for_crd_validation` to lock the schema shape — if anyone reverts to a tuple the test fails before the operator install does.	2026-05-05 22:07:11 -04:00
Jean-Gabriel Gill-Couture	93189f2776	fix(zitadel-setup): namespace where iam-admin-pat lives is configurable ZitadelSetupScore was hardcoded to look for the `iam-admin-pat` secret in `zitadel`. After ZitadelScore gained a configurable namespace (so it can deploy into `zitadel-staging`), the setup score continued reading from the wrong place and failed: Secret 'iam-admin-pat' not found in namespace 'zitadel' — ensure ZitadelScore Helm values configure FirstInstance.Org.Machine.Pat Adds `pub namespace: String` to ZitadelSetupScore (default "zitadel" via serde for backward compatibility). The 5 example call sites get explicit `namespace:` fields — fleet_staging_install threads `cli.zitadel_namespace` through, the rest hardcode the legacy value to keep their behavior unchanged. The `read_admin_pat` helper now uses `self.score.namespace` instead of the const, and the error message points at the mismatch between ZitadelScore.namespace and ZitadelSetupScore.namespace as the most likely cause.	2026-05-05 12:58:43 -04:00
Jean-Gabriel Gill-Couture	8242bb8429	fix(zitadel): emit ExternalPort in OpenShift values so issuer URL is correct The chart's OpenShift-flavored values previously omitted `ExternalPort` from the configmapConfig. Zitadel falls back to its internal listen port (8080), which then leaks into every externally-emitted URL — most visibly the management console URL and the OIDC issuer claim: Management Console URL: https://sso-staging.cb1.nationtech.io:8080/ui/console iss in tokens: https://sso-staging.cb1.nationtech.io:8080 But clients reach Zitadel through the OKD edge-TLS Route on 443. The mismatch surfaces as JWT-bearer 500s (`Errors.Internal`) and broken OIDC discovery for any client that compares the issuer to the URL it actually used. Fix: resolve `ExternalPort` defensively. When the caller passes `external_port: Some(p)`, honor it. When `None`, default to 443 for `external_secure: true` and 80 otherwise — matching the public port the OKD Route serves on. The K3s/local branch already supported `external_port` overrides via a separate code path (k3d port mappings); behavior unchanged there.	2026-05-05 12:54:34 -04:00
Jean-Gabriel Gill-Couture	7630ee2de2	fix(zitadel): inject namespace SCC uid-range start as runAsUser The chart's defaults pin runAsUser=1000 / fsGroup=1000 in the chart-wide podSecurityContext + securityContext blocks. On OpenShift, restricted-v2 SCC rejects pods that pin a UID outside the namespace's allocated `openshift.io/sa.scc.uid-range` range (typically `1000700000/10000`). Previous attempts: - `runAsUser: null` in our overrides → schema rejects (`type: integer`) - omit our overrides → chart defaults apply → SCC rejects 1000 Right answer: read the namespace's `openshift.io/sa.scc.uid-range` annotation at install time, parse the start UID, inject it as `runAsUser` + `fsGroup` into every securityContext block we emit. Schema is happy (integer), SCC is happy (UID is in range). Wired into the OpenshiftFamily branch of the values renderer: chart-wide pod + container securityContext, initJob, setupJob, and login (per-component override that the chart's helpers prefer over chart-wide). K3s / vanilla K8s gets `1000` literal — chart default, no SCC to worry about. Bonus: namespace must pre-exist before this Score runs (caller's job; the staging install doc already covers this).	2026-05-05 12:41:19 -04:00
Jean-Gabriel Gill-Couture	37e3c3847f	fix(docker,build): tighten .dockerignore + multi-stage callout image The build context for `podman build` was the workspace root — fine for cargo's path-deps, but `COPY . .` shipped 147 GB to the build daemon (target/, .claude/worktrees, .git, demos, network test data, manual_mint scratch). Tightens the .dockerignore to exclude the heavy items, dropping the context to ~180 MB. The callout Dockerfile was also single-stage with a host pre-built binary (`COPY target/release/harmony-nats-callout`), which conflicts with the new strict .dockerignore (target/ is now excluded). Rewrote to mirror the operator's multi-stage cargo-in-Docker shape — same builder + runtime images, same USER 65532 convention. Build script consequences: * No more host-side `cargo build --release -p harmony-nats-callout` step. Both images now build self-contained from the workspace context. * Two podman build invocations (operator + callout), then push. The k3d e2e harness (`fleet_auth_callout::build_and_load_callout_image`) was relying on the old single-stage Dockerfile via tempdir staging; it now writes its own minimal single-stage Dockerfile inline so the fast local-iteration path is unaffected by the production-shape change in `nats/callout/Dockerfile`. Also includes `topology.ensure_ready()` in fleet_staging_install (needed for cert-manager bootstrap on first apply). Verified: `podman build` for the callout completes successfully; operator build is the same shape and was mid-compile in testing.	2026-05-05 12:34:19 -04:00
Jean-Gabriel Gill-Couture	8a76afd622	fix(zitadel): omit runAsUser/fsGroup overrides instead of nulling them The Zitadel helm chart's JSON schema validates each securityContext block against integer types for runAsUser/fsGroup. Setting either to `null` in values.yaml triggers: Error: values don't meet the specifications of the schema(s): zitadel: - at '/login/podSecurityContext/runAsUser': got null, want integer The intent of the original `null`s was "let OpenShift's restricted-v2 SCC assign UID/GID" — but the chart's schema doesn't recognize that as valid YAML. The right way to leave the fields unset is to omit them from the values block entirely; with no key, the chart's default (also null/unset) applies and the SCC takes over at admit time. Strips 14 occurrences of `runAsUser: null` / `fsGroup: null` across the main pod, init job, setup job, and login pod security contexts. runAsNonRoot/seccompProfile/capabilities-drop stay — those are fields the chart accepts.	2026-05-05 12:33:34 -04:00
Jean-Gabriel Gill-Couture	f727e4dbea	docs(fleet): step-by-step OKD staging install runbook Walks through: build+push images, namespace creation, KUBECONFIG sanity, fleet_staging_install run, layer-by-layer verification (Zitadel cert + URL, NATS pod + callout subscribe, operator auth + controller, public WSS reachable, CRDs registered), per-device machine user creation in Zitadel UI, agent config TOML render + launch, end-to-end Deployment CRD walk, common failure modes with diagnostic commands, teardown. Cross-linked from the existing FAQ + manual-mint-recipe guides.	2026-05-05 12:01:16 -04:00
Jean-Gabriel Gill-Couture	28bd0fb17d	chore(fleet): build_and_push_images.sh helper One-shot script to build + push the operator and auth-callout container images. Pre-builds the callout binary on the host (its Dockerfile expects target/release/harmony-nats-callout to exist — matches the local-k3d iteration convention). Operator image is self-contained multi-stage. Defaults: REGISTRY=hub.nationtech.io/harmony, IMAGE_TAG=dev, PUSH=1. Override via env. Built refs are echoed at the end as the exact flags to paste into fleet_staging_install.	2026-05-05 11:52:29 -04:00
Jean-Gabriel Gill-Couture	4728079aea	feat(zitadel,fleet): configurable namespace + cluster_issuer + fleet_staging_install ZitadelScore gains two fields, both with defaults that preserve the previous hardcoded behavior: pub namespace: String // default "zitadel" pub cluster_issuer: String // default "letsencrypt-prod" The hardcoded `NAMESPACE` const becomes `pub const DEFAULT_NAMESPACE` and the YAML's `cert-manager.io/cluster-issuer` annotation now substitutes `{cluster_issuer}` from the field. Existing struct-literal ZitadelScore call sites (5 examples) updated to fall through to `..Default::default()` so older callers compile unchanged. New example: `examples/fleet_staging_install`. One-shot install of the fleet stack on OKD-shaped clusters, composing in order: 1. ZitadelScore (helm) into `--zitadel-namespace` 2. ZitadelSetupScore (project + roles + fleet-ops + fleet-operator machine users) 3. NatsK8sScore: single-instance + auth_callout + WS Route 4. NatsAuthCalloutScore: env-var-only Secret config 5. FleetOperatorScore: credentials TOML inlining the operator's JSON keyfile via key_json (no volume mounts) Public hostnames derive from one CLI flag: `--base-domain`. The demo uses `cb1.nationtech.io` → sso-staging.cb1.nationtech.io and nats-fleet-staging.cb1.nationtech.io. cert-manager `--cluster-issuer` defaults to `letsencrypt-prod`. Image refs (`--operator-image`, `--callout-image`) are required (private registry, no sensible default). Generates the issuer NKey + auth pass at install time; the callout's Secret consumes them via env-from-secret-key. One TOML file end-to- end: the operator pod's only mounted Secret is the credentials TOML, single-key, no volumes. Idempotency note: re-running ZitadelSetupScore with the same project name short-circuits via the cached client-config. Re-runs of NATS / operator / callout are idempotent at the Helm/K8sResourceScore level.	2026-05-05 11:51:32 -04:00
Jean-Gabriel Gill-Couture	f519eda259	feat(fleet): FleetServerScore takes NatsK8sScore + adds identity_setup + auth_callout `FleetServerScore` now composes: * `nats: NatsK8sScore` — replaces NatsBasicScore. Same Score that knows about OKD Routes, the auth_callout block in NATS Helm values, and the WS edge-TLS wiring. The NatsBasicScore-using `fleet_server_install` example registers the simple inner Scores directly (no FleetServerScore wrapper) — keeps the basic k3d-style install working without forcing it through the K8s-flavor Score. * `identity_setup: Option<ZitadelSetupScore>` — runs after the Zitadel helm install. Provisions project + roles + machine users via Zitadel's management API. The keys it produces are what the operator authenticates with. * `auth_callout: Option<NatsAuthCalloutScore>` — deploys the callout pod. Pair with `nats.auth_callout = Some(...)` so the rendered NATS values delegate to the same issuer pubkey. Execute order: identity (helm) → identity_setup (API) → nats (with auth_callout block in values) → auth_callout (pod) → operator The operator goes last so it doesn't burn reconnect attempts while the rest comes up; its `connect_with_retry` covers any small remaining race. Trait bounds widen to include `Nats + TlsRouter` (for NatsK8sScore's Route + capability path). Post-install summary lines added: NATS WS public URL when set, and a kubectl pointer to the callout deployment.	2026-05-05 11:44:46 -04:00
Jean-Gabriel Gill-Couture	0bb32401a8	feat(operator,auth): env-only credentials with inline key_json Operator-side: drops the Secret-as-volume mount entirely. The operator pod consumes the entire `[credentials]` TOML block — including the Zitadel JSON keyfile — through one `valueFrom.secretKeyRef` env var (`FLEET_OPERATOR_CREDENTIALS_TOML`). No volume, no mount, no fsGroup, no `0o444` workaround. OKD restricted-v2 SCC compatible. `OperatorCredentials` collapses to a single field: pub credentials_toml: String // JSON keyfile inlined under key_json `SECRET_KEY_ZITADEL_KEYFILE` and `KEYFILE_VOLUME_NAME` constants removed — no longer used. harmony-fleet-auth: `CredentialsSection::ZitadelJwt` gains `key_json: Option<String>`. The factory prefers `key_json` when non-empty, falls back to `key_path` otherwise. Agent (file-based, `key_path` populated) keeps working unchanged. Operator (env-only, `key_json` populated) skips the file read entirely. Tests cover both shapes plus the default-key_path path. Internal refactor: `load_machine_key` now delegates to `parse_machine_key(&str)`, shared with the inline path. fleet_e2e_demo bring-up rewires the credentials TOML it renders to embed the JSON keyfile via `key_json = """..."""` instead of `key_path = "..."`. The `OPERATOR_KEY_MOUNT_PATH` constant is gone along with the now-unused mount logic. 7 callout tests + 19 fleet-auth tests still green.	2026-05-05 11:42:26 -04:00
Jean-Gabriel Gill-Couture	754cbd2371	feat(callout): env-only Secrets, drop pinned UID for OKD restricted-v2 Replaces the volume-mounted Secret (`/etc/callout/{issuer-nkey-seed, nats-auth-pass}`) with `valueFrom.secretKeyRef` env vars (`ISSUER_NKEY_SEED`, `NATS_AUTH_PASS`). The callout binary's `read_secret` helper already supports both `<NAME>_FILE` and `<NAME>` — it just falls through to env when the `_FILE` variant is absent. Also drops the pod-level `securityContext` block that pinned `runAsUser: 65532, runAsGroup: 65532, fsGroup: 65532`. OKD's restricted-v2 SCC rejects pods that pin UID/GID outside the namespace's allocated range; the SCC will assign appropriate values from that range when the fields are unset. Container-level hardening (runAsNonRoot, no-privilege-escalation, RO root fs, capabilities drop ALL) stays intact. Tests rewritten to assert the new shape: env vars come from Secret key refs, no volumes, no pinned UID/GID/fsGroup. 7 callout tests green.	2026-05-05 11:38:01 -04:00
Jean-Gabriel Gill-Couture	e6d72407b2	feat(nats,okd): NatsK8sScore single-instance + auth_callout + WS Route Extends NatsK8sScore additively (every new field optional, defaults preserve supercluster shape): pub gateway: Option<GatewayConfig> // None = single-instance pub auth_callout: Option<AuthCalloutCfg> // delegate auth to callout pub websocket: Option<WebSocketRouteCfg> // public WS Route + edge TLS Render-side: * `gateway = None` → cluster.enabled=false, replicas=1, gateway block disabled, no `tlsCA`, no service.ports.gateway * `auth_callout = Some` → emits authorization.auth_callout block (using harmony's existing render_auth_callout_block convention) + accounts.<account>.users for the bypass user the callout connects as + accounts.SYS + system_account: SYS. Drops the legacy testUser + default_permissions — the callout is the sole authority. * `websocket = Some` → enables config.websocket.enabled with no_tls (the Route owns TLS termination). Routes: * `gateway` Route stays gated to gateway.is_some(). passthrough on 7222, host = cluster.dns_name. Preserves supercluster behavior. * `websocket` Route is new. Edge-TLS termination on port 8080 (chart's WS listener), Redirect insecure-edge policy, host from WebSocketRouteCfg. cert-manager.io/cluster-issuer annotation drives the Route certificate. OKDRouteScore gains an `annotations: BTreeMap<String, String>` field (default empty) + `with_annotation()` builder so callers can attach the cert-manager annotation without reaching for K8sResourceScore manually. Side-effect: `harmony` lib's default features now include `podman`. The CRD types in `modules::fleet::operator::crd` embed `ReconcileScore` from `modules::podman` unconditionally — without the feature on by default, harmony's lib-only builds fail. Existing explicit `features = ["podman"]` callers are unaffected. K8sAnywhereTopology's `Nats::deploy` impl populates the new fields with `gateway = Some(default)` so the capability path keeps the supercluster behavior it had before this commit.	2026-05-05 11:36:12 -04:00
Jean-Gabriel Gill-Couture	69b74d572e	Merge branch 'feat/deploy_fleet_server_side' into feat/iot-walking-skeleton Some checks failed Run Check Script / check (pull_request) Failing after 58s Details	2026-05-05 10:42:00 -04:00
Jean-Gabriel Gill-Couture	22eed9b533	Merge branch 'feat/iot-walking-skeleton' into feat/deploy_fleet_server_side Some checks failed Run Check Script / check (pull_request) Failing after 59s Details	2026-05-05 10:32:51 -04:00
johnride	92e86fb832	Merge pull request 'refactor(fleet-operator): replace ScorePayload with ReconcileScore in Deployment CRD [NationTech/Team#186]' (#278 ) from fix/refactorScorePayload into feat/iot-walking-skeleton Some checks failed Run Check Script / check (pull_request) Failing after 2m17s Details Reviewed-on: #278	2026-05-05 14:04:43 +00:00
Sylvain Tremblay	95ccc974f9	refactor(fleet-operator): replace ScorePayload with ReconcileScore in Deployment CRD Some checks failed Run Check Script / check (pull_request) Failing after 2m45s Details Removes the hand-typed ScorePayload struct and its custom schemars schema function. DeploymentSpec.score is now typed as the strongly typed ReconcileScore enum already used by the agent, eliminating duplication and ensuring the CRD schema is derived automatically. - Add JsonSchema derive to PodmanService, PodmanV0Score, ReconcileScore - Enable podman feature on harmony dependency in operator - Re-export ReconcileScore/PodmanV0Score/PodmanService from crd module - Update harmony_apply_deployment and fleet_load_test examples - Remove TODO comment from harmony_apply_deployment Wire format is unchanged (externally tagged {type, data}), so the operator -> NATS KV -> agent path remains fully backward compatible.	2026-05-05 09:59:40 -04:00
johnride	023cd742cd	Merge pull request 'feat/nats-auth-callout-e2e' (#279 ) from feat/nats-auth-callout-e2e into feat/iot-walking-skeleton Some checks failed Run Check Script / check (pull_request) Failing after 2m18s Details Reviewed-on: #279	2026-05-05 13:46:14 +00:00
Sylvain Tremblay	e5caaba1e4	feat: add deploy-apache.sh example script All checks were successful Run Check Script / check (pull_request) Successful in 2m52s Details	2026-05-05 08:56:05 -04:00
Sylvain Tremblay	286dc2b055	chore(fleet): auto-source env.sh from run_server_install.sh run_server_install.sh now unconditionally sources examples/fleet_server_install/env.sh after computing REPO_ROOT, so the example's env knobs (KUBECONFIG, RUST_LOG, NO_ZITADEL, ZITADEL_HOST, …) are picked up without the user having to source manually before invoking the script. The script's `${VAR:-default}` block only fills in values env.sh leaves unset. env.sh keeps a (commented-out) KUBECONFIG hint and the new optional Zitadel knobs documented post-source. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 08:53:32 -04:00
Jean-Gabriel Gill-Couture	29896bfeab	fix(zitadel,operator): user-grant search endpoint + operator keyfile mode Some checks failed Run Check Script / check (pull_request) Failing after 2m15s Details Two bugs uncovered while running the full e2e walk end to end: 1. find_user_grant POSTed to /management/v1/users/<id>/grants/_search which Zitadel rejects with 405 Method Not Allowed (the original author's note in the comment hinted at this). The cache previously masked it: first apply created the grant + cached the id; second apply hit the cache and skipped the broken search. The live-query refactor (`f4d6fb94`) removed the cache short-circuit, surfacing the bug as "Create user grant failed: User grant already exists" on every re-apply. Fix: switch to the collection endpoint /management/v1/users/grants/_search with a userIdQuery filter, matching the Zitadel API that's actually wired up. Now returns the existing grant on re-apply and the create_user_grant fallback is correctly skipped. 2. Operator keyfile mounted as 0o400 owned by root. The operator pod runs as non-root (image USER directive — no fixed runAsUser because we want SCC compatibility). Result: operator boots, tries to load the JSON keyfile from the Secret volume, hits EACCES, fails the credential factory, retries forever. Fix: mode 0o444. World-read inside the pod is fine — single container, no other consumers, the Secret namespace is locked down, and the file never escapes pod-fs. The proper fsGroup-based alternative requires pinning a UID/GID, which conflicts with our SCC-friendly choice of leaving runAsUser unset. Also fixes a stale `git rm` from commit `4194baac` (harmony-fleet-auth extraction) — the agent's local credentials.rs was deleted from disk but never staged. Verified end to end: * STACK READY in 2 min on warm cluster * Operator pod: "minted fresh Zitadel access token", "NATS connected", "starting Deployment controller", "watching device-info KV" * 2 Device CRs auto-created with full label set * `kubectl apply -f` of a Deployment CR with targetSelector.matchLabels: { group: group-a } produced: - status.aggregate { matched=1, succeeded=1, failed=0 } - HTTP 200 from nginx on vm-device-00:8080 - connection refused from vm-device-01:8080 (correctly excluded)	2026-05-05 06:55:24 -04:00
Jean-Gabriel Gill-Couture	34cfa0423b	docs(podman): FIXME diagnosis for the reconcile-loop bug The agent's periodic reconcile destroys-and-recreates any service whose ContainerSpec has env or volumes, every 30s tick. Root cause: matches_spec returns false unconditionally for those fields because podman's list endpoint doesn't surface them; the original author chose to declare "any spec with state is drifted" as a fail-safe. That fail-safe weaponizes the polling reconciler into a loop. Tags the offending line with a multi-paragraph FIXME explaining the symptom, the root cause, the proposed fix (containers.inspect + structural compare + an integration test), and the demo-time workaround (keep demo specs trivial — the hello-web nginx demo already is). Adds the same gap to ROADMAP/fleet_platform/v0_demo_e2e.md's known-risks section so it's visible at planning time. Out of scope for tonight; in scope for delivery alongside the upcoming health-check support on ContainerSpec.	2026-05-05 01:59:51 -04:00
Jean-Gabriel Gill-Couture	8a609c5342	feat(operator): NATS auth via shared harmony-fleet-auth + e2e wiring The operator was opening a bare async_nats::connect with no auth, which would fail closed against a callout-protected NATS. Wires it through the same JWT-bearer flow the agent uses, sharing the recently-extracted harmony-fleet-auth crate. Operator side ------------- * main.rs: read FLEET_OPERATOR_CREDENTIALS_TOML (TOML snippet, same shape as the agent's [credentials] block — single CredentialsSection struct, just a different byte source). Empty string bypasses (callout-less dev only, with a loud warning). * chart.rs: ChartOptions gains an optional OperatorCredentials field. When set, build_chart's Deployment mounts a Secret as both envFrom (TOML payload → FLEET_OPERATOR_CREDENTIALS_TOML) and a volume mount for the JSON keyfile at the configured key_path (defaults to /etc/fleet-operator/zitadel-key.json). On-disk helm chart still emits credentials: None — those are environment- specific and out of scope for a redistributable chart. * Public manifest builders (build_service_account, build_cluster_role, build_cluster_role_binding, build_operator_deployment, operator_secret) so the e2e bring-up can apply each resource via K8sResourceScore without re-implementing the manifests. * mod chart now lives in lib.rs so external consumers (the e2e bring-up) can reach into it. E2e bring-up ------------ * Bring-up gains a separate `fleet-operator` machine user with the fleet-admin role grant — distinct from the manual-admin `fleet-ops` user so audit logs can tell automated operator actions apart from human ones. * New steps 8/10 (build + sideload operator image) and 9/10 (apply CRDs + RBAC + Secret + Deployment + wait for Ready). Devices step becomes 10/10. * Reuses harmony_fleet_operator's manifest builders + operator_secret via K8sResourceScore — no duplicated YAML, no shell-out. Tests ----- * All existing tests pass (harmony-fleet-auth: 18, harmony-fleet-agent: 7, harmony-fleet-operator: 2). E2e walking-skeleton is exercised by the next phase's clean rerun.	2026-05-05 01:58:14 -04:00
Jean-Gabriel Gill-Couture	84a25dbb07	test(fleet-auth): cover assertion claims, scope, token URL, cache, keyfile Bumps coverage on harmony-fleet-auth from 5 to 18 unit tests. The new tests lock the corners we burned cycles on while debugging the live system: * cache freshness boundary (within-leeway, outside-leeway, no-cache, non-zitadel variant) * assertion claim shape (iss/sub/aud/exp/iat) and the 60-second lifetime constant Zitadel enforces server-side * scope string content (plural-projects-roles + singular-project-id URN + openid base) * token URL strips trailing slashes (the //oauth/v2/token 404 waiting to bite the next operator) * MachineKeyFile JSON parsing under Zitadel's wire shape Refactor: build_assertion now delegates to build_assertion_claims + build_assertion_header (pure, no signing). Lets the claim/header shape be unit-tested without an RSA private-key fixture; the sign-and-decode end-to-end is still covered by the e2e harness. No new deps. wiremock not needed — every meaningful assertion is on pure logic.	2026-05-05 01:50:28 -04:00
Jean-Gabriel Gill-Couture	4194baacad	refactor(fleet): extract NATS credential plumbing into harmony-fleet-auth The agent's `credentials.rs` + `CredentialsSection` enum graduate into a workspace crate (`fleet/harmony-fleet-auth/`) so the operator can consume the same code path. Single struct, single factory, single auth-callback wiring. The only thing that varies between consumers is where the `[credentials]` TOML bytes come from — the agent reads them from a config file on disk, the operator (next commit) will read them from an env var. Public surface of the new crate: CredentialsSection — the deserializable CredentialSource / NatsCredential — the runtime objects MachineKeyFile / CachedToken — helper types credential_source_from_config — factory connect_options_with_credentials — async-nats wiring Agent consumes via `pub use harmony_fleet_auth::CredentialsSection` in its own `config.rs` so existing call sites keep working. Existing 5 tests in the new crate + 7 in the agent all green. This commit is structurally a move; behavior unchanged. Operator wiring, additional unit tests, and the JWT-mint refactor (split build_assertion / build_scope / build_token_url for testability) follow in the next commits.	2026-05-05 01:48:42 -04:00
Jean-Gabriel Gill-Couture	612d934ad4	docs(fleet): manual JWT-bearer mint + NATS write recipe Working PyJWT script + nats CLI commands for talking to a callout-protected NATS by hand. Distills what we learned debugging the auth chain: which scope claims matter, why the audience is the project id (not the API app's clientId), how to read OIDC_AUDIENCE off the live callout instead of trusting the cache, and the failure modes — including the PyJWT vs jwt package collision that costs 30 minutes the first time you hit it. Cross-linked from fleet-zitadel-faq.md.	2026-05-05 01:43:36 -04:00
Jean-Gabriel Gill-Couture	f4d6fb9431	fix(zitadel): always live-query Zitadel for IDs instead of trusting cache ZitadelClientConfig was used as both a key store (machine keys — which Zitadel cannot return after creation, so caching is required) AND a lookup cache (project_id, machine_user_ids, user_grants). The latter introduced a silent drift class: - ZitadelSetupScore writes the cache incrementally as it creates each resource. - If Zitadel is reset between runs (Postgres recreated, IDs reissued), the cache still holds the old IDs. - ensure_project / ensure_app / ensure_machine_user / user_grant short-circuited on cache hit and never consulted Zitadel — so downstream Scores got the stale ID. - The legacy `project_id` field was further `is_none`-guarded so it preserved the very first id ever seen, surviving any number of Zitadel resets. Net effect in the wild: the deployed callout's `OIDC_AUDIENCE` silently pointed at a project that no longer existed, while agents kept working only because their TOML config carried the matching stale id. A manual mint script reading `project_id` from the cache would produce tokens that pass signature validation but fail the audience check — exactly the symptom that surfaced this bug. Fix: drop the cache-hit short-circuit in every ensure_* path and always live-query. The cache now only holds machine key material (its only legitimate role) and a record of last-known IDs that get refreshed on every apply. Cost: ~1 extra HTTP per project / app / user / grant per Score apply — these are not hot paths. Also: stop is_none-guarding `config.project_id` so the legacy field tracks live state for older single-project consumers.	2026-05-05 01:11:18 -04:00
Sylvain Tremblay	2ab7c102b1	feat(fleet/scripts): default Zitadel on, add NO_ZITADEL opt-out Flip the polarity of the Zitadel knobs in run_server_install.sh: the Score is now installed on every run, and `NO_ZITADEL=1` is the explicit skip. Defaults: ZITADEL_HOST=zitadel.localhost (HTTP ingress auto-selected by the example crate's `.localhost` rule). ZITADEL_VERSION stays optional (empty = inherit the example's clap default). Updates env.sh to document the new polarity (NO_ZITADEL as the opt-out, ZITADEL_HOST/VERSION as overrides on top of the defaults). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 19:26:39 -04:00
Sylvain Tremblay	4f1d95b99f	feat(fleet): wire optional ZitadelScore into FleetServerScore FleetServerScore gains `pub identity: Option<ZitadelScore>` and a conditional `.interpret()` call after the operator install. Trait bounds widen from `Topology + HelmCommand` to `Topology + HelmCommand + K8sclient + PostgreSQL` to satisfy the ZitadelScore impl — both inner Scores need the wider topology even when identity is None (Rust trait bounds are static). Example crate consequences: - Switched topology from K8sBareTopology to K8sAnywhereTopology (provides PostgreSQL via CNPG). `ensure_ready` now installs cert-manager as a side effect — Zitadel's prod ingress needs it anyway, and it's harmless on k3d. - New CLI flags: --zitadel-host (Option<String>; omitted = no Zitadel), --zitadel-version, --zitadel-insecure. Dev-friendly defaults: hosts ending in .localhost / .test default to external_secure=false. - Outcome details now include the Zitadel URL when identity is set. Auxiliary: - Added env.sh next to the example, mirroring okd_add_node's pattern (KUBECONFIG / RUST_LOG / sqlite secret store paths, with optional ZITADEL_HOST documented). - run_server_install.sh now reads ZITADEL_HOST / ZITADEL_VERSION env and passes them through. Trailing banner conditionally prints the Zitadel `helm uninstall` command alongside the operator one. Out of scope: load-test.sh drives the same example crate and may need a topology audit after this change. Flagged for follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 19:14:04 -04:00
Sylvain Tremblay	3aae580850	feat(fleet/scripts): --score-only flag for run_server_install.sh Skips cluster create + operator image build + k3d sideload when set — just refreshes the kubeconfig and runs the Score against the already- bootstrapped cluster. Shaves the slow rebuild + sideload off the dev loop when iterating on Score-side code with the operator binary unchanged. Errors out cleanly if --score-only is passed but the cluster is missing (instead of letting cargo trip on a missing kube context). Unknown flags also fail-fast. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:42:59 -04:00
Sylvain Tremblay	978f3050f7	feat(fleet): use harmony_cli for example output + nicer Score outcomes Switch example_fleet_server_install from a manual `create_interpret(). execute()` + `println!` to `harmony_cli::run`, which wires up the framework's standard logger + reporter — emoji-tagged per-Score progress lines and an end-of-run summary listing each Score's `Outcome.details`. Mirrors the okd_add_node example's pattern. For events to fire on the inner Scores, FleetServerScore now calls `Score::interpret` (not `create_interpret().execute`) on NatsBasicScore + FleetOperatorScore. Same change inside FleetOperatorScore for its inner HelmChartScore. Outcome.details populated: - FleetOperatorScore: image, namespace, release_name, NATS URL. - FleetServerScore: in-cluster NATS URL, kubectl pointer to the operator deployment, kubectl tip for verifying CRDs. Progress logs added inside FleetOperatorScore between the chart- render and helm-install phases (`info!`). FleetOperatorScore fields are now `pub` so callers can read them post-construction (FleetServerScore needs `operator.namespace` for its summary). Builder methods unchanged; both styles coexist. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:31:23 -04:00
Sylvain Tremblay	46ac4a572a	feat(fleet): local-run scripts for example_fleet_server_install + glibc fix Two scripts for running the new install Score against a local cluster: - examples/fleet_server_install/run.sh — generic, cwd-independent passthrough around `cargo run -p example_fleet_server_install`. - fleet/scripts/run_server_install.sh — opinionated k3d test harness: creates `fleet-server-test` cluster if absent (with NATS port 4222 mapped through klipper-lb), builds the operator image via build_docker.sh, sideloads it, runs the Score, and leaves the cluster up. Prints teardown + redeploy commands at the end. Header documents the helm-idempotency limitation: a rebuilt image won't redeploy on a second run unless `helm uninstall` is invoked first (HelmChartScore short-circuits on chart_version match). Proper fix is deferred — content-hash chart_version or a force_upgrade flag. Dockerfile glibc pin: builder pinned to `rust:1.94-slim-bookworm`. Unsuffixed `rust:slim` follows Debian's latest stable (trixie = glibc 2.40), so binaries built there fail to start on the `debian:bookworm-slim` runtime (glibc 2.36) with "GLIBC_2.39 not found". Surfaced when running the new scripts end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 16:02:23 -04:00
Jean-Gabriel Gill-Couture	3069f5b9ae	Merge remote-tracking branch 'origin' into feat/nats-auth-callout-e2e Some checks failed Run Check Script / check (pull_request) Failing after -44h57m27s Details	2026-05-04 15:38:52 -04:00
Sylvain Tremblay	3a3e4a2312	feat(fleet): FleetOperatorScore + FleetServerScore for k8s server-side install Collapses the load-test harness's chart-gen + helm-install dance into first-class Harmony Scores. Customer-facing path: let score = FleetServerScore::new(nats, operator); score.create_interpret().execute(&Inventory::empty(), &topology).await?; FleetOperatorScore renders the operator chart (CRDs + RBAC + ServiceAccount + Deployment) into a tempdir and delegates to HelmChartScore. FleetServerScore composes it with NatsBasicScore via fail-fast `?` chaining; Zitadel + Argo hang off the same chain when their Scores land. Structural change: CRD type definitions and chart-builder moved from fleet/harmony-fleet-operator/src/{crd,chart}.rs into harmony/src/modules/fleet/operator/. Harmony can't depend on the operator crate (cycle), so the score-side code lives in harmony and the operator binary imports the types right back via `harmony::modules::fleet::operator::*`. Considered keeping CRDs in the operator crate with the score either there or in a sibling crate, but putting customer-facing scores in harmony/src/modules/fleet/ matches the existing convention (FleetDeviceSetupScore, ProvisionVmScore) and keeps the CRDs reachable from future harmony scores (e.g. an inventory aggregator reading Device CRs) without dragging in the operator binary. The operator's `chart` subcommand stays as a developer convenience (routes through harmony::modules::fleet::operator::build_chart) so `cargo run -p harmony-fleet-operator -- chart` still produces an identical chart on disk for inspection. Existing examples (fleet_load_test, harmony_apply_deployment) updated to import CRD types from harmony directly. load-test.sh phase 3c collapses to a single `cargo run -p example_fleet_server_install` invocation; phase 2b's NATS install still runs separately so the host-side NATS reachability probe sits where it always did. Idempotency: re-running short-circuits via HelmChartScore::find_installed_release on both inner installs. Verified: cargo fmt --check, cargo clippy, cargo test all pass; the 4 fleet operator unit tests (2 migrated from operator crate, 2 new on FleetOperatorScore defaults/builders) pass under `cargo test -p harmony`; operator chart subcommand produces an identical chart structure post-refactor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 15:06:28 -04:00
Sylvain Tremblay	e5d2332421	chore: cargo fmt drift in fleet/setup_score + linux/ansible_configurator Pure rustfmt wrapping on long lines that pre-dated this branch — surfaced when running `cargo fmt --check` as part of unrelated work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 15:06:02 -04:00
Sylvain Tremblay	abd1b29717	ci(fleet): add Gitea workflow to build + push operator image Mirrors .gitea/workflows/harmony_composer.yaml: on push to master (or manual dispatch), build the multi-stage Dockerfile and push hub.nationtech.io/harmony/harmony-fleet-operator:latest. No buildx caching yet — TODO comment in the workflow tracks it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 13:52:17 -04:00
Sylvain Tremblay	b17ed1f6a0	feat(fleet): multi-stage Dockerfile + Harbor push script for operator The operator Dockerfile previously copied a host-built binary into archlinux:base — archlinux was a glibc-ABI workaround for that host-build path. Convert to a two-stage build (rust:1.94-slim → debian:bookworm-slim) so cargo runs inside the image. load-test.sh loses its host cargo build + staging-context trick and now points podman at the workspace root with -f. Add build_docker.sh as the local Harbor entry point (DOCKER_TAG, PUSH overrides). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 13:48:21 -04:00
stremblay	ebd199b22e	Merge pull request 'feat/prepare-rpi' (#280 ) from feat/prepare-rpi into feat/iot-walking-skeleton Some checks failed Run Check Script / check (pull_request) Failing after -44h57m29s Details Reviewed-on: #280	2026-05-04 17:28:43 +00:00
Jean-Gabriel Gill-Couture	c6284c09bc	feat(fleet-agent): emit state pulse on direct device-state.<id> subject Some checks failed Run Check Script / check (pull_request) Failing after -44h56m12s Details The agent's data plane was JetStream-KV-only, so live observers that don't want to consume the JS stream had no signal to subscribe to. The walking-skeleton e2e admin test was failing as a result — admin subscribes to `device-state.>` (the per-device direct subject) and saw nothing in 30s. This commit adds a small core-NATS publish on `device-state.<id>` alongside the existing KV writes: - `FleetPublisher::publish_state_pulse()` emits a tiny `{device_id, kind: "heartbeat", at}` payload on `device-state.<device_id>`, called from the heartbeat loop so observers see traffic on the same 30s cadence as the KV heartbeat write — but on a non-JetStream subject anyone can sub to. - `write_deployment_state()` now fans out the same payload it puts in the KV bucket on the direct subject, so live admin tooling picks up reconcile transitions immediately without watching the KV stream. Also threads `device_id_prefix_strip = "device-"` through the fleet_e2e_demo bring-up. The bring-up has its own NatsAuthCalloutScore construction (parallel to fleet_auth_callout's `bring_up_stack`), and was missing the prefix-strip line, so the deployed callout was interpolating permissions against `device-vm-device-00` instead of the bare device id the agent uses. Locks the regression with a unit test (`device_id_prefix_strip_lands_as_env_value`) on the deployment manifest builder. Verified end-to-end in the VM rehearsal: test both_devices_heartbeat_within_60s ... ok test admin_jwt_reads_any_device_subject ... ok	2026-05-04 09:36:26 -04:00
Jean-Gabriel Gill-Couture	54308fd7a4	chore: formatting Some checks failed Run Check Script / check (pull_request) Failing after -44h56m9s Details	2026-05-04 09:03:35 -04:00
Jean-Gabriel Gill-Couture	d4fd4859ec	fix(callout): align device permissions with KV key formats and machine-user prefix Some checks failed Run Check Script / check (pull_request) Failing after -44h57m23s Details Two bugs surfaced when the agent went live against NATS JetStream KV in the VM-based e2e rehearsal: 1. The default `device` role only allowed flat `device-state.<id>` / `device-commands.<id>` subjects. The agent's actual data plane is JetStream KV, which puts every operation on `$KV.<bucket>.<key>` subjects with control-plane traffic on `$JS.API.>` and `$JS.ACK.>`. With the old role config, the very first KV publish died with `Permissions Violation for Publish to "$JS.API.INFO"`. The role now allows `$JS.API.>` + `$JS.ACK.>` plus the four per-device data subjects derived from harmony_reconciler_contracts::kv (info.<id>, state.<id>.<dep>, heartbeat.<id>, desired-state.<id>.<dep>). The legacy direct `device-state.<id>` / `device-commands.<id>` subjects are kept so non-JetStream callers of NatsAuthCalloutScore still work. A new unit test (`device_role_covers_reconciler_contract_kv_subjects`) imports the contract crate as a dev-dep and asserts each contract- produced subject is matched, plus that cross-device subjects are not matched. This locks the role config to the contract surface so future renames break the test before they break prod. 2. Zitadel's `client_id` claim for a machine user equals the userName verbatim. Both `fleet_rpi_setup` and `fleet_e2e_demo` create the user as `device-{device_id}`, so the JWT carries `device-vm-device-00` while the agent's KV keys use the bare `vm-device-00`. The callout was interpolating the prefixed string into permissions, producing rules that never matched what the agent actually publishes. Adds `device_id_prefix_strip` (env: `DEVICE_ID_PREFIX_STRIP`, defaults empty so existing deployments are unaffected). When set, the validator strips the prefix from the extracted claim before permission interpolation. The fleet_auth_callout example wires it to `device-` so the e2e harness stays end-to-end correct without reaching into either naming convention. Verified end-to-end: both VM agents now publish DeviceInfo / heartbeat through JetStream KV with no permission errors and zero service restarts since the rollout.	2026-05-03 17:49:48 -04:00
Jean-Gabriel Gill-Couture	050d4697d2	chore: cargo fmt setup_score.rs	2026-05-03 17:49:22 -04:00
Jean-Gabriel Gill-Couture	7dd5f1504f	chore: cargo fmt sweep across modified files No behavior changes; only re-flowing existing expressions.	2026-05-03 17:49:15 -04:00
Jean-Gabriel Gill-Couture	6607fe7494	fix(e2e-demo): point agent_binary default at the real cargo target name The cargo bin target is `harmony-fleet-agent`, not `fleet-agent` — the latter never existed under target/release. Smoke-a4 happened to work because callers passed --agent-binary explicitly; the harness defaults didn't.	2026-05-03 17:49:09 -04:00
Jean-Gabriel Gill-Couture	a4b9e7ac9f	fix(fleet-agent): request projects:roles scope so role claim is emitted Zitadel only includes the project-roles block in an access token when the JWT-bearer request asks for it via the `urn:zitadel:iam:org:projects:roles` scope (PLURAL "projects"). Without it the agent's token has a valid signature/audience but no roles, so the NATS auth callout rejects with "no authorized role in token" even though the machine user has a "device" grant. Discovered while running the VM-based e2e rehearsal: agents could mint a token, connect to NATS, then immediately fail authorization. The plural-projects vs. singular-project distinction is a Zitadel convention; both scopes are required, and the comment now spells out what each one does.	2026-05-03 17:49:04 -04:00
Jean-Gabriel Gill-Couture	49f9834eb2	feat(e2e-demo): apply FleetDeviceSetupScore over SSH per VM Wires the previously-built FleetDeviceSetupScore through to a LinuxHostTopology against each pre-provisioned VM. Mirrors the fleet_rpi_setup pattern but synthesizes inline so the harness drives N VMs in sequence without re-deriving the CLI plumbing. Each VM gets: - An /etc/hosts entry mapping `sso.fleet.local` → libvirt host IP via the new HostsEntry support, so the in-VM agent's HTTP client to Zitadel can resolve the issuer. - The per-device Zitadel machine key dropped at /etc/fleet-agent/zitadel-key.json. - Agent TOML with `type = "zitadel-jwt"` pointing at the keyfile. - Agent service started under systemd. SSH user assumed `fleet-admin` (matches what fleet_vm_setup + smoke-a4 cloud-init create). Private key from the harmony fleet keypair (ensure_fleet_ssh_keypair). After this commit, `cargo run -p example-fleet-e2e-demo` is the single command that turns a fresh k3d + 2 booted VMs into a fully-converged stack: Zitadel + NATS callout + 2 agents speaking JWT-bearer to NATS. Tomorrow's morning: prove it actually does that on a clean machine.	2026-05-03 17:08:52 -04:00
Jean-Gabriel Gill-Couture	1d453dd9aa	feat(e2e-demo): VM-based rehearsal harness + /etc/hosts injection Adds `examples/fleet_e2e_demo/` — composes fleet_auth_callout's existing pieces (Zitadel + auth callout deploy) with per-device machine-user provisioning (one ZitadelSetupScore call per VM) and FleetDeviceSetupScore using FleetDeviceAuth::ZitadelJwt. The harness expects pre-provisioned libvirt VMs (one per device) reachable via `FLEET_E2E_VM_<i>_IP` env vars; full VM provisioning via ProvisionVmScore is a follow-up — keeping the harness observable in pieces during the cold-start debugging tomorrow. Constituent helpers in `fleet_auth_callout::lib.rs` flipped from private to `pub` (deploy_zitadel, wait_for_zitadel_ready, ensure_issuer_seed, build_and_load_callout_image, etc.) so the new harness composes them rather than re-implementing. `bring_up_full_stack`: 1. Ensure k3d cluster (re-uses fleet_auth_callout's create_k3d). 2. Deploy Zitadel + Postgres. 3. CoreDNS rewrite + wait for Zitadel HTTP + wait for the chart-provisioned `iam-admin-pat` secret. (Last step is new and load-bearing — without it ZitadelSetupScore races the chart's setup job and fails on first cold-run.) 4. ZitadelSetupScore for project + API app + roles + admin machine-user (admin gets fleet-admin role grant). 5. Issuer NKey from a persisted secret + NATS deploy with auth_callout block + callout pod. 6. For each device i: per-device ZitadelSetupScore (machine-user with `device` role grant), pull the JSON keyfile from cache, render the agent's TOML with the keyfile path. (FleetDeviceSetupScore invocation is wired structurally; the SSH-and-apply step is gated behind the VM provisioning follow-up.) `HostsEntry` + `merge_hosts_file` added to FleetDeviceSetupScore so VMs on a libvirt NAT can resolve `sso.fleet.local` to the host gateway. Managed-block markers in /etc/hosts make the merge idempotent across re-runs and removable when entries are dropped from the score. Four new unit tests cover the merge invariants (insert, replace, strip, byte-stable). Tests skeleton in `tests/e2e_walking_skeleton.rs`: - `both_devices_heartbeat_within_60s` — implemented; reads from device-info KV via admin token. - `admin_jwt_reads_any_device_subject` — implemented; subscribes to `device-state.>` as admin. - `cross_device_isolation_enforced_in_vm` — `#[ignore]` pending per-device-key plumbing through E2eHandles. - `agent_recovers_from_nats_pod_restart` — `#[ignore]` pending the NATS-pod-restart driver. The two `#[ignore]`d tests cover the load-bearing reconnect and isolation invariants. Wiring them is the morning-of-rehearsal priority since those are the customer-facing claims. Out of scope of this commit (called out in the roadmap doc): - ProvisionVmScore integration (today operator runs fleet_vm_setup out-of-band). - Operator install via Helm (smoke-a4 runs operator host-side; this harness inherits that pattern). - Full SSH-based agent install via FleetDeviceSetupScore — Score built, invocation gated.	2026-05-03 17:07:40 -04:00
Jean-Gabriel Gill-Couture	fdcc7040dd	docs(fleet): chapter 6 — VM-based customer demo rehearsal plan Adds ROADMAP/fleet_platform/v0_demo_e2e.md and threads it from v0_1_plan.md. The VM rehearsal extends smoke-a4 (already-green k3d + libvirt VM + agent + apply CR + reconcile loop) with Zitadel + auth callout + agent JWT auth. Two devices + one admin, real cargo tests sharing a OnceCell-bringup. Plan calls out: - The 7 tests, including the load-bearing `agent_recovers_from_nats_pod_restart` (asserts the auto-reconnect + auth-callback re-mint path under realistic disturbance). - Five known risks / debugging traps to expect on first cold-start (iam-admin-pat secret timing, /etc/hosts injection, k3d port collisions, etc.). - Success criteria for the rehearsal day: cold cargo run greens in <20 min, all 7 tests green on a clean machine, the NATS-restart test reliably greens 5 runs in a row. - Anything below the success criteria → reframe the customer call to "architecture walkthrough + local k3d demo + pilot in 1-2 weeks." Avoids burning the relationship to keep a deadline. Once VM rehearsal is green the residual OKD deltas are configuration (Route annotations, image registry, real DNS, cert) — no new code.	2026-05-03 16:59:43 -04:00
Jean-Gabriel Gill-Couture	e3e6d33dc8	fix(fleet_vm_setup): adopt FleetDeviceAuth::TomlShared shape The VM smoke harness still uses shared NATS creds for v0 (no Zitadel JWT path through libvirt — the customer-facing Pi flow has it via fleet_rpi_setup --bootstrap-token). Rewriting the FleetDeviceSetupConfig literal against the new `auth: FleetDeviceAuth` field.	2026-05-03 15:44:18 -04:00
Jean-Gabriel Gill-Couture	4053ac52de	docs(fleet): demo runbook (operator + developer flow, single page) Hand-on walkthrough for the 48-hour customer demo: - Operator: build/push the callout image → fleet-staging-deploy → capture project_id + cli_client_id from the printed panel. - Developer: fleet-sso-login proves Zitadel SSO works end-to-end. - Pi onboarding: extract iam-admin-pat from the staging cluster, cross-compile the agent for aarch64, run fleet-rpi-setup once per device with --bootstrap-token. Each Pi's agent connects to NATS over WSS using the JWT-bearer token minted from its per-device keyfile. - Deploy a container to a labeled subset via example_harmony_apply_deployment with --env / --volume / --restart flags (env + bind mounts + restart policy that work_item #1 added). - Observe the cross-device security model holding via the auth callout's logs. Also captures what's deliberately NOT in the demo (compose auto-translation, UI, Tailscale backdoor, device-join-request flow, OpenBao, K8s OIDC) so the customer call has clean expectation- setting. The runbook is the closing piece of the 48h-demo work plan; sequenced after the eight feat / refactor commits that built the underlying functionality.	2026-05-03 15:43:10 -04:00
Jean-Gabriel Gill-Couture	5396ef8bf2	feat(example): fleet-sso-login — Zitadel device-code CLI login Adds `examples/fleet_sso_login/` — the developer-side CLI that proves the SSO works end-to-end against a deployed staging instance. RFC 8628 device-code flow: - POSTs `/oauth/v2/device_authorization` with the harmony-cli client_id. - Prints `verification_uri_complete` so the developer opens one URL in the browser; Zitadel handles the auth (username/password, MFA, whatever the customer has wired into Zitadel's auth chain). - Polls `/oauth/v2/token` honouring the standard `authorization_pending` / `slow_down` polling protocol. - On success: decodes the access token's claims, prints `Welcome <name> <email>`, persists the session (issuer + client_id + access_token + claims) at $DATA_DIR/harmony/sso-session.json with mode 0600. For the demo this proves the SSO chain end-to-end. The actual `harmony fleet apply` operation (which would consume the persisted token through a fleet-platform API gateway) is post-demo — clusters typically don't accept Zitadel JWTs as kube-apiserver bearer tokens without an OIDC integration the customer would have to opt into. `fleet_staging_deploy` now also provisions a `harmony-cli` Device Code OIDC application alongside the existing API app, captures its client_id from the ZitadelClientConfig cache, and prints both the client_id and the exact `cargo run -p example-fleet-sso-login ...` invocation in the operator's "next steps" panel.	2026-05-03 15:41:54 -04:00
Jean-Gabriel Gill-Couture	8d8e700786	feat(example): fleet-staging-deploy — operator-side OKD bringup Adds `examples/fleet_staging_deploy/` — the operator-side, run-once- per-customer harness that brings up the fleet platform's central services on a real OKD/K8s cluster. Complements the existing `fleet_auth_callout` (k3d local-dev harness, kept unchanged) and `fleet_rpi_setup` (per-device onboarding). `FleetDomainConfig` is the single source of truth for hostnames: base_domain = "customer1.nationtech.io" → zitadel.<base> (Zitadel HTTPS via OKD HAProxy edge-TLS) → nats.<base> (NATS WSS through the same ingress) Nothing is hardcoded; the operator supplies one --base-domain flag and the deploy is fully parameterized. Re-running is idempotent (rides the helm-upgrade-by-default + ZitadelSetupScore search-then- create + persisted issuer-NKey-secret idempotency layers). NATS values render under config.merge.{auth_callout, accounts, system_account}, with WSS via `websocket: { enabled, port: 8443, ingress: { className: openshift-default, ... } }` and the OKD-flavored HAProxy edge-TLS annotations: route.openshift.io/termination: edge haproxy.router.openshift.io/timeout: "1h" (Switch to `reencrypt` when the customer wants pod-to-edge TLS; gateway-api migration is on their roadmap, separate from the demo.) bring_up_staging(): - Deploys ZitadelScore (external_secure: true, no external_port → 443). - Waits for HTTPS .well-known. - Provisions the project + API app + roles via ZitadelSetupScore hitting Zitadel through the public ingress (port 443, TLS verified). No machine users provisioned — fleet_rpi_setup mints them on demand per device, so the staging deploy stays device-count-agnostic. - Persists / reads the issuer NKey seed in the `callout-issuer-seed` K8s secret (so re-runs don't invalidate user JWTs already in flight on customer Pis). - Deploys NATS via NatsHelmChartScore with the WSS values. - Deploys NatsAuthCalloutScore (oidc_audience = project_id; external_secure path means no danger_accept_invalid_certs). main.rs ends by printing the exact `cargo run -p example-fleet-rpi-setup ...` invocation the operator runs against a Pi, with the project_id and zitadel/nats URLs filled in. Three unit tests cover the domain config + NATS values rendering (WSS + edge-TLS annotations + auth_callout under merge).	2026-05-03 15:38:56 -04:00
Jean-Gabriel Gill-Couture	ab98cbabf9	feat(fleet): per-device Zitadel bootstrap in fleet_rpi_setup The Pi onboarding flow can now mint a per-device Zitadel machine user on the operator's machine and ship the resulting JWT key to the Pi — the agent then authenticates to NATS via JWT-bearer instead of shared nats_user/nats_pass. `FleetDeviceSetupConfig.auth: FleetDeviceAuth` replaces the previous flat `nats_user` / `nats_pass` fields. Two variants: - TomlShared { nats_user, nats_pass } — legacy / dev fallback. - ZitadelJwt { machine_key_json, oidc_issuer_url, audience, ... } — per-device JWT-bearer. The Score: * Drops `machine_key_json` to /etc/fleet-agent/zitadel-key.json (mode 0640, owner fleet-agent — matches the agent's secret-mount conventions). * Renders [credentials] type = "zitadel-jwt" pointing at that keyfile + the issuer + audience the agent's CredentialSource needs. A change to either the keyfile content or the TOML triggers an agent restart, same as binary / unit drift. `fleet_rpi_setup --bootstrap-token <PAT>` activates the Zitadel path. The bootstrap PAT is held in the CLI's memory only; it never lands on the Pi. New flags: --zitadel-issuer-url, --zitadel-project-id, --zitadel-device-role (default `device`), --danger-accept-invalid-certs. `zitadel_bootstrap` is a slim ManagementAPI client that, idempotently per device: 1. Find-or-create machine user `device-${device_id}`. 2. Find-or-skip a project role grant (defaults to `device`). 3. Always mint a fresh JSON key and return its content. (Zitadel doesn't expose the private half of an existing key, so reusing isn't possible — stale keys remain valid until expiry, which is fine because each setup run overwrites the on-device keyfile.) Three new render_toml tests cover the zitadel-jwt path; eleven existing agent tests still pass. Out of scope, tracked: device-join-request + admin-approve flow that would replace bootstrap-PAT entirely (closer to the OKD node-approval pattern). Long-lived admin PAT is acceptable for the demo per product call.	2026-05-03 15:22:13 -04:00
Jean-Gabriel Gill-Couture	b4d3d7d02c	fix(linux): SshCredentials default_ubuntu_aws missing sudo_password The merge of feat/prepare-rpi added a `sudo_password: Option<String>` field to SshCredentials but the `default_ubuntu_aws` constructor on the destination branch was authored before that field existed. Add the missing field as `None` (matches the prepare-rpi semantics: passwordless sudo expected unless explicitly configured).	2026-05-03 15:17:03 -04:00
Jean-Gabriel Gill-Couture	c785f13abd	merge: feat/prepare-rpi (Pi onboarding harness + linux-host capabilities)	2026-05-03 15:15:22 -04:00
Jean-Gabriel Gill-Couture	74ee7fc9f2	feat(agent): Zitadel JWT credential source + auto-reconnect The fleet agent's NATS connection is the load-bearing piece of the "never lose connectivity to a device" guarantee. This commit makes that hold even when Zitadel access tokens expire across NATS pod restarts and network partitions. New `[credentials]` config variants (externally-tagged): type = "toml-shared" { nats_user, nats_pass } # v0/dev type = "zitadel-jwt" { key_path, oidc_issuer_url, audience, ... } A `CredentialSource` enum dispatches per variant: - TomlShared returns the same user/pass each call. - ZitadelJwt mints an access token from Zitadel via the JWT-bearer flow (RFC 7523). The keyfile at `key_path` is the only durable secret on the device; the bearer token is short-lived and refreshed in-memory when the cached value is within 5 minutes of expiry. Two concurrent refreshes are race-safe — the second writer's mint is wasted but produces a correct token. The agent's `connect_nats` is rewritten on top of async-nats's `with_auth_callback`, which is invoked on every (re)connect attempt: - async-nats reconnects automatically on disconnect (default behaviour of ConnectOptions) — we don't need a watchdog. - Each reconnect attempt invokes the callback, which calls `next_credential()`. If the cached token is expired, a fresh one is minted before the reconnect proceeds. So a Pi that loses NATS while its token has just expired will pick up a brand-new token on the next reconnect attempt with no operator intervention. - An `event_callback` surfaces Connected / Disconnected / SlowConsumer / ServerError events into tracing — operators can see exactly when reconnects happen, which is non-negotiable for an out-of-warranty device fleet. A subtle constraint drove the trait shape: async-nats's `with_auth_callback` requires the returned future to be `Send + Sync`, which `#[async_trait]`'s erased `Pin<Box<dyn Future + Send>>` does not satisfy. The credential source is therefore an enum (concrete dispatch) rather than `dyn CredentialSource`. Two variants is small enough that enum dispatch beats trait-object plumbing. Out of scope, tracked for follow-up: a separate daemon for SSH access to the Pi via Tailscale/Headscale ("secure backdoor"), and the device-join-request + admin-approve flow that would replace the current admin-PAT bootstrap pattern.	2026-05-03 15:15:01 -04:00
Jean-Gabriel Gill-Couture	a0a5faa3d0	chore: remove accidentally-committed scratch + agent worktrees The previous commit swept in `.claude/worktrees/*` (ephemeral agent worktree submodules) and a few scratch files that landed at the repo root during prior sessions. None of them are project artifacts. Removing them from the index and adding to .gitignore so future `git add -A` doesn't re-include them. Files on disk are unchanged.	2026-05-03 15:08:19 -04:00
Jean-Gabriel Gill-Couture	6d55892736	feat(podman): env vars + bind-mount volumes + restart policy The IoT walking-skeleton's PodmanV0Score and the underlying ContainerSpec capability were name+image+ports only. Real customer workloads (the demo target's docker-compose for example) need at minimum: - Environment variables for runtime config + secrets injected at deploy time. - Bind-mount volumes so the container can persist data across recreates (sqlite db files, config dirs). - Restart policy so the container survives device reboot or crash. PodmanService and ContainerSpec gain `env: Vec<(String, String)>`, `volumes: Vec<VolumeMount>`, and `restart_policy: RestartPolicy`. All three default to empty / `unless-stopped` via #[serde(default)] so any Deployment CR written before this change still deserializes — that includes the existing smoke harnesses and any field-side state. VolumeMount is bind-only in v0 (host_path -> container_path, optional read_only). Named/anonymous volumes can be added behind the same field later by inspecting host_path's shape; the customer's compose file is expected to use bind mounts only. RestartPolicy mirrors podman/docker convention — `no`, `unless-stopped` (default, matching docker-compose), `on-failure`, `always`. Serialized kebab-case so docker-compose translation is mechanical. PodmanTopology::ensure_service_running now passes env / mounts / restart policy to the podman API. matches_spec conservatively forces recreate whenever the spec carries non-empty env / volumes or a non- default restart policy: the podman list endpoint doesn't surface those fields, so a structural compare isn't possible from ListContainer alone. Recreating an unchanged container is cheap (~hundreds of ms); the alternative (silent stale-config window) isn't acceptable for fleet-managed devices. example_harmony_apply_deployment grows --env, --volume, and --restart flags so an operator can drive the new shape from the CLI when authoring a Deployment CR. Tests: - legacy CR JSON without the new fields deserializes (wire-compat). - env ordering survives roundtrip (drift-detection invariant). - restart policy serializes kebab-case (compose-translation contract). - podman_v0_score_roundtrip exercises env + volumes + restart.	2026-05-03 15:08:01 -04:00
Jean-Gabriel Gill-Couture	6c45fb22ba	feat(nats-callout): production callout + harmony module + e2e demo harmony-nats-callout becomes a deployable service, not just a library: - New [[bin]] target with env+secret-file driven config and SIGINT/SIGTERM-aware shutdown. - Dockerfile (single-stage archlinux:base, non-root, matches harmony-fleet-operator convention). - Refactored handler into a pure `decide()` function so the entire authorization decision tree is unit-testable without async-nats. - New `roles` module with role resolution + a `validate_device_id` security gate that rejects NATS subject metacharacters in device_id (.>* whitespace) — closes a real escalation path through the `{device_id}` placeholder in the per-device permissions block. - Configurable role claim path + admin/device role names; admin wins when both are present (privilege-escalation invariant). 57 unit tests cover every reachable branch of the security decision tree; 4 e2e tests in nats/integration-test-callout exercise real NATS in podman with: device pubsub on own subjects, cross-device subject isolation, admin-can-read-anything, and JWT-without-role rejection. harmony/src/modules/nats_auth_callout/: - New `NatsAuthCalloutScore` deploys the callout as a K8s Deployment + Secret. fsGroup + 0o440 secret mode so the non-root container can read its mounted seed/password without leaving them in env vars. - `render_auth_callout_block` helper produces the YAML for NATS Helm `config.merge.authorization.auth_callout` so both halves stay in sync. examples/fleet_auth_callout/: - `bring_up_stack()` orchestrates k3d -> Zitadel + Postgres -> CoreDNS rewrite -> project + roles + machine users with JWT keys -> NATS Helm with auth_callout block -> callout image build + sideload -> NatsAuthCalloutScore deploy. Idempotent across re-runs (issuer NKey persisted in a K8s secret so user JWTs survive restarts). - `mint_access_token()` RFC 7523 JWT-bearer client. Uses Host header with port so Zitadel emits a matching issuer. - main.rs prints URLs/creds/keyIds and waits for Ctrl-C. - Three #[tokio::test] functions sharing one cluster via OnceCell: admin_can_read_any_device_subject, device_can_only_access_own_subjects, unknown_role_is_rejected. All green on real k3d.	2026-05-03 15:01:44 -04:00
Jean-Gabriel Gill-Couture	b8bc2217fd	feat(zitadel): ExternalPort + machine-user/role/key/grant provisioning ZitadelScore: - Auto-provisions an `iam-admin-pat` Kubernetes secret via the chart's FirstInstance.Org.Machine.Pat block. ZitadelSetupScore depended on this secret existing; without the chart values, the prior code path was non-functional. - New `external_port: Option<u32>` field. Controls Zitadel's emitted issuer URL when the host port mapping isn't 80/443 (k3d typically maps 8080:80). Without it, JWT-bearer audience validation 500s with `Errors.Internal` because the assertion's `aud` doesn't match the chart-default issuer at port 80. ZitadelSetupScore is extended for the JWT-bearer flow needed by the NATS auth callout: - API apps (resource servers — required for project-id audience scope) - Project roles (`POST .../projects/{id}/roles`, idempotent) - Machine users with KEY_TYPE_JSON keys (provisioned + cached device-side; Zitadel does not expose the key material on subsequent reads, so the local cache is the source of truth) - User grants (project + role keys) Cache (ZitadelClientConfig) gains projects, machine_user_ids, machine_keys, and user_grants — keyed for idempotency across re-runs. Backwards compatible with existing harmony_sso example: the new fields have `#[serde(default)]` and prior callers just need empty vecs. Refresh upgrade-by-default in helm chart (separate commit) lets ExternalPort changes propagate to existing releases on re-run.	2026-05-03 15:01:22 -04:00
Jean-Gabriel Gill-Couture	36974bda32	refactor(helm): upgrade-by-default for unpinned releases Helm releases without a pinned `chart_version` previously short-circuited to a NOOP when already installed, which silently dropped any `values_yaml` / `values_overrides` changes the caller had made. Now we fall through to `helm upgrade --install` whenever: - the release isn't installed (unchanged), or - it's installed and either unpinned or pinned-and-matching. Helm itself becomes the source of truth for "did anything actually change" — no-op upgrades are cheap and changed values get applied automatically without the caller having to opt in via a flag. `install_only=true` keeps the prior skip-if-installed shortcut so bootstrap operators (cert-manager, prometheus-operator, CRDs) that should not be touched on re-runs continue to behave the same. Pinned-version safety net is unchanged: a different version installed than what the score requests is an error, never a silent change.	2026-05-03 15:01:07 -04:00
Sylvain Tremblay	b86f8f11f9	feat: add little script to call the fleet_rpi_setup example Some checks failed Run Check Script / check (pull_request) Failing after -44h57m30s Details	2026-05-01 14:51:03 -04:00
Sylvain Tremblay	34e2f832ec	docs(linux): clarify sudo_password scope, TODO for SSH password auth The new sudo_password field is strictly for privilege escalation on the remote host (sudo -S, ansible become) — not for SSH login. SSH auth is still key-only. Adds a TODO on SshCredentials pointing at where SSH password support would land if/when we want it, and a matching note on the SudoPassword Secret type. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 14:09:38 -04:00
Sylvain Tremblay	be9073e461	feat(examples/fleet_rpi_setup): auto-fetch sudo password via SecretManager Probe `sudo -n true` over SSH before constructing the topology. If the probe succeeds (passwordless sudo, the typical rpi-imager default), proceed silently. If it fails, fetch the password through SecretManager::get_or_prompt::<SudoPassword>() — first run prompts the operator, subsequent runs reuse the cached value (same flow SshKeyPair etc. use). Adds harmony_secret dep, env.sh with the standard HARMONY_SECRET_NAMESPACE / HARMONY_SECRET_STORE / HARMONY_DATABASE_URL / RUST_LOG variables, and a doc snippet at the top of main.rs pointing at it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 13:57:00 -04:00
Sylvain Tremblay	c2b8403f14	feat(linux): support optional sudo_password on SshCredentials Lets callers populate creds.sudo_password when the bootstrap admin doesn't have passwordless sudo. None = current behavior unchanged. Wire-level injection: - ansible runs: when Some, write to a tempfile::NamedTempFile and pass ANSIBLE_BECOME_PASSWORD_FILE=<path> via Command::env. Path in env, never value in argv. File deletes on drop. - direct ssh_exec sudo paths (ensure_linger, ensure_user_unit_active, fetch_file): new sudo_exec helper that uses `sudo -S` with the password piped via the new ssh_exec stdin parameter, otherwise plain sudo. ensure_user_unit_active's && chain folded into one sudo+sh -c call since `sudo -S` only reads stdin once. ssh_executor.rs: ssh_exec gains an optional stdin: Option<&str>; on Some, writes via channel.data() then channel.eof() so the remote reader doesn't hang. Existing 4 call sites pass None. fleet_vm_setup updated to set sudo_password: None (behavior identical). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 13:56:50 -04:00
Sylvain Tremblay	414fe1cf7b	feat(secret): add SudoPassword Secret type Sudo password for a Linux bootstrap admin user. Stored under key "SudoPassword" via SecretManager when a host doesn't have passwordless sudo configured. Same shape as the other single-field Secret types in this file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 13:56:35 -04:00
Sylvain Tremblay	413077bcd0	feat(fleet): pre-flight config diff + confirm in FleetDeviceSetupScore New first step (1/7): read /etc/fleet-agent/config.toml off the device and compare against the rendered desired config. Three branches: - missing → info, first install - matches → warn, converge anyway - differs → warn + unified diff (similar::TextDiff with 2-line context radius, '-/+' marker style) + inquire::Confirm prompt defaulting to N. Aborts with InterpretError if declined. Existing 6 steps renumbered to 2/7-7/7. The diff replaces the previous "dump both full configs" approach which was unreadable even for one-line differences. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 13:11:10 -04:00
Sylvain Tremblay	54d1f0733b	feat(linux): add FileFetcher capability for reading remote files Mirrors FileDelivery in the opposite direction: returns Some(content) or None if the file doesn't exist. AnsibleHostConfigurator implements it via two SSH calls (sudo test -e + sudo cat), routed through sudo to handle root- or service-owned config files. Added to the LinuxHostConfiguration umbrella so any score with that bound gets it. Enables scores to pre-flight-compare desired state against current state before committing to a destructive change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 13:11:00 -04:00
Sylvain Tremblay	12249a7f88	refactor(fleet): inline agent binary local->remote on one recap line Folds the "-> /usr/local/bin/fleet-agent" continuation into the "Agent binary:" line. Removes the hardcoded-indent fragility (bullet prefix shifts in cli_reporter would have broken alignment). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 12:48:37 -04:00
Sylvain Tremblay	3cbe62c807	fix(cli): surface Outcome.details on NOOP outcomes too cli_reporter only accumulated details for SUCCESS, dropping the recap on idempotent re-runs that legitimately return NOOP with populated details. FleetDeviceSetupScore is the first score to exercise this path; the filter was over-restrictive. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 12:48:32 -04:00
Sylvain Tremblay	96f17f3ca0	refactor(examples/fleet_rpi_setup): adopt harmony_cli::run convention Drop the bespoke framed renderer, failure hint catalog, and custom env_logger setup. Score output now flows through harmony_cli's standard reporter (bullet list under "🚀 All done!"), matching the other examples. cli_logger::init() at the top of main so early logs (ensure_ansible_venv) get the same formatting. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 12:38:39 -04:00
Sylvain Tremblay	c3b25c8298	feat(fleet): structured progress + recap for FleetDeviceSetupScore Replace the opaque change-log with tagged per-step info traces and a human-readable Outcome.details recap (Device ID / NATS / Labels / User / Agent binary -> remote / Service). User and Service lines carry their own ✅/🔄 state markers; final line is ✅ for noop and 🎉 for runs that applied changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 12:38:33 -04:00
Sylvain Tremblay	bf1a438b90	fix(linux): drop redundant envelope from parseable ansible errors When stdout already parses into UNREACHABLE!/FAILED! + msg, the trailing (ansible-exit=..., stderr=..., stdout=...) envelope just duplicated the same text. Strip it when stderr is empty and the verb is recognized; keep it when it adds debug signal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 12:38:27 -04:00
Sylvain Tremblay	2cb884976a	feat(examples): add fleet_rpi_setup for onboarding a physical Pi Sibling of fleet_vm_setup with the libvirt provisioning step removed: the operator has already booted Pi OS Lite themselves (rpi-imager, preloaded SSH key, passwordless sudo on the admin user), so the example goes straight to applying FleetDeviceSetupScore over SSH. Defaults match the typical rpi-imager flow (--pi-user pi, --ssh-key ~/.ssh/id_ed25519); --ssh-key supports tilde expansion. The harmony dep is pulled in without the kvm feature since no VM is created here. RUST_LOG defaults to info so the score's per-step traces show up without the operator having to set the env var. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 11:41:05 -04:00
Sylvain Tremblay	97ec4848fd	fix(linux): make ensure_user_unit_active actually report NOOP systemctl --user enable --now is systemd-level idempotent, but the prior implementation always returned ChangeReport::CHANGED. This made every re-run of any score that touches a user-scoped unit (notably FleetDeviceSetupScore's podman.socket step) lie about its change count, defeating the noop detection the rest of the score honors. Probe is-enabled --quiet && is-active --quiet first; only call enable --now (and report CHANGED) when the unit isn't already in the desired state. Mirrors the existing ensure_linger pattern in the same file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 11:39:11 -04:00
Jean-Gabriel Gill-Couture	7fa1ca2683	feat: default for ubuntu aws linux topology Some checks failed Run Check Script / check (pull_request) Failing after 12m51s Details	2026-05-01 08:53:03 -04:00
Jean-Gabriel Gill-Couture	af67992b6e	refactor: production auth callout service with real integration tests nats-jwt: - Add NkeyPub newtype with prefix validation - Add ClaimType and Algorithm typed enums - Add impl_nats_claims! macro eliminating 4x duplicated impl blocks - Add AuthorizationRequestClaimsBuilder (completing all builder types) - Fix AuthorizationResponseBuilder: add issuer() builder method, stop mutating iss in sign() - Tighten trait bounds: encode<T: Serialize>, decode_unverified<T: DeserializeOwned> - Remove dead error variants Expired/NotYetValid - Add builder tests for all 4 claims types - Deduplicate is_zero helper harmony-nats-callout (rewritten): - AuthCalloutService: production service connecting to NATS, subscribing to .REQ.USER.AUTH, dispatching auth requests - AuthCalloutConfig with builder pattern - handler.rs: pure auth request handler (decode → validate → mint → respond) extracted from test - Fix ZitadelValidator: validate() is now async (was blocking_read deadlock in async contexts) - Remove dead fields kid_map, jwks_uri - Make danger_accept_invalid_certs configurable - permissions: InterpolatedPermissions named struct instead of 4-tuple integration-test-callout: - Converted to lib+test crate: src/lib.rs exports test utilities - Tests now exercise the REAL AuthCalloutService (not inline handler) - Extracted MockOidcServer, NatsServer, CalloutContext into library - Replace yasna with rsa crate for DER parsing - Add Drop to NatsServer for container cleanup - Add module constants for all magic values - README updated with new architecture diagram	2026-04-29 00:45:05 -04:00
Jean-Gabriel Gill-Couture	48ec80ed66	docs: add integration test README with auth flow diagram	2026-04-28 23:21:23 -04:00
Jean-Gabriel Gill-Couture	f848d94808	refactor: remove dead operator-mode code from nats crates - Remove operator-mode files: account_manager, authorizer, service, config, main.rs, plan.md from callout crate - Remove operator/activation claims from nats-jwt (builder and claims) - Inline PermissionsConfig into permissions.rs (config.rs removed) - Remove harmony-nats-callout dep from integration test (unused) - Remove unused imports in algorithm.rs tests - Clean up callout Cargo.toml (remove bin, unused deps)	2026-04-28 23:20:37 -04:00
Jean-Gabriel Gill-Couture	65daa76658	feat: NATS auth callout e2e integration test - nats-jwt crate: JWT builder types for user claims, authorization request/response, account claims, algorithm encode/decode - harmony-nats-callout crate: Zitadel OIDC JWT validator, callout service scaffold, account manager (WIP) - integration-test-callout: end-to-end test validating the full auth callout flow — device connects with Zitadel JWT → callout validates JWT → returns per-device user JWT with scoped permissions → device can pub/sub on its own subjects only - Mock OIDC server for test (JWKS + openid-configuration) - Negative test: device A cannot subscribe to device B's subjects - Added UserClaimsBuilder::audience() for account-scoped user JWTs	2026-04-28 23:15:18 -04:00
Jean-Gabriel Gill-Couture	50debfd163	chore: Some code review comments inlined	2026-04-28 16:41:15 -04:00
johnride	01d2cfa0ba	Merge pull request 'feat/iot-helm' (#275 ) from feat/iot-helm into feat/iot-walking-skeleton All checks were successful Run Check Script / check (pull_request) Successful in 2m10s Details Reviewed-on: #275	2026-04-25 13:52:22 +00:00
johnride	fbe58228f2	Merge pull request 'refactor: rebrand iot → fleet, operator/agent crates → harmony-fleet-*' (#276 ) from feat/iot-rebrand into feat/iot-helm All checks were successful Run Check Script / check (pull_request) Successful in 2m12s Details Reviewed-on: #276	2026-04-25 13:48:23 +00:00
Jean-Gabriel Gill-Couture	7c1fedb303	refactor: rebrand iot → fleet, operator/agent crates → harmony-fleet-* All checks were successful Run Check Script / check (pull_request) Successful in 2m25s Details The IoT vocabulary was anchoring the codebase to one customer's domain. The reconciler pattern is generic — operator in k8s, NATS KV as desired-state bus, agents reconciling podman / OKD / KVM / anything that can register. "Fleet" captures that neutrally; IoT stays acknowledged in docs as the first customer use case. Done now, while nothing is deployed. After a partner fleet lands, changing the CRD group alone is a multi-quarter migration. Scope (nothing left over): Paths + crates - iot/ → fleet/ - iot/iot-operator-v0 → fleet/harmony-fleet-operator - iot/iot-agent-v0 → fleet/harmony-fleet-agent - harmony/src/modules/iot → harmony/src/modules/fleet - ROADMAP/iot_platform → ROADMAP/fleet_platform - examples/iot_{vm_setup, load_test, nats_install} → examples/fleet_* - -v0 suffix dropped on the operator + agent crates (semver in Cargo.toml already tracks version) Rust identifiers - enum IotScore (podman score payload) → ReconcileScore - struct IotDeviceSetupScore/Config → FleetDeviceSetupScore/Config - InterpretName::IotDeviceSetup → InterpretName::FleetDeviceSetup - HarmonyIotPool → HarmonyFleetPool (libvirt pool) - HARMONY_IOT_POOL_NAME (default "harmony-iot") → HARMONY_FLEET_POOL_NAME ("harmony-fleet") - IotSshKeypair → FleetSshKeypair - ensure_iot_ssh_keypair / ensure_harmony_iot_pool / check_iot_smoke_preflight_for_arch → fleet-prefixed variants Wire / config surfaces - CRD group `iot.nationtech.io` → `fleet.nationtech.io` - Finalizer `iot.nationtech.io/finalizer` → `fleet.nationtech.io/finalizer` - Shortnames iotdep/iotdevice → fleetdep/fleetdev - Env var IOT_AGENT_CONFIG → FLEET_AGENT_CONFIG - Env var IOT_VM_ADMIN_PASSWORD → FLEET_VM_ADMIN_PASSWORD - Binary /usr/local/bin/iot-agent → /usr/local/bin/fleet-agent - Systemd user `iot-agent` → `fleet-agent` - VM admin user `iot-admin` → `fleet-admin` Defaults - Namespaces iot-system/iot-demo/iot-load → fleet-system/fleet-demo/fleet-load - Helm release iot-nats → fleet-nats - Helm release iot-operator-v0 → harmony-fleet-operator - Container image localhost/iot-operator-v0:latest → localhost/harmony-fleet-operator:latest - On-disk cache $HARMONY_DATA_DIR/iot/ → $HARMONY_DATA_DIR/fleet/ (cloud-images, ssh keypairs, libvirt pool) What stayed - harmony-reconciler-contracts — already neutrally named - Wire types (DeviceInfo, DeploymentState, HeartbeatPayload, DeploymentName) — already neutral - KV buckets (device-info, device-state, device-heartbeat, desired-state) — already neutral - CRD kind names (Deployment, Device) — already neutral - NatsBasicScore / NatsHelmChartScore / HelmChart / etc. — framework-scope, unchanged Verification - cargo check --workspace --all-targets: clean - All harmony lib tests (114), fleet-operator (6), fleet-agent (7), harmony-reconciler-contracts (13): green - End-to-end load-test (20 devices / 3 CRs / 20s under fleet/scripts/load-test.sh): PASS. Image built as localhost/harmony-fleet-operator:latest, chart installed as release harmony-fleet-operator in namespace fleet-system, all CR aggregates correct. Zero stragglers: grep across the tree for \biot\b / IOT_ / \bIot[A-Z] returns empty (excluding docs explicitly talking about IoT as the first customer's domain).	2026-04-23 11:10:10 -04:00
Jean-Gabriel Gill-Couture	61cdb9c326	refactor(examples): rename iot_apply_deployment → harmony_apply_deployment All checks were successful Run Check Script / check (pull_request) Successful in 2m17s Details Addresses the review point that the applier CLI was anchored in IoT vocabulary, but the CRD it applies is a generic declarative- reconcile intent that works for Pi podman today and OKD / KVM / anything-reconcilable tomorrow. The name now reflects what it actually does. Mechanical rename: crate, binary, `PatchParams::apply(...)` field manager, doc comments, every reference in smoke-a4.sh, the v0_1_plan.md Chapter 1 section, and the example itself. The CRD types + paths + operator name are not touched by this commit — that's the broader rebrand, planned for a dedicated branch. - examples/iot_apply_deployment/ → examples/harmony_apply_deployment/ - crate name: example_iot_apply_deployment → example_harmony_apply_deployment - binary name: iot_apply_deployment → harmony_apply_deployment - PatchParams field manager: "iot-apply-deployment" → "harmony-apply-deployment" 0 stragglers: `grep example_iot_apply_deployment` across the tree returns empty.	2026-04-23 11:00:19 -04:00
Jean-Gabriel Gill-Couture	4254a2092c	refactor(nats): share the helm-chart primitive across all NATS scores Addresses the review point that NatsBasicScore was a parallel typed-k8s_openapi path — reinventing probes, resource shapes, pod anti-affinity, JetStream storage — instead of reusing what NatsK8sScore already does via the upstream nats/nats helm chart. Every shape the project will ever ship (supercluster, single node, TLS, gateway, leaf nodes) is expressible as values on that chart. Parallel resource construction was churn waiting to diverge. The shape now: HelmChartScore [existing helm-install primitive] ▲ │ pins chart + repo │ NatsHelmChartScore (new) [exposes values_yaml only] ▲ ▲ │ │ NatsBasicScore NatsK8sScore (single node) (supercluster + TLS + gateways) Changes: - Delete harmony/src/modules/nats/node.rs (279 lines of typed k8s_openapi Deployment/Service/Namespace — gone). - New harmony/src/modules/nats/helm_chart.rs: NatsHelmChartScore pins chart_name = "nats/nats" and its official repository; values_yaml is the only varying input. Implements Score<T> for any topology with HelmCommand; caller hands it to K8sBareTopology / HAClusterTopology / K8sAnywhereTopology. - Rewrite score_nats_basic.rs as a thin preset: build a minimal single-node values_yaml (fullnameOverride, replicaCount=1, cluster.enabled=false, jetstream on/off, service type via the chart's `service.merge.spec.type` knob, optional image override). 10 unit tests on render_values covering every builder combination + image-ref splitting. Score bound moves from `T: K8sclient` to `T: HelmCommand` since installation is now helm-based. - score_nats_k8s.rs: last step in deploy_nats switches from a hand-constructed HelmChartScore to NatsHelmChartScore::new(...). Supercluster values_yaml construction untouched — a supercluster is just a more elaborate values file against the same chart. - bare_topology.rs: add `impl HelmCommand for K8sBareTopology` so the in-load-test flow (K8sBareTopology → NatsBasicScore → NatsHelmChartScore → HelmChartScore) compiles. Returns a bare `helm` command; KUBECONFIG resolution mirrors how HAClusterTopology does it. - mod.rs: export NatsHelmChartScore + the re-shaped NatsServiceType. - load-test.sh: the nats/nats chart provisions a StatefulSet, not a Deployment. Wait on `pod -l app.kubernetes.io/name=nats` instead of `deployment/iot-nats` — works across workload kinds. Tests: - 2 helm_chart unit tests (chart+repo pinning, default install- upgrade semantics) - 10 score_nats_basic unit tests covering every values shape - Full load-test.sh e2e (20 devices / 3 CRs / 20s): PASS.	2026-04-23 10:58:17 -04:00
Jean-Gabriel Gill-Couture	61d3a6b757	feat(iot/chart): typed variants + CRD-keep + Pod security context All checks were successful Run Check Script / check (pull_request) Successful in 2m17s Details Three production-path improvements bundled into one chart change, all verified end-to-end (helm lint + load-test pass): 1. Switch from `HelmResourceKind::from_serializable(...)` to the typed `HelmResourceKind::{Namespace, ServiceAccount, ClusterRole, ClusterRoleBinding, Crd}` variants added to the shared harmony helm module. Serialization output is byte-equivalent; IDE discoverability + type-safety go up. 2. Annotate both CRDs with `helm.sh/resource-policy: keep`. Without this, `helm uninstall iot-operator-v0` cascade-deletes the CRDs; the kube GC then deletes every Deployment CR and every Device CR; the operator finalizer fires on each deletion and wipes the `desired-state` KV; agents tear down every container. One typo on uninstall would be fleet-wide catastrophe. `keep` makes uninstall data-preserving and idempotent — wipe requires an explicit `kubectl delete crd …`. 3. Lock down the operator Pod's securityContext: - `runAsNonRoot: true` - `readOnlyRootFilesystem: true` - `allowPrivilegeEscalation: false` - `capabilities: drop [ALL]` - `seccompProfile: RuntimeDefault` Deliberately no `runAsUser` — OpenShift's `restricted-v2` SCC assigns namespace-specific UIDs and rejects fixed ones. The image's `USER 65532:65532` (Dockerfile) gives vanilla k8s a non-root UID; OpenShift's SCC overrides with its own. Same chart works on both without custom SCC bindings. Dockerfile adds `USER 65532:65532` — required for vanilla k8s to accept `runAsNonRoot: true` without a Pod-level `runAsUser`. 65532 is the distroless/chainguard `nonroot` convention; arbitrary but safe (no overlap with common system UIDs). Tests: 2 chart unit tests locking in the keep annotation + SC shape. End-to-end load test at 20 devices / 3 CRs: pod comes up clean under the restricted SC, all aggregates correct, zero operator warnings.	2026-04-23 10:32:03 -04:00
Jean-Gabriel Gill-Couture	20b94dfacf	feat(harmony/helm): typed HelmResourceKind variants for RBAC + Namespace + CRD Extends HelmResourceKind with typed variants for Namespace, ServiceAccount, ClusterRole, ClusterRoleBinding, and CustomResourceDefinition. Previously only Service + Deployment had typed variants; everything else went through the `from_serializable`/`CustomYaml` escape hatch. The escape hatch stays (documented as "always prefer a typed variant") for forward-compat with types we haven't imported yet. Any consumer currently using `from_serializable` for one of the new typed variants can switch; serialization output is byte- equivalent (both paths route through serde_yaml on the same k8s_openapi struct). Motivation: every Rust operator built on harmony wants the same five resources — Namespace, SA, ClusterRole, ClusterRoleBinding, CRD — to be chart-template-ready. Typing them once here means every operator's chart.rs stays short and IDE-discoverable instead of a string-of-from_serializable-calls. Filenames carry the resource name where applicable (serviceaccount-<name>.yaml, clusterrole-<name>.yaml, etc.) so charts with multiple ClusterRoles don't collide on a single `clusterrole.yaml` file. 2 unit tests: unique-filename invariant across the five typed variants, and crd-name round-trip.	2026-04-23 10:27:11 -04:00
Jean-Gabriel Gill-Couture	3d39b670dd	feat(iot-agent): config-driven routing labels Before: the agent published only `device-id=<id>` on DeviceInfo, which collapsed every Deployment.spec.targetSelector to "target one device by id" — usable, but not the actual scalability story. The K8s-Node analogue wants kubelet-declared node labels driving DaemonSet nodeSelector; we were missing the equivalent. After: a new `[labels]` section in the agent's TOML config, set by IotDeviceSetupScore and plumbed through to every DeviceInfo publish. Config labels merge with the default `device-id` on startup. Re-running the Score with a changed label map regenerates the TOML, triggers the byte-compare idempotency path, restarts the agent; new labels propagate into Device.metadata.labels and Deployment selectors re-resolve on the operator side. Manual toml edits + `systemctl restart iot-agent` is the break-glass path. Scope: - iot/iot-agent-v0/src/config.rs: `labels: BTreeMap<String,String>` on AgentConfig, defaults to empty via #[serde(default)]. Two parse tests cover the "section present" + "section absent" cases. - iot/iot-agent-v0/src/main.rs: merge cfg.labels with the default `device-id` entry before DeviceInfo publish. Config wins on key conflicts — unusual but legal. - harmony/src/modules/iot/setup_score.rs: IotDeviceSetupConfig gains `labels: BTreeMap<String,String>` (replacing the dedicated `group` field — group is just a conventional label now, not a distinct axis). render_toml renders a [labels] section; BTreeMap iteration guarantees sorted output so the Score's byte-compare change detection stays idempotent. Three unit tests: section content, byte-identical rendering across runs, value escaping. - examples/iot_vm_setup/src/main.rs: `--labels key=val,key=val` with a parser that errors on malformed chunks, empty keys/values, or an empty map (a device with no labels is practically untargetable, better to fail at the CLI than onboard a ghost). Live label changes require an agent restart (same as kubelet's --node-labels on a running Node). Edit-labels-on-running-fleet is a later chapter; for v0 the restart cost is negligible. Tests: 7 iot-agent + 3 iot setup_score + existing operator/ contracts suite — all green.	2026-04-23 10:25:25 -04:00
Jean-Gabriel Gill-Couture	a616204b1c	refactor(nats): extract typed single-node primitive; NatsBasicScore becomes a thin wrapper Some checks failed Run Check Script / check (pull_request) Failing after 54s Details Addresses the review point that NatsBasicScore was introduced as a parallel NATS path instead of sharing primitives with the rest of the module. The render logic (Deployment + Service + Namespace for one NATS server pod) is now pulled into a new `nats::node` module built on ADR 018 — typed k8s_openapi structs, no helm templating — and NatsBasicScore is a high-level preset that sets defaults on a NatsNodeSpec and runs the shared render fns. Module-level doc on `nats::node` explicitly flags that future high-level scores (clustered, TLS, gateway) should grow the spec and reuse the same primitive, and that NatsK8sScore + NatsSuperclusterScore are scheduled to migrate onto this primitive in a follow-up so the helm-templating path disappears entirely from the NATS module. 7 unit tests between node (the primitive) + score_nats_basic (the wrapper) cover service-type routing + JetStream flag propagation.	2026-04-23 09:48:42 -04:00
Jean-Gabriel Gill-Couture	1df0ba7cdc	refactor(iot): drop --system from iot-agent; add optional admin password Two changes with a single motivation — make the iot-agent runtime robust under multi-user hosts + unblock chaos-testing workflows on the VM admin user. 1. iot-agent user is no longer --system. Rootless podman needs subuid/subgid ranges in /etc/subuid + /etc/subgid before layer unpacking. Ubuntu's useradd --system deliberately skips those allocations (system users aren't expected to run user namespaces), so we were patching the gap with a hardcoded "usermod --add-subuids 100000-165535". That range collides with any other user on the host that also runs rootless containers — a real footgun. Dropping --system lets useradd's default allocator pick a non-overlapping range, and the whole ensure_subordinate_ids trait method + ansible impl goes away as dead code. 2. VmFirstBootConfig.admin_password (Option<String>). When set, cloud-init unlocks the account and enables ssh_pwauth on the guest — intended for reliability / chaos testing sessions where the operator wants to log in and break things on purpose. Default is still key-only auth. example_iot_vm_setup plumbs a --admin-password flag + IOT_VM_ADMIN_PASSWORD env var; smoke-a4 passes them through so chaos sessions are one env var away from a ready VM. 3 cloud-init unit tests cover the locked + unlocked + YAML-escape paths.	2026-04-23 09:48:36 -04:00
Jean-Gabriel Gill-Couture	24b8282b7f	feat(iot): Chapter 3 — operator helm chart (local, no registry) Some checks failed Run Check Script / check (pull_request) Failing after 50s Details Generates a self-contained helm chart directory from typed Rust (ADR 018 — Template Hydration). The chart packages: - Deployment CRD (from Deployment::crd()) - Device CRD (from Device::crd()) - ServiceAccount, ClusterRole, ClusterRoleBinding with the exact verbs the operator uses — nothing aspirational - operator Deployment (image, env NATS_URL + RUST_LOG) No hand-authored yaml, no Helm templating. Re-run the chart subcommand to regenerate for different inputs. When a publishable chart is needed (user-facing `values.yaml`), layer a templating pass on this output; for the load test the plain chart is enough. New surface: - `iot-operator-v0 chart --output <dir> [--image ... --nats-url ...]` writes the chart tree and prints its path. - `iot/iot-operator-v0/Dockerfile` — minimal archlinux:base wrapper around the host-built release binary (glibc-ABI match without a two-stage Docker build). load-test.sh: drops the host-side operator spawn entirely. Phase 3 now builds the operator image, sideloads it into k3d via `podman save \| docker load \| k3d image import`, generates the chart via the `chart` subcommand, and `helm upgrade --install` it into the cluster. `dump_operator_log` pulls `kubectl logs` into the stable work dir so HOLD=1 + failure-tail hooks keep working. Two gotchas debugged along the way, preserved in code comments: - workspace `.dockerignore` excludes `target/`, so the image build uses a staged build context under $WORK_DIR/image-ctx. - `podman build -t foo/bar:tag` stores as `localhost/foo/bar:tag`, which k3d image import can't find under the original tag. Use `localhost/iot-operator-v0:latest` as the canonical image ref end-to-end. Load-test results (selector architecture, operator in helm- installed pod, same envelope as the host-side baseline): \| Scale \| Duration \| Writes \| Rate \| Errors \| CR aggregates \| \|-------\|---------:\|-------:\|-----:\|-------:\|:-------------:\| \| 20 devices / 3 CRs \| 20s \| 400 \| 20/s \| 0 \| 3/3 ok \| \| 10k / 1000 CRs \| 120s \| 1,201,967 \| 10,009/s \| 0 \| 1000/1000 ok \| No operator warnings, no errors across the run. Image build + sideload + helm install adds ~30s to startup; steady-state throughput unchanged from host-side.	2026-04-23 06:57:56 -04:00
Jean-Gabriel Gill-Couture	173f549918	chore(iot): roadmap doc sync + code review pass Roadmap: - v0_1_plan.md Chapter 2: rewrite to describe the shipped selector + Device CRD model (matchedDeviceCount, LabelSelector, per-concern KV). Drop AgentStatus / observed_score_string / target_devices references. Update "State of the world" preamble to match 2026-04-23 reality. - chapter_4_aggregation_scale.md: SUPERSEDED banner at top with a clear what-was-kept vs. what-was-dropped summary. Original body preserved as decision-trail archaeology. Code review pass on the iot crates, behavior-preserving: - fleet_aggregator: owned_targets is now keyed by DeploymentName (matches the KV key space — globally unique, no namespace). The old DeploymentKey keying created an orphan-leak on operator restart: seed_owned_targets stashed entries under a sentinel namespace ("") that on_deployment_upsert never merged. Now seeding populates the map correctly so restart + selector change diffs properly. - fleet_aggregator: reuse the Client passed into run() for the patch_api instead of calling Client::try_default() a second time. - fleet_aggregator: delete _use_list_params / _use_deployment_spec placeholder scaffolding + unused ListParams / DeploymentSpec / ScorePayload imports. Inline one-liner serialize_score. - fleet_aggregator: clean up `then(\|\| ...)` → filter/map split. - device_reconciler: `is_label_value(v).then_some(()).is_some()` → plain `is_label_value(v)`. - crd: delete speculative DeviceStatus + DeviceCondition (no one writes to them; the comment in DeviceSpec documents where they'd land when a heartbeat-reflection reconciler shows up). - controller: compute `obj.name_any()` once in cleanup(). All 24 tests green. End-to-end load test (20 devices / 3 groups / 20s) PASS after the changes.	2026-04-23 06:35:36 -04:00
Jean-Gabriel Gill-Couture	8a6a9f1a03	refactor(iot): Deployment.targetSelector + Device CRD (DaemonSet-like) Kills the "CRD owns a list of device ids" smell. Deployment CR now carries a standard K8s LabelSelector; Device is a first-class cluster- scoped CR (like Node). Matching, desired-state KV writes, and status aggregation all run off selector evaluation against the Device cache — no list of device ids anywhere in the CRD spec. Cross-resource model: - Agent publishes DeviceInfo (with labels) to NATS `device-info` KV. - device_reconciler watches that bucket → server-side-applies a cluster-scoped Device CR with metadata.labels + spec.inventory. - Deployment controller is now just validation + finalizer cleanup. - fleet_aggregator watches Deployment CRs + Device CRs + device-state KV, maintains in-memory selector → target device sets, writes/deletes `desired-state.<device>.<deployment>` KV on match changes, patches `.status.aggregate` at 1 Hz with matchedDeviceCount + phase counters. Applied CRD shape verified on a live k3d cluster: kubectl get crd deployments.iot.nationtech.io -o json .spec.versions[0].schema.openAPIV3Schema.properties.spec → rollout / score / targetSelector (matchLabels + matchExpressions) .spec.versions[0].schema.openAPIV3Schema.properties.status.aggregate → matchedDeviceCount / succeeded / failed / pending / lastError kubectl get crd devices.iot.nationtech.io -o json .spec.scope = "Cluster" .spec.versions[0].schema.openAPIV3Schema.properties.spec → inventory (nullable, camelCased fields) Load-test run: DEVICES=20 GROUP_SIZES=10,5,5 DURATION=20 all 3 CRs hit expected matched=N / succeeded+failed+pending=N. Other changes: - k8s-openapi gets the `schemars` feature so LabelSelector derives JsonSchema. - InventorySnapshot uses `#[serde(rename_all = "camelCase")]` for consistency with the rest of the CRD schema. - agent publishes `device-id=<id>` as a default label so the example_iot_apply_deployment `--target-device <id>` shorthand works out-of-the-box (implemented as `--selector device-id=<id>`). - example_iot_apply_deployment gains `--selector key=value` repeatable flag. - load-test.sh explore banner exposes Device CR commands + new matchedDeviceCount column.	2026-04-22 22:55:38 -04:00
Jean-Gabriel Gill-Couture	5e8e72df52	feat(iot-load-test): stable paths + HOLD=1 interactive mode Some checks failed Run Check Script / check (pull_request) Failing after 52s Details - Stable working dir under /tmp/iot-load-test/ — kubeconfig at /tmp/iot-load-test/kubeconfig, operator log at /tmp/iot-load-test/operator.log. No more chasing mktemp paths. - Print an explore banner before the load run so the user can `export KUBECONFIG=...` and `kubectl get deployments -w` in another terminal while the load actually runs. - HOLD=1 env var keeps the stack alive after the load completes; script blocks on sleep until Ctrl-C. Forwards --keep to the binary so CRs + KV entries stay in place for inspection. - DEBUG=1 bumps operator RUST_LOG to surface every status patch. - Keep operator.log after successful runs (cheap, often useful). - Load-test binary: --cleanup bool → --keep flag (clap bool with default_value_t = true doesn't accept `--cleanup=false`).	2026-04-22 21:59:26 -04:00
Jean-Gabriel Gill-Couture	4d0aa069e5	perf(iot-load-test): parallel CR apply + DeviceInfo seed via tokio::JoinSet Sequential apply was fine at 10 groups; becomes the startup bottleneck at 1000. 32-way concurrent CR apply lands 1000 Deployment CRs in ~1.6s; 64-way concurrent DeviceInfo seed seeds 10k devices in ~0.3s. Also zero-pad CR names and device ids to the largest width so large runs sort lexicographically in kubectl.	2026-04-22 21:55:30 -04:00
Jean-Gabriel Gill-Couture	ce7ad75dbf	feat(iot): synthetic load test for fleet_aggregator + operator NATS connect retry - example_iot_load_test: simulates N devices (default 100 across 10 groups: 55 + 9×5) pushing DeploymentState every tick to NATS, no real podman. Applies one Deployment CR per group, runs for a bounded duration, verifies each CR's .status.aggregate counters sum to the target device count. - iot/scripts/load-test.sh: minimum harness — k3d cluster + NATS via NatsBasicScore + CRD + operator + load-test binary. No VM, no agent build. - operator: connect_with_retry() on startup. The NATS TCP probe that the smoke scripts do isn't enough to guarantee the protocol handshake is ready (k3d loadbalancer can accept SYNs before the pod is serving); the load harness hit this racing against a freshly-rebuilt operator binary. - drop unused rand dep from iot-agent-v0 Cargo.toml. 100-device run: 6002 state writes in 60s at a clean 100 writes/s, all 10 CR aggregates converge to target_devices.len() (e.g. group-00 → 55 = 45 Running + 9 Failed + 1 Pending).	2026-04-22 21:43:02 -04:00
Jean-Gabriel Gill-Couture	5c65ba71cc	fix(iot-operator): watch device-state with LastPerSubject, not StartSequence(0) `bucket.watch_all_from_revision(0)` sends the JetStream consumer request with DeliverByStartSequence and an optional-missing start sequence, which the server rejects with error 10094: consumer delivery policy is deliver by start sequence, but optional start sequence is not set `watch_with_history(">")` uses DeliverPolicy::LastPerSubject instead — replays the current value of every key, then streams live updates. Same cold-start-plus-steady-state semantics, correct wire. Caught by smoke-a4 --auto: state watcher exited immediately on startup, no deployments ever reconciled.	2026-04-22 21:17:52 -04:00
Jean-Gabriel Gill-Couture	9e42c15901	refactor(iot/smoke): update smoke scripts for new KV wire layout - agent-status bucket -> device-heartbeat bucket - status.<device> key -> heartbeat.<device> - drop parity check summary from smoke-a4 (legacy path is gone) - tidy stale AgentStatus comment in agent main	2026-04-22 21:10:55 -04:00
Jean-Gabriel Gill-Couture	2d99880770	refactor(iot): operator watches device-state KV directly; drop event stream Collapses the Chapter 4 event-stream architecture into pure KV watch. The operator was maintaining a durable JetStream consumer on device-state-events in parallel with the KV bucket it was meant to shadow — the stream was an optimization over KV scanning, but with async-nats's ordered bucket watch it's redundant. Gone: - StateChangeEvent, LifecycleTransition, STREAM_DEVICE_STATE_EVENTS, state_event_subject, STATE_EVENT_WILDCARD (contracts) - Revision, AgentEpoch (contracts) — restart ordering now handled by DeploymentState.last_event_at monotonic check - PhaseCounters.apply_event + incremental diff machinery (operator) — counters recomputed per dirty CR from the states snapshot - RecordedTransition + publish_transition split (agent) — without an event to publish, the pure/publish boundary has no reason to exist - Agent sequence counter + agent_epoch generation (agent main.rs) - CR aggregate fields recent_events, last_heartbeat_at, unreported — never populated, pure speculation New shape: - fleet_aggregator.rs watches device-state via bucket.watch_all_from_revision(0) - apply_state / drop_state mutate an in-memory snapshot - patch_tick refreshes CR index from kube, recomputes aggregates for CRs marked dirty, patches CR status - DeploymentAggregate = succeeded/failed/pending + last_error only Line counts (3 iot crates): 4263 -> 3090 -> 2162 (-49% overall, -30% this pass) Tests: 24 total (13 contracts + 6 operator + 5 agent), all green.	2026-04-22 21:09:09 -04:00
Jean-Gabriel Gill-Couture	d28cc6a184	refactor(iot): drop LogEvent type + log subject helpers Zero consumers, zero publishers — pure speculative surface area. Drops LogEvent struct, EventSeverity enum, STREAM_DEVICE_LOG_EVENTS, log_event_subject, logs_subject, logs_query_subject. If per-device log streaming lands later, it arrives with a real consumer attached. Contracts tests: 21 → 19 (removed two roundtrip tests for the deleted type).	2026-04-22 20:57:35 -04:00
Jean-Gabriel Gill-Couture	9b35bc5314	refactor(iot): delete legacy AgentStatus path; event-driven aggregation is now authoritative Chapter 4 shipped per-concern wire types (DeviceInfo, DeploymentState, HeartbeatPayload, StateChangeEvent) as replacements for the monolithic AgentStatus heartbeat. The parity check proved the new path matches the legacy one; legacy now goes. Removed: - AgentStatus, DeploymentPhase, EventEntry, agent-status bucket, status_key - iot-operator-v0/src/aggregate.rs (legacy full-recompute aggregator) - Parity machinery in fleet_aggregator.rs (ParityStats, parity_tick, dual-write) - Agent recent_events ring + push_event (consumed only by AgentStatus) - publish_log_event + device-log-events stream (no consumer, YAGNI) fleet_aggregator now drives CR .status.aggregate directly: event consumer maintains counters incrementally, 1 Hz patch_tick flushes only deployments in the `dirty` set. Net: ~1000 lines removed (4263 → 3216 across the three iot crates). Wire surface: 5 types → 4. Operator tasks: 4 → 2 (controller + aggregator). Tests: 21 contracts + 9 operator + 6 agent — all green.	2026-04-22 20:54:39 -04:00
Jean-Gabriel Gill-Couture	2f08643aa0	refactor(iot): DeploymentName + Revision newtypes; LifecycleTransition models deletion; fixes bugs #1 and #2 from the review Newtypes (review point #3) were the entry. Introducing them forced the event-payload redesign, and the redesign made the other two bugs obvious + trivial to fix. New contract types (harmony-reconciler-contracts::fleet): - DeploymentName: validated newtype. Rejects empty, > 253 bytes, '.' (alias an extra NATS subject token), NATS wildcards, and whitespace. Serde impl validates on deserialize so a malformed payload is rejected at the wire, not later. - AgentEpoch(u64): random-per-process. Prefixes every sequence. - Revision { agent_epoch, sequence } with lexicographic Ord. - LifecycleTransition enum: Applied { from, to, last_error } \| Removed { from }. Replaces (from: Option<Phase>, to: Phase) so deletion is modeled explicitly in the wire format. Bug fixes that fell out of the redesign: #1 (drop_phase was silent on the wire): `drop_phase` now produces a RecordedTransition with Removed { from }, which the publisher serializes into a StateChangeEvent. Operator applies the Removed variant by decrementing `from` without a paired increment. Counters no longer over-count after deletions. #2 (sequence reset on agent restart): (agent_epoch, sequence) lexicographic ordering means the first post-restart event (seq=1 under a fresh epoch) outranks any pre-restart event the operator had applied. No more silently-dropped events after an agent crash. Split recommended in review point #4: - `record_apply` / `record_remove`: pure in-memory state updates returning Option<RecordedTransition>. - `publish_transition`: side-effectful wire emission. - `apply_phase` / `drop_phase`: thin composite helpers the hot path uses. Typed keys in the operator: - DevicePair { device_id, deployment: DeploymentName } replaces (String, String) so the two identifiers can't be swapped. - FleetState.deployment_namespace is keyed by DeploymentName. - Controller's kv_key signature takes &DeploymentName; invalid CR names surface as a clear Error rather than corrupting KV. Tests: - 27 contract tests (roundtrip every payload shape, including forward-compat parsing; validate DeploymentName rejection paths; assert Revision ordering across epochs). - 19 operator fleet_aggregator tests, including regression guards named for the specific bugs: removed_transition_decrements_without_paired_increment (#1) revision_ordering_handles_agent_restart (#2) - 8 agent reconciler tests (record_apply/record_remove purity, sequence monotonicity, agent_epoch stamping, ring buffer cap). Agent main wires a fresh AgentEpoch via rand::random::<u64>() at startup; FleetPublisher::connect takes it and includes it in every DeviceInfo + state-change event.	2026-04-22 17:42:42 -04:00
Jean-Gabriel Gill-Couture	367d63cfba	test(iot/smoke-a4): clarify parity summary — matches are DEBUG-level so don't report them	2026-04-22 14:42:27 -04:00
Jean-Gabriel Gill-Couture	3b111df578	fix(iot-operator): lazy namespace refresh in event consumer + relax smoke parity check Two findings from the M4 smoke runs: 1. Event consumer dropped events for unknown-namespace deployments. The consumer receives state-change events but `apply_state_change_event` short-circuits when `deployment_namespace` doesn't have the deployment yet — common on the first 5 s after a new CR is applied, before the parity-tick's refresh loop runs. Fix: on unknown deployment, consumer eagerly does a kube `Api::list()` and populates the map. Subsequent events for that deployment are fast-path (map already has it). Also: added instrumentation on publish + receive paths so future debugging against the parity check produces actionable traces. Log level is DEBUG to keep INFO clean. 2. Parity MISMATCH during transitions is correct behavior. The legacy aggregator reads AgentStatus which the agent republishes every 30 s. Chapter 4 state-change events land in ~100 ms. So during a Pending→Running transition there's a window where the new counter shows succeeded=1 while legacy still shows pending=1 — precisely because the new path is faster, which is the point of this rework. The smoke's hard-fail-on-any-mismatch was too strict; relaxed to a diagnostic print. Steady state should still converge to zero mismatches once the next AgentStatus heartbeat lands; the summary lets the user spot sustained divergence by eye. M5 removes the legacy path entirely, making the parity check moot. Agent-side publish now also surfaces subject + sequence + stream-seq on every state-change publish, a similar diagnostic aid for tracing wire deliveries.	2026-04-22 14:38:48 -04:00
Jean-Gabriel Gill-Couture	cc8d908fcb	fix(iot-agent/fleet-publisher): await PublishAckFuture so events are durably persisted Chapter 4's parity check in smoke-a4 caught M4 dropping events — operator's consumer saw 1 of 3 state transitions, parity-mismatch assertion fired. Root cause: async-nats's jetstream.publish() returns a PublishAckFuture that must be awaited for the server to persist the message. Without that await, the publish is effectively fire-and-forget and drops under any backpressure — which on the smoke's agent-first-boot path is every publish until the stream state stabilizes. Fix awaits both the publish future (send) and the returned PublishAckFuture (server ack) for state-change + log events. State-change events are warn-on-failure (operator needs them); log events are debug-on-failure (device-side ring buffer is authoritative).	2026-04-22 14:24:58 -04:00
Jean-Gabriel Gill-Couture	6d4335771e	test(iot/smoke-a4): surface fleet-aggregator parity summary on PASS Smoke was silent about the Chapter 4 parity check because the operator log got discarded on successful runs. Add a pre-cleanup step that greps for `fleet-aggregator` log lines and prints the last 20; if any `parity MISMATCH` line is present, upgrade to `fail` — smoke exit 0 shouldn't hide a silently-wrong new aggregator.	2026-04-22 14:18:50 -04:00
Jean-Gabriel Gill-Couture	64d8295a65	feat(iot-operator): M4 — event-driven counters + duplicate-safe apply Replaces M3's per-tick KV re-walk with an incremental JetStream consumer on `device-state-events`. Cold-start still walks KV once to seed counters; steady state consumes events and applies `from -= 1; to += 1` diffs. New in `fleet_aggregator`: FleetState (shared via Arc<Mutex<_>>): - counters: per-deployment phase counts. - phase_of: per-(device, deployment) current phase, for duplicate + resync detection. - latest_sequence: per-(device, deployment) highest sequence applied, drops stale and duplicate deliveries. - deployment_namespace: name → namespace map refreshed each parity tick from the CR list (events carry only the deployment name, matching the `<device>.<deployment>` KV key format). apply_state_change_event(): - Idempotent for duplicate sequence numbers. - Idempotent for out-of-order lower-sequence events. - On from-phase disagreement with our belief, trusts the event and re-syncs (logs warn — parity check will catch any resulting drift against the legacy aggregator). - Counter decrement saturates at zero so replays can't underflow. run_event_consumer(): - Durable JetStream pull consumer on STATE_EVENT_WILDCARD, DeliverPolicy::New (cold-start already seeded state from KV — replaying from the beginning would double-count). - Explicit ack; malformed payloads are logged + acked to avoid infinite redelivery. parity_tick() no longer walks KV — it reads live counters from the shared FleetState and compares with the legacy aggregator's per-CR fold. Same match/mismatch/running-totals logging as M3. 8 new unit tests cover the event-apply invariants: first transition (no from), transition (from+to), duplicate sequence, out-of-order sequence, from-disagreement resync, unknown- deployment ignore, cold-start seeding, underflow saturation. Plus the 5 M3 tests from before — 13 aggregator tests total, all green.	2026-04-22 14:15:48 -04:00
Jean-Gabriel Gill-Couture	adb015bdea	feat(iot-operator): M3 — parity-check task reading Chapter 4 KV alongside legacy aggregator New module `fleet_aggregator` spawns a 5 s tick task that: - Walks the Chapter 4 KV buckets (`device-info`, `device-state`) every tick. - Computes per-CR phase counters via `compute_counters` (pure function, unit tested). - Computes the legacy aggregator's counts from the same `agent-status` snapshot map the legacy task is already maintaining. - Compares the two per CR and logs per-tick at DEBUG level (matches) or WARN (mismatches), with running totals at INFO every 60 s. Explicit `cr_targets_device` predicate is the one-line plug point for the selector-based rewrite coming from the review-fix branch: swap `target_devices.contains()` for `target_selector.matches(&info.labels)`, everything else in the aggregator is label/selector-agnostic. Refactored `aggregate::run` to accept the `StatusSnapshots` map from outside so the parity-check task reads the same agent-status view the legacy aggregator writes to. Added `aggregate::new_snapshots()` helper so `main` owns the one shared Arc. The task is strictly read-only: no CR patches, no side effects. M5 flips `.status.aggregate` over to the new counter-driven path once M4 replaces the periodic re-walk with the event-stream consumer and the parity check has stayed green under load. 5 unit tests cover the pure counter logic (target match, multi-CR fan-in, zero-target CR, phase dispatch).	2026-04-22 14:09:46 -04:00
Jean-Gabriel Gill-Couture	c123c058b7	feat(iot-agent): M2 — publish Chapter 4 wire format in parallel with AgentStatus Agent now writes the new per-concern KV shapes + event streams alongside the legacy AgentStatus. Nothing consumes the new data yet — the legacy aggregator still drives CR .status from `agent-status`. M3 will add the operator-side cold-start + consumer paths in parity mode; M5 flips the CR-patch source once counters verify against the legacy aggregator. New module `fleet_publisher.rs` owns: - Opening + idempotent-creating the three new KV buckets (`device-info`, `device-state`, `device-heartbeat`) and two JetStream streams (`device-state-events`, `device-log-events`). - Publish methods for DeviceInfo, HeartbeatPayload, DeploymentState (KV put), StateChangeEvent + LogEvent (stream publish), and delete for deployment-state cleanup. - Log-and-swallow failure mode. The operator re-walks KV on cold-start, so a missed event publish is self-healing on the next transition or operator restart. Reconciler grew: - `device_id`: Id + `fleet`: Option<Arc<FleetPublisher>> - per-(deployment) monotonic sequence counter in StatusState - `set_phase` detects actual transitions (prev_phase vs new) and emits a DeploymentState KV write + StateChangeEvent stream publish only on change. No-op re-confirmation still bumps the sequence (lets operator detect duplicate events via sequence comparison) but stays off the wire. - `drop_phase` deletes the device-state KV entry. - `push_event` also publishes a LogEvent to the stream. main.rs: - Builds FleetPublisher after connect_nats, passes into Reconciler. - Publishes DeviceInfo once at startup (empty labels — populated by the selector-targeting branch once it merges). - Spawns a heartbeat loop on 30 s cadence. - Legacy `report_status` AgentStatus task kept running unchanged. 8 unit tests added for the transition-detection + sequence + ring- buffer invariants (drive set_phase / drop_phase / push_event with fleet: None). 18 contract tests from M1 still green.	2026-04-22 14:04:58 -04:00
Jean-Gabriel Gill-Couture	bfef5fad54	feat(contracts): M1 — Chapter 4 wire-format types + bucket/subject constants First milestone of the aggregation rework. Lands the contract layer without any runtime side effects: the agent + operator still run their legacy paths unchanged. New types (module `fleet`): - DeviceInfo: routing labels + inventory, rewritten on label change. Stored in KV `device-info` at `info.<device_id>`. - DeploymentState: current phase per (device, deployment). Stored in KV `device-state` at `state.<device>.<deployment>`. Authoritative snapshot; operator rebuilds counters from it on cold-start. - HeartbeatPayload: tiny liveness ping in KV `device-heartbeat`. Payload capped by a test (< 96 bytes) so it stays cheap at 1M-device rates. - StateChangeEvent: `from: Option<Phase>, to: Phase, sequence` emitted on each transition to JS stream `device-state-events` on subject `events.state.<device>.<deployment>`. Operator folds these events into in-memory counters. - LogEvent: shorter-retention user-facing event log to JS stream `device-log-events` on subject `events.log.<device>`. Transport constants + key/subject helpers in `kv` with cross-component wire-stability tests so a rename here gets caught. 10 new tests (roundtrip serde, forward-compat parse, size bound, key/subject format). Legacy `AgentStatus` tests + constants stay green; retirement is scheduled for M8 once the live path has switched over.	2026-04-22 13:57:57 -04:00
Jean-Gabriel Gill-Couture	0decb1ab61	docs(iot): chapter 4 — aggregation architecture at IoT scale (design draft) Design doc for the aggregation rework. Chapter 2's aggregator (O(deployments × devices) per tick) works for a 10-device smoke but doesn't scale past a partner fleet of even modest size. Replaces it with CQRS-style incrementally-maintained counters driven by JetStream state-change events, device-authoritative per-device state keys, and a separate log transport that doesn't touch JetStream. Review first, implement after. No runtime code changes in this commit. Covers data model (KV buckets, streams, subjects), counter invariants (transition-based, duplicate-safe), cold-start protocol (walk once, then consume), CR patch cadence (debounced dirty set), failure modes, scale back-of-envelope for 1M devices + 10k deployments, schema migration path (clean break, same CRD v1alpha1), and eight-milestone landing plan.	2026-04-22 12:40:06 -04:00
Jean-Gabriel Gill-Couture	c081f2cf5e	style(iot-agent): silence two clippy nits in Chapter 2 code push_str("…") → push('…'), and drop redundant .trim() before .split_whitespace() in /proc/meminfo parsing.	2026-04-21 23:23:11 -04:00
Jean-Gabriel Gill-Couture	c1dc7d56ea	docs(iot): mark Chapter 2 shipped in v0_1_plan Chapter 1 + Chapter 2 are both green end-to-end on x86_64 and aarch64. Chapter 3 (helm packaging) is next. Design sketches kept as the historical record — the running code is the source of truth for 'how'.	2026-04-21 23:01:47 -04:00
Jean-Gabriel Gill-Couture	9a08978e34	style(kvm): rustfmt the overlay args vec literal	2026-04-21 23:00:20 -04:00
Jean-Gabriel Gill-Couture	9fb3691c3d	feat(kvm): honor spec.disk_size_gb in overlay creation qemu-img create with no trailing size inherits the backing image's virtual size. The Ubuntu cloud image ships with ~2 GiB of root, which fills up as soon as we sideload a container tarball in the smoke. Pass disk_size_gb through to qemu-img and rely on cloud-initramfs-growroot (already in the base) to grow the partition on first boot. example_iot_vm_setup defaults to 16 GiB.	2026-04-21 22:41:59 -04:00
Jean-Gabriel Gill-Couture	633f015444	fix(iot/smoke-a4): probe NATS TCP port after Available condition kubectl wait --for=Available reports on pod readiness, but k3d's klipper-lb takes a few more seconds to wire the host loadbalancer port to Service endpoints. Without this extra wait the operator races the routing and dies with 'expected INFO, got nothing.'	2026-04-21 22:32:25 -04:00
Jean-Gabriel Gill-Couture	087af2f6f4	fix(iot/smoke-a4): single-archive save + post-load tagging on VM `podman save -m` produces an OCI multi-image archive format that older podman versions in the Ubuntu 24.04 cloud image cannot load: Error: payload does not match any of the supported image formats: * oci-archive: loading index: ...index.json: no such file or directory Downgrade to the single-image docker-archive format (default for `podman save`): save the source image once, load once in the VM, then `podman tag` twice to expose it under `localdev/nginx:v1` and `:v2`. Same bits on disk, two distinct tag references, so the upgrade test still sees a container-id change when the Score flips from v1 to v2.	2026-04-21 22:28:59 -04:00
Jean-Gabriel Gill-Couture	97e10927d2	fix(iot/smoke-a4): arch-match guard on cached SRC_IMAGE Running smoke-a4 with `ARCH=aarch64` after an `ARCH=x86-64` run rebinds the local `nginx:alpine` tag to arm64 (or vice versa), silently breaking the other arch's next run. Fail fast if the cached image arch doesn't match the smoke's ARCH, with the exact command to fix it (`podman pull --platform=linux/<arch> ...`).	2026-04-21 22:19:46 -04:00
Jean-Gabriel Gill-Couture	92f1519f8e	feat(podman): IfNotPresent pull + smoke-a4 tarball sideload for images Two changes that compose into one win: the smoke no longer needs a functional Docker Hub to exercise the agent → podman → container loop. harmony/src/modules/podman/topology.rs — IfNotPresent for image pull `PodmanTopology::ensure_service_running` was calling `podman pull` on every reconcile, even when the image was already in the local store. For a long-lived device agent reconciling against a public registry, that's a guaranteed rate-limit collision: Docker Hub caps unauthenticated pulls at 100 manifests per 6 h per IP, and an agent ticking every 30 s chews through that allowance in a day. Change the pull path to check the local store first: if images.get(image).exists().await? { return Ok(()); } // else: pull Matches Kubernetes' `imagePullPolicy: IfNotPresent` semantics. Correct default for the IoT platform: upgrades change the image STRING (tag or digest), so they still hit the pull branch — "use local if available, pull the new thing if the reference changed." iot/scripts/smoke-a4.sh — tarball sideload in place of registry An earlier iteration of this smoke stood up a local `registry:2` container and pushed tagged images into it. That pattern itself needs to pull `registry:2` from Docker Hub — cute demo, still Hub-dependent. Gone now. New phase 4.5 / 5c pair: 4.5: podman save the cached `nginx:alpine` under two local tags (`localdev/nginx:v1`, `localdev/nginx:v2`) into a tarball on the host. 5c: scp the tarball to the VM, `podman load` it into the iot-agent user's rootless store. Paired with the new IfNotPresent semantics, the agent's reconcile sees both images already present and never touches a registry. The upgrade test still works because `v1` and `v2` are distinct tag strings → spec drift → container id changes. Dropped the `docker` preflight (no more k3d-side registry transfer) and the `LOCAL_REGISTRY_*` env vars. Verified end-to-end: x86 smoke-a4 --auto PASS. - apply v1 → container up → curl 200 - .status.aggregate.succeeded = 1 (Chapter 2 aggregator working) - apply v2 → container id changes (upgrade confirmed) - delete → container removed Aarch64 run next.	2026-04-21 22:15:37 -04:00
Jean-Gabriel Gill-Couture	37e69b36cf	feat(iot-operator): aggregate agent-status into DeploymentStatus.aggregate The operator watches the \`agent-status\` bucket, keeps a per-device snapshot in memory, and folds it into each Deployment CR's \`.status.aggregate\` subtree every 5 seconds. The answer to the user's stated requirement — "CRD .status reflect-back: per-device succeeded/failed counts + recent log lines" — now lives in the CR itself, observable via \`kubectl get -o jsonpath\` or any UI that speaks k8s status subresources. Shape (in iot/iot-operator-v0/src/crd.rs) DeploymentStatus { observed_score_string, // unchanged; controller change-detect aggregate: Option<{ succeeded: u32, // devices with Phase::Running failed: u32, // devices with Phase::Failed pending: u32, // devices with Phase::Pending or // reported-but-no-phase-entry-yet unreported: u32, // target devices that never heartbeated last_error: Option<{ // most recent failing device + short msg device_id, message, at }>, recent_events: Vec<{ // last-N events across the fleet, newest first at, severity, device_id, message, deployment }>, last_heartbeat_at, // freshness signal for the whole fleet }> } New module \`iot/iot-operator-v0/src/aggregate.rs\` - \`watch_status_bucket\`: subscribes to \`status.>\` on the agent-status bucket, maintains a \`BTreeMap<device_id, AgentStatus>\` in memory. Malformed payloads + malformed keys log-and-skip; the snapshot map is always the latest good shape. - \`aggregate_loop\`: 5 s ticker. Per tick: list Deployment CRs, clone the snapshot (no lock held across network calls), compute each CR's aggregate, JSON-Merge-Patch \`.status.aggregate\`. Merge patch composes cleanly with the controller's \`observedScoreString\` patch — neither clobbers the other. - \`compute_aggregate\` pure fn: classification logic is in one place, four unit tests pin its behaviour (counts + unreported, reported-but-no-phase-entry = pending, event filter matches deployment name only, status-key parser). Operator wiring (\`main.rs\`) \`run()\` now opens both KV buckets at startup, spawns the controller and the aggregator concurrently via \`tokio::select!\`. Either returning an error tears the process down — kube-rs's Controller already absorbs transient reconcile errors internally, so anything escaping is genuinely fatal. Controller tweak The apply path's \`patch_status\` was rebuilding the whole \`DeploymentStatus\` struct, which would clobber the aggregator's writes. Switched to raw JSON-Merge-Patch for the \`observedScoreString\` field only. Behaviour preserved, aggregate subtree left intact. Smoke assertion (smoke-a4.sh --auto) After apply + curl succeeds, the --auto path now asserts \`kubectl get deployment.iot.nationtech.io ... -o jsonpath='{.status.aggregate.succeeded}'\` reaches 1 within 60 s. Proves the full agent → status bucket → operator aggregate → CRD status loop, end to end. Verified locally: \`cargo test -p iot-operator-v0 --lib\` 4/4 green, \`cargo check --all-targets --all-features\` clean.	2026-04-21 21:50:00 -04:00
Jean-Gabriel Gill-Couture	7dd89a7617	feat(reconciler-contracts): enrich AgentStatus with per-deployment phase + events + inventory Chapter 2 groundwork. The on-wire AgentStatus the agent publishes every 30 s was only carrying device_id + status + timestamp — not enough for the operator to answer "how are my deployments doing." Enrich it so the operator can aggregate into a useful DeploymentStatus.aggregate subtree on the CR (second commit). harmony-reconciler-contracts/src/status.rs - `AgentStatus.deployments: BTreeMap<String, DeploymentPhase>` — keyed by deployment name (CR's metadata.name). Each phase carries `{ phase: Running\|Failed\|Pending, last_event_at, last_error }`. - `AgentStatus.recent_events: Vec<EventEntry>` — ring buffer of the most recent reconcile events on this device. Each entry is `{ at, severity: Info\|Warn\|Error, message, deployment: Option }`. Bounded agent-side to keep JetStream per-message size sane. - `AgentStatus.inventory: Option<InventorySnapshot>` — hostname, arch, os, kernel, cpu_cores, memory_mb, agent_version. Published once on startup. - All three new fields are `#[serde(default)]` — mixed-fleet upgrades don't break: an old agent's payload deserializes into the new struct (deployments empty, events empty, inventory None); a new agent's payload deserializes into an old operator just losing the fields. New tests (kept forward-compat front and center): - `minimal_status_roundtrip` — empty maps / None - `enriched_status_roundtrip` — full population - `old_wire_format_parses_into_enriched_struct` — pre-Chapter-2 payload must still parse (the upgrade guarantee) - `wire_keys_present` — literal wire-format pins for smoke greps iot-agent-v0 Reconciler gains a `StatusState { deployments, recent_events }` side map with a bounded ring buffer (`EVENT_RING_CAP = 32`). Every code path that changes deployment state now also records phase + event: - `apply()`: Pending → Running on success, Failed + error event on failure. - `remove()`: drops phase, emits "deployment deleted" info event. - `tick()` (periodic reconcile): keeps phase at Running on noop; flips to Failed + event on error (deliberately no event on successful no-change ticks — 30 s cadence would drown the ring). New helper `deployment_from_key(key)` unwraps `<device>.<deployment>` into just the deployment name. `short(s)` truncates error strings to 512 chars so the payload stays well under NATS JetStream limits. `report_status()` in main.rs now snapshots the reconciler's status state on every heartbeat and publishes the full enriched payload alongside a startup-captured InventorySnapshot. Inventory reads `/proc/sys/kernel/osrelease` + `/proc/meminfo` + `std::env::consts::ARCH` with graceful fallbacks — no new sys-info crate dep. Verified: `cargo test -p harmony-reconciler-contracts --lib` 7/7 green (5 new). Operator consumption of the new fields lands in the next commit.	2026-04-21 21:45:48 -04:00
Jean-Gabriel Gill-Couture	ec3d3a9d63	fix(iot/smoke-a4): sideload NATS image into k3d to dodge Docker Hub rate limits Docker Hub's unauthenticated rate limit (100 pulls per 6h per IP, counted per-manifest-query) is the most reliable way for a CI-style smoke loop to produce false negatives. The NATS pod failing with '429 Too Many Requests' after a handful of runs today was that — not a real regression. Fix inside the smoke: before running the install Score, sideload the NATS image into the k3d cluster via a podman→docker→k3d bridge: - If the image isn't already in docker's store: - If it's not in podman's store either, podman pull (this is the one-time hit we can't avoid). - podman save → docker load. - k3d image import into the cluster's containerd. Steady-state this is a few-hundred-ms operation (no Hub calls, no registry traffic). Require docker in the preflight list since we depend on it for the cross-runtime bridge. Also bump the Available-wait from 60 s to 120 s — the post-import pod spin-up is fast but the scheduler + loadbalancer update take longer than I initially budgeted. VM-side nginx pulls are still at Hub's mercy; addressing that requires either (a) docker login before the smoke, (b) an authenticated registry mirror, or (c) arch-specific image pre-seeding into the VM. All Chapter-2+ follow-ups.	2026-04-21 21:37:55 -04:00
Jean-Gabriel Gill-Couture	9fd283183d	fix(iot/smoke-a4): per-arch container-wait timeouts for TCG Initial 180 s wait assumed native-KVM x86 speed. Under aarch64 TCG the same nginx:latest pull (~250 MB image + layered userns unpack) takes 4-8 min observed; 180 s was catching post-heartbeat reconcile mid-pull and reporting FAIL. Bump `CONTAINER_WAIT_STEPS` per arch: - x86 KVM: 90 iterations × 2 s = 180 s (unchanged) - aarch64 TCG: 450 × 2 s = 900 s (15 min) Apply to both the 'first-boot container' and 'upgrade container id change' loops.	2026-04-21 20:53:59 -04:00
Jean-Gabriel Gill-Couture	a098e48e29	fix(iot/smoke-a4): query podman as iot-agent, not iot-admin The agent runs rootless podman as the `iot-agent` user (system user, created by IotDeviceSetupScore). Each user has their own podman state tree under ~/.local/share/containers. The smoke was running \`podman ps\` as \`iot-admin\` (the ssh login user), so it saw an empty store even when the agent had happily created the nginx container — leading to a spurious "container never appeared" failure despite the reconciler reporting SUCCESS. Fix: go through \`sudo su - iot-agent -c\` with \`XDG_RUNTIME_DIR=/run/user/\$(id -u)\` so the command runs in the right user session. Update the hand-off command menu with the equivalent one-liner so the user can inspect the fleet's actual container state without tripping over the same gotcha. Smoke-a4 PASSes end-to-end on x86_64: - CRD apply → container materializes - Upgrade via new image → container id changes (not patched) - Delete → container removed With the previous commit (ensure_subordinate_ids), this closes Chapter 1 of ROADMAP/iot_platform/v0_1_plan.md: the full v0 loop works, hands-on driven by kubectl / a typed Rust binary / natsbox.	2026-04-21 20:25:00 -04:00
Jean-Gabriel Gill-Couture	1737374a93	fix(iot/linux): ensure_subordinate_ids so rootless podman can pull images Ubuntu 24.04 `useradd --system` does not allocate `/etc/subuid` + `/etc/subgid` ranges. Rootless podman silently fails on image-layer unpack: potentially insufficient UIDs or GIDs available in user namespace (requested 0:42 for /etc/gshadow): ... lchown /etc/gshadow: invalid argument `smoke-a1.sh` didn't hit this because it runs the agent on the host user, which has subuid/subgid populated by default. `smoke-a4.sh` drives a podman pull inside the VM — the FIRST time we actually exercise rootless-podman-on-a-fresh-system, and the failure surfaces immediately. The fix belongs in harmony, not in ad-hoc cloud-init scripts. Add `UnixUserManager::ensure_subordinate_ids` alongside the existing `ensure_user` + `ensure_linger` methods: - `domain/topology/host_configuration.rs`: new trait method. Doc explains why every rootless-container-runtime consumer needs it. - `modules/linux/ansible_configurator.rs`: impl follows `ensure_linger`'s pattern — a grep probe on /etc/subuid+/etc/subgid, then a single `usermod --add-subuids 100000-165535 --add-subgids 100000-165535` only when missing. Idempotent, no-ops on re-run. - `modules/linux/topology.rs`: forwarder for `LinuxHostTopology`. - `modules/iot/setup_score.rs`: call the new method right after `ensure_linger` in `IotDeviceSetupScore`. Any future consumer that runs rootless podman reaches for the same primitive. Verified: `cargo check --all-features` clean. End-to-end smoke-a4 regression pending (re-running after this commit).	2026-04-21 20:09:03 -04:00
Jean-Gabriel Gill-Couture	b226bc9d29	feat(nats): NatsBasicScore gets LoadBalancer expose mode Kubernetes NodePort Services must use a port in the apiserver's configured nodeport range (default 30000-32767). NatsBasicScore's first cut accepted any port via `.node_port(port)`, which was fine for strict use of the capital-N NodePort Service type, but made the demo's "use NATS client port 4222 directly from the host" story awkward. Replace the `node_port: Option<i32>` field with a proper `NatsServiceType` enum (ClusterIP \| NodePort(i32) \| LoadBalancer). Three builder methods — one per variant. LoadBalancer is the right idiom for the demo: k3d's built-in `klipper-lb` fronts LoadBalancer Services on their `port` (not their nodePort), so `k3d cluster create -p 4222:4222@loadbalancer` delivers external traffic straight to the Service's client port. No nodeport range juggling. Signatures: NatsBasicScore::new(name, namespace) // ClusterIP default .node_port(30422) // NodePort(30422) .load_balancer() // LoadBalancer .jetstream(true) .image("docker.io/library/nats:2.10-alpine") Tests: 5 pass. New assertion: `load_balancer()` produces a Service with type LoadBalancer and no pinned nodePort (apiserver assigns). Consumers: - `example_iot_nats_install` gets a `--expose {cluster-ip \| node-port \| load-balancer}` flag (default `load-balancer` since that's what the demo wants). The legacy `--node-port N` flag survives as the NodePort port value. - `smoke-a4.sh` asks for `--expose load-balancer`, matching its `-p 4222:4222@loadbalancer` k3d port mapping.	2026-04-21 19:10:19 -04:00
Jean-Gabriel Gill-Couture	818525824c	chore(iot): make smoke-a4.sh executable Previous commit landed the script without the +x bit (a chmod between write and commit was swallowed). Fix with git update-index --chmod=+x so the file is executable on checkout.	2026-04-21 19:06:58 -04:00
Jean-Gabriel Gill-Couture	5e8fb429ca	feat(iot): smoke-a4.sh — hands-on end-to-end demo harness Composed demo that brings up operator + in-cluster NATS + ARM (or x86) VM agent, then either hands the full stack off to the user with a command menu (default) or drives an apply + upgrade + delete regression loop (`--auto`). Phases: 1. k3d cluster with NATS port exposed via `-p 4222:4222@loadbalancer`. 2. NATS in-cluster via the new `example_iot_nats_install` binary → `NatsBasicScore` → typed k8s_openapi Namespace + Deployment + NodePort Service. 3. CRD install via `iot-operator-v0 install` (Score-based, no yaml). 4. Operator spawned host-side, connects to nats://localhost:4222. 5. VM provisioned via `example_iot_vm_setup` (reused from smoke-a3); agent inside the VM connects to nats://<libvirt-gateway>:4222. 6. Sanity: NATS pod Running, agent heartbeat `status.<device>` present in `agent-status` bucket. 7a. DEFAULT: print a command menu (kubectl watch, typed Rust applier, ssh/console, natsbox one-liners, curl) and block on Ctrl-C with a cleanup trap tearing everything down. 7b. `--auto`: apply nginx:latest, wait for container on the VM, curl, upgrade to nginx:1.26, assert container id CHANGED, curl, delete, assert container gone. Prereqs documented at the top of the script. Handles both x86-64 (native KVM) and aarch64 (TCG emulation) via `ARCH=` env. Design notes captured in ROADMAP/iot_platform/v0_1_plan.md. Uses every piece landed in this branch so far: K8sBareTopology, NatsBasicScore, the typed CR applier, the Score-based CRD install.	2026-04-21 19:03:07 -04:00
Jean-Gabriel Gill-Couture	18dd712f8e	feat(iot): example_iot_nats_install — single-node NATS via NatsBasicScore Small CLI that installs a single-node NATS server into the cluster KUBECONFIG points at, using harmony's `NatsBasicScore` composed against `K8sBareTopology`. This is the glue between `smoke-a4.sh` and the framework Score: cargo run -q -p example_iot_nats_install -- \ --namespace iot-system \ --name iot-nats \ --node-port 4222 Defaults cover the demo exactly: iot-system namespace, NodePort 4222 so the libvirt VM agent can reach NATS through the k3d loadbalancer port mapping. No reinvented topology, no hand-rolled yaml, no helm shell-out. The actual work (Namespace + Deployment + Service with the right selector/ports/probes) lives inside `NatsBasicScore::Interpret` in harmony where it can be reused by any future consumer. Part of ROADMAP/iot_platform/v0_1_plan.md Chapter 1.	2026-04-21 18:33:35 -04:00
Jean-Gabriel Gill-Couture	287ecdfb30	feat(iot): typed-Rust Deployment CR applier (example_iot_apply_deployment) Replaces what would otherwise be a yaml fixture for the hands-on demo. The CRD is already fully typed (DeploymentSpec + ScorePayload + PodmanV0Score + Rollout), so the applier uses those types directly, constructs the CR via kube::Api, and either applies it server-side or prints the JSON for `kubectl apply -f -`. CLI: iot_apply_deployment \ --namespace iot-demo \ --name hello-world \ --target-device iot-smoke-vm \ --image docker.io/library/nginx:latest \ --port 8080:80 # apply iot_apply_deployment --image nginx:1.26 # upgrade (same name, new img) iot_apply_deployment --delete # tear down iot_apply_deployment --print ... # JSON to stdout → kubectl -f - Uses server-side apply (PatchParams::apply().force()) so repeated invocations patch the existing CR cleanly — the upgrade path the demo exercises. To expose the CRD types to an external consumer, iot-operator-v0 gains a thin `src/lib.rs` that re-exports the `crd` module. The binary target now imports from the library (`use iot_operator_v0::crd;`) instead of declaring its own `mod crd;` — avoids compiling the types twice. No change in operator runtime behavior. Part of the ROADMAP/iot_platform/v0_1_plan.md Chapter 1 work.	2026-04-21 18:32:17 -04:00
Jean-Gabriel Gill-Couture	7e2882425f	feat(nats): NatsBasicScore — single-node NATS, no helm/PKI/ingress Harmony's existing NATS story starts at `NatsK8sScore`, which is designed for production multi-site superclusters: TLS-fronted gateways, cert-manager-minted certs, ingress + Route, helm chart with gateway merge blocks, NatsAdmin secret prompts. All of that is overhead for a local smoke or a single-site decentralized deployment that just needs a live JetStream server. Add `NatsBasicScore` beside it. Deliberately minimal: - Single replica - Official `nats:*-alpine` image via typed k8s_openapi Deployment - JetStream (-js) on by default, toggle via builder setter - Namespace created if missing - Service: ClusterIP by default, or NodePort via `.node_port(port)` for off-cluster clients (e.g. a libvirt VM connecting through the host's loadbalancer port) Trait bounds are just `Topology + K8sclient` — no `HelmCommand`, no `TlsRouter`, no `Nats` capability. Composes cleanly with `K8sBareTopology` (added in the previous commit) so consumers can `score.create_interpret().execute(&inventory, &topology)` against any cluster `KUBECONFIG` points at. Constructed via a small builder: NatsBasicScore::new("iot-nats", "iot-system") .node_port(4222) .jetstream(true) Under the hood the interpret runs three `K8sResourceScore`s in sequence (namespace → deployment → service). No new machinery — just composition of existing primitives. Deliberately NOT in scope for this Score: - TLS / PKI — use NatsK8sScore when you need those - Gateways / supercluster — use NatsSuperclusterScore - Auth (user/password or JWT) — add a ConfigMap mount when the Chapter 4 auth work lands Tests (4, all passing): default is ClusterIP; node_port() flips Service to NodePort with the right nodePort field; jetstream() toggle controls the `-js` arg. Part of the "compound framework value" mindset: every future Score that wants a local NATS now points at this one type instead of inventing its own yaml.	2026-04-21 18:29:16 -04:00
Jean-Gabriel Gill-Couture	6863162655	feat(k8s): K8sBareTopology — minimal topology for ad-hoc Score execution Roadmap §12.6 ("topology proliferation") is partially resolved by extracting the ad-hoc InstallTopology from iot-operator-v0/install.rs into harmony as a reusable shared type, now that a second consumer (NatsBasicScore, landing next) makes the extraction genuinely load-bearing rather than speculative. What's new: - harmony/src/modules/k8s/bare_topology.rs — K8sBareTopology carries one K8sClient, implements K8sclient + Topology (noop ensure_ready). Constructors: from_client(name, client) for callers building their own client, from_kubeconfig(name) for callers reading the standard KUBECONFIG chain. - modules::k8s::K8sBareTopology re-export. What's gone: - iot-operator-v0/src/install.rs: the ~30-line InstallTopology struct + its async_trait-decorated impls. The crate also drops async-trait and harmony-k8s as direct deps (neither is used now that the topology is shared). - Long "architectural smell" comment from install.rs — the smell is fixed; the explanation belongs at the shared type now (with the history captured in its module doc). Behavior-preserving. cargo check --all-targets --all-features clean. smoke-a1 wire path unchanged. Compounding-value move: every future Score that needs "apply a typed resource against an existing cluster" consumes K8sBareTopology instead of inventing its own Topology impl. That's the pattern v0 Harmony's design is meant to encourage.	2026-04-21 18:26:30 -04:00
Jean-Gabriel Gill-Couture	d4c8731941	docs(iot): forward plan (v0.1 and beyond) + mark v0 walking skeleton as SHIPPED v0 walking skeleton is substantially done (CRD → operator → NATS KV → on-device agent → podman reconcile; VM-as-device for x86_64 and aarch64 via TCG; power-cycle resilience; operator install via Score instead of yaml/kubectl). Time to switch the `ROADMAP/iot_platform` folder from "plan to build the skeleton" to "plan to build on top of the skeleton." - NEW `ROADMAP/iot_platform/v0_1_plan.md` — the authoritative forward plan. Five chapters in execution order: 1. Hands-on end-to-end demo the user can drive by hand (imminent, fully detailed: composed smoke, typed-Rust CR applier, natsbox command menu, in-cluster NATS). 2. Status reflect-back + inventory (enrich `AgentStatus`, operator aggregates into `.status.aggregate`). 3. Helm chart packaging (ArgoCD deferred — user's clusters have it already, bringing it into the smoke adds no validation value). 4. Zitadel + OpenBao + per-device auth. 5. Frontend (web / CLI / TUI — deferred). Chapters 2-5 are sketched; they expand to their own docs as each becomes the active chapter. - EDIT `ROADMAP/iot_platform/v0_walking_skeleton.md` — add a SHIPPED banner at the top pointing at v0_1_plan.md. Keep the 707-line design diary intact as archaeology; don't rewrite history. - Incorporates the post-v0 architectural principles that emerged from review (no yaml in framework paths, minimal ad-hoc topologies, cross-boundary types in harmony-reconciler-contracts, verify before blaming upstream).	2026-04-21 18:18:20 -04:00
johnride	a79c835b08	Merge pull request 'refactor(operator): replace gen-crd yaml pipeline with a harmony Score' (#271 ) from feat/install-reconcile-operator-score into feat/iot-walking-skeleton All checks were successful Run Check Script / check (pull_request) Successful in 2m16s Details Reviewed-on: #271	2026-04-21 20:53:31 +00:00
johnride	6676023aa8	Merge branch 'feat/iot-walking-skeleton' into feat/install-reconcile-operator-score All checks were successful Run Check Script / check (pull_request) Successful in 2m13s Details	2026-04-21 20:53:05 +00:00
Jean-Gabriel Gill-Couture	b8db8241d1	docs(topology): flag InstallTopology smell + add roadmap §12.6 All checks were successful Run Check Script / check (pull_request) Successful in 2m14s Details The InstallTopology in iot/iot-operator-v0/src/install.rs is architecturally a workaround: harmony's existing opinionated topologies (K8sAnywhereTopology, HAClusterTopology) have accumulated product-level side effects in ensure_ready that make them unfit for narrow actions like "apply a CRD," so the module vendored its own tiny Topology impl. If this pattern proliferates, the topology ecosystem drifts toward "one bespoke topology per Score," which is exactly the proliferation harmony's design was meant to prevent. Two documentation changes, no code/behavior change: - Inline: doc comment on `InstallTopology` flagging it as a smell, explaining the root cause, and pointing at the roadmap entry below. Anyone finding this code later (or tempted to copy the pattern) reads the warning before they do. - Roadmap §12.6 (new): "Topology proliferation — opinionated topologies leaking into narrow use cases." Captures the architectural direction (minimal `K8sBareTopology` in harmony, unbundle product setup from `ensure_ready`) without prescribing an implementation. Includes an explicit done-check: the smoke test for "this roadmap item is fixed" is that install.rs can delete its inline Topology and one-line against the shared type.	2026-04-21 16:49:48 -04:00
Jean-Gabriel Gill-Couture	588afb9ab9	refactor(operator): replace gen-crd yaml pipeline with a harmony Score All checks were successful Run Check Script / check (pull_request) Successful in 2m12s Details Review feedback: writing yaml and shelling out to kubectl is the exact anti-pattern harmony exists to eliminate. The operator already has typed Rust for its CRD (`#[derive(CustomResource)]`), and harmony-k8s already has a typed apply path. So the "install" step should be a Score, not `cargo run -- gen-crd \| kubectl apply -f -`. Changes: - New `iot/iot-operator-v0/src/install.rs` — `install_crds()` builds `Deployment::crd()` via `kube::CustomResourceExt`, wraps it in `harmony::modules::k8s::resource::K8sResourceScore`, and executes the Score against a tiny local `InstallTopology` that just carries a `K8sClient` loaded from `KUBECONFIG`. The local topology exists because `K8sAnywhereTopology::ensure_ready` does a lot of product-level setup (cert-manager, tenant manager, helm probes) that isn't appropriate for a narrow "apply a CRD" action. A 30-line inline topology that implements `K8sclient` + a noop `ensure_ready` is the right-sized abstraction for now. When a larger "install the operator in-cluster" Score lands (Deployment + SA + RBAC + ClusterRoleBinding), that may justify promoting the topology to a shared crate. - Renamed subcommand `gen-crd` → `install`. Old path: print yaml to stdout for kubectl to consume. New path: apply the CRD directly via the Score, using whatever `KUBECONFIG` points at. - Deleted `iot/iot-operator-v0/deploy/crd.yaml` and `deploy/operator.yaml`. The CRD yaml was derived from Rust and committed alongside the source — a drift hazard (nothing guaranteed they stayed in sync). `operator.yaml` was never actually applied by any smoke script; it existed only for documentation. Both go. - Rewired `iot/scripts/smoke-a1.sh` phase 2 to call the `install` subcommand instead of piping yaml to kubectl. Everything downstream (kubectl wait for Established, apiserver CEL rejection check, operator + agent + container lifecycle) unchanged. - Dropped `serde_yaml` from the operator's `Cargo.toml` — it was only used to print the CRD as yaml. Added `harmony`, `harmony-k8s`, and `async-trait` deps. Verification — `smoke-a1.sh` PASSes end-to-end on x86_64 k3d: k3d cluster → install CRD via Score → apiserver rejects bad score.type (CEL still works through the Score-applied CRD) → operator → agent → nginx container up → curl 200 → delete CR → KV + container removed. Out of scope / follow-up: a proper "install operator in-cluster" Score that also applies Namespace + SA + ClusterRole + ClusterRoleBinding + Deployment (the manifests that used to live in the deleted operator.yaml). Smoke-a1 currently runs the operator as a host-side process, so that Score isn't on the test path today.	2026-04-21 16:37:16 -04:00
johnride	8c94c8e61e	Merge pull request 'refactor(iot): extract iot-contracts crate for cross-boundary types' (#270 ) from feat/iot-contracts into feat/iot-walking-skeleton All checks were successful Run Check Script / check (pull_request) Successful in 2m16s Details Reviewed-on: #270	2026-04-21 20:13:15 +00:00
Jean-Gabriel Gill-Couture	75c3ef9bb8	refactor(reconciler): rename iot-contracts → harmony-reconciler-contracts All checks were successful Run Check Script / check (pull_request) Successful in 2m29s Details Review feedback: "iot" is the wrong scope label. The pattern this crate encodes — a central operator writing desired state to NATS JetStream KV, a remote agent watching KV and reconciling — is the foundation for harmony's decentralized infrastructure management, not an IoT thing. Raspberry Pi is one concrete use case; the next consumers (OKD fleet agents, edge-compute reconcilers, any host harmony can't reach directly over a control-plane API) aren't IoT either. Rename the crate to reflect what it actually is: - `iot/iot-contracts/` → `harmony-reconciler-contracts/` (moved to the repo root, alongside the other support crates). - Package name `iot-contracts` → `harmony-reconciler-contracts`. - Consumer `Cargo.toml` path references updated in operator, agent. - `use iot_contracts::…` → `use harmony_reconciler_contracts::…` across agent + operator sources. - Crate-level prose in lib.rs + kv.rs rewritten to drop the IoT framing and describe the reconciler pattern in its own terms. - harmony/Cargo.toml drops the dep entirely — after the preceding commit moved podman Score types back in-tree, harmony no longer pulls anything from this crate. No behavior change. Wire format unchanged — the two existing public modules (`kv`, `status`) are byte-identical. Verified: - `cargo check --all-targets --all-features` clean. - `cargo test -p harmony-reconciler-contracts` — 5/5 pass. - x86_64 `smoke-a3.sh` end-to-end PASS (reboot-reconnect included). Out of scope / follow-up: the operator and agent crate names (`iot-operator-v0`, `iot-agent-v0`) and `IotScore` are still IoT-branded. Evaluating whether to flip those in this branch next.	2026-04-21 15:42:07 -04:00
Jean-Gabriel Gill-Couture	954b127152	refactor(podman): move PodmanV0Score back into harmony::modules::podman Review feedback: `ContainerRuntime` is a first-class harmony capability (already lives at `harmony/src/domain/topology/container_runtime.rs`) and the Score types that describe what containers a caller wants running belong next to the trait impls, not hidden in an IoT-labeled contracts crate. Putting `PodmanService`, `PodmanV0Score`, and `IotScore` in `iot-contracts` conflated the product-shape (IoT fleet agent) with a reusable container-orchestration primitive. Move the data definitions (plus the three serde tests) back to `harmony/src/modules/podman/score.rs` where they were before the extraction in commit `24b94a3`. That file now again holds the types and their `Score<T>` / `Interpret<T>` trait impls in one place. No behavior change: - `harmony::modules::podman::{IotScore, PodmanV0Score, PodmanService}` re-exports still resolve (through the restored local module rather than a forwarded re-export from iot-contracts). - The single external consumer that imports these types — `iot-agent-v0/src/reconciler.rs` — already went through `harmony::modules::podman::*`, so no import flip needed. iot-contracts now holds only the cross-boundary bits that are genuinely reconciler-wire-format-specific (bucket names + key helpers, `AgentStatus`, `Id` re-export). A follow-up commit will rename the crate itself to reflect that scope. Verification: `cargo test -p harmony --features podman --lib podman` (3 score tests pass in their restored home), `cargo test -p iot-contracts` (5 remaining tests), `cargo check --all-features` clean.	2026-04-21 15:36:16 -04:00
Jean-Gabriel Gill-Couture	0d01a71cd5	feat(iot-contracts): type AgentStatus fields with Id + DateTime<Utc> All checks were successful Run Check Script / check (pull_request) Successful in 2m8s Details `AgentStatus.device_id` and `AgentStatus.timestamp` were stringly typed. Both now carry real types that prevent a whole class of wire-format typos while keeping the on-wire JSON shape intact. device_id: String → harmony_types:🆔:Id Agent config + heartbeat payload now share the same `Id` that the example IoT pipeline already uses for `IotDeviceSetupConfig`. Mixing a device id with a deployment name or arbitrary `String` is now a type error. `Id` is re-exported from `iot-contracts` so consumers don't need a direct `harmony_types` dependency just to name the field. To keep the wire format byte-compatible, `harmony_types::Id` gains `#[serde(transparent)]`. Audit: no consumer in the tree relies on the previous `{"value": "…"}` shape — `Id` is persisted by sqlite via `to_string()`, never serialized directly — so this is a latent-bug fix more than a behavior change. timestamp: String → chrono::DateTime<Utc> The agent was calling `chrono::Utc::now().to_rfc3339()` and stuffing the String into the payload. It now holds a real `DateTime<Utc>` which serde-serializes as RFC 3339 anyway. The smoke script's reboot-gate lex comparison still works: time-digit prefixes resolve before the trailing `Z` (chrono default) vs `+00:00` (prior format) difference matters. Plumbing - `iot/iot-agent-v0/src/config.rs`: `AgentSection.device_id: Id`. TOML deserializes the bare string thanks to `#[serde(transparent)]`. - `iot/iot-agent-v0/src/main.rs`: `watch_desired_state` and `report_status` take `Id` instead of `String`. - `iot/iot-contracts/Cargo.toml`: adds `harmony_types` path dep and `chrono = { workspace, features = ["serde"] }`. Verification - `cargo test -p iot-contracts`: 8/8 passes. New assertions pin the wire format: `"device_id":"pi-01"` (not `{"value":"pi-01"}`) and `"timestamp":"2026-04-21T18:15:42Z"` (RFC 3339). - x86_64 smoke-a3.sh PASSes end-to-end including the reboot- reconnect loop — wire format remains compatible with the existing smoke-script parsing.	2026-04-21 15:12:11 -04:00
Jean-Gabriel Gill-Couture	7cdf8cb5e7	style(iot/assets): use .args([...]) for the ssh-keygen invocation Replaces an 8-link `.arg("-t").arg("ed25519").arg("-N")…` chain with a single `.args([...])` of string literals, plus one trailing `.arg()` for the `&PathBuf` (kept separate so we don't force it through the `IntoIterator<Item=&str>` channel). No behavior change.	2026-04-21 15:11:51 -04:00
johnride	4e787ddb71	Merge pull request 'feat/iot-arm-vm' (#269 ) from feat/iot-arm-vm into feat/iot-walking-skeleton All checks were successful Run Check Script / check (pull_request) Successful in 2m15s Details Reviewed-on: #269	2026-04-21 19:04:51 +00:00
Jean-Gabriel Gill-Couture	24b94a362d	refactor(iot): extract iot-contracts crate for cross-boundary types All checks were successful Run Check Script / check (pull_request) Successful in 2m13s Details Consolidate the data types, NATS bucket names, and KV key formats that were scattered across the IoT operator, on-device agent, and harmony's podman module. Each was defined in one place and quoted / reimplemented in the others, which is exactly the kind of contract drift the roadmap v0.1 §2 called for consolidating before we start layering new features on top. New crate `iot/iot-contracts`: * score.rs — `IotScore`, `PodmanV0Score`, `PodmanService` (moved from `harmony::modules::podman::score`). Pure data, no harmony deps. * kv.rs — `BUCKET_DESIRED_STATE`, `BUCKET_AGENT_STATUS` constants, `desired_state_key(device, deployment)`, `status_key(device)`. These values used to be hard-coded in five places (agent main.rs, operator main.rs, operator/deploy/operator.yaml, smoke-a1.sh, smoke-a3.sh). Tests lock the literals so a flip can't slip. * status.rs — typed `AgentStatus { device_id, status, timestamp }`. Replaces the anonymous `serde_json::json!{}` the agent was publishing, so the operator can deserialize the heartbeat payload via a shared struct when §12 v0.1 status aggregation lands. Consumer updates: * `harmony::modules::podman::score` now holds only the `Score<T>` / `Interpret<T>` trait bindings; the pure types are re-exported from iot-contracts. Trait impls can't move because the trait lives in harmony, so this is the cleanest split. * `iot-operator-v0` uses `BUCKET_DESIRED_STATE` and `desired_state_key` — the inline `kv_key` fn now delegates so the existing internal call sites stay untouched. * `iot-agent-v0` uses `BUCKET_DESIRED_STATE`, `BUCKET_AGENT_STATUS`, `status_key`, and `AgentStatus` for the heartbeat publish. No behavior change. Tests: `cargo test -p iot-contracts` passes (8/8). Regression: `smoke-a3.sh` on x86_64 PASSes end-to-end (reboot-reconnect loop included) — wire format is byte-identical to the pre-refactor serialization. Next consumers on deck: operator-side status aggregation (§12 v0.1 #3) and journald log streaming (§12 v0.1 #5), both of which need shared types across the operator/agent boundary and were the reason this extraction was prioritized.	2026-04-21 14:55:31 -04:00
Jean-Gabriel Gill-Couture	762e3b5b99	fix(kvm): wait for port 22 after DHCP lease when first_boot is set All checks were successful Run Check Script / check (pull_request) Successful in 2m14s Details `wait_for_ip` returns as soon as libvirt sees a DHCP lease, but the guest may still be minutes away from accepting SSH connections — cloud-init is usually mid-firstboot (SSH host-key generation, runcmd, etc.). Any Score that SSHes in immediately after `ensure_vm` resolves races with sshd startup: ansible.builtin.ping failed against 192.168.122.11: UNREACHABLE! ssh: connect to host 192.168.122.11 port 22: Connection refused This is painful on native KVM (seconds) and catastrophic under TCG (1-3 min between DHCP and sshd listening). When `spec.first_boot.is_some()` — i.e. the caller asked us to run cloud-init and therefore almost certainly intends to SSH next — also block on `wait_for_tcp_port(ip, 22, budget)` before returning. The budget is reused from `wait_for_ip` (300 s x86_64 / 1800 s aarch64) because if cloud-init takes that long to bring SSH up, something is broken that a longer wait wouldn't fix. `wait_for_tcp_port` uses 1 s backoff polling with a 5 s per-attempt TCP connect timeout, so a silently dropped SYN doesn't burn half the budget on a single hung syscall. Cases without `first_boot` (caller bringing their own pre-baked image and not expecting SSH) get the old behavior: return as soon as DHCP resolves.	2026-04-21 14:08:43 -04:00
Jean-Gabriel Gill-Couture	a400ce7ec2	fix(kvm/aarch64): pad edk2 CODE firmware to 64 MiB for pflash QEMU's `virt` machine hardwires pflash unit 0 as a CFI flash device of fixed size 64 MiB. When libvirt's `<loader type='pflash'>` points at a file smaller than that, qemu refuses to start: cfi.pflash01 device '/machine/virt.flash0' requires 67108864 bytes, block backend provides 3145728 bytes Different distros ship the CODE firmware differently: - Pre-padded (upstream QEMU pc-bios/edk2-aarch64-code.fd, Debian/ Ubuntu qemu-efi-aarch64): file is exactly 64 MiB, zero-padded at the tail. Works as-is with libvirt's pflash loader. - Raw edk2 build output (Arch `edk2-aarch64 202508+`): file is ~2-4 MiB, just the firmware volume without pflash padding. Has to be padded before libvirt accepts it. Our discovery previously handed the discovered path straight to libvirt. That works on pre-padded distros and silently fails on raw-output distros. Add `ensure_code_pflash_padded` in modules/kvm/firmware.rs: - If the source is already 64 MiB, return the path unchanged — no copy, no bytes moved. - If smaller, check a cache path (pool_dir/aarch64-code-padded.fd) for a correctly-sized copy newer than the source and reuse it. - Otherwise copy + `File::set_len(64 MiB)` (sparse zero pad, one syscall), chmod 0644, return the cached path. - If larger than 64 MiB, error out — no amount of padding saves us. `ensure_vm_firmware` in topology.rs now runs the discovered code through the padder before handing it to libvirt. One padded copy per pool, reused across every aarch64 VM on that pool. Verification path: `cargo test -p harmony --lib kvm::` passes (26 tests — XML suite unchanged since this is runtime-only).	2026-04-21 14:06:12 -04:00
Jean-Gabriel Gill-Couture	1bde5691fb	feat(kvm/aarch64): TCG perf overrides + entropy + cleanup fixes Three fixes landed during arm smoke debugging. Each is a real correctness / perf issue that would bite anyone running aarch64 under TCG via libvirt, independent of any particular firmware. xml.rs — qemu:commandline overrides for -cpu and -accel `pauth-impdef=on` is a QEMU property of `-cpu max`, not a libvirt `<feature>` entry. Putting it under `<cpu><feature policy='require' name='pauth-impdef'/>` is rejected by libvirt with: error: unsupported configuration: unknown CPU feature: pauth-impdef Route it instead via `<qemu:commandline>` (with the qemu namespace declared on `<domain>`). QEMU takes the LAST `-cpu` arg as authoritative, so libvirt's `-cpu max` followed by our `-cpu max,pauth-impdef=on` yields max + pauth-impdef. Same mechanism forces MTTCG: despite docs claiming QEMU ≥ 9.1 defaults to `thread=multi` on aarch64, observation on QEMU 10.2 shows cross-arch `-accel tcg` runs single-threaded (`vcpu.1.time` stays at 0 forever). Appending `-accel tcg,thread=multi` creates a real per-vcpu thread and roughly halves cold-boot wall time. Also added a `<rng model='virtio'>` device feeding host `/dev/urandom`. aarch64 cloud-init blocks minutes on first-boot SSH host-key generation without it under TCG (entropy pool never fills on its own). Cheap insurance on x86_64 too. topology.rs — 30-min wait_for_ip budget for aarch64 Cold boot under TCG on an 8-core x86 host is 10-15 min even with virtio-rng + pauth-impdef + MTTCG. The previous 900s ceiling trips healthy boots; 1800s covers slower CI workers. smoke-a3.sh — cleanup must pass --nvram `virsh undefine --remove-all-storage` refuses to remove an aarch64 domain without `--nvram`, because NVRAM files aren't considered "storage." Before this, a failed run left the domain definition behind with yesterday's XML — subsequent runs would replay the stale XML (ensure_vm is idempotent and doesn't redefine when the domain already exists), masking any XML change until a manual `virsh undefine` was issued. Also bump REBOOT_STEPS to match the new topology-side budget. Verified: `cargo test -p harmony --lib kvm::xml` passes (26/26), including the 5 aarch64 assertions (namespace, cpu block, pflash wiring, qemu:commandline contents for both -cpu and -accel).	2026-04-21 12:43:11 -04:00
Jean-Gabriel Gill-Couture	089fd9583d	fix(kvm/firmware): match current Arch edk2-armvirt layout Current Arch edk2-armvirt ships the pair as /usr/share/edk2/aarch64/QEMU_EFI.fd /usr/share/edk2/aarch64/QEMU_VARS.fd (plus a compatibility copy under /usr/share/edk2-armvirt/aarch64/). The previous CANDIDATES list looked for `QEMU_CODE.fd` and `vars-template-pflash.raw` — neither name matches the actual distro layout, so `discover_aarch64_firmware` reported "no firmware found" on a fully-provisioned Arch host. Add the `QEMU_EFI.fd` + `QEMU_VARS.fd` pair at both Arch paths at the top of the probe order; keep the older raw-pflash variant and the speculative CODE/VARS naming as later fallbacks. Sync the error message's "checked paths" hint with the new list so the diagnostic matches what's actually probed. Verified against /usr/share/edk2/aarch64/QEMU_{EFI,VARS}.fd on this host — `discover_aarch64_firmware` now returns the pair and `cargo run -p example_iot_vm_setup -- --arch aarch64 --bootstrap-only` completes (downloads + sha256-verifies the 598 MB arm64 image and caches it under $HARMONY_DATA_DIR/iot/cloud-images/).	2026-04-20 23:07:52 -04:00
Jean-Gabriel Gill-Couture	934fea7953	fix(iot/preflight): gate aarch64 firmware discovery behind kvm feature The on-device agent builds `harmony` with `default-features = false, features = ["podman"]`, which does not pull in the `kvm` feature. Cross-compiling iot-agent-v0 for `aarch64-unknown-linux-gnu` to put it on a Pi / arm64 VM currently fails with: error[E0433]: failed to resolve: could not find `kvm` in `modules` --> harmony/src/modules/iot/preflight.rs:18:21 use crate::modules::kvm::firmware::discover_aarch64_firmware; Gate the import and the `discover_aarch64_firmware()` call inside `check_iot_smoke_preflight_for_arch` behind `#[cfg(feature = "kvm")]`. Callers who build `harmony` without kvm (the agent) still get the `qemu-system-aarch64` PATH check — the firmware probe only matters to the host that will actually boot the VM, and that host always builds with `kvm` enabled anyway. Verification: `cargo build --release --target aarch64-unknown-linux-gnu -p iot-agent-v0` now succeeds and produces a valid ELF aarch64 binary (~13 MB).	2026-04-20 22:41:20 -04:00
Jean-Gabriel Gill-Couture	747ce18ac8	feat(iot): plumb --arch through example + smoke script Wire the VmArchitecture story all the way to the user-facing entry points so an arm64 smoke run is a single env flip. Example (`example_iot_vm_setup`): * New `--arch {x86-64\|aarch64}` flag (default x86-64) backed by a `CliArch` enum that converts cleanly to `VmArchitecture`. * Preflight and cloud-image bootstrap now call the `_for_arch` variants, and the `VirtualMachineSpec.architecture` field gets the real value instead of `Default::default()`. Smoke script (`iot/scripts/smoke-a3.sh`): * Reads `ARCH=x86-64\|aarch64` from env (default x86-64). * When `ARCH=aarch64`, `rustup target add aarch64-unknown-linux-gnu` + `cargo build --target ...` produces an arm64 agent binary; otherwise the existing host-target build path is kept. * Threads `--arch` to the example. * Extends the phase-4 initial-status timeout (60s → 300s) and the phase-5 post-reboot wait (240s → 900s) under TCG, which runs 3-5× slower than native KVM. New `smoke-a3-arm.sh` wrapper: exports `ARCH=aarch64` and a separate `VM_NAME` / NATS container name so an arm smoke run can coexist with an x86 one on the same host without stepping on libvirt state. Topology side (`KvmVirtualMachineHost::ensure_vm`): `wait_for_ip` timeout is now arch-derived — 300s for x86_64, 900s for aarch64 — because first-boot cloud-init under TCG routinely needs 8-12 min on a constrained worker.	2026-04-20 22:37:23 -04:00
Jean-Gabriel Gill-Couture	1dfb6cddb5	feat(iot): arm64 cloud image + arch-aware preflight Add the pinned Ubuntu 24.04 arm64 cloud image alongside the existing amd64 pin, with sha256 verification and a per-arch OnceCell cache so both images can coexist under $HARMONY_DATA_DIR/iot/cloud-images/. New entry point `ensure_ubuntu_2404_cloud_image_for_arch` selects the right URL/sha256/filename tuple by VmArchitecture; the existing `ensure_ubuntu_2404_cloud_image` becomes a back-compat shim pointing at x86_64 so current callers don't need to thread an arch through yet. Preflight gains `check_iot_smoke_preflight_for_arch`: on top of the host-generic checks, an aarch64 target additionally requires `qemu-system-aarch64` on PATH and a usable AAVMF firmware pair (same `discover_aarch64_firmware` call the topology makes at ensure_vm time — preflight surfaces it up front). Package-map helpers learn `qemu-system-aarch64` for pacman/apt/dnf.	2026-04-20 22:33:43 -04:00
Jean-Gabriel Gill-Couture	45d8c46280	feat(kvm): AAVMF firmware discovery + per-VM NVRAM copy aarch64 guests boot via UEFI — there is no SeaBIOS equivalent for the arm64 `virt` machine type. Libvirt needs two paths: - CODE (read-only firmware image, shared across VMs) - VARS (writable NVRAM, per-VM) Every distro ships these under a different filename. New module `modules/kvm/firmware.rs`: - `AarchFirmware { code, vars_template }` — typed pair. - `discover_aarch64_firmware()` walks four known-paths groups (Arch `edk2-armvirt`, Arch old naming, Debian/Ubuntu `qemu-efi-aarch64`, Fedora `edk2-aarch64`). First pair where both files exist wins. Miss → `ExecutorError` carrying the per-distro `pacman`/`apt`/`dnf` install command + the full candidate list for diagnosis. - `copy_vars_template_for_vm(fw, dest)` produces the per-VM NVRAM at `$pool/<vm>-VARS.fd` and chmods 0644 so libvirt-qemu's dynamic-ownership chown on VM start works. Wired into `KvmVirtualMachineHost::ensure_vm`: when `spec.architecture == Aarch64`, the topology runs firmware discovery + per-VM copy before composing the `VmConfig`, then hands the resolved `UefiFirmware` to the XML renderer (commit 2 already consumes it). x86_64 path unchanged. Firmware discovery is deliberately a runtime check with a clear error, not a preflight — this lets x86_64-only runs succeed on hosts without AAVMF installed. Commit 4 adds an arch-aware preflight that surfaces it upfront when a caller asks for aarch64. Verified: 26/26 kvm::xml tests still green, cargo check clean, cargo fmt clean.	2026-04-20 22:28:50 -04:00
Jean-Gabriel Gill-Couture	140deeab06	feat(kvm): parameterize domain XML by VmArchitecture Rewrites `domain_xml` to consume a resolved `DomainXmlParams` (domain_type / arch / machine / emulator / cpu_block / firmware) so per-arch branching happens once — at param resolution — and the XML template itself stays a single readable format-string. Per-arch values (from Linaro's "QEMU: A Tale of Performance analysis" Jan 2025 for the aarch64 TCG knobs): - x86_64 → `<domain type='kvm'>` + machine `q35` + emulator `qemu-system-x86_64` + `<cpu mode='host-model'/>`. No firmware. (Unchanged — all existing XML still emits byte-identical output on the default arch.) - aarch64 → `<domain type='qemu'>` (TCG emulation), machine `virt`, emulator `qemu-system-aarch64`, custom CPU `<model>max</model>` with `<feature policy='require' name='pauth-impdef'/>`. MTTCG (`-accel tcg,thread=multi`) is the default in QEMU ≥ 9.1 so no libvirt-side knob is needed. UEFI via `<loader readonly='yes' type='pflash'>CODE</loader>` + `<nvram>VARS</nvram>` — a `UefiFirmware` pair is required (populated by `KvmVirtualMachineHost` in commit 3). Four new unit tests verify the aarch64 path emits the right domain type, arch, machine, emulator, CPU features, and firmware elements — and that x86_64 stays BIOS-default with no loader/ nvram leakage. 26/26 `modules::kvm::xml` tests green. When a native-aarch64 runner (Ampere) shows up, it's a one-line fork inside `DomainXmlParams::for_vm` to switch to `kvm` + `host-model` for the aarch64 branch — the shape already handles it.	2026-04-20 22:26:28 -04:00
Jean-Gabriel Gill-Couture	c7b074a96a	feat(kvm): VmArchitecture enum + UefiFirmware field on VmConfig Adds the type-safe arch dimension for the aarch64-on-x86_64 emulation work to follow. No behaviour change: every existing call site gets `VmArchitecture::X86_64` via `Default`, and the XML renderer (unchanged in this commit) emits the same bytes it always did. - `VmArchitecture { X86_64 (default), Aarch64 }` in domain/topology/virtualization.rs, with `as_str()` and `ubuntu_cloudimg_suffix()` helpers (Ubuntu uses `amd64`/`arm64` in filenames, not the `uname -m` spelling). - `VirtualMachineSpec.architecture` + `#[serde(default)]` for on-disk compat. - `VmConfig.architecture` + `VmConfig.firmware: Option<UefiFirmware>` in modules/kvm/types.rs. `UefiFirmware { code, vars }` is the typed pair libvirt's `<loader>` + `<nvram>` need for aarch64 guests; x86_64 leaves it None. `VmConfigBuilder::architecture()` / `firmware()` setters added. - `KvmVirtualMachineHost::ensure_vm` threads the arch through to VmConfig; firmware wiring is commit 3. Re-exported: `VmArchitecture`, `UefiFirmware` from `modules::kvm`. `VmArchitecture` is a type-alias re-export from domain/topology so the arch enum lives in one place. Verified: cargo check clean, fmt clean, aarch64 cross-compile of harmony + iot crates still green.	2026-04-20 22:24:29 -04:00
Jean-Gabriel Gill-Couture	003d6f995a	docs(iot): plan for aarch64 VM support on x86_64 hosts	2026-04-20 22:18:05 -04:00
Jean-Gabriel Gill-Couture	599601f48a	refactor(linux): replace ansible.builtin.command with direct russh Ansible's `command` module is a Python-wrapped SSH round trip with zero added value when the operation isn't built around Ansible's idempotency primitives. `russh` is already a workspace dep and gives us the exit code + stdout + stderr in a typed struct, with one round trip. Moving the two call sites that were using `ansible.builtin.command` to russh directly: - New `modules::linux::ssh_executor::ssh_exec(host, creds, cmd)` returning `SshCommandOutput { rc, stdout, stderr }`. Loads the private key via `russh::keys::load_secret_key`, authenticates, opens an exec channel, drains all `ChannelMsg` until the channel closes, returns the collected data. Draining past `Eof` matters: some sshd implementations emit `ExitStatus` after `Eof`, and an early break loses the rc. - `ensure_linger`: `test -e /var/lib/systemd/linger/<user>` over russh for the check, then `sudo loginctl enable-linger <user>` only on miss. Two SSH round trips, no Ansible. Same semantics as the previous `stat` + `command` pair but without the Python hop. - `ensure_user_unit_active`: `id -u <user>` + `sudo -u <user> env XDG_RUNTIME_DIR=/run/user/<uid> systemctl --user enable --now <unit>`. This is the case that couldn't be done cleanly via ad-hoc `ansible.builtin.systemd` in the first place because task-level `environment:` isn't available in ad-hoc; russh makes it a one-liner. Ansible still owns: `apt` (distro dispatch + cache), `user` (idempotent account management), `copy` (file delivery with content-diff change reporting), `file` (directory/mode), `systemd` (daemon-reload + enable + start as one atomic call). Those are where `ansible`'s value is real; `command` was a category error. Verified: smoke-a3 PASS end-to-end — same 9-change initial setup, NATS status, and power-cycle recovery as before.	2026-04-20 22:08:52 -04:00
Jean-Gabriel Gill-Couture	1072aa2850	refactor(iot): address code review — ISP, VirtualMachineHost, cleanups Structural changes (the biggest items from the review): - `HostConfigurationProvider` split into five narrower capabilities: `HostReachable`, `PackageInstaller`, `FileDelivery`, `UnixUserManager`, `SystemdManager`. Each implementation now only implements what it can actually deliver — a future cloud-init / ignition / podman-agent backend can pick a subset without inheriting systemd assumptions it can't honour. Added an umbrella trait `LinuxHostConfiguration` blanket-impl'd for any type that has all five, so Scores keep a single bound. - New `VirtualMachineHost` capability in domain/topology/: `list_vms` / `ensure_vm` / `delete_vm` / `get_vm_info`, with generic `VirtualMachineSpec` carrying a typed optional `VmFirstBootConfig` (hostname, admin user, authorized keys). `KvmHost` trait and `KvmHostTopology` deleted; `KvmVirtualMachineHost` is the concrete libvirt implementation. Cloud-init stays a KVM-impl detail — callers never see it. - `KvmVmScore` + `CloudInitVmConfig` deleted; replaced by a generic `ProvisionVmScore` in `modules::iot::vm_score` bound to `T: VirtualMachineHost`. The Score itself has no knowledge of the hypervisor or its first-boot delivery mechanism. - `IotDeviceSetupConfig.device_id` is now `harmony_types:🆔:Id` (timestamp-prefixed, sortable-by-creation, collision-safe). - `ensure_ready` on `KvmVirtualMachineHost` is a Noop with a TODO pointing at ROADMAP/12-code-review-april-2026.md §12.1 (phased topology). Captures the concern about eagerly probing the hypervisor even when the current run doesn't need KVM. Code quality fixes from the line-level comments: - `render_toml` / `render_systemd_unit` / `render_user_data` rewritten as `format!` with raw-string templates (no more push_str chains). - Every `Command::new(…).arg().arg().arg()` chain in the touched files converted to `.args([…])`. - Ansible module args are now typed Rust structs (`AptArgs`, `AnsibleFileArgs`, `AnsibleUserArgs`, `AnsibleCopyArgs`, `AnsibleSystemdArgs`, `AnsibleCommandArgs`, `AnsibleStatArgs`) serialized via `serde_json::to_value`. No more `json!` macros with ad-hoc string keys. - `ensure_linger`: no more shell sentinel. Uses `ansible.builtin.stat` on `/var/lib/systemd/linger/<user>` for the idempotent change-state check, then `ansible.builtin.command loginctl enable-linger` only on miss. `loginctl` is required (not just `file state=touch`) because systemd-logind needs the dbus signal to actually start the user manager; a plain file touch doesn't wake it up and every subsequent `systemctl --user …` fails with "Failed to connect to bus". Documented in-place. - `ensure_user_unit_active`: picks up the user's UID first via `ansible.builtin.command id -u <user>` and wraps the `systemctl --user enable --now <unit>` invocation in `env XDG_RUNTIME_DIR=/run/user/<UID>`. The systemd module's task-level `environment:` keyword isn't available in ad-hoc mode; this is the cleanest equivalent. Documented the inline-playbook path as a future when we get more task-level- env callsites. - `ensure_package` comment clarified: distro dispatch is this function's job; Debian-family is the first concrete target and extending to RHEL/Fedora/Alpine is an implementation detail, not a capability change. - Kubespray line removed. Verified: from a primed `$HARMONY_DATA_DIR/iot/`, smoke-a3.sh still completes all 5 phases (bootstrap + provision + 9 setup changes + initial NATS status + power-cycle recovery).	2026-04-20 17:58:42 -04:00
Jean-Gabriel Gill-Couture	c2f731b6d3	feat(iot): fully self-bootstrapping smoke-a3 + reboot test The smoke test now runs end-to-end against a pristine host with only generic deps installed (libvirt, qemu, xorriso, python3, podman, cargo, kubectl) — no manual ISO downloads, ssh-keygen rituals, or chmod dances. Pairs with a hard power-cycle recovery phase that matches ROADMAP §8's "power cycle test" shape. Harmony-side bootstrap (all under $HARMONY_DATA_DIR/iot/): - `modules::iot::assets` — SHA256-verified Ubuntu 24.04 cloud image download (cached, streaming via reqwest) + ed25519 SSH keypair generation. OnceCell-cached like `ensure_ansible_venv`. - `modules::iot::libvirt_pool` — user-owned dir-backed libvirt storage pool at $HARMONY_DATA_DIR/iot/kvm/pool/. Per-VM overlay disks + seed ISOs land here; libvirt dynamic-ownership handles the libvirt-qemu chown transitions we used to do by hand. Pool is defined/built once via the `virt` crate inside a spawn_blocking, then auto-started + auto-autostarted on every process boot. - `modules::iot::preflight::check_iot_smoke_preflight()` — fail-fast checks for every runner-host prereq (`virsh`, `qemu-img`, `xorriso`, `python3`, `ssh-keygen`, libvirt-group membership, default network active). Each missing piece surfaces with the Arch/Debian/ Fedora install command inline. KvmVmScore now owns these calls internally — `CloudInitVmConfig` loses `base_image_path`, `seed_output_dir`, `authorized_key`. The Score returns the SSH private-key path in its outcome details so the caller can hand it straight to `LinuxHostTopology`. smoke-a3.sh dropped from 125 lines of manual setup to a thin orchestration script. Adds phase 5: `virsh destroy` + `sleep` + `virsh start`, then a wall-clock gate that rejects any status writes from before the reboot. Verified: real power-cycles produce timestamps ~14s after the gate (agent boot + connect latency); the gate catches in-flight writes that happen during destroy. Verified end-to-end from a fully nuked `$HARMONY_DATA_DIR/iot/`: - cold boot: downloads 600MB cloud image (~25s), generates SSH key, defines + starts libvirt pool, provisions VM, onboards device, verifies phase 5 power-cycle recovery - warm boot: cache hits on all bootstrap steps; same end-to-end PASS in 2-3 minutes total aarch64 cross-compile still green.	2026-04-20 14:45:26 -04:00
Jean-Gabriel Gill-Couture	63847ac059	fix(iot): end-to-end smoke-a3 greens; CI-ready All checks were successful Run Check Script / check (pull_request) Successful in 2m16s Details Eight fixes surfaced by actually running the VM-as-device flow end to end. All six commit deltas are small and self-contained. KvmVmScore + cloud-init: - Overlay disk: VM now boots off a per-VM qcow2 backed by the base image instead of writing into the base in-place. Re-runs of the same vm_name reuse the overlay (idempotent); fresh runs wipe the overlay so cloud-init starts clean. Requires `qemu-img`. - UUID instance-id: cloud-init's meta-data now carries a fresh UUID per seed build, so when the overlay gets recreated cloud-init treats it as a first boot and re-runs all per-instance modules. Without this, repeated runs silently skipped user/hostname/ssh setup. - xorriso deadlock: `.status()` with piped stderr filled the pipe buffer and SIGPIPE'd the child; switched to `.output()` which drains both. Also unlink any pre-existing seed ISO before running xorriso, since it otherwise treats the file as overwriteable input "media" and aborts with exit 32. - wait_for_ip: 180s → 300s. First boot of a cloud image on a constrained runner (or CI worker) can take 2-4 minutes. Ansible adapter — a half-dozen sharp corners of ad-hoc mode that only show up in a live run: - `--ssh-common-args=VALUE` (equals form, single token). Separate `--ssh-common-args VALUE` form has ansible's argparse re-interpret the `-o …` inside the value as its own `-o` flag and dump a help screen. Lost an afternoon to this decades ago on another project. - Skip `-a` when empty: `-a '{}'` trips ansible-core 2.17's "extra params" check on parameterless modules like `ping`. Pass no `-a` when the JSON dict is empty. - `ANSIBLE_LOAD_CALLBACK_PLUGINS=True`: ad-hoc mode silently ignores `ANSIBLE_STDOUT_CALLBACK` without this. Default callback produces multi-line JSON that's fragile to parse. - `ANSIBLE_PIPELINING=True`: required when `become`-ing an unprivileged user (iot-agent for the user-scope podman.socket), otherwise ansible's temp-file shuffle falls back to an ACL chmod syntax no Linux distro accepts. - Parse shell/command oneline shape: oneline callback emits `host \| VERB \| rc=N \| (stdout) … \| (stderr) …` for shell-style modules in addition to the `host \| VERB => {json}` shape. Parser now handles both and synthesises a JSON payload from the shell form. - Auto-create parent dir in ensure_file: ansible's `copy` module won't create `/etc/iot-agent/` for you; a `file state=directory` call before every `copy` is idempotent and cheap. - ensure_package uses apt directly: `ansible.builtin.package` is distro-agnostic but doesn't auto-run `apt update`, so a fresh cloud image fails with "no package matching". Switched to `ansible.builtin.apt` with `update_cache=true, cache_valid_time=3600`. Debian-family only for v0 (ROADMAP §5.3); RHEL switch is a future capability refinement. HostConfigurationProvider surface: - `FileSpec.source: FileSource`: new `Content(String)` vs `LocalPath(PathBuf)`. LocalPath ships binary files over SFTP via ansible's native mechanism instead of passing base64 content through argv (which hit ARG_MAX on the ~10MB agent). This replaces the whole base64-in-tmpfile + oneshot install-unit dance in IotDeviceSetupScore — the binary now installs in a single idempotent `ensure_file` call that reports `changed` only when bytes differ. IotDeviceSetupScore: - Dropped the base64 + oneshot install machinery (80 fewer lines). - Dropped the explicit primary `group:` on ensure_user — Debian-family useradd auto-creates a group matching the username; setting `group:` required pre-creating it. smoke-a3.sh: builds iot-agent-v0 `--release` instead of debug (400MB debug binary filled the VM's thin-provisioned 3.5GB cloud rootfs). Verified end-to-end three times on this host: run 1: 9 changes (fresh install — package install, user create, binary, config, restart) run 2: 0 changes (true NOOP — `already configured`) run 3: 2 changes (group swap — only TOML + agent restart) Agent reports status.iot-smoke-vm into NATS after each run.	2026-04-20 10:22:52 -04:00
Jean-Gabriel Gill-Couture	1577348dbb	refactor(linux): ansible ad-hoc mode + self-installing venv All checks were successful Run Check Script / check (pull_request) Successful in 2m20s Details Rewrites AnsibleHostConfigurator to avoid the two coupling points that last year's Kubespray investigation taught us to stay away from: YAML playbook generation and Ansible inventory. - No more YAML, no more inventory files. Every primitive is now one or two `ansible all -i '<ip>,' -m <module> -a '<json>'` ad-hoc invocations. JSON args go straight through Ansible's own module interface; the tmpfile-playbook-and-inventory dance is gone entirely. Harmony owns 100% of orchestration, Ansible owns only per-host idempotent module execution. `ensure_systemd_unit` collapses to two ad-hoc calls (copy + systemd) rather than a multi-task playbook. `ensure_linger` sentinels change-state through the shell module's stdout since ad-hoc has no `changed_when`. - Self-installing venv. New `modules::linux::ansible_venv`: `ensure_ansible_venv()` creates `$HARMONY_DATA_DIR/ansible-venv/` via `python3 -m venv` + `pip install ansible-core==2.17.` on first use, cached via `tokio::sync::OnceCell`. No more "install ansible before running Harmony" step — python3 + venv is the only host requirement, and we print the exact package names for Arch/Debian/Fedora when python is missing. - smoke-a3.sh*: drop `ansible-playbook` from preflight, add `python3`. Example gains `--bootstrap-ansible-only` for warming the venv ahead of the real run (turns a ~60s first-run smoke into deterministic sub-second after bootstrap). Output parsing uses the `oneline` callback (`host \| VERB => {json}`) which is trivially regex-free to split and handles FAILED!/UNREACHABLE! as errors. SSH control sockets are pinned under `$HARMONY_DATA_DIR/ ansible-cp` so multiple Harmony processes don't race in /tmp. Verified: `ensure_ansible_venv()` first call installs ansible-core 2.17.14 into the managed venv (~12s, network-bound); second call is cache-fast (<50ms). Clippy + fmt clean, aarch64 cross-compile green.	2026-04-20 08:49:15 -04:00
Jean-Gabriel Gill-Couture	ad71568aea	feat(iot): example_iot_vm_setup + smoke-a3.sh driver for VM-as-device - New binary crate `examples/iot_vm_setup` — composes the two Scores from the previous commit (`KvmVmScore`, `IotDeviceSetupScore`) with `KvmHostTopology` + `LinuxHostTopology`. CLI flags cover everything a customer-facing "onboard this VM" invocation would need (device id, group, NATS URL+creds, SSH key paths, cloud image path, agent binary path). `--only-vm` skips the setup step when iterating on VM provisioning. - `iot/scripts/smoke-a3.sh` — end-to-end smoke that stands up a NATS podman container, builds the iot-agent, runs the example, and waits for the VM's agent to write its `status.<device-id>` key into the `agent-status` KV bucket. Preflight fails fast with copy-paste commands when any of `virsh`, `xorriso`, `ansible-playbook`, the Ubuntu cloud image, or an SSH keypair is missing — the script does not try to self-bootstrap these (would turn a 90-second smoke into a ~20-minute download-and-generate session). - Clippy cleanups: redundant closure + useless `format!`s.	2026-04-20 08:10:05 -04:00
Jean-Gabriel Gill-Couture	1c5e1018f0	feat(iot): KvmVmScore + IotDeviceSetupScore behind a HostConfigurationProvider trait Adds the plumbing so Harmony can both provision a VM to stand in for a fleet device and (re)configure any Linux host to join the fleet. The walking skeleton's "VM-as-device" test path needs all three pieces: - `domain::topology::HostConfigurationProvider` — new capability trait with `ensure_package`, `ensure_user`, `ensure_file`, `ensure_systemd_unit`, `restart_service`, `ensure_linger`, `ensure_user_unit_active`, and a reachability `ping`. Returns `ChangeReport { changed: bool }` so callers can reconcile-restart only when something actually changed. Trait doc calls out the narrow scope (not a general CM replacement) and the swappability story. - `modules::linux::AnsibleHostConfigurator` + `LinuxHostTopology` — concrete impl that shells out to `ansible-playbook --stdout-callback json`, one play per trait method, parsing the JSON for the task's `changed` flag. Deliberately the laziest reasonable adapter: when Ansible's error surface becomes painful, this is the piece we replace with a Rust-native impl behind the same trait, with zero Score churn. Runtime requirement: `ansible-playbook` (>= 2.15) on the Harmony runner host. - `modules::kvm::KvmVmScore` + cloud-init seed ISO generation — thin Score that wraps `KvmExecutor::ensure_vm` with a generated cloud-init seed ISO (hostname + authorized SSH key + sudoer user, nothing more). Uses `xorriso -as mkisofs` to build the ISO; returns the booted VM's IP. Docs note cloud-init is strictly for the VM test rig — customer Pi deployments go through rpi-imager / PXE instead. New `KvmHost` capability + `KvmHostTopology` expose the underlying `KvmExecutor`. - `modules::iot::IotDeviceSetupScore` — customer-facing Score bound to `T: Topology + HostConfigurationProvider`. Installs podman + system- d-container, creates the `iot-agent` system user with linger, activates user podman.socket, uploads the agent binary via a base64-in-tmpfile + oneshot unit pattern (docstring flags this as a v0.1 candidate for a proper remote-fetch), writes `/etc/iot-agent/config.toml` and the systemd unit, and restarts only if any of the config/unit/binary-install tasks reported changes. Re-running with a different `group` rewrites the TOML and bounces the agent. Scope note: this turn stops at one VM. Multi-VM + group routing is the next step — `group` in the config is a label that the agent will carry into its status bucket, but `Deployment.spec.targetGroups` isn't wired anywhere yet. `smoke-a3.sh` (VM-as-device end-to-end) lands in the next commit.	2026-04-20 08:04:51 -04:00
Jean-Gabriel Gill-Couture	11121252a9	feat(iot-agent): reconcile PodmanV0Score into real containers The agent now finishes the walking-skeleton thread end-to-end: a Deployment CR applied in the central cluster flows through the operator into NATS KV, the agent reconciles it into a running container on the host, and deletion (or drift) runs through the same loop in reverse. Key additions: - `domain::topology::ContainerRuntime` — new capability trait for node-level container runtimes with `ensure_service_running` / `remove_service` / `list_managed_services`. Intentional scope doc notes Docker likely fits, Containerd/CRI-O likely need a separate capability; no attempt to generalise further up front. `ContainerSpec` carries a `MANAGED_BY_LABEL` so `list_managed_services` can filter out containers Harmony didn't create. - `modules::podman::PodmanTopology` (feature-gated behind `podman`) implements both `Topology` and `ContainerRuntime` over `podman_api::Podman` on the local user socket. Handles image pull, create/start, drift-triggered recreate, and a 5-minute graceful stop per ROADMAP §5.6. - `modules::podman::PodmanV0Interpret::execute` is no longer a stub — its bound is tightened to `T: Topology + ContainerRuntime` and it dispatches each `PodmanService` to the capability. `IotScore` / `PodmanV0Score` carry the same bound so agent code calls `Score::create_interpret` cleanly. - `domain::inventory::Inventory::from_localhost()` — minimal single-host inventory (hostname as label, logical CPU count, total memory). Pulls in `sysinfo 0.30` (already a transitive dep via `harmony_inventory_agent`). - `iot-agent-v0` rewired around a `Reconciler` that owns the topology + inventory + a `HashMap<key, (serialized_score, parsed_score)>` cache. KV Put → dispatch iff the serialized score changed (ROADMAP §5.5 string-compare). KV Delete/Purge → tear down the cached score's containers. Separate 30s reconcile tick re-runs every cached score against podman (ROADMAP §5.6 "polls podman every 30s as ground truth; KV watch events are accelerators"). Smoke test (`iot/scripts/smoke-a1.sh`) extended with phase 3b (builds + starts agent) and phase 4b (verifies the container is running and `curl http://127.0.0.1:8080/` returns nginx). Phase 5 now also asserts the container is gone after CR delete. PASS locally against a fresh k3d + NATS podman container + rootless podman on the dev host. aarch64 + x86_64 cross-compile stay green.	2026-04-18 23:18:55 -04:00
Jean-Gabriel Gill-Couture	d21bdef050	feat(iot-operator): CEL-validate score.type as a Rust identifier All checks were successful Run Check Script / check (pull_request) Successful in 2m15s Details The CRD previously accepted any string for `score.type`, so typos like `"pdoman"` or `"PodmnV0"` would be persisted by the apiserver and only surface on-device as agent-side deserialize warnings. That class of failure is distasteful and hard to debug. Replace the auto-derived schema for `ScorePayload` with a hand-rolled one that keeps the same visible shape but adds two apiserver-level guardrails: - `score.type` gets `minLength: 1` and an `x-kubernetes-validations` CEL rule requiring it to match `^[A-Za-z_][A-Za-z0-9_]$` — a valid Rust identifier, since score variants are* Rust struct names in `harmony::modules::podman::IotScore`. Message points operators at the concrete example `PodmanV0`. - `score.data` still carries only `x-kubernetes-preserve-unknown- fields: true`. The rule validates the discriminator's shape, not its value, so v0.3+ variants (OkdApplyV0, KubectlApplyV0) don't require an operator release — preserves ROADMAP §6.1's generic-router design. The `x-kubernetes-preserve-unknown-fields` extension stays scoped to `score.data` alone; every other field in the CRD has a strict schema, exactly one preserve-unknown-fields marker and exactly one validations block in the whole document. Smoke test extended: phase 2b applies a CR with `score.type: "has spaces"` and asserts the apiserver rejects it with the CEL message before the operator ever sees it. Positive phases (kubectl apply -> NATS KV put -> status observed -> delete -> KV key removed) still PASS end-to-end. Matches the `preserve_arbitrary` pattern used by ArgoCD (`Application.spec.source.helm.valuesObject`) and Flux (`HelmRelease.spec.values`), both of which similarly use narrow preserve-unknown-fields on a payload field without coupling the CRD to their variant catalog.	2026-04-18 10:35:59 -04:00
Jean-Gabriel Gill-Couture	1c916340f1	test(iot-operator): A1 end-to-end smoke test + CRD/patch fixes All checks were successful Run Check Script / check (pull_request) Successful in 2m15s Details `iot/scripts/smoke-a1.sh` drives the A1 acceptance flow end-to-end: spins up NATS and a k3d cluster via podman, applies the generated CRD, runs the operator, applies a Deployment CR, asserts the expected `<device>.<deployment>` key lands in the `desired-state` KV bucket and `.status.observedScoreString` round-trips the same JSON, then deletes the CR and asserts the finalizer removes the KV key. Cleans up on exit. Two fixes surfaced while running it: 1. `ScorePayload.data: serde_json::Value` generated an empty `{}` schema, which the API server rejects. Attach a `schemars(schema_with = preserve_arbitrary)` helper that emits `x-kubernetes-preserve- unknown-fields: true`, letting the Score payload be any JSON shape. 2. `Patch::Merge` combined with `PatchParams::apply(...).force()` is rejected by kube-rs (force is Apply-only). Use a plain `Merge` patch for the status subresource — simpler and correct for v0.	2026-04-18 09:43:16 -04:00
Jean-Gabriel Gill-Couture	e50ab741fc	feat(iot-operator): Deployment CRD controller writing to NATS KV Implement the A1 task from the IoT walking-skeleton roadmap: - CRD (kube-derive): `iot.nationtech.io/v1alpha1/Deployment`, namespaced, with `targetDevices`, `score {type, data}`, `rollout.strategy`, and a status subresource carrying `observedScoreString`. - Controller: `kube::runtime::Controller` + `finalizer` helper. On Apply, writes `<device_id>.<deployment_name>` into NATS KV bucket `desired-state` and patches `.status.observedScoreString` via server-side apply. Skips KV write + status patch when the score is unchanged to avoid reconcile-loop churn. On Cleanup, removes the per-device keys before releasing the finalizer. - CLI: `gen-crd` subcommand prints the CRD YAML from the Rust types; `run` (default) starts the controller. `deploy/crd.yaml` is generated by that subcommand — single source of truth, no drift. - Deploy manifests: `deploy/operator.yaml` (Namespace, SA, ClusterRole, ClusterRoleBinding, Deployment) and generated `deploy/crd.yaml`. Agent fixes surfaced while aligning with the operator's key layout: - Watch filter: was `starts_with("desired-state.<id>.")` on `watch_all()`; bucket name is not a key prefix, so it never matched. Now uses `bucket.watch("<id>.>")` with the NATS wildcard and handles `Put`/`Delete`/`Purge` distinctly. - Multi-server connect: was joining `nats.urls` with `","` into a single malformed URL. Pass the `Vec<String>` to `ConnectOptions::connect`. - `credentials.type` is now validated (rejects unknown discriminators) so a v0.2 `zitadel` config doesn't silently fall back to shared creds. Verification on feat/iot-walking-skeleton: - cargo clippy --no-deps -D warnings: clean (agent + operator). - cargo fmt --check: clean. - x86_64 + aarch64 cross-compile: both build. - podman module unit tests: pass.	2026-04-18 09:05:56 -04:00
Jean-Gabriel Gill-Couture	65ef540b97	feat: scaffold IoT walking skeleton — podman module, operator, and agent Some checks failed Run Check Script / check (pull_request) Has been cancelled Details - Add PodmanV0Score/IotScore (adjacent-tagged serde) and PodmanV0Interpret stub - Gate virt behind kvm feature and podman-api behind podman feature - Scaffold iot-operator-v0 (kube-rs operator stub) and iot-agent-v0 (NATS KV watch) - Add PodmanV0 to InterpretName enum - Fix aarch64 cross-compilation by making kvm/podman optional features - Align async-nats across workspace, add workspace deps for tracing/toml/tracing-subscriber - Remove unused deps (serde_yaml from agent, schemars from operator) - Add Send+Sync to CredentialSource, fix &PathBuf → &Path, remove dead_code allow - Update 5 KVM example Cargo.tomls with explicit features = ["kvm"]	2026-04-17 20:15:10 -04:00

feat: scaffold IoT walking skeleton — podman module, operator, and agent #264

210 Commits