feat/iot-helm #275
Reference in New Issue
Block a user
No description provided.
Delete Branch "feat/iot-helm"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Kills the "CRD owns a list of device ids" smell. Deployment CR now carries a standard K8s LabelSelector; Device is a first-class cluster- scoped CR (like Node). Matching, desired-state KV writes, and status aggregation all run off selector evaluation against the Device cache — no list of device ids anywhere in the CRD spec. Cross-resource model: - Agent publishes DeviceInfo (with labels) to NATS `device-info` KV. - device_reconciler watches that bucket → server-side-applies a cluster-scoped Device CR with metadata.labels + spec.inventory. - Deployment controller is now just validation + finalizer cleanup. - fleet_aggregator watches Deployment CRs + Device CRs + device-state KV, maintains in-memory selector → target device sets, writes/deletes `desired-state.<device>.<deployment>` KV on match changes, patches `.status.aggregate` at 1 Hz with matchedDeviceCount + phase counters. Applied CRD shape verified on a live k3d cluster: kubectl get crd deployments.iot.nationtech.io -o json .spec.versions[0].schema.openAPIV3Schema.properties.spec → rollout / score / targetSelector (matchLabels + matchExpressions) .spec.versions[0].schema.openAPIV3Schema.properties.status.aggregate → matchedDeviceCount / succeeded / failed / pending / lastError kubectl get crd devices.iot.nationtech.io -o json .spec.scope = "Cluster" .spec.versions[0].schema.openAPIV3Schema.properties.spec → inventory (nullable, camelCased fields) Load-test run: DEVICES=20 GROUP_SIZES=10,5,5 DURATION=20 all 3 CRs hit expected matched=N / succeeded+failed+pending=N. Other changes: - k8s-openapi gets the `schemars` feature so LabelSelector derives JsonSchema. - InventorySnapshot uses `#[serde(rename_all = "camelCase")]` for consistency with the rest of the CRD schema. - agent publishes `device-id=<id>` as a default label so the example_iot_apply_deployment `--target-device <id>` shorthand works out-of-the-box (implemented as `--selector device-id=<id>`). - example_iot_apply_deployment gains `--selector key=value` repeatable flag. - load-test.sh explore banner exposes Device CR commands + new matchedDeviceCount column.Roadmap: - v0_1_plan.md Chapter 2: rewrite to describe the shipped selector + Device CRD model (matchedDeviceCount, LabelSelector, per-concern KV). Drop AgentStatus / observed_score_string / target_devices references. Update "State of the world" preamble to match 2026-04-23 reality. - chapter_4_aggregation_scale.md: SUPERSEDED banner at top with a clear what-was-kept vs. what-was-dropped summary. Original body preserved as decision-trail archaeology. Code review pass on the iot crates, behavior-preserving: - fleet_aggregator: owned_targets is now keyed by DeploymentName (matches the KV key space — globally unique, no namespace). The old DeploymentKey keying created an orphan-leak on operator restart: seed_owned_targets stashed entries under a sentinel namespace ("") that on_deployment_upsert never merged. Now seeding populates the map correctly so restart + selector change diffs properly. - fleet_aggregator: reuse the Client passed into run() for the patch_api instead of calling Client::try_default() a second time. - fleet_aggregator: delete _use_list_params / _use_deployment_spec placeholder scaffolding + unused ListParams / DeploymentSpec / ScorePayload imports. Inline one-liner serialize_score. - fleet_aggregator: clean up `then(|| ...)` → filter/map split. - device_reconciler: `is_label_value(v).then_some(()).is_some()` → plain `is_label_value(v)`. - crd: delete speculative DeviceStatus + DeviceCondition (no one writes to them; the comment in DeviceSpec documents where they'd land when a heartbeat-reflection reconciler shows up). - controller: compute `obj.name_any()` once in cleanup(). All 24 tests green. End-to-end load test (20 devices / 3 groups / 20s) PASS after the changes.Three production-path improvements bundled into one chart change, all verified end-to-end (helm lint + load-test pass): 1. Switch from `HelmResourceKind::from_serializable(...)` to the typed `HelmResourceKind::{Namespace, ServiceAccount, ClusterRole, ClusterRoleBinding, Crd}` variants added to the shared harmony helm module. Serialization output is byte-equivalent; IDE discoverability + type-safety go up. 2. Annotate both CRDs with `helm.sh/resource-policy: keep`. Without this, `helm uninstall iot-operator-v0` cascade-deletes the CRDs; the kube GC then deletes every Deployment CR and every Device CR; the operator finalizer fires on each deletion and wipes the `desired-state` KV; agents tear down every container. One typo on uninstall would be fleet-wide catastrophe. `keep` makes uninstall data-preserving and idempotent — wipe requires an explicit `kubectl delete crd …`. 3. Lock down the operator Pod's securityContext: - `runAsNonRoot: true` - `readOnlyRootFilesystem: true` - `allowPrivilegeEscalation: false` - `capabilities: drop [ALL]` - `seccompProfile: RuntimeDefault` Deliberately *no* `runAsUser` — OpenShift's `restricted-v2` SCC assigns namespace-specific UIDs and rejects fixed ones. The image's `USER 65532:65532` (Dockerfile) gives vanilla k8s a non-root UID; OpenShift's SCC overrides with its own. Same chart works on both without custom SCC bindings. Dockerfile adds `USER 65532:65532` — required for vanilla k8s to accept `runAsNonRoot: true` without a Pod-level `runAsUser`. 65532 is the distroless/chainguard `nonroot` convention; arbitrary but safe (no overlap with common system UIDs). Tests: 2 chart unit tests locking in the keep annotation + SC shape. End-to-end load test at 20 devices / 3 CRs: pod comes up clean under the restricted SC, all aggregates correct, zero operator warnings.Overall not too bad but some important items to consider.
@@ -0,0 +10,4 @@[dependencies]harmony = { path = "../../harmony", default-features = false, features = ["podman"] }iot-operator-v0 = { path = "../../iot/iot-operator-v0" }That should be renamed to something harmony-reconcile-operator or fleet manager or distributed deployment reconcile operator . Capture the essence of the decentralized fleet (iot or datacenters or whatever) management.
@@ -271,0 +293,4 @@ssh_exec(host,creds,&format!("sudo usermod --add-subuids 100000-165535 --add-subgids 100000-165535 {user}"),Delete that, hardcoded subuids/gids is a bug waiting to happen when multiple users on the same machine
@@ -0,0 +1,279 @@//! Low-level NATS single-node primitive.This is completely wrong. Do not build a completely new way to deploy nats. Just use the existing deployment method in the module and create a score on top of the same method for the simple deployment.
For example, if the current nats deploys a supercluster + tls + multinode with the helm chart, extract a NatsHelmChartScore and on top of it refactor the complex score and create a simple one NatsBasicScore .
@@ -0,0 +1,179 @@//! High-level single-node NATS Score — a thin preset over the//! low-level [`super::node::NatsNodeSpec`] primitive.The feature is correct but the implementation of the nats node primitive is wrong, see previous comment.
@@ -0,0 +22,4 @@heartbeat_bucket: kv::Store,}impl FleetPublisher {This one looks good, we might be able to improve type safety a bit but it's already looking good to me.
@@ -0,0 +1,342 @@//! Generate the operator's helm chart from typed Rust.Isn't there an existing crate or some sort of schema we can use for helm charts?
Addresses the review point that NatsBasicScore was a parallel typed-k8s_openapi path — reinventing probes, resource shapes, pod anti-affinity, JetStream storage — instead of reusing what NatsK8sScore already does via the upstream nats/nats helm chart. Every shape the project will ever ship (supercluster, single node, TLS, gateway, leaf nodes) is expressible as values on that chart. Parallel resource construction was churn waiting to diverge. The shape now: HelmChartScore [existing helm-install primitive] ▲ │ pins chart + repo │ NatsHelmChartScore (new) [exposes values_yaml only] ▲ ▲ │ │ NatsBasicScore NatsK8sScore (single node) (supercluster + TLS + gateways) Changes: - Delete harmony/src/modules/nats/node.rs (279 lines of typed k8s_openapi Deployment/Service/Namespace — gone). - New harmony/src/modules/nats/helm_chart.rs: NatsHelmChartScore pins chart_name = "nats/nats" and its official repository; values_yaml is the only varying input. Implements Score<T> for any topology with HelmCommand; caller hands it to K8sBareTopology / HAClusterTopology / K8sAnywhereTopology. - Rewrite score_nats_basic.rs as a thin preset: build a minimal single-node values_yaml (fullnameOverride, replicaCount=1, cluster.enabled=false, jetstream on/off, service type via the chart's `service.merge.spec.type` knob, optional image override). 10 unit tests on render_values covering every builder combination + image-ref splitting. Score bound moves from `T: K8sclient` to `T: HelmCommand` since installation is now helm-based. - score_nats_k8s.rs: last step in deploy_nats switches from a hand-constructed HelmChartScore to NatsHelmChartScore::new(...). Supercluster values_yaml construction untouched — a supercluster is just a more elaborate values file against the same chart. - bare_topology.rs: add `impl HelmCommand for K8sBareTopology` so the in-load-test flow (K8sBareTopology → NatsBasicScore → NatsHelmChartScore → HelmChartScore) compiles. Returns a bare `helm` command; KUBECONFIG resolution mirrors how HAClusterTopology does it. - mod.rs: export NatsHelmChartScore + the re-shaped NatsServiceType. - load-test.sh: the nats/nats chart provisions a StatefulSet, not a Deployment. Wait on `pod -l app.kubernetes.io/name=nats` instead of `deployment/iot-nats` — works across workload kinds. Tests: - 2 helm_chart unit tests (chart+repo pinning, default install- upgrade semantics) - 10 score_nats_basic unit tests covering every values shape - Full load-test.sh e2e (20 devices / 3 CRs / 20s): PASS.The IoT vocabulary was anchoring the codebase to one customer's domain. The reconciler pattern is generic — operator in k8s, NATS KV as desired-state bus, agents reconciling podman / OKD / KVM / anything that can register. "Fleet" captures that neutrally; IoT stays acknowledged in docs as the first customer use case. Done now, while nothing is deployed. After a partner fleet lands, changing the CRD group alone is a multi-quarter migration. Scope (nothing left over): Paths + crates - iot/ → fleet/ - iot/iot-operator-v0 → fleet/harmony-fleet-operator - iot/iot-agent-v0 → fleet/harmony-fleet-agent - harmony/src/modules/iot → harmony/src/modules/fleet - ROADMAP/iot_platform → ROADMAP/fleet_platform - examples/iot_{vm_setup, load_test, nats_install} → examples/fleet_* - -v0 suffix dropped on the operator + agent crates (semver in Cargo.toml already tracks version) Rust identifiers - enum IotScore (podman score payload) → ReconcileScore - struct IotDeviceSetupScore/Config → FleetDeviceSetupScore/Config - InterpretName::IotDeviceSetup → InterpretName::FleetDeviceSetup - HarmonyIotPool → HarmonyFleetPool (libvirt pool) - HARMONY_IOT_POOL_NAME (default "harmony-iot") → HARMONY_FLEET_POOL_NAME ("harmony-fleet") - IotSshKeypair → FleetSshKeypair - ensure_iot_ssh_keypair / ensure_harmony_iot_pool / check_iot_smoke_preflight_for_arch → fleet-prefixed variants Wire / config surfaces - CRD group `iot.nationtech.io` → `fleet.nationtech.io` - Finalizer `iot.nationtech.io/finalizer` → `fleet.nationtech.io/finalizer` - Shortnames iotdep/iotdevice → fleetdep/fleetdev - Env var IOT_AGENT_CONFIG → FLEET_AGENT_CONFIG - Env var IOT_VM_ADMIN_PASSWORD → FLEET_VM_ADMIN_PASSWORD - Binary /usr/local/bin/iot-agent → /usr/local/bin/fleet-agent - Systemd user `iot-agent` → `fleet-agent` - VM admin user `iot-admin` → `fleet-admin` Defaults - Namespaces iot-system/iot-demo/iot-load → fleet-system/fleet-demo/fleet-load - Helm release iot-nats → fleet-nats - Helm release iot-operator-v0 → harmony-fleet-operator - Container image localhost/iot-operator-v0:latest → localhost/harmony-fleet-operator:latest - On-disk cache $HARMONY_DATA_DIR/iot/ → $HARMONY_DATA_DIR/fleet/ (cloud-images, ssh keypairs, libvirt pool) What stayed - harmony-reconciler-contracts — already neutrally named - Wire types (DeviceInfo, DeploymentState, HeartbeatPayload, DeploymentName) — already neutral - KV buckets (device-info, device-state, device-heartbeat, desired-state) — already neutral - CRD kind names (Deployment, Device) — already neutral - NatsBasicScore / NatsHelmChartScore / HelmChart / etc. — framework-scope, unchanged Verification - cargo check --workspace --all-targets: clean - All harmony lib tests (114), fleet-operator (6), fleet-agent (7), harmony-reconciler-contracts (13): green - End-to-end load-test (20 devices / 3 CRs / 20s under fleet/scripts/load-test.sh): PASS. Image built as localhost/harmony-fleet-operator:latest, chart installed as release harmony-fleet-operator in namespace fleet-system, all CR aggregates correct. Zero stragglers: grep across the tree for \biot\b / IOT_ / \bIot[A-Z] returns empty (excluding docs explicitly talking about IoT as the first customer's domain).Huge progress here, will do a final full review in the skeleton branch with everything together.