wip feat/add-new-node #268
Open
stremblay
wants to merge 27 commits from
feat/add-new-node into master
pull from: feat/add-new-node
merge into: NationTech:master
NationTech:master
NationTech:feat/openwebui
NationTech:feat/iot-walking-skeleton
NationTech:feat/iot-aggregation-scale
NationTech:feat/iot-operator-helm-chart
NationTech:feat/removesideeffect
NationTech:feat/test-alert-receivers-sttest
NationTech:feat/brocade-client-add-vlans
NationTech:feat/agent-desired-state
NationTech:feat/opnsense-dns-implementation
NationTech:feat/named-config-instances
NationTech:worktree-bridge-cse_012j1jB37XfjXvDGHUjHrKSj
NationTech:chore/leftover-adr
NationTech:feat/config_e2e_zitadel_openbao
NationTech:example/vllm
NationTech:feat/config_sqlite
NationTech:chore/roadmap
NationTech:feature/kvm-module
NationTech:feat/rustfs
NationTech:feat/harmony_assets
NationTech:feat/brocade_assisted_setup
NationTech:feat/cluster_alerting_score
NationTech:e2e-tests-multicluster
NationTech:fix/refactor_alert_receivers
NationTech:feat/change-node-readiness-strategy
NationTech:feat/zitadel
NationTech:feat/improve-inventory-discovery
NationTech:fix/monitoring_abstractions_openshift
NationTech:feat/nats-jetstream
NationTech:adr-nats-creds
NationTech:feat/st_test
NationTech:feat/dockerAutoinstall
NationTech:chore/cleanup_hacluster
NationTech:doc/cert-management
NationTech:feat/certificate_management
NationTech:adr/017-staleness-failover
NationTech:fix/nats_non_root
NationTech:feat/rebuild_inventory
NationTech:fix/opnsense_update
NationTech:feat/unshedulable_control_planes
NationTech:feat/worker_okd_install
NationTech:doc-and-braindump
NationTech:fix/pxe_install
NationTech:switch-client
NationTech:okd_enable_user_workload_monitoring
NationTech:configure-switch
NationTech:fix/clippy
NationTech:feat/gen-ca-cert
NationTech:feat/okd_default_ingress_class
NationTech:fix/add_routes_to_domain
NationTech:secrets-prompt-editor
NationTech:feat/multisiteApplication
NationTech:feat/ceph-install-score
NationTech:feat/ceph-osd-score
NationTech:feat/ceph_validate_health
NationTech:better-indicatif-progress-grouped
NationTech:feat/crd-alertmanager-configs
NationTech:better-cli
NationTech:opnsense_upgrade
NationTech:feat/monitoring-application-feature
NationTech:dev/postgres
NationTech:feat/cd/localdeploymentdemo
NationTech:feat/webhook_receiver
NationTech:feat/kube-prometheus
NationTech:feat/init_k8s_tenant
NationTech:feat/discord-webhook-receiver
NationTech:feat/kube-prometheus-monitor
NationTech:feat/tenantScore
NationTech:feat/teams-integration
NationTech:feat/slack-notifs
NationTech:monitoring
NationTech:runtime-profiles
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
No description provided.
Delete Branch "feat/add-new-node"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
New score that publishes all the artifacts needed to add a single control-plane or worker node to an already-running OKD cluster: - reads existing nodes of the target role from the live cluster via kube-rs (no oc shell-out) - derives the next hostname from their naming pattern (cp0,cp1,cp2 -> cp3; cp0-harmony -> cp3-harmony) and the next IPv4 by inferring a linear seq->last-octet mapping from the existing InternalIPs - runs DiscoverHostForRoleScore for installation disk + bond + blacklist - pulls the live user-data secret (master/worker-user-data-managed in openshift-machine-api) and injects the operator's captured NetworkManager config (.nmconnection bond master + slaves + blacklist) plus /etc/hostname straight into the stub ignition JSON - pushes <role>-<hostname>.ign to the firewall HTTP root, renders a per-MAC iPXE pointing at it, and creates the DHCP reservation - returns a detail summary; for ControlPlane, prints a loud reminder that etcd membership / API serving certs / CSR approval are still manual follow-ups Hostname and IP are auto-derived — not user-editable — because that was the user's explicit decision. Fails loudly if the cluster has zero existing nodes of the target role or if the IP assignment isn't linear-stride-1. Includes unit tests for hostname parsing, next-hostname derivation, IP deriver (standard case + error cases), and ignition injection (preserves existing storage.files, handles a stub with no storage key, emits mode 0600 + overwrite). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Two intertwined updates to the sttest example: Modernize the OPNsense construction to match affilium2: - drop the local OPNSenseFirewallConfig that reused one username/password for both API and SSH - use the canonical OPNSenseFirewallCredentials + OPNSenseApiCredentials from harmony::config::secret (each prompted / stored separately) - extract get_opnsense() -> Arc<OPNSenseFirewall> as a reusable helper - call OPNSenseFirewall::with_api_port(..., 9443, ...) so the client hits the OPNsense web GUI on 9443 (HAProxy owns 443) - trim Cargo.toml to the deps the crate actually imports Exercise the new Day-2 add-node flow end to end: - init harmony_cli::cli_logger so the run is debuggable - replace the lone HarmonyDiscoveryStrategy::MDNS with a SUBNET scan over 192.168.40.0/24:25000, defined once and threaded through both OKDInstallationPipeline::get_all_scores(...) and the appended AddOkdNodeScore { role: ControlPlane, ... } - env.sh now documents the two OPNsense secrets and reminds the operator to point KUBECONFIG at ./data/okd/installation_files_sttest0/auth/kubeconfig after the installer finishes (AddOkdNodeScore needs a reachable k8s API) The add-node tail only publishes cp3's ignition + byMAC + DHCP; etcd membership, serving certs, and CSR approval remain manual per the score's own footer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>The OPNsense HttpServer impl was calling install_package("os-caddy") whenever CaddyConfig::is_installed() returned false. is_installed() maps any API failure (timeout, connection error, wrong port) to "not installed", so on Day-2 — when caddy is obviously already there and serving — a flaky `GET /api/caddy/General/get` would trigger a pointless reinstall attempt that itself stalled on a 30s timeout and brought the whole score down. Trust the operator: if caddy isn't installed, Harmony logs a warn and lets the next upload_files / enable call surface a pointed error. Anyone bootstrapping OPNsense from scratch installs os-caddy via the plugin manager once, same as they provision API credentials. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>CaddyConfig::is_installed() used to return bool, collapsing every error — 404, 401, 5xx, timeout, TLS mismatch — into "not installed". That's what caused the Day-2 sttest run to stall on a 30s GET /api/caddy/General/get timeout, then fall into a second 30s install_package timeout, and finally die after ~4 minutes. Change the signature to Result<bool, Error>: - Ok(true) on 2xx - Ok(false) only on HTTP 404 (MVC route doesn't exist -> plugin truly not installed) - Err(_) for everything else (can't tell) The single caller in OPNSenseFirewall::ensure_initialized now matches on the Result: Ok(true) skips install, Ok(false) runs install_package, and Err(_) fails the score immediately with a pointed message asking the operator whether the API is reachable at the configured port. This also restores the auto-install branch that commit9b6bdafremoved — that commit was rejected; this replaces it with a correctly-gated install. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Two small log tweaks surfaced during Day-2 sttest runs: - The interpret start log was `Http running...` — ambiguous when several scores push through the same Http interpret. Match the success/fail format by prefixing with the score name: [HttpScore] Http starting... - CIDR-based inventory discovery logs `Error querying inventory agent on X:port : ...` at info level for every IP in the subnet that doesn't answer with a proper agent response. Drop to debug — most IPs on a /24 are noise; timeouts already log at debug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>feat/add-new-nodeto wip feat/add-new-nodeAfter DHCP/byMAC/ignition are published, OKDAddNodeScore now offers to reboot the newly-discovered node over SSH using the cluster SshKeyPair (the same secret Harmony already baked into the discovery image's /root/.ssh/authorized_keys via inventory.kickstart.j2). The operator still gets a confirm prompt; declining falls back to the old "power-cycle manually" path with the IP printed. New plumbing: - harmony_types::ssh::SshCredentials — plain enum (SshKey with PEM + passphrase | Password). No derives; not a Harmony Secret. Built on-the-fly from existing secrets the caller already holds. - harmony::infra::ssh::run_command(host, port, creds, cmd) — single russh round-trip, fresh connection per call, no trait or state. Returns stdout or SshError; caller tolerates NonZeroExit when the command is expected to sever the session (reboot). Harmony's Cargo.toml now uses russh/russh-keys via workspace instead of a pinned 0.45.0 literal to match opnsense-config. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>After the SSH reboot step, OKDAddNodeScore now polls the cluster every 5s (up to 15min) until the new Node reports Ready, approving any pending kubelet CSRs that come up in the meantime. This closes the last manual step for Day-2 control-plane adds — previously the score's success footer told the operator to go run `oc adm certificate approve` themselves. Two CSR primitives on K8sClient (new harmony-k8s/src/csr.rs): - list_pending_kubelet_csrs(): CSRs whose signer is kubelet-serving or kube-apiserver-client-kubelet AND whose status carries no Approved/Denied condition yet. - approve_csr(name, reason, message): PATCH /approval with an Approved=True condition via kube-rs' Api::patch_approval. The score's approval loop checks Ready *first* each iteration so a stray CSR appearing after the node is healthy is not approved. CSR matching: - kubelet-serving: spec.username == "system:node:<hostname>" (the kubelet signs this field itself; trust it). - kube-apiserver-client-kubelet: every pending one is approved. The node identity lives in the embedded CSR PEM's subject CN and parsing it would pull a new x509 dep for little gain; a Day-2 add is an operator-triggered, single-window operation and this matches what `oc adm certificate approve` of the pending set does. CP success footer now only flags API serving-cert rotation as the remaining manual step; etcd membership is explicitly noted as auto-reconciled by cluster-etcd-operator. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>After CSR approval lands and the new node is Ready, OKDAddNodeScore now rebuilds the 6 OKD LoadBalancerServices (public 80/443, private 80/443 with node-ready health checks, 22623 CP-only, 6443 CP-only with /readyz check) from the *live* cluster's node list and runs them through LoadBalancerScore. opnsense-config's configure_service is already find-by-name + update-or-create, so existing backend servers keep their UUIDs and the new IP gets linked into the right pools without tearing down the config. To avoid duplicating the 60-line service-build, OKDLoadBalancerScore's internal builder is promoted to a pub fn: services_for_nodes(control_plane, workers) -> (Vec<LoadBalancerService>, Vec<LoadBalancerService>) OKDLoadBalancerScore::new() becomes a 4-line wrapper; Day-1 bootstrap output is byte-identical. The two private helpers (control_plane_to_backend_server, nodes_to_backend_server) collapse into a single private `backends_for(hosts, port)`; the tests that referenced the old helpers now assert against services_for_nodes output directly. Source of truth: live cluster, not topology.control_plane. The static topology vecs don't grow past bootstrap — reading from the cluster means a later add-node run sees the nodes we added in this one. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Moves HAProxy backend refresh from post-Ready (old step 13) to immediately after the DHCP reservation (new step 10.5), paralleling how DHCP primes the firewall before the node boots. Why this is safe: - HAProxy's health checks are designed for exactly this: cp3's backend is marked DOWN until its kubelet answers /readyz, and the other CPs keep serving traffic. No routing disruption. - The only visible effect is a few minutes of health-check failure logs on OPNsense while cp3 comes up. Why this is better: - All firewall-side prep (DHCP + HAProxy) runs together, matching the mental model "get the firewall ready, then ignite the node". - If OPNsense rejects the configure_service call, the score fails before the reboot — operator fixes and retries with no mid-transition node to babysit. - The post-reboot phase reduces to the only thing that actually needs post-reboot state: CSR approval + wait-for-Ready. Source of truth for the backend list stays the live cluster + the new (hostname, IP) we just DHCP-reserved — so subsequent add-node runs still see nodes added by earlier ones. `existing` is already in scope from step 1, so we only need one extra k8s call (the other role's nodes for ingress pool completeness). Introduced an existing_to_logical_hosts helper shared between that local use and the still-needed cluster_logical_hosts wrapper for the other role. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>OKD marks control-plane nodes as schedulable workers by default, so they come back from BOTH `node-role.kubernetes.io/master` and `node-role.kubernetes.io/worker` label-selector queries. When OKDAddNodeScore's Day-2 refresh hands those two slices to services_for_nodes, the port 80/443 "all nodes" pools were built with a naive `cps.chain(workers).collect()` and ended up with the CPs listed twice. OPNsense faithfully wrote two `server` stanzas with the same name per CP, and `haproxy -c` rejected the staging config: backend 'backend_0.0.0.0_80', another server named '192.168.40.21_80' was already defined at line 82 Fix: order-preserving dedupe by IP when composing all_nodes. Matches the OPNsense server-name convention ({ip}_{port}), so no two servers ever share a name regardless of what hostnames the cluster reports. Bootstrap path is unaffected — the static topology.control_plane / topology.workers Vecs in the example files don't overlap, so every insert succeeds and the output is byte-identical. New unit test pins the overlap case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>The previous reboot command, nohup bash -c 'sleep 2 && systemctl reboot' >/dev/null 2>&1 & looked clever but silently did nothing on the real discovery image: run_command returned Ok instantly (the backgrounded shell exited status 0), the log said "Reboot dispatched", and the physical host stayed up. sshd closes the session the moment the foreground shell exits and the backgrounded `nohup bash -c ...` got torn down before it ever hit `systemctl reboot`. Replace with `systemctl --no-block reboot`: systemd takes ownership, returns exit 0 immediately, SSH closes cleanly. No bash, no &, no redirection. Flip the Err branch from debug to warn too — with the old command an error was expected (session dying mid-reboot), with the new command an error means the reboot really didn't dispatch and the operator needs to see it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>With RUST_LOG=harmony=debug each iteration now reports: - poll number + elapsed/total timeout - whether the Node object exists yet, and if so, a one-line summary of its status.conditions (Ready=False MemoryPressure=False …) - total pending kubelet CSRs returned by the cluster - per-CSR accept/skip decisions with the signer and username - tick summary: approved-this-tick, skipped-this-tick, cumulative - sleeping X seconds before next poll Makes it obvious when the loop is stuck vs. when it's making progress (e.g. "Node not yet present" vs "Node present, Ready=False" vs "found 0 pending CSRs" vs "approved 1 skipped 2"). No behavior change; the old info logs for "Approved CSR …" and "Node Ready after …" still fire at info. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Validates that /usr/local/http/scos/{scos-live-kernel.x86_64, scos-live-initramfs.x86_64.img, scos-live-rootfs.x86_64.img} on the firewall match what the running cluster's openshift-install declares in `coreos print-stream-json`. If any file is missing, the wrong hash, or backed by a dangling symlink, the score prints an ssh+curl+ln-sfn bundle the operator can paste to refresh directly on the firewall and loops on a Confirm until clean or they abort. Why the symlinks: stream JSON ships versioned upstream filenames (rhcos-418.94.xxx-live-kernel-x86_64) but bootstrap.ipxe.j2 asks for the stable ones. Bundle `curl -L -O`s the versioned payload and `ln -sfn`s the stable name at it; sha256sum follows symlinks so the same verifier catches stale-payload-behind-stable-name too. Score runs standalone (add it to any pipeline) AND is called as step 0 of OKDAddNodeScore::execute before anything else — operators never have to remember to wire it in. A cluster upgrade between bootstrap and a Day-2 add is the specific drift this catches. Requires `oc` on PATH and KUBECONFIG pointed at the cluster (same assumption the rest of the OKD flow already makes). `oc adm release extract --command=openshift-install --to=./data/okd/bin` runs fresh each invocation so we never verify against a stale installer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Previous command was sha256sum <path> 2>/dev/null | awk '{print $1}' which assumes a Bourne-style shell. OPNsense's root shell is FreeBSD csh, which doesn't understand `2>/dev/null` (it parses that as two separate args + stdout redirect) and in general mangles the pipe+awk combination enough that what came back was unparseable — so sha256_of's "trim output and require exactly 64 hex chars" check rejected everything as `None`, and every file on the firewall showed as `<missing>` even when it was right there and correctly hashed. Drop the shell tricks: run the bare `sha256sum <path>` (every shell understands that), capture its output, and parse in Rust by scanning each line for a first token that looks like a 64-char hex hash. Robust to stderr noise interleaved into the SSH channel buffer too. Added 5 unit tests for the parser so the regression is locked down. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>`physical_hosts.version_id` is `Id::default().to_string()`, which formats `{masked_hex_timestamp}_{rand7}`. Two problems for using that as a "latest row" key: 1. `format!("{:x}", v)` is variable-length. When the masked 24-bit clock crosses a hex-width boundary (every few days at most, ±18h or 12 days from cycle start), lexicographic comparison breaks: `'_'` (0x5F) > any digit, so "1234_…" sorts HIGHER than "123456_…" even though the latter was saved later. 2. The timestamp is masked to 24 bits, wrapping every ~194 days. Net effect: `MAX(version_id)` in get_all_hosts / get_latest_by_id / save's dedupe lookup can return an OLD row instead of the most recently inserted one — exactly the "stale IP in the Select" the user reported after re-adding a previously-deleted node. Fix: order by SQLite's implicit `rowid` instead — strictly monotonic INSERT counter, always reflects real save order. physical_hosts is not WITHOUT ROWID, so rowid is always there. The version_id column stays on every row (useful in the UI / audit logs) but no longer drives ranking. Id::default() is untouched; it's used throughout the codebase and fixing the format is a separate, wider change. sqlx offline cache regenerated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>View command line instructions
Checkout
From your project repository, check out a new branch and test the changes.