wip feat/add-new-node #268

Open
stremblay wants to merge 27 commits from feat/add-new-node into master
Owner
No description provided.
stremblay added 4 commits 2026-04-21 17:50:17 +00:00
Pulls the two Askama bond templates out of OpenShiftNmStateNetworkManager into
a new pub(crate) networkmanager_cfg module and parameterizes the bond master
on LaggProtocol (plus a new disabled-interface builder for blacklisted NICs).
LACP output is kept byte-identical to the previous template so existing
post-install bonding is unchanged; the new builders unblock the upcoming
Day-2 add-node flow which needs to pick modes per operator input.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Was &'static str because bootstrap only ever rendered bootstrap.ign /
master.ign / worker.ign. The upcoming add-node flow generates per-host
filenames like master-cp3.ign, so the lifetime has to be bound to the
caller's String. All existing callers pass string literals, which coerce
through unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New score that publishes all the artifacts needed to add a single
control-plane or worker node to an already-running OKD cluster:

  - reads existing nodes of the target role from the live cluster
    via kube-rs (no oc shell-out)
  - derives the next hostname from their naming pattern (cp0,cp1,cp2
    -> cp3; cp0-harmony -> cp3-harmony) and the next IPv4 by inferring
    a linear seq->last-octet mapping from the existing InternalIPs
  - runs DiscoverHostForRoleScore for installation disk + bond + blacklist
  - pulls the live user-data secret (master/worker-user-data-managed in
    openshift-machine-api) and injects the operator's captured
    NetworkManager config (.nmconnection bond master + slaves +
    blacklist) plus /etc/hostname straight into the stub ignition JSON
  - pushes <role>-<hostname>.ign to the firewall HTTP root, renders a
    per-MAC iPXE pointing at it, and creates the DHCP reservation
  - returns a detail summary; for ControlPlane, prints a loud reminder
    that etcd membership / API serving certs / CSR approval are still
    manual follow-ups

Hostname and IP are auto-derived — not user-editable — because that was
the user's explicit decision. Fails loudly if the cluster has zero
existing nodes of the target role or if the IP assignment isn't
linear-stride-1.

Includes unit tests for hostname parsing, next-hostname derivation,
IP deriver (standard case + error cases), and ignition injection
(preserves existing storage.files, handles a stub with no storage key,
emits mode 0600 + overwrite).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
feat(examples): add runnable okd_add_node example
All checks were successful
Run Check Script / check (pull_request) Successful in 2m8s
9944b6c02b
Wires AddOkdNodeScore into a minimal example that assumes an
already-running OKD cluster and an OPNsense firewall serving PXE. The
topology builder mirrors examples/sttest's shape but leaves
control_plane/workers/bootstrap_host empty — the score reads existing
nodes directly from the cluster via kube-rs, so static topology lists
aren't consulted.

env.sh documents that KUBECONFIG must point at the target cluster's
kubeconfig; the score relies on K8sClient::try_default() to pick it up
when HAClusterTopology.kubeconfig is None.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stremblay added 2 commits 2026-04-21 18:07:24 +00:00
The example was adding a worker; switch it to ControlPlane since that's
the case the operator most often needs a runbook for. AddOkdNodeScore
already prints the CP follow-up caveat (etcd, serving certs, CSR) in
its success output, so this also makes the caveat visible on a plain
`cargo run`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
refactor(okd): drop /etc/hostname injection, rely on DHCP
All checks were successful
Run Check Script / check (pull_request) Successful in 2m10s
6a672eb4ae
AddOkdNodeScore was baking /etc/hostname into the stub ignition as
belt-and-suspenders. The initial bootstrap flow (cp0/cp1/cp2) never
did this and it works fine: RHCOS/SCOS NetworkManager honors the
DHCP-provided hostname, and the score already creates a DHCP static
reservation with the chosen name before the operator reboots. Removing
the redundant path (function, call site, and its unit test) so the
add-node flow matches bootstrap and has one less code path to audit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stremblay added 7 commits 2026-04-21 20:03:46 +00:00
Two intertwined updates to the sttest example:

Modernize the OPNsense construction to match affilium2:
  - drop the local OPNSenseFirewallConfig that reused one username/password
    for both API and SSH
  - use the canonical OPNSenseFirewallCredentials + OPNSenseApiCredentials
    from harmony::config::secret (each prompted / stored separately)
  - extract get_opnsense() -> Arc<OPNSenseFirewall> as a reusable helper
  - call OPNSenseFirewall::with_api_port(..., 9443, ...) so the client
    hits the OPNsense web GUI on 9443 (HAProxy owns 443)
  - trim Cargo.toml to the deps the crate actually imports

Exercise the new Day-2 add-node flow end to end:
  - init harmony_cli::cli_logger so the run is debuggable
  - replace the lone HarmonyDiscoveryStrategy::MDNS with a SUBNET scan
    over 192.168.40.0/24:25000, defined once and threaded through both
    OKDInstallationPipeline::get_all_scores(...) and the appended
    AddOkdNodeScore { role: ControlPlane, ... }
  - env.sh now documents the two OPNsense secrets and reminds the
    operator to point KUBECONFIG at
    ./data/okd/installation_files_sttest0/auth/kubeconfig after the
    installer finishes (AddOkdNodeScore needs a reachable k8s API)

The add-node tail only publishes cp3's ignition + byMAC + DHCP; etcd
membership, serving certs, and CSR approval remain manual per the
score's own footer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The OPNsense HttpServer impl was calling install_package("os-caddy")
whenever CaddyConfig::is_installed() returned false. is_installed()
maps any API failure (timeout, connection error, wrong port) to
"not installed", so on Day-2 — when caddy is obviously already there
and serving — a flaky `GET /api/caddy/General/get` would trigger a
pointless reinstall attempt that itself stalled on a 30s timeout and
brought the whole score down.

Trust the operator: if caddy isn't installed, Harmony logs a warn and
lets the next upload_files / enable call surface a pointed error.
Anyone bootstrapping OPNsense from scratch installs os-caddy via the
plugin manager once, same as they provision API credentials.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CaddyConfig::is_installed() used to return bool, collapsing every error —
404, 401, 5xx, timeout, TLS mismatch — into "not installed". That's what
caused the Day-2 sttest run to stall on a 30s GET /api/caddy/General/get
timeout, then fall into a second 30s install_package timeout, and finally
die after ~4 minutes.

Change the signature to Result<bool, Error>:
  - Ok(true)  on 2xx
  - Ok(false) only on HTTP 404 (MVC route doesn't exist -> plugin truly
    not installed)
  - Err(_)    for everything else (can't tell)

The single caller in OPNSenseFirewall::ensure_initialized now matches on
the Result: Ok(true) skips install, Ok(false) runs install_package, and
Err(_) fails the score immediately with a pointed message asking the
operator whether the API is reachable at the configured port. This also
restores the auto-install branch that commit 9b6bdaf removed — that
commit was rejected; this replaces it with a correctly-gated install.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AddOkdNodeScore identified the just-discovered host by diffing
"mappings before discovery" against "mappings after". On a re-run where
the operator re-picks the same host (DiscoverHostForRoleScore's
overwrite prompt does DELETE+INSERT, keeping the same host_id), the
diff is empty and the score bails with "no new host mapped for this
role".

Add InventoryRepository::get_latest_host_for_role(role) that returns
the highest-id row for a role, and use it after discovery. The
auto-increment id makes this robust whether the host is brand new or a
replacement, and drops the HashSet snapshot plumbing from the caller.

SQLite impl mirrors get_role_mapping's shape (ORDER BY id DESC LIMIT 1,
deserialize NetworkConfig JSON). .sqlx/ offline cache regenerated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two small log tweaks surfaced during Day-2 sttest runs:

- The interpret start log was `Http running...` — ambiguous when several
  scores push through the same Http interpret. Match the success/fail
  format by prefixing with the score name:
      [HttpScore] Http starting...

- CIDR-based inventory discovery logs `Error querying inventory agent on
  X:port : ...` at info level for every IP in the subnet that doesn't
  answer with a proper agent response. Drop to debug — most IPs on a
  /24 are noise; timeouts already log at debug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns with the rest of the OKD score namespace (OKDSetup01InventoryScore,
OKDBootstrap02…, OKDLoadBalancerScore, …). The Add prefix made this one
the odd name out. Renames the companion interpret (AddOkdNodeInterpret →
OKDAddNodeInterpret), the Score::name() debug string, and the
InterpretName::Custom tag for symmetry. AddNodeRole stays as-is (not a
Score, and its prefix still reads fine).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix: opnsense port is 8443
All checks were successful
Run Check Script / check (pull_request) Successful in 2m9s
cb78b111a7
stremblay changed title from feat/add-new-node to wip feat/add-new-node 2026-04-22 15:14:09 +00:00
stremblay added 4 commits 2026-04-22 17:03:28 +00:00
After DHCP/byMAC/ignition are published, OKDAddNodeScore now offers to
reboot the newly-discovered node over SSH using the cluster SshKeyPair
(the same secret Harmony already baked into the discovery image's
/root/.ssh/authorized_keys via inventory.kickstart.j2). The operator
still gets a confirm prompt; declining falls back to the old
"power-cycle manually" path with the IP printed.

New plumbing:
  - harmony_types::ssh::SshCredentials — plain enum (SshKey with PEM +
    passphrase | Password). No derives; not a Harmony Secret. Built
    on-the-fly from existing secrets the caller already holds.
  - harmony::infra::ssh::run_command(host, port, creds, cmd) — single
    russh round-trip, fresh connection per call, no trait or state.
    Returns stdout or SshError; caller tolerates NonZeroExit when the
    command is expected to sever the session (reboot).

Harmony's Cargo.toml now uses russh/russh-keys via workspace instead of
a pinned 0.45.0 literal to match opnsense-config.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the SSH reboot step, OKDAddNodeScore now polls the cluster every
5s (up to 15min) until the new Node reports Ready, approving any
pending kubelet CSRs that come up in the meantime. This closes the
last manual step for Day-2 control-plane adds — previously the score's
success footer told the operator to go run `oc adm certificate
approve` themselves.

Two CSR primitives on K8sClient (new harmony-k8s/src/csr.rs):
  - list_pending_kubelet_csrs(): CSRs whose signer is kubelet-serving
    or kube-apiserver-client-kubelet AND whose status carries no
    Approved/Denied condition yet.
  - approve_csr(name, reason, message): PATCH /approval with an
    Approved=True condition via kube-rs' Api::patch_approval.

The score's approval loop checks Ready *first* each iteration so a
stray CSR appearing after the node is healthy is not approved. CSR
matching:
  - kubelet-serving: spec.username == "system:node:<hostname>" (the
    kubelet signs this field itself; trust it).
  - kube-apiserver-client-kubelet: every pending one is approved.
    The node identity lives in the embedded CSR PEM's subject CN and
    parsing it would pull a new x509 dep for little gain; a Day-2 add
    is an operator-triggered, single-window operation and this matches
    what `oc adm certificate approve` of the pending set does.

CP success footer now only flags API serving-cert rotation as the
remaining manual step; etcd membership is explicitly noted as
auto-reconciled by cluster-etcd-operator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After CSR approval lands and the new node is Ready, OKDAddNodeScore
now rebuilds the 6 OKD LoadBalancerServices (public 80/443, private
80/443 with node-ready health checks, 22623 CP-only, 6443 CP-only with
/readyz check) from the *live* cluster's node list and runs them
through LoadBalancerScore. opnsense-config's configure_service is
already find-by-name + update-or-create, so existing backend servers
keep their UUIDs and the new IP gets linked into the right pools
without tearing down the config.

To avoid duplicating the 60-line service-build, OKDLoadBalancerScore's
internal builder is promoted to a pub fn:
    services_for_nodes(control_plane, workers)
        -> (Vec<LoadBalancerService>, Vec<LoadBalancerService>)
OKDLoadBalancerScore::new() becomes a 4-line wrapper; Day-1 bootstrap
output is byte-identical. The two private helpers
(control_plane_to_backend_server, nodes_to_backend_server) collapse
into a single private `backends_for(hosts, port)`; the tests that
referenced the old helpers now assert against services_for_nodes
output directly.

Source of truth: live cluster, not topology.control_plane. The static
topology vecs don't grow past bootstrap — reading from the cluster
means a later add-node run sees the nodes we added in this one.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
refactor(add-node): refresh HAProxy before reboot, not after Ready
All checks were successful
Run Check Script / check (pull_request) Successful in 2m12s
21e97c7ca9
Moves HAProxy backend refresh from post-Ready (old step 13) to
immediately after the DHCP reservation (new step 10.5), paralleling
how DHCP primes the firewall before the node boots.

Why this is safe:
  - HAProxy's health checks are designed for exactly this: cp3's
    backend is marked DOWN until its kubelet answers /readyz, and
    the other CPs keep serving traffic. No routing disruption.
  - The only visible effect is a few minutes of health-check failure
    logs on OPNsense while cp3 comes up.

Why this is better:
  - All firewall-side prep (DHCP + HAProxy) runs together, matching
    the mental model "get the firewall ready, then ignite the node".
  - If OPNsense rejects the configure_service call, the score fails
    before the reboot — operator fixes and retries with no
    mid-transition node to babysit.
  - The post-reboot phase reduces to the only thing that actually
    needs post-reboot state: CSR approval + wait-for-Ready.

Source of truth for the backend list stays the live cluster + the
new (hostname, IP) we just DHCP-reserved — so subsequent add-node
runs still see nodes added by earlier ones. `existing` is already
in scope from step 1, so we only need one extra k8s call (the other
role's nodes for ingress pool completeness). Introduced an
existing_to_logical_hosts helper shared between that local use and
the still-needed cluster_logical_hosts wrapper for the other role.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stremblay added 10 commits 2026-04-22 22:57:01 +00:00
OKD marks control-plane nodes as schedulable workers by default, so
they come back from BOTH `node-role.kubernetes.io/master` and
`node-role.kubernetes.io/worker` label-selector queries. When
OKDAddNodeScore's Day-2 refresh hands those two slices to
services_for_nodes, the port 80/443 "all nodes" pools were built with
a naive `cps.chain(workers).collect()` and ended up with the CPs
listed twice. OPNsense faithfully wrote two `server` stanzas with the
same name per CP, and `haproxy -c` rejected the staging config:

  backend 'backend_0.0.0.0_80', another server named
  '192.168.40.21_80' was already defined at line 82

Fix: order-preserving dedupe by IP when composing all_nodes. Matches
the OPNsense server-name convention ({ip}_{port}), so no two servers
ever share a name regardless of what hostnames the cluster reports.

Bootstrap path is unaffected — the static topology.control_plane /
topology.workers Vecs in the example files don't overlap, so every
insert succeeds and the output is byte-identical. New unit test pins
the overlap case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous reboot command,

    nohup bash -c 'sleep 2 && systemctl reboot' >/dev/null 2>&1 &

looked clever but silently did nothing on the real discovery image:
run_command returned Ok instantly (the backgrounded shell exited
status 0), the log said "Reboot dispatched", and the physical host
stayed up. sshd closes the session the moment the foreground shell
exits and the backgrounded `nohup bash -c ...` got torn down before
it ever hit `systemctl reboot`.

Replace with `systemctl --no-block reboot`: systemd takes ownership,
returns exit 0 immediately, SSH closes cleanly. No bash, no &, no
redirection. Flip the Err branch from debug to warn too — with the
old command an error was expected (session dying mid-reboot), with
the new command an error means the reboot really didn't dispatch and
the operator needs to see it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
With RUST_LOG=harmony=debug each iteration now reports:
  - poll number + elapsed/total timeout
  - whether the Node object exists yet, and if so, a one-line summary
    of its status.conditions (Ready=False MemoryPressure=False …)
  - total pending kubelet CSRs returned by the cluster
  - per-CSR accept/skip decisions with the signer and username
  - tick summary: approved-this-tick, skipped-this-tick, cumulative
  - sleeping X seconds before next poll

Makes it obvious when the loop is stuck vs. when it's making progress
(e.g. "Node not yet present" vs "Node present, Ready=False" vs "found
0 pending CSRs" vs "approved 1 skipped 2"). No behavior change; the
old info logs for "Approved CSR …" and "Node Ready after …" still
fire at info.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Polling every 5 seconds was copied from K8sClient::wait_for_node_ready
but is overkill here: the kubelet-Ready and CSR-generation transitions
each take minutes, so 5s just hammers the API server and floods the
debug logs. 30s is plenty.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Validates that /usr/local/http/scos/{scos-live-kernel.x86_64,
scos-live-initramfs.x86_64.img, scos-live-rootfs.x86_64.img} on the
firewall match what the running cluster's openshift-install declares
in `coreos print-stream-json`. If any file is missing, the wrong hash,
or backed by a dangling symlink, the score prints an
ssh+curl+ln-sfn bundle the operator can paste to refresh directly on
the firewall and loops on a Confirm until clean or they abort.

Why the symlinks: stream JSON ships versioned upstream filenames
(rhcos-418.94.xxx-live-kernel-x86_64) but bootstrap.ipxe.j2 asks for
the stable ones. Bundle `curl -L -O`s the versioned payload and
`ln -sfn`s the stable name at it; sha256sum follows symlinks so the
same verifier catches stale-payload-behind-stable-name too.

Score runs standalone (add it to any pipeline) AND is called as step
0 of OKDAddNodeScore::execute before anything else — operators never
have to remember to wire it in. A cluster upgrade between bootstrap
and a Day-2 add is the specific drift this catches.

Requires `oc` on PATH and KUBECONFIG pointed at the cluster (same
assumption the rest of the OKD flow already makes). `oc adm release
extract --command=openshift-install --to=./data/okd/bin` runs fresh
each invocation so we never verify against a stale installer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous command was
    sha256sum <path> 2>/dev/null | awk '{print $1}'
which assumes a Bourne-style shell. OPNsense's root shell is FreeBSD
csh, which doesn't understand `2>/dev/null` (it parses that as two
separate args + stdout redirect) and in general mangles the pipe+awk
combination enough that what came back was unparseable — so
sha256_of's "trim output and require exactly 64 hex chars" check
rejected everything as `None`, and every file on the firewall showed
as `<missing>` even when it was right there and correctly hashed.

Drop the shell tricks: run the bare `sha256sum <path>` (every shell
understands that), capture its output, and parse in Rust by scanning
each line for a first token that looks like a 64-char hex hash.
Robust to stderr noise interleaved into the SSH channel buffer too.
Added 5 unit tests for the parser so the regression is locked down.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously, if the SSH reboot command errored (e.g. node already
vanished, network hiccup), we logged a warn and fell straight into
the 15-minute CSR/Ready loop against a node that wasn't rebooting —
guaranteed timeout, 15 useless minutes.

Replace with a Confirm prompt that mirrors the "operator declined
SSH reboot" path: "power-cycle manually and press enter to continue,
or decline to abort". Declining returns an InterpretError; confirming
hands off to the CSR loop as usual, which is correct because by then
the operator has taken over.

Also: cap the SSH connect at 15s in harmony::infra::ssh. Kernel TCP
retries were burning 2+ minutes on dead hosts before we even got to
the failure branch; for Day-2 flows across a LAN that's far too slow
to react. New SshError::ConnectTimeout variant reports the host+port
+budget so the message is actionable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes to the CSR/Ready loop, both learned the hard way:

1. Transient API errors no longer abort the score.
   During a CP reboot the apiserver connection you're talking to can
   drop (HAProxy flipped backends, etcd flapped quorum, whatever),
   and kube-rs surfaces it as e.g. `ServiceError: client error
   (SendRequest)`. Previously that propagated out of the loop with
   `?`, the score failed, Harmony exited — while the node was still
   mid-ignition, leaving a half-joined cluster no one was watching
   the CSRs for. Now get_resource, list_pending_kubelet_csrs, and
   approve_csr all warn + fall through to the next tick. Only the
   15-minute deadline can stop us.

2. Sleep first, then poll.
   The reboot was just dispatched; the node is mid-reboot and can't
   possibly be Ready, and no CSRs exist yet. Polling immediately
   wastes an API round-trip every run. Swap the loop so the
   sleep(30s) happens at the top, giving the node time to boot
   before we ask about it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix(inventory): order physical_hosts by rowid, not version_id
All checks were successful
Run Check Script / check (pull_request) Successful in 2m15s
feadb088c8
`physical_hosts.version_id` is `Id::default().to_string()`, which
formats `{masked_hex_timestamp}_{rand7}`. Two problems for using that
as a "latest row" key:

  1. `format!("{:x}", v)` is variable-length. When the masked 24-bit
     clock crosses a hex-width boundary (every few days at most, ±18h
     or 12 days from cycle start), lexicographic comparison breaks:
     `'_'` (0x5F) > any digit, so "1234_…" sorts HIGHER than
     "123456_…" even though the latter was saved later.
  2. The timestamp is masked to 24 bits, wrapping every ~194 days.

Net effect: `MAX(version_id)` in get_all_hosts / get_latest_by_id /
save's dedupe lookup can return an OLD row instead of the most
recently inserted one — exactly the "stale IP in the Select" the user
reported after re-adding a previously-deleted node.

Fix: order by SQLite's implicit `rowid` instead — strictly monotonic
INSERT counter, always reflects real save order. physical_hosts is
not WITHOUT ROWID, so rowid is always there. The version_id column
stays on every row (useful in the UI / audit logs) but no longer
drives ranking.

Id::default() is untouched; it's used throughout the codebase and
fixing the format is a separate, wider change.

sqlx offline cache regenerated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All checks were successful
Run Check Script / check (pull_request) Successful in 2m15s
This pull request can be merged automatically.
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin feat/add-new-node:feat/add-new-node
git checkout feat/add-new-node
Sign in to join this conversation.
No Reviewers
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: NationTech/harmony#268
No description provided.