`physical_hosts.version_id` is `Id::default().to_string()`, which
formats `{masked_hex_timestamp}_{rand7}`. Two problems for using that
as a "latest row" key:
1. `format!("{:x}", v)` is variable-length. When the masked 24-bit
clock crosses a hex-width boundary (every few days at most, ±18h
or 12 days from cycle start), lexicographic comparison breaks:
`'_'` (0x5F) > any digit, so "1234_…" sorts HIGHER than
"123456_…" even though the latter was saved later.
2. The timestamp is masked to 24 bits, wrapping every ~194 days.
Net effect: `MAX(version_id)` in get_all_hosts / get_latest_by_id /
save's dedupe lookup can return an OLD row instead of the most
recently inserted one — exactly the "stale IP in the Select" the user
reported after re-adding a previously-deleted node.
Fix: order by SQLite's implicit `rowid` instead — strictly monotonic
INSERT counter, always reflects real save order. physical_hosts is
not WITHOUT ROWID, so rowid is always there. The version_id column
stays on every row (useful in the UI / audit logs) but no longer
drives ranking.
Id::default() is untouched; it's used throughout the codebase and
fixing the format is a separate, wider change.
sqlx offline cache regenerated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes to the CSR/Ready loop, both learned the hard way:
1. Transient API errors no longer abort the score.
During a CP reboot the apiserver connection you're talking to can
drop (HAProxy flipped backends, etcd flapped quorum, whatever),
and kube-rs surfaces it as e.g. `ServiceError: client error
(SendRequest)`. Previously that propagated out of the loop with
`?`, the score failed, Harmony exited — while the node was still
mid-ignition, leaving a half-joined cluster no one was watching
the CSRs for. Now get_resource, list_pending_kubelet_csrs, and
approve_csr all warn + fall through to the next tick. Only the
15-minute deadline can stop us.
2. Sleep first, then poll.
The reboot was just dispatched; the node is mid-reboot and can't
possibly be Ready, and no CSRs exist yet. Polling immediately
wastes an API round-trip every run. Swap the loop so the
sleep(30s) happens at the top, giving the node time to boot
before we ask about it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously, if the SSH reboot command errored (e.g. node already
vanished, network hiccup), we logged a warn and fell straight into
the 15-minute CSR/Ready loop against a node that wasn't rebooting —
guaranteed timeout, 15 useless minutes.
Replace with a Confirm prompt that mirrors the "operator declined
SSH reboot" path: "power-cycle manually and press enter to continue,
or decline to abort". Declining returns an InterpretError; confirming
hands off to the CSR loop as usual, which is correct because by then
the operator has taken over.
Also: cap the SSH connect at 15s in harmony::infra::ssh. Kernel TCP
retries were burning 2+ minutes on dead hosts before we even got to
the failure branch; for Day-2 flows across a LAN that's far too slow
to react. New SshError::ConnectTimeout variant reports the host+port
+budget so the message is actionable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous command was
sha256sum <path> 2>/dev/null | awk '{print $1}'
which assumes a Bourne-style shell. OPNsense's root shell is FreeBSD
csh, which doesn't understand `2>/dev/null` (it parses that as two
separate args + stdout redirect) and in general mangles the pipe+awk
combination enough that what came back was unparseable — so
sha256_of's "trim output and require exactly 64 hex chars" check
rejected everything as `None`, and every file on the firewall showed
as `<missing>` even when it was right there and correctly hashed.
Drop the shell tricks: run the bare `sha256sum <path>` (every shell
understands that), capture its output, and parse in Rust by scanning
each line for a first token that looks like a 64-char hex hash.
Robust to stderr noise interleaved into the SSH channel buffer too.
Added 5 unit tests for the parser so the regression is locked down.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Validates that /usr/local/http/scos/{scos-live-kernel.x86_64,
scos-live-initramfs.x86_64.img, scos-live-rootfs.x86_64.img} on the
firewall match what the running cluster's openshift-install declares
in `coreos print-stream-json`. If any file is missing, the wrong hash,
or backed by a dangling symlink, the score prints an
ssh+curl+ln-sfn bundle the operator can paste to refresh directly on
the firewall and loops on a Confirm until clean or they abort.
Why the symlinks: stream JSON ships versioned upstream filenames
(rhcos-418.94.xxx-live-kernel-x86_64) but bootstrap.ipxe.j2 asks for
the stable ones. Bundle `curl -L -O`s the versioned payload and
`ln -sfn`s the stable name at it; sha256sum follows symlinks so the
same verifier catches stale-payload-behind-stable-name too.
Score runs standalone (add it to any pipeline) AND is called as step
0 of OKDAddNodeScore::execute before anything else — operators never
have to remember to wire it in. A cluster upgrade between bootstrap
and a Day-2 add is the specific drift this catches.
Requires `oc` on PATH and KUBECONFIG pointed at the cluster (same
assumption the rest of the OKD flow already makes). `oc adm release
extract --command=openshift-install --to=./data/okd/bin` runs fresh
each invocation so we never verify against a stale installer.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Polling every 5 seconds was copied from K8sClient::wait_for_node_ready
but is overkill here: the kubelet-Ready and CSR-generation transitions
each take minutes, so 5s just hammers the API server and floods the
debug logs. 30s is plenty.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
With RUST_LOG=harmony=debug each iteration now reports:
- poll number + elapsed/total timeout
- whether the Node object exists yet, and if so, a one-line summary
of its status.conditions (Ready=False MemoryPressure=False …)
- total pending kubelet CSRs returned by the cluster
- per-CSR accept/skip decisions with the signer and username
- tick summary: approved-this-tick, skipped-this-tick, cumulative
- sleeping X seconds before next poll
Makes it obvious when the loop is stuck vs. when it's making progress
(e.g. "Node not yet present" vs "Node present, Ready=False" vs "found
0 pending CSRs" vs "approved 1 skipped 2"). No behavior change; the
old info logs for "Approved CSR …" and "Node Ready after …" still
fire at info.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous reboot command,
nohup bash -c 'sleep 2 && systemctl reboot' >/dev/null 2>&1 &
looked clever but silently did nothing on the real discovery image:
run_command returned Ok instantly (the backgrounded shell exited
status 0), the log said "Reboot dispatched", and the physical host
stayed up. sshd closes the session the moment the foreground shell
exits and the backgrounded `nohup bash -c ...` got torn down before
it ever hit `systemctl reboot`.
Replace with `systemctl --no-block reboot`: systemd takes ownership,
returns exit 0 immediately, SSH closes cleanly. No bash, no &, no
redirection. Flip the Err branch from debug to warn too — with the
old command an error was expected (session dying mid-reboot), with
the new command an error means the reboot really didn't dispatch and
the operator needs to see it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OKD marks control-plane nodes as schedulable workers by default, so
they come back from BOTH `node-role.kubernetes.io/master` and
`node-role.kubernetes.io/worker` label-selector queries. When
OKDAddNodeScore's Day-2 refresh hands those two slices to
services_for_nodes, the port 80/443 "all nodes" pools were built with
a naive `cps.chain(workers).collect()` and ended up with the CPs
listed twice. OPNsense faithfully wrote two `server` stanzas with the
same name per CP, and `haproxy -c` rejected the staging config:
backend 'backend_0.0.0.0_80', another server named
'192.168.40.21_80' was already defined at line 82
Fix: order-preserving dedupe by IP when composing all_nodes. Matches
the OPNsense server-name convention ({ip}_{port}), so no two servers
ever share a name regardless of what hostnames the cluster reports.
Bootstrap path is unaffected — the static topology.control_plane /
topology.workers Vecs in the example files don't overlap, so every
insert succeeds and the output is byte-identical. New unit test pins
the overlap case.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Moves HAProxy backend refresh from post-Ready (old step 13) to
immediately after the DHCP reservation (new step 10.5), paralleling
how DHCP primes the firewall before the node boots.
Why this is safe:
- HAProxy's health checks are designed for exactly this: cp3's
backend is marked DOWN until its kubelet answers /readyz, and
the other CPs keep serving traffic. No routing disruption.
- The only visible effect is a few minutes of health-check failure
logs on OPNsense while cp3 comes up.
Why this is better:
- All firewall-side prep (DHCP + HAProxy) runs together, matching
the mental model "get the firewall ready, then ignite the node".
- If OPNsense rejects the configure_service call, the score fails
before the reboot — operator fixes and retries with no
mid-transition node to babysit.
- The post-reboot phase reduces to the only thing that actually
needs post-reboot state: CSR approval + wait-for-Ready.
Source of truth for the backend list stays the live cluster + the
new (hostname, IP) we just DHCP-reserved — so subsequent add-node
runs still see nodes added by earlier ones. `existing` is already
in scope from step 1, so we only need one extra k8s call (the other
role's nodes for ingress pool completeness). Introduced an
existing_to_logical_hosts helper shared between that local use and
the still-needed cluster_logical_hosts wrapper for the other role.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After CSR approval lands and the new node is Ready, OKDAddNodeScore
now rebuilds the 6 OKD LoadBalancerServices (public 80/443, private
80/443 with node-ready health checks, 22623 CP-only, 6443 CP-only with
/readyz check) from the *live* cluster's node list and runs them
through LoadBalancerScore. opnsense-config's configure_service is
already find-by-name + update-or-create, so existing backend servers
keep their UUIDs and the new IP gets linked into the right pools
without tearing down the config.
To avoid duplicating the 60-line service-build, OKDLoadBalancerScore's
internal builder is promoted to a pub fn:
services_for_nodes(control_plane, workers)
-> (Vec<LoadBalancerService>, Vec<LoadBalancerService>)
OKDLoadBalancerScore::new() becomes a 4-line wrapper; Day-1 bootstrap
output is byte-identical. The two private helpers
(control_plane_to_backend_server, nodes_to_backend_server) collapse
into a single private `backends_for(hosts, port)`; the tests that
referenced the old helpers now assert against services_for_nodes
output directly.
Source of truth: live cluster, not topology.control_plane. The static
topology vecs don't grow past bootstrap — reading from the cluster
means a later add-node run sees the nodes we added in this one.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the SSH reboot step, OKDAddNodeScore now polls the cluster every
5s (up to 15min) until the new Node reports Ready, approving any
pending kubelet CSRs that come up in the meantime. This closes the
last manual step for Day-2 control-plane adds — previously the score's
success footer told the operator to go run `oc adm certificate
approve` themselves.
Two CSR primitives on K8sClient (new harmony-k8s/src/csr.rs):
- list_pending_kubelet_csrs(): CSRs whose signer is kubelet-serving
or kube-apiserver-client-kubelet AND whose status carries no
Approved/Denied condition yet.
- approve_csr(name, reason, message): PATCH /approval with an
Approved=True condition via kube-rs' Api::patch_approval.
The score's approval loop checks Ready *first* each iteration so a
stray CSR appearing after the node is healthy is not approved. CSR
matching:
- kubelet-serving: spec.username == "system:node:<hostname>" (the
kubelet signs this field itself; trust it).
- kube-apiserver-client-kubelet: every pending one is approved.
The node identity lives in the embedded CSR PEM's subject CN and
parsing it would pull a new x509 dep for little gain; a Day-2 add
is an operator-triggered, single-window operation and this matches
what `oc adm certificate approve` of the pending set does.
CP success footer now only flags API serving-cert rotation as the
remaining manual step; etcd membership is explicitly noted as
auto-reconciled by cluster-etcd-operator.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After DHCP/byMAC/ignition are published, OKDAddNodeScore now offers to
reboot the newly-discovered node over SSH using the cluster SshKeyPair
(the same secret Harmony already baked into the discovery image's
/root/.ssh/authorized_keys via inventory.kickstart.j2). The operator
still gets a confirm prompt; declining falls back to the old
"power-cycle manually" path with the IP printed.
New plumbing:
- harmony_types::ssh::SshCredentials — plain enum (SshKey with PEM +
passphrase | Password). No derives; not a Harmony Secret. Built
on-the-fly from existing secrets the caller already holds.
- harmony::infra::ssh::run_command(host, port, creds, cmd) — single
russh round-trip, fresh connection per call, no trait or state.
Returns stdout or SshError; caller tolerates NonZeroExit when the
command is expected to sever the session (reboot).
Harmony's Cargo.toml now uses russh/russh-keys via workspace instead of
a pinned 0.45.0 literal to match opnsense-config.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns with the rest of the OKD score namespace (OKDSetup01InventoryScore,
OKDBootstrap02…, OKDLoadBalancerScore, …). The Add prefix made this one
the odd name out. Renames the companion interpret (AddOkdNodeInterpret →
OKDAddNodeInterpret), the Score::name() debug string, and the
InterpretName::Custom tag for symmetry. AddNodeRole stays as-is (not a
Score, and its prefix still reads fine).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two small log tweaks surfaced during Day-2 sttest runs:
- The interpret start log was `Http running...` — ambiguous when several
scores push through the same Http interpret. Match the success/fail
format by prefixing with the score name:
[HttpScore] Http starting...
- CIDR-based inventory discovery logs `Error querying inventory agent on
X:port : ...` at info level for every IP in the subnet that doesn't
answer with a proper agent response. Drop to debug — most IPs on a
/24 are noise; timeouts already log at debug.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AddOkdNodeScore identified the just-discovered host by diffing
"mappings before discovery" against "mappings after". On a re-run where
the operator re-picks the same host (DiscoverHostForRoleScore's
overwrite prompt does DELETE+INSERT, keeping the same host_id), the
diff is empty and the score bails with "no new host mapped for this
role".
Add InventoryRepository::get_latest_host_for_role(role) that returns
the highest-id row for a role, and use it after discovery. The
auto-increment id makes this robust whether the host is brand new or a
replacement, and drops the HashSet snapshot plumbing from the caller.
SQLite impl mirrors get_role_mapping's shape (ORDER BY id DESC LIMIT 1,
deserialize NetworkConfig JSON). .sqlx/ offline cache regenerated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CaddyConfig::is_installed() used to return bool, collapsing every error —
404, 401, 5xx, timeout, TLS mismatch — into "not installed". That's what
caused the Day-2 sttest run to stall on a 30s GET /api/caddy/General/get
timeout, then fall into a second 30s install_package timeout, and finally
die after ~4 minutes.
Change the signature to Result<bool, Error>:
- Ok(true) on 2xx
- Ok(false) only on HTTP 404 (MVC route doesn't exist -> plugin truly
not installed)
- Err(_) for everything else (can't tell)
The single caller in OPNSenseFirewall::ensure_initialized now matches on
the Result: Ok(true) skips install, Ok(false) runs install_package, and
Err(_) fails the score immediately with a pointed message asking the
operator whether the API is reachable at the configured port. This also
restores the auto-install branch that commit 9b6bdaf removed — that
commit was rejected; this replaces it with a correctly-gated install.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The OPNsense HttpServer impl was calling install_package("os-caddy")
whenever CaddyConfig::is_installed() returned false. is_installed()
maps any API failure (timeout, connection error, wrong port) to
"not installed", so on Day-2 — when caddy is obviously already there
and serving — a flaky `GET /api/caddy/General/get` would trigger a
pointless reinstall attempt that itself stalled on a 30s timeout and
brought the whole score down.
Trust the operator: if caddy isn't installed, Harmony logs a warn and
lets the next upload_files / enable call surface a pointed error.
Anyone bootstrapping OPNsense from scratch installs os-caddy via the
plugin manager once, same as they provision API credentials.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two intertwined updates to the sttest example:
Modernize the OPNsense construction to match affilium2:
- drop the local OPNSenseFirewallConfig that reused one username/password
for both API and SSH
- use the canonical OPNSenseFirewallCredentials + OPNSenseApiCredentials
from harmony::config::secret (each prompted / stored separately)
- extract get_opnsense() -> Arc<OPNSenseFirewall> as a reusable helper
- call OPNSenseFirewall::with_api_port(..., 9443, ...) so the client
hits the OPNsense web GUI on 9443 (HAProxy owns 443)
- trim Cargo.toml to the deps the crate actually imports
Exercise the new Day-2 add-node flow end to end:
- init harmony_cli::cli_logger so the run is debuggable
- replace the lone HarmonyDiscoveryStrategy::MDNS with a SUBNET scan
over 192.168.40.0/24:25000, defined once and threaded through both
OKDInstallationPipeline::get_all_scores(...) and the appended
AddOkdNodeScore { role: ControlPlane, ... }
- env.sh now documents the two OPNsense secrets and reminds the
operator to point KUBECONFIG at
./data/okd/installation_files_sttest0/auth/kubeconfig after the
installer finishes (AddOkdNodeScore needs a reachable k8s API)
The add-node tail only publishes cp3's ignition + byMAC + DHCP; etcd
membership, serving certs, and CSR approval remain manual per the
score's own footer.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AddOkdNodeScore was baking /etc/hostname into the stub ignition as
belt-and-suspenders. The initial bootstrap flow (cp0/cp1/cp2) never
did this and it works fine: RHCOS/SCOS NetworkManager honors the
DHCP-provided hostname, and the score already creates a DHCP static
reservation with the chosen name before the operator reboots. Removing
the redundant path (function, call site, and its unit test) so the
add-node flow matches bootstrap and has one less code path to audit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The example was adding a worker; switch it to ControlPlane since that's
the case the operator most often needs a runbook for. AddOkdNodeScore
already prints the CP follow-up caveat (etcd, serving certs, CSR) in
its success output, so this also makes the caveat visible on a plain
`cargo run`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires AddOkdNodeScore into a minimal example that assumes an
already-running OKD cluster and an OPNsense firewall serving PXE. The
topology builder mirrors examples/sttest's shape but leaves
control_plane/workers/bootstrap_host empty — the score reads existing
nodes directly from the cluster via kube-rs, so static topology lists
aren't consulted.
env.sh documents that KUBECONFIG must point at the target cluster's
kubeconfig; the score relies on K8sClient::try_default() to pick it up
when HAClusterTopology.kubeconfig is None.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New score that publishes all the artifacts needed to add a single
control-plane or worker node to an already-running OKD cluster:
- reads existing nodes of the target role from the live cluster
via kube-rs (no oc shell-out)
- derives the next hostname from their naming pattern (cp0,cp1,cp2
-> cp3; cp0-harmony -> cp3-harmony) and the next IPv4 by inferring
a linear seq->last-octet mapping from the existing InternalIPs
- runs DiscoverHostForRoleScore for installation disk + bond + blacklist
- pulls the live user-data secret (master/worker-user-data-managed in
openshift-machine-api) and injects the operator's captured
NetworkManager config (.nmconnection bond master + slaves +
blacklist) plus /etc/hostname straight into the stub ignition JSON
- pushes <role>-<hostname>.ign to the firewall HTTP root, renders a
per-MAC iPXE pointing at it, and creates the DHCP reservation
- returns a detail summary; for ControlPlane, prints a loud reminder
that etcd membership / API serving certs / CSR approval are still
manual follow-ups
Hostname and IP are auto-derived — not user-editable — because that was
the user's explicit decision. Fails loudly if the cluster has zero
existing nodes of the target role or if the IP assignment isn't
linear-stride-1.
Includes unit tests for hostname parsing, next-hostname derivation,
IP deriver (standard case + error cases), and ignition injection
(preserves existing storage.files, handles a stub with no storage key,
emits mode 0600 + overwrite).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Was &'static str because bootstrap only ever rendered bootstrap.ign /
master.ign / worker.ign. The upcoming add-node flow generates per-host
filenames like master-cp3.ign, so the lifetime has to be bound to the
caller's String. All existing callers pass string literals, which coerce
through unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pulls the two Askama bond templates out of OpenShiftNmStateNetworkManager into
a new pub(crate) networkmanager_cfg module and parameterizes the bond master
on LaggProtocol (plus a new disabled-interface builder for blacklisted NICs).
LACP output is kept byte-identical to the previous template so existing
post-install bonding is unchanged; the new builders unblock the upcoming
Day-2 add-node flow which needs to pick modes per operator input.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>