feat/opnsense-bootstrap-score #285

Open

stremblay wants to merge 38 commits from feat/opnsense-bootstrap-score into master

Author	SHA1	Message	Date
Sylvain Tremblay	f2ab47de6e	fix(opnsense-config): preserve symlinks in upload_folder Some checks failed Run Check Script / check (pull_request) Failing after 32s Details `tokio::fs::DirEntry::file_type()` does not follow symlinks, so the existing is_file()/is_dir() branches both returned false for them and the loop silently skipped every symlink. Caller-visible consequence: uploading ./data/okd/installer_image/ (three versioned SCOS files + three stable-name symlinks pointing at them) ended up with only the versioned files on the firewall under /usr/local/http/scos/. The byMAC iPXE files chainload via the stable names (scos-live-kernel.x86_64 etc.), so PXE boots dangled on a 404 until an operator created the symlinks by hand. Adds a third is_symlink() branch that reads the link target with tokio::fs::read_link and recreates it remotely via `ln -sfn` over SSH. `ln` rather than SFTP SSH_FXP_SYMLINK because OpenSSH's server inverts the (path, target) argument order versus the spec — `ln` is unambiguous across shells (including the firewall's tcsh). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 15:03:10 -04:00
Sylvain Tremblay	a283f92388	feat(okd): extend recovery Score to also re-create byMAC iPXE files Renames OKDReapplyDhcpBindingsScore to OKDReapplyFromInventoryScore (file: reapply_from_inventory.rs) to reflect the broader scope. Per selected role, the Score now: 1. Re-writes dnsmasq Host entries via DhcpHostBindingScore (existing behavior). 2. Re-creates byMAC/01-<mac>.ipxe boot files via IPxeMacBootFileScore, rendering BootstrapIpxeTpl with the role-appropriate ignition file (bootstrap.ign / master.ign / worker.ign). Pulls installation_device + MAC from each host's PhysicalHost + HostConfig row in the inventory DB; errors loudly if either is missing rather than silently producing a half-written byMAC tree. Same constructors (interactive / for_roles / all_roles) and same inquire multi-select UX, with the prompt wording broadened from "DHCP bindings" to "firewall config". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 14:47:55 -04:00
Sylvain Tremblay	28e6755d5f	feat(okd): OKDReapplyDhcpBindingsScore — re-apply DHCP from inventory DB Recovery Score for the case where the OPNsense firewall has been reinstalled but the harmony inventory database still holds the discovered physical hosts. Looks up DB hosts per role, zips them with HAClusterTopology slots, runs DhcpHostBindingScore to re-create the dnsmasq Host entries (DHCP reservation + A record) without doing network discovery, iPXE, or reboot work. Interactive via inquire multi-select (default) or explicit role list via for_roles() / all_roles(). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 14:18:20 -04:00
Sylvain Tremblay	a9baa4f15d	perf(opnsense-config): bump SFTP upload chunk size to 256 KB Some checks failed Run Check Script / check (pull_request) Failing after 55s Details The previous `FramedRead<File, BytesCodec>` loop in `upload_folder` defaulted to ~8 KB reads, each turning into its own SFTP WRITE round-trip. Replace it with an explicit 256 KB chunked read+write_all loop. Same correctness, fewer awaits, fewer protocol packets per file. Drops the unused `tokio_stream` and `tokio_util::codec` imports. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 10:44:01 -04:00
Sylvain Tremblay	d0fe5802d0	feat(opnsense): pause for operator network reconnect between LAN rebind and reboot The LAN rebind step moves the firewall to a new subnet, severing the dev machine's connection. The terminal reboot step that immediately followed would then time out trying to POST `core/firmware/reboot` from a machine no longer on the firewall's subnet. Insert a blocking `inquire::Confirm` prompt between step 5 and step 6 (only when `target_lan` is `Some(_)`) that: - tells the operator the new firewall address and prefix, - prompts them to renew DHCP or set a static IP in the new subnet, - waits for explicit confirmation before continuing. Declining returns `InterpretError` so the bootstrap fails loudly in a clearly-recoverable mid-state — re-running after reconnection picks up at the reboot step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 10:23:51 -04:00
Sylvain Tremblay	1dfd0dcecc	feat(opnsense): reboot + verify reachability at end of OPNsenseBootstrapScore Append a terminal step to `OPNsenseBootstrapScore.execute()`: POST `core/firmware/reboot` against the firewall's final address, then wait for the API to go unreachable, come back, and settle. Hard-fail if the firewall does not reappear at `final_ip:target_api_port`. The dance has touched firmware, the optional LAN bridge, the DHCP pool, and the LAN IP itself — a clean reboot guarantees the running kernel/config matches what was persisted, and the post-reboot probe makes reachability a contract the rest of the harmony pipeline can rely on instead of a best-effort warn-only check. Reuses the existing `wait_for_reboot_cycle` waiter from `firmware_upgrade.rs` (promoted to `pub`) and the same `core/firmware/reboot` POST shape that `perform_firmware_upgrade` uses mid-pipeline. Idempotency is unchanged: `decide()` still short-circuits to NOOP when creds exist + target is reachable + vanilla is gone, so a re-run does not trigger a second reboot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 09:05:46 -04:00
Sylvain Tremblay	46dca31108	test(opnsense-vm-integration): exercise OPNsenseLanBridgeScore end-to-end Add the LAN-bridge step at position #12 in the score pipeline (single-member bridge on `vtnet0`), plus end-of-test reachability assertions for HTTPS at 9443 and SSH at 22 — both must succeed before the test reports PASS. The SSH check guards specifically against the regression where sshd stays bound to the old (now IP-less) LAN device after the bridge step and the next run times out trying to reconnect. State snapshot now captures `net.link.bridge.*` sysctl values and asserts each expected key (`inherit_mac`, `pfil_member`, `pfil_bridge`, `pfil_local_phys`) has the expected value, rather than just counting ≥4 entries (the namespace isn't owned exclusively by the Score). Verified PASSED end-to-end on three sequential runs (clean run, idempotent re-run, post-reboot probe). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 08:50:23 -04:00
Sylvain Tremblay	bd86fffae7	feat(opnsense): OPNsenseLanBridgeScore + bootstrap_score.lan_bridge field Add the standalone `OPNsenseLanBridgeScore` (Score<OPNSenseFirewall>) plus a new `lan_bridge: Option<LanBridgeParams>` field on `OPNsenseBootstrapScore`. Both call the shared `ensure_lan_bridge_step` so behaviour stays in lockstep. `LanBridgeParams::members` takes physical NIC names (e.g. `["igc0", "igc2","igc3","igc4"]`). The Score resolves them to logical interfaces, auto-promoting unmapped NICs to fresh `optN` slots before bridging. WAN's NIC is rejected with a hard error. `members: None` triggers an interactive `MultiSelect` annotated with each NIC's current logical assignment. Inside `OPNsenseBootstrapScore`, the bridge step runs AFTER the firmware upgrade and BEFORE the optional LAN-IP rebind — so a `target_lan` rebind naturally targets `bridge0` rather than the now-unbound physical port. Defaults: `reassign_lan=true`, `perf_tunables=true`, `enable_stp=false`. Perf tunables run BEFORE the bridge create so `net.link.bridge.inherit_mac=1` is live when the first member is attached (otherwise the bridge gets an auto-generated MAC and the host's L2 path silently breaks after the LAN-IP move). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 08:50:12 -04:00
Sylvain Tremblay	2bbd612277	feat(opnsense): atomic LAN-bridge SSH helper in bootstrap.rs Add `ensure_lan_bridge_atomic_via_ssh`, `ensure_physical_nic_assigned_via_ssh`, `set_lan_member_via_ssh`, and the `AtomicBridgeOutcome` enum. These are the SSH-driven primitives the LAN-bridge step will compose into a single Score in the next commit. The atomic helper runs one PHP-on-SSH script that, in a single `Config::save()`: - promotes unassigned physical NICs to fresh `<interfaces><optN>` entries - moves the current LAN-bound NIC to a new `optN` (when `reassign_lan` is set), so the bridge references the OPT rather than `lan` itself — required to avoid the circular `lan↔bridge0` member reference that silently breaks L2 forwarding when both point at each other - creates or updates `<bridges><bridged>` with the resolved logical members - repoints `<interfaces><lan><if>` to the bridge The detached configctl chain (`nohup … & </dev/null`) brings the new topology up without deadlocking the initiating SSH channel — sshd is restarted in the same chain so it rebinds to the new LAN device. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 08:49:58 -04:00
Sylvain Tremblay	037f08a1f6	feat(opnsense-config): BridgeConfig + InterfaceSettingsConfig wrappers Thin wrappers over the generated bridge and interfaces/settings models, exposed through the `Config` singleton via `Config::bridge()` and `Config::interface_settings()` — same accessor pattern as the existing `caddy()` / `lagg()` / `dnsmasq()` helpers. `InterfaceSettingsConfig::ensure_offloads_disabled()` is the entry point used by the LAN-bridge step to disable TSO/LRO globally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 08:49:45 -04:00
Sylvain Tremblay	9b31c302f2	fix(opnsense-api): tolerate object/array shapes for Disablevlanhwfilter OPNsense's `GET /api/interfaces/settings/get` returns `disablevlanhwfilter` as a `BaseListField::getNodeOptions()` select-widget structure rather than a plain string. Because the option keys are numeric strings (`"0"`/`"1"`/ `"2"`), PHP's `json_encode` collapses them into a JSON array — so the array index IS the wire code. The deserializer now accepts: - plain string (the `setItem` round-trip path), - object form (`{key: {value, selected}}`), - array form (index = wire code). Wire codes are also fixed to `"0"`/`"1"`/`"2"` (from the XML `value="…"` attribute, per `BaseModel::parseOptionData`), not the element names `"opt0"`/`"opt1"`/`"opt2"`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 08:49:36 -04:00
Sylvain Tremblay	92b0d0053a	feat(opnsense-api): generate bridge + bridge_settings_api models Add codegen output for the OPNsense Interfaces/Bridge MVC model and its settings controller. Pure generated code — no hand-written logic; mirrors the structure of the other models under `src/generated/`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 08:49:25 -04:00
Sylvain Tremblay	9d8dab60db	docs(opnsense): add reviewer-facing links to ethname / pin docstrings Some checks failed Run Check Script / check (pull_request) Failing after 43s Details The module docstring on OPNsensePinNicNamesScore already explained the NIC-shuffle problem in prose, but didn't actually link to the sources that justify the design. Anyone reviewing the code or auditing the vendored script had to Google their way to the OPNsense forum thread and the upstream repo. Adds: - https://forum.opnsense.org/index.php?topic=27023.0 (the canonical thread, with franco's endorsement of ethname) - https://forums.freebsd.org/threads/how-to-associate-an-interface-name-with-its-mac.89337/ (broader FreeBSD context for the enumeration issue) - https://github.com/eborisch/ethname (upstream repo) - https://www.freshports.org/sysutils/ethname/ (FreeBSD port entry) Also restructured the pin_nic_names module docstring into "Why this exists" / "Background reading" / "What it does" / "Two ways to use this" sections so reviewers can find the rationale faster. The ETHNAME_SCRIPT const in bootstrap.rs gets the upstream URL inline too, so the script's purpose is self-evident at every call site. No code changes. cargo doc renders the links live; cargo check / fmt / clippy stay clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 13:16:32 -04:00
Sylvain Tremblay	68a2487b2b	fix(opnsense): set LAN DHCP range via REST API before flipping the LAN IP Some checks failed Run Check Script / check (pull_request) Failing after 43s Details Previous attempts to update the DHCP range as part of the LAN-rebind step poked at config.xml from PHP, which didn't match the actual XML schema on OPNsense 26.x. The result: LAN IP moved to 192.168.200.1 but DHCP was still trying to hand out 192.168.1.100–199 leases → no clients could obtain an address on the new subnet, and the bootstrapping operator was kicked off their own LAN. Use OPNsense's own REST API instead. The existing `opnsense_config::modules::dnsmasq::DhcpConfigDnsMasq::set_dhcp_range` already does the right thing — it finds the dnsmasq range bound to `interface == "lan"` (or creates one), updates `start_addr`/`end_addr`, then asks OPNsense to reconfigure dnsmasq. Validation and dependent service restarts go through OPNsense's model classes, not our XPath guesses. Sequencing matters: the API endpoint lives on the firewall's current LAN IP, so the range update has to be hit before the LAN IP flip kills our HTTP connection. New flow in `OPNsenseBootstrapScore` step 5: 5a. set_lan_dhcp_range_via_api(...) ← OPNsense API on vanilla_ip:9443 5b. change_lan_ip_via_ssh(...) ← flips LAN IP, detached configctl `change_lan_ip_via_ssh` is simplified back to a single concern: PHP rewrites `interfaces.lan.ipaddr`/`subnet`, then a detached `configctl interface reconfigure lan` + service-restart chain applies the change. No more multi-backend XML guessing inside the PHP. The DHCP pool follows OPNsense's install-default convention `<net>.100` – `<net>.199` regardless of prefix length. Operators who want a different range can resize via the WebUI / API after bootstrap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 12:53:54 -04:00
Sylvain Tremblay	84e610ca60	style(opnsense): align pin-step log lines with the rest of harmony The pin-step lines I added in `9eeede18` invented two notations: [1/3] / [2/3] / [3/3] in pin_nic_names_step (a/c) / (b/c) / (c/c) in install_ethname_via_ssh Neither matches the established convention. OKDAddNodeScore and the existing OPNsenseBootstrapScore beats use plain prose verbs with no ordinal markers — "Logged in to ...", "Enabled SSH ...", "Moved web GUI port ...", "LAN rebind X -> Y", "Persisted OPNSenseApiCredentials + ...". Top-level Score code carries a [ScoreName/host] tag; low-level SSH helpers (e.g. change_lan_ip_via_ssh) log untagged short prose. Rewrite the six pin-step log lines to follow that. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 11:22:42 -04:00
Sylvain Tremblay	9eeede18b8	feat(opnsense): pin physical NIC names to MAC addresses via vendored ethname Some checks failed Run Check Script / check (pull_request) Failing after 40s Details On multi-NIC FreeBSD/OPNsense boxes (Wize 5070 and similar), PCIe enumeration order shuffles igc0/igc1/... across reboots. OPNsense binds wan/lan assignments to interface names, so a shuffle silently re-points them at the wrong physical ports and breaks firewall rules. Validated fix from OPNsense forum #27023 (endorsed by franco): the upstream `ethname` rc.d script (MIT, © Eric Borisch 2016–2019, frozen at v2.0.1) does a two-stage rename in early boot — before `netif` — mapping MACs to fixed interface names. Vendor the 280-line script inline rather than `pkg install ethname`. `pkg install` on a fresh ISO often fails because the firmware lags the live pkg repo, and the firmware-upgrade reboot is precisely the boot we need to defend against. Vendoring sidesteps the chicken-and-egg. Adds: harmony/data/opnsense/ethname.sh vendored upstream script (verbatim) harmony/data/opnsense/ethname.LICENSE preserves MIT terms bootstrap.rs: ETHNAME_SCRIPT (const, include_str!) DEFAULT_PHYSICAL_DRIVER_PREFIXES (const) list_physical_nics_via_ssh / read_ethname_mac_set_via_ssh / install_ethname_via_ssh (pub SSH helpers) pin_nic_names module: pin_nic_names_step — the shared one-shot logic OPNsensePinNicNamesScore — Score<OPNsenseBootstrapTopology> for ad-hoc re-pinning / standalone use OPNsenseBootstrapScore composes pin_nic_names_step internally as a mandatory step between the web UI dance and API key mint — every firewall bootstrapped through harmony gets pinned NIC names automatically, no caller code change required. Idempotent: re-running on a firewall whose MAC set already matches /etc/rc.conf.d/ethname is a NOOP. The existence probe for the config file is wrapped in `sh -c '...'` because OPNsense's root login shell is /bin/csh (tcsh); bare Bourne if/then/else fails there. Simple `&&` chains (the pattern in the other SSH helpers) work in both shells. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 10:49:59 -04:00
Sylvain Tremblay	d9c9ffc6fa	fix(opnsense-vm-integration): drop stray todo!("stop here") in print_setup Some checks failed Run Check Script / check (pull_request) Failing after 38s Details Leaked into commit `92717441` when an uncommitted debug breadcrumb in the user's working tree was staged alongside the post-review cleanup. `--setup` would panic at runtime before printing the closing newline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 07:48:44 -04:00
Sylvain Tremblay	92717441b6	refactor(opnsense): post-review cleanup — named timeouts, shared upgradestatus helper, doc fixes Some checks failed Run Check Script / check (pull_request) Failing after 46s Details - examples/opnsense_vm_integration: flip firmware_upgrade back to Disabled (the Score pipeline already runs OPNsenseFirmwareUpgradeScore explicitly, bootstrap-time upgrade was redundant); rewrite module docstring to match post-refactor behavior. - examples/opnsense_pair_integration: add TODO near abort_wizard noting the example should migrate to compose OPNsenseBootstrapScore. - harmony::modules::opnsense::firmware_upgrade: pull magic timeouts into named module-scope consts with one-line rationale; reuse the new shared check_firmware_task_done helper for upgradestatus polling. - opnsense-config: add check_firmware_task_done helper + name install_package's poll interval / max attempts; install_package now shares the helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 06:55:23 -04:00
Sylvain Tremblay	27f18d601a	refactor(opnsense): drop the wizard-abort call from OPNsenseBootstrapScore The call to `OPNsenseBootstrap::abort_wizard()` (POST /api/core/initial_setup/abort) failed with 403 Forbidden on every run: that endpoint requires a session-CSRF token under cookie auth and we don't fetch one before calling it (only `login()` extracts a token, and it's tied to the login form). The 403 was logged as WARN and silently ignored — and empirically the wizard flag doesn't block ANY of the following steps (SSH enable, web GUI port change, API key mint via SSH, LAN rebind, firmware upgrade). So the call was producing log noise for no observable benefit. Drop the call from the Score's interpret flow. The `OPNsenseBootstrap::abort_wizard()` helper stays in the library — a future caller that wants to do it properly (GET an authenticated page, extract its CSRF token, include it in the abort POST) can still use it. Only downside: a human operator who later opens the WebUI manually will see the OPNsense first-run wizard prompt once and have to dismiss it. Acceptable trade for clean automated bootstrap logs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 08:14:35 -04:00
Sylvain Tremblay	5f34fd5d35	fix(opnsense): drop redundant summary block from Prompt-mode confirmation In FirmwareUpgradeMode::Prompt, the summary block was being printed twice — once via the `info!("{tag} Pending firmware …:\n{summary}")` line just above the mode-gating match, and again inside the inquire::Confirm prompt's header text. The prompt now asks only the yes/no question; the operator reads the summary from the info! log line one row above. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 08:09:55 -04:00
Sylvain Tremblay	51854e205c	feat(opnsense): OPNsensePackageInstallScore + linear pipeline in vm_integration Every other operational primitive in harmony has a Score wrapper between the low-level call and the user-facing composition layer. Package installation didn't — `Config::install_package` was being called naked from the integration example with a hand-rolled `match install_package() { Ok=>… Err=>compose-firmware-upgrade-Score-and-retry }` glue. That's exactly the "imperative orchestration in the caller" pattern harmony's CLAUDE.md tells us to push into Scores. This commit: - Adds `OPNsensePackageInstallScore { packages: Vec<String> }` in a new `harmony/src/modules/opnsense/package_install.rs`. The Interpret iterates packages, skips ones already installed via `is_package_installed`, calls `install_package` on the rest, surfaces newly-installed vs. already-present in `Outcome::success_with_details`. Idempotent on re-runs. - Adds the `OPNsensePackageInstall` variant to `InterpretName` + Display. - The Score deliberately has NO firmware-upgrade fallback baked in. If install fails because firmware is stale, `install_package`'s error message already points the operator at `OPNsenseFirmwareUpgradeScore`. Composition is the operator's job — same as every other Score pair relationship in harmony. - Rewrites `examples/opnsense_vm_integration::run_integration` to drop the ~40-line try/Err/retry block. The two new Scores (firmware upgrade + package install) are prepended to `build_all_scores`, so the pipeline becomes a linear vec: vec![ OPNsenseFirmwareUpgradeScore { mode: Auto, .. }, OPNsensePackageInstallScore { packages: vec!["os-haproxy"] }, webgui, lb, dhcp, … (existing config scores) ] Both `run_cli` invocations (run 1 and the idempotency run 2) exercise the new Scores. Both naturally NOOP on the second pass: upgrade because `firmware/status == "none"`, install because `is_package_installed("os-haproxy") == true`. Three unit tests in the new module cover Score name, serialization, and empty-package-list handling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 07:59:56 -04:00
Sylvain Tremblay	689ab8d21a	fix(opnsense): install_package polls upgradestatus + inspects log; example uses the Score via run_cli Two related changes the operator asked for, replacing the previous "firmware/running ready/busy" heuristic that felt unclean. 1. install_package: drop the `firmware/info` + `firmware/running` ready-idle polling. Restore the OPNsense-native pattern: poll `/api/core/firmware/upgradestatus` until `status == "done"` (same endpoint OPNsense's own WebUI uses for its install progress popup), then verify via `firmware/info` whether the package actually got installed. On failure, surface pkg's actual error from the `log` field of the upgradestatus response (last 8 non-empty lines) plus a "run OPNsenseFirmwareUpgradeScore first" hint. Tolerate transient upgradestatus errors as the 26.1.6 release notes document the endpoint as unstable; 120 × 3 s ceiling is the safety net. Now produces the clear, fast-fail message the operator remembers from before the branch, but with the actual pkg failure reason ("pkg: No packages available to install matching 'os-haproxy'", or whatever the underlying issue is) included. 2. opnsense_vm_integration: the post-install-failure fallback now composes `OPNsenseFirmwareUpgradeScore { mode: Auto }` into a `Vec<Box<dyn Score<OPNSenseFirewall>>>` and dispatches it via `harmony_cli::run_cli`, matching the way the rest of `run_integration` runs its Scores. Replaces the direct call to the bare `perform_firmware_upgrade()` helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 07:42:58 -04:00
Sylvain Tremblay	a703442d8d	fix(opnsense): recognize "ready" as the idle value from firmware/running I added Signal B polling to install_package + wait_for_task_or_reboot checking for `status == ""` or `"none"` as the "configd idle" condition, but OPNsense's `configctl firmware running` script (core/scripts/firmware/running.sh) actually outputs `"ready"` when no firmware operation holds the lock and `"busy"` when one does. So Signal B never fired against a real OPNsense — the loop kept seeing `status: "ready"` (= idle) and treating it as "still running". For install_package this meant a doomed install still consumed the full 6-minute timeout. For wait_for_task_or_reboot it was masked by Signal A (version moved) almost always winning first, but the bug was the same. Recognize "ready" (case-insensitive) plus defensive "" / "none" as idle. Verified against the upstream script: if ${FLOCK} -n 9; then echo "ready" else echo "busy" fi Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 07:29:21 -04:00
Sylvain Tremblay	d264c84c40	refactor(opnsense-vm-integration): use perform_firmware_upgrade in the install fallback The Err arm after the first `install_package("os-haproxy")` attempt used to POST `firmware/update` (= `pkg update`, repo-metadata refresh) and sleep 5s before retrying. That's a weaker, hand-rolled subset of what `OPNsenseFirmwareUpgradeScore` / `perform_firmware_upgrade` already does properly. Replace with a call to `perform_firmware_upgrade(..., FirmwareUpgradeMode::Auto, ...)`. That does the full canonical flow: firmware/check → firmware/status → firmware/update or upgrade → poll (with multi-signal completion + reboot tolerance) → verify the product_version moved. After it returns, the firewall is at the latest firmware AND its package index is current, so the retry of `install_package("os-haproxy")` finds the right packages and succeeds. This is what the operator asked for: "[on install failure] it should call the firmware update score." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 07:23:47 -04:00
Sylvain Tremblay	9eb36985ea	fix(opnsense-config): fast-fail install_package via firmware/running idle signal `install_package`'s poll loop only watched `firmware/info` for the positive "package installed" signal. When OPNsense's background install task failed silently (typical on stale repo metadata: pkg can't find os-haproxy that matches the firmware), the package never appeared in `firmware/info`, so the loop consumed its entire 6-minute ceiling before returning Err. The caller's fallback ("refresh metadata + retry") couldn't fire for 6+ minutes — looked like a hang. Add a second poll signal each iteration: `firmware/running` reports the name of the currently active configd task (empty when idle). When the install task vanishes (empty for 2 consecutive polls) AND the package still isn't in `firmware/info`, we know the install ended without succeeding. Fail fast with: "OPNsense install task for <pkg> ended without installing the package. The repository metadata is likely stale — try refreshing it via firmware/update, or run OPNsenseFirmwareUpgradeScore first, then retry." Typical failed install now detects within ~10s instead of 6min. The 120 × 3s ceiling stays as a safety net for "task running but never completes" pathologies. This restores the fast-fail behavior the OLD pre-refactor install_package had (via its bail-on-upgradestatus-404 path), with a proper, stable signal instead of relying on the documented-unstable endpoint. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 07:23:29 -04:00
Sylvain Tremblay	2c658b8ce8	feat(opnsense): FirmwareUpgradeMode enum (Auto / AutoMinor / Prompt / Disabled) Replaces the boolean `OPNsenseBootstrapScore::upgrade_firmware` knob with a four-variant enum that decides per pending upgrade whether to apply it automatically, skip major-series upgrades, prompt the operator, or skip entirely. Also exposed via `OPNsenseFirmwareUpgradeScore::mode` for standalone composition. - `Auto` (default): apply every pending update and upgrade. - `AutoMinor`: apply in-series updates (status == "update"); skip major-series upgrades (status == "upgrade"). Uses OPNsense's own `status` field for the major/minor distinction — no version-string parsing. - `Prompt`: print a per-iteration summary and ask via inquire::Confirm. Errors out with `PromptRequiresTty` when run headless (no TTY) so CI contexts must pick `Auto` / `AutoMinor` / `Disabled` explicitly. - `Disabled`: skip the upgrade step entirely. The summary surfaced for Prompt (and logged in Auto/AutoMinor too) includes: - status_msg (OPNsense's "108 updates available, 349.2 MiB, reboot required" line) - whether the OPNsense product package itself is being upgraded ("Main OPNsense: 26.1 → 26.1.8") or whether the update only touches plugins/packages ("Main OPNsense: staying at <ver>") - kind (update vs upgrade) - reboot required (yes/no) Two new helpers — `extract_opnsense_version_change` and `render_upgrade_summary` — pull the version diff out of `status.all_packages` / `status.all_sets` (looking for the `opnsense` or `opnsense-update` entry) and assemble the human-readable block. Wired through: - `OPNsenseFirmwareUpgradeScore::mode` (default Auto). - `OPNsenseBootstrapScore::firmware_upgrade` (replaces `upgrade_firmware: bool`; same default behavior). - `examples/opnsense_vm_integration` opts out with `firmware_upgrade: FirmwareUpgradeMode::Disabled`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 07:00:57 -04:00
Sylvain Tremblay	6115eb1a23	fix(opnsense): trust the reboot as definitive task completion When a firmware update has status_reboot=1, the reboot IS the final step of the install — once it completes, the task is done by definition. But my multi-signal polling loop kept trying to verify completion via signals A/B/C after the reboot, and all three are unreliable post-reboot: A. firmware/info `product_version` doesn't change if the update was package-only (no version bump). B. firmware/running keeps reporting the previous task as active until the next firmware/check kicks it — the operator observed "clicking 'check for updates' in the UI unstuck it", confirming OPNsense retains stale task state until a fresh check resets it. C. firmware/upgradestatus 404s (documented unstable on 26.1) when no task is registered, which is the state after a real upgrade. Net: in iteration 2 of a real upgrade run (26.1.8 → 26.1.x with 2 more packages), the wait loop was stuck silently polling for several minutes after the firewall had already rebooted and was fully operational. Now: when the TCP probe detects unreachable and wait_for_reboot_cycle returns, immediately return TaskOutcome { rebooted: true } instead of re-entering the polling loop. The outer perform_firmware_upgrade loop already calls firmware/check at the top of the next iteration (which both refreshes OPNsense's task state AND tells us if more updates are pending) and reads firmware/info after wait_for_task_or_reboot returns to verify the version moved. Those are the real post-reboot completion signals — the in-loop polling was redundant and harmful. The non-reboot path (status_reboot=0 updates, e.g. pure metadata refresh) is unchanged: signals A/B/C still run because there's no reboot to use as a definitive completion event. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 14:44:47 -04:00
Sylvain Tremblay	1b718ef6c8	refactor(opnsense-api): truncate HTTP response body in WARN logs `OpnsenseClient::handle_response_typed` logged the entire response body in its WARN line on non-success status codes. OPNsense's 404 page is ~12 lines of HTML; every transient 404 (and the firmware/upgradestatus endpoint 404s constantly on 26.1, per release notes calling it "known to be unstable") dumped a multi-line block into the log. Route the body through a new truncate_for_log helper that keeps the first non-empty line, caps the result at 200 chars, and appends an ellipsis if anything was elided. JSON error responses (typically one short line) stay intact; HTML pages collapse to "<!DOCTYPE html>…". The `Error::Api { body, .. }` value passed to callers is unchanged, so code that wants to inspect the full body still can. Three unit tests cover: short-line passthrough, HTML collapsing, length-capped ellipsis. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 14:44:47 -04:00
Sylvain Tremblay	c4d46f1817	fix(opnsense): multi-signal completion detection in firmware upgrade poll loop The previous wait loop polled only `firmware/upgradestatus` for `status == "done"`, with a TCP-probe fallback to detect reboots. Two ways this got stuck against real OPNsense 26.1: 1. After the upgrade reboot completed, OPNsense had no active background task to track, so `upgradestatus` 404'd indefinitely. 2. The TCP-probe fallback could miss the brief unreachable window between two 5s-apart polls — if both polls saw the API up, we never set `rebooted=true` and never bailed. Net: the upgrade ran fine (26.1 → 26.1.8 applied), but our code waited forever for a "done" signal that never came. Now the wait loop polls THREE signals per iteration and exits on any: A. firmware/info `product_version` differs from version_before_action B. firmware/running `status` empty for 2 consecutive polls (configd reports no active task) C. firmware/upgradestatus `status == "done"` (when the endpoint works) Plus the TCP probe still detects mid-task reboots and waits for the firewall to come back — but it's no longer the sole exit path. After a reboot, signal A almost always wins on the first post-recovery poll. perform_firmware_upgrade now snapshots the version before each action and passes it as `version_before_action` so signal A has a baseline that's valid even after a previous iteration already bumped the version. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 14:12:26 -04:00
Sylvain Tremblay	89f7455399	fix(opnsense): rewrite perform_firmware_upgrade per OPNsense's actual async API The previous version did `POST firmware/check`, slept 5s, read `firmware/status`, and used a fragile "status_msg contains 'up to date'" heuristic to decide whether to upgrade. Two related bugs: 1. `firmware/check` is async — it returns immediately with a msg_uuid and runs in the background. 5s is far less than the metadata refresh takes; on a fresh boot the status still reads "requires to check for update first to provide more information" when we look. 2. That message doesn't match any of my "up to date" keywords, so my pending-check returned true and triggered `firmware/upgrade` against a system that had no actionable upgrade plan. firmware/upgrade returned immediately (also async), status stayed "none", and the helper reported success without anything having happened. Rewrite per OPNsense's actual API (verified against FirmwareController.php and firmware.volt): 1. GET firmware/info → capture initial product_version 2. Loop ≤ 5 iterations (a kernel upgrade can unlock further package updates that need their own pass): a. POST firmware/check (async) b. Poll firmware/upgradestatus until status == "done" c. POST firmware/status → read `status` enum ("none"/"update"/"upgrade"/"error") d. If "none": done. (First iteration → NOOP. Later → success.) e. If "update" / "upgrade": POST firmware/{that}, poll upgradestatus until done, handle reboot (auto-reboot or explicit firmware/reboot if status_reboot == "1"), then GET firmware/info to verify product_version changed. 3. Return UpgradeOutcome with initial_version, final_version, iterations, rebooted flag. The upgrade Score's Outcome now reports the actual version transition ("Firmware upgraded: 25.7.4 → 26.1.6 in 2 iteration(s) (rebooted: true)"). Mid-upgrade reboots are detected by `upgradestatus` going unreachable + a TCP probe confirming the API is down (vs. just a 404 from the documented-unstable endpoint). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 13:59:29 -04:00
Sylvain Tremblay	9e0224264c	feat(opnsense): OPNsenseFirmwareUpgradeScore + bake into bootstrap by default Adds a `Score<OPNSenseFirewall>` that brings an OPNsense firewall to the latest available firmware/package level via the REST API: POST firmware/check → refresh upstream metadata GET firmware/status → check what's actionable POST firmware/upgrade → trigger if anything is pending poll firmware/status → wait for status to return to "none" POST firmware/reboot → if status_reboot == "1" and we haven't already auto-rebooted mid-upgrade + wait unreachable → reachable → 30s settle The core logic lives in `perform_firmware_upgrade()` so it can be called from elsewhere. `OPNsenseBootstrapScore` now exposes `upgrade_firmware: bool` (default `true`) and calls the same helper after credentials are persisted, before any optional LAN rebind. The firewall thus ends bootstrap on its latest firmware, exactly the right beat operationally: no production traffic yet, operator already babysitting, all subsequent Scores run against current code. Why not extend `OPNSenseLaunchUpgrade` (the existing SSH-based Score)? It calls a shell script (`opnsense-update.sh`), has a `todo!()` Serialize impl, no idempotency check, and holds an `Arc<Config>` directly instead of reading from a topology. The new score uses the REST API end-to-end, idempotency-checks via `status_upgrade_action`, and slots cleanly into normal Score<T> composition. `OPNSenseLaunchUpgrade` stays alongside it for now; affilium2 keeps working unchanged. We can deprecate the SSH one in a follow-up once the API one has flown against real firewalls. `opnsense_vm_integration` explicitly opts out of the bootstrap-time upgrade (`upgrade_firmware: false`) — the VM image is a known firmware version, and we don't want each integration run to spend 10+ minutes pulling firmware updates. New `InterpretName::OPNsenseFirmwareUpgrade` variant. Unit tests cover score name, default api_port (9443), and serialization. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 13:33:23 -04:00
Sylvain Tremblay	a0401ca6c4	refactor(opnsense-vm-integration): collapse firmware-update fallback now that install_package is resilient After `fix(opnsense-config): poll firmware/info, not the unstable upgradestatus, in install_package`, the library's install_package call handles its own polling correctly and tolerates transient API errors. So the example's fallback no longer needs to track firmware/status, issue an explicit reboot, or wait for an unreachable→reachable cycle — all of that logic was duplicating what should be (and now is) the library's responsibility. Collapse the ~100-line Err arm to ~15 lines: when the first install_package attempt fails, kick `firmware/update` (== `pkg update`, refresh repository metadata), sleep 5s, and retry. The original failure mode (first install fails because a freshly bootstrapped firewall has no pkg metadata yet) is what this fallback exists to address; nothing more. Net deletion of ~85 lines from the example. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 13:12:39 -04:00
Sylvain Tremblay	27a4492e9a	fix(opnsense-config): poll firmware/info, not the unstable upgradestatus, in install_package The legacy polling loop in `Config::install_package` hit `/api/core/firmware/upgradestatus` until `status == "done"`. That endpoint is marked "known to be unstable" in OPNsense 26.1.6 release notes (the WebUI itself traps its generic error popup) and reliably 404s on a freshly bootstrapped 26.1 system. The loop's error handling used `.map_err(Error::Api)?` so a single 404 short-circuited the whole install — even when the underlying install_package POST succeeded. Switch to polling `/api/core/firmware/info` and looking for the package in the response with `installed == "1"`. That's the same check the existing code did AFTER the loop; moving it INTO the loop removes the dependency on `upgradestatus` entirely. Transient errors from the firmware/info call are now logged at debug! and tolerated as "keep polling" (the API may briefly be unreachable if a package install triggers a reboot — extremely rare for plugins, but defensible to handle). The unused `UpgradeStatus` struct is dropped along with the legacy loop. Behavior on success is identical (returns Ok, same info! log). On timeout the error message is more descriptive (`"did not appear as installed within 360 seconds"`) than the previous `"installation did not complete successfully"` which was actually printed for both "polling timed out" and "the package isn't in firmware/info" cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 13:11:32 -04:00
Sylvain Tremblay	3f719f7bcd	fix(opnsense-vm-integration): use firmware/status + explicit reboot, not unstable upgradestatus The firmware-update fallback used to poll `/api/core/firmware/upgradestatus` for a "done" signal, then wait_for_https + a 10s sleep before retrying the package install. That endpoint is documented as "known to be unstable" in OPNsense 26.1.6 release notes (the WebUI itself traps its generic error popup), so the polling loop never breaks out via the success path — it just times out. wait_for_https then succeeds during the brief window before OPNsense actually starts rebooting, and the install retry gets killed mid-reboot with a `reqwest::Request` timeout. Switch to `/api/core/firmware/status`, which is the stable endpoint and returns a definitive `status_reboot` field ('1' if a reboot is required after the in-progress update/upgrade, computed from `needs_reboot` and `upgrade_needs_reboot` per FirmwareController.php). Poll until the update finishes (status == "none") or the API becomes unreachable (auto-reboot during update), then read `status_reboot` and trigger an explicit `POST /api/core/firmware/reboot` if needed. The wait-for-unreachable window after the reboot is then a tight 60s — we know the reboot just happened. No more blind multi-minute timeouts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 12:06:44 -04:00
Sylvain Tremblay	9a29a48968	fix(opnsense-vm-integration): default HARMONY_SECRET_STORE to file The previous refactor only defaulted HARMONY_SECRET_NAMESPACE; running --full / --boot then panicked because `init_secret_manager` falls back to the Infisical backend when HARMONY_SECRET_STORE is unset (see harmony_secret::lib:82), and that requires HARMONY_SECRET_INFISICAL_URL. Default HARMONY_SECRET_STORE to "file" the same way so `cargo run -p opnsense-vm-integration -- --full` works out of the box without sourcing an env.sh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 11:42:33 -04:00
Sylvain Tremblay	baf15d587e	refactor(opnsense-vm-integration): compose OPNsenseBootstrapScore instead of inline dance Drop the procedural login → abort wizard → SSH → port → API-key sequence in `boot_vm` and `run_integration`, and replace the bootstrap leg with a single `harmony_cli::run_cli` invocation of `OPNsenseBootstrapScore` against `OPNsenseBootstrapTopology`. The diagnose_via_ssh fallback and the SSH-22 polling loop go away too — both are covered by the Score's own idempotency probe and the per-step error messages the Score emits. Credentials now round-trip through `SecretManager` rather than through local variables: the Score persists `OPNSenseApiCredentials` + `OPNSenseFirewallCredentials` from `--boot` / `--full`, and `run_integration` reads them back when constructing the production `OPNSenseFirewall` topology and the typed `OpnsenseClient` used by the verification step. `SecretManager` panics on a missing `HARMONY_SECRET_NAMESPACE`, so main() sets a binary-specific default if the operator hasn't already exported one. `harmony_secret` is added as a direct dependency. No behavior change for `--check` / `--download` / `--clean` / `--status`. `--boot` and `--full` now emit `[OPNsenseBootstrap/192.168.1.1]`-prefixed log lines from the Score's Interpret. Subsequent `--boot` runs against an already-bootstrapped VM NOOP through the idempotency check instead of re-running the dance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 11:36:07 -04:00
Sylvain Tremblay	4693930e63	feat(opnsense): OPNsenseBootstrapScore + OPNsenseBootstrapTopology Solve the OPNsense bootstrap chicken-and-egg problem with a Score-shaped abstraction. Until now, every binary deploying onto a fresh OPNsense had to copy ~80 lines of procedural orchestration (login → abort wizard → SSH → port move → API key mint → LAN flip) into its own main.rs, because the bootstrap creates the very credentials that `OPNSenseFirewall` needs to construct. The trick: a separate, minimal `OPNsenseBootstrapTopology` that holds only {vanilla_ip, default_username, default_password}. The new `Score<OPNsenseBootstrapTopology>` runs the dance from `Interpret::execute`, persists `OPNSenseApiCredentials` and `OPNSenseFirewallCredentials` to `SecretManager`, and optionally rebinds the LAN. The calling binary then builds a normal `OPNSenseFirewall` from the now-stored credentials and runs `Score<OPNSenseFirewall>` composition against it — two Maestro<T> phases in sequence, SecretManager as the bridge. Idempotency is handled by a 4-boolean decision matrix (api_creds_exist, ssh_creds_exist, vanilla_reachable, target_reachable) extracted into a pure helper and table-tested. The Score is safe to re-run: NOOP when already bootstrapped, DANCE on first-run or partial resume, FAILURE with clear recovery instructions when target is up but secrets are lost (factory-reset and re-run). Output follows the precedent of `OKDAddNodeScore`: - `[OPNsenseBootstrap/{vanilla_ip}]`-prefixed log lines, one info! per state change - Runbook-shaped Outcome::success_with_details listing where the firewall now lives, where credentials were stored, and the manual reconnect step if a LAN rebind happened - Multi-sentence InterpretError messages including the recovery path Includes a new `OPNsenseBootstrap` variant on `InterpretName`. Unit tests cover Score name, serialization, the full idempotency decision matrix, and `ensure_ready` failure when the firewall is unreachable. Scope: abstraction-only. Example main.rs files keep their current procedural shape; refactoring them to compose the new Score is a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 10:30:02 -04:00
Sylvain Tremblay	38f16b8d46	refactor(opnsense): promote SSH bootstrap helpers to harmony::modules::opnsense::bootstrap `create_api_key_ssh` and `change_lan_ip_via_ssh` were defined identically in both `opnsense_vm_integration` and `opnsense_pair_integration` example main.rs files. Lift them into `harmony::modules::opnsense::bootstrap` as `pub` free functions so future callers (including a forthcoming `OPNsenseBootstrapScore`) reuse a single canonical implementation. Also add `probe_https`, a one-shot reachability probe with a short timeout, which the bootstrap Score will use for its idempotency check. Behavior in the two examples is unchanged — they pass `"root"`/`"opnsense"` at their call sites, matching the hard-coded values the deleted local helpers used. Username/password are now parameters (validated against PHP-injection-prone characters), and `new_ip` in `change_lan_ip_via_ssh` is strict-parsed as `IpAddr` before interpolation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 10:29:45 -04:00

feat/opnsense-bootstrap-score #285

38 Commits