`tokio::fs::DirEntry::file_type()` does not follow symlinks, so the
existing is_file()/is_dir() branches both returned false for them and
the loop silently skipped every symlink.
Caller-visible consequence: uploading
./data/okd/installer_image/ (three versioned SCOS files + three
stable-name symlinks pointing at them) ended up with only the
versioned files on the firewall under /usr/local/http/scos/. The
byMAC iPXE files chainload via the stable names
(scos-live-kernel.x86_64 etc.), so PXE boots dangled on a 404 until
an operator created the symlinks by hand.
Adds a third is_symlink() branch that reads the link target with
tokio::fs::read_link and recreates it remotely via `ln -sfn` over
SSH. `ln` rather than SFTP SSH_FXP_SYMLINK because OpenSSH's server
inverts the (path, target) argument order versus the spec — `ln` is
unambiguous across shells (including the firewall's tcsh).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Renames OKDReapplyDhcpBindingsScore to OKDReapplyFromInventoryScore
(file: reapply_from_inventory.rs) to reflect the broader scope. Per
selected role, the Score now:
1. Re-writes dnsmasq Host entries via DhcpHostBindingScore (existing
behavior).
2. Re-creates byMAC/01-<mac>.ipxe boot files via IPxeMacBootFileScore,
rendering BootstrapIpxeTpl with the role-appropriate ignition file
(bootstrap.ign / master.ign / worker.ign).
Pulls installation_device + MAC from each host's PhysicalHost +
HostConfig row in the inventory DB; errors loudly if either is missing
rather than silently producing a half-written byMAC tree.
Same constructors (interactive / for_roles / all_roles) and same
inquire multi-select UX, with the prompt wording broadened from "DHCP
bindings" to "firewall config".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Recovery Score for the case where the OPNsense firewall has been
reinstalled but the harmony inventory database still holds the
discovered physical hosts. Looks up DB hosts per role, zips them with
HAClusterTopology slots, runs DhcpHostBindingScore to re-create the
dnsmasq Host entries (DHCP reservation + A record) without doing
network discovery, iPXE, or reboot work.
Interactive via inquire multi-select (default) or explicit role list
via for_roles() / all_roles().
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous `FramedRead<File, BytesCodec>` loop in `upload_folder`
defaulted to ~8 KB reads, each turning into its own SFTP WRITE
round-trip. Replace it with an explicit 256 KB chunked read+write_all
loop. Same correctness, fewer awaits, fewer protocol packets per file.
Drops the unused `tokio_stream` and `tokio_util::codec` imports.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The LAN rebind step moves the firewall to a new subnet, severing the dev
machine's connection. The terminal reboot step that immediately followed
would then time out trying to POST `core/firmware/reboot` from a machine
no longer on the firewall's subnet.
Insert a blocking `inquire::Confirm` prompt between step 5 and step 6
(only when `target_lan` is `Some(_)`) that:
- tells the operator the new firewall address and prefix,
- prompts them to renew DHCP or set a static IP in the new subnet,
- waits for explicit confirmation before continuing.
Declining returns `InterpretError` so the bootstrap fails loudly in a
clearly-recoverable mid-state — re-running after reconnection picks up
at the reboot step.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Append a terminal step to `OPNsenseBootstrapScore.execute()`: POST
`core/firmware/reboot` against the firewall's final address, then wait for
the API to go unreachable, come back, and settle. Hard-fail if the firewall
does not reappear at `final_ip:target_api_port`.
The dance has touched firmware, the optional LAN bridge, the DHCP pool, and
the LAN IP itself — a clean reboot guarantees the running kernel/config
matches what was persisted, and the post-reboot probe makes reachability
a contract the rest of the harmony pipeline can rely on instead of a
best-effort warn-only check.
Reuses the existing `wait_for_reboot_cycle` waiter from `firmware_upgrade.rs`
(promoted to `pub`) and the same `core/firmware/reboot` POST shape that
`perform_firmware_upgrade` uses mid-pipeline. Idempotency is unchanged:
`decide()` still short-circuits to NOOP when creds exist + target is
reachable + vanilla is gone, so a re-run does not trigger a second reboot.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add the LAN-bridge step at position #12 in the score pipeline (single-member
bridge on `vtnet0`), plus end-of-test reachability assertions for HTTPS at
9443 and SSH at 22 — both must succeed before the test reports PASS. The
SSH check guards specifically against the regression where sshd stays bound
to the old (now IP-less) LAN device after the bridge step and the next run
times out trying to reconnect.
State snapshot now captures `net.link.bridge.*` sysctl values and asserts
each expected key (`inherit_mac`, `pfil_member`, `pfil_bridge`,
`pfil_local_phys`) has the expected value, rather than just counting
≥4 entries (the namespace isn't owned exclusively by the Score).
Verified PASSED end-to-end on three sequential runs (clean run, idempotent
re-run, post-reboot probe).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add the standalone `OPNsenseLanBridgeScore` (Score<OPNSenseFirewall>) plus a
new `lan_bridge: Option<LanBridgeParams>` field on `OPNsenseBootstrapScore`.
Both call the shared `ensure_lan_bridge_step` so behaviour stays in lockstep.
`LanBridgeParams::members` takes **physical** NIC names (e.g. `["igc0",
"igc2","igc3","igc4"]`). The Score resolves them to logical interfaces,
auto-promoting unmapped NICs to fresh `optN` slots before bridging. WAN's
NIC is rejected with a hard error. `members: None` triggers an interactive
`MultiSelect` annotated with each NIC's current logical assignment.
Inside `OPNsenseBootstrapScore`, the bridge step runs AFTER the firmware
upgrade and BEFORE the optional LAN-IP rebind — so a `target_lan` rebind
naturally targets `bridge0` rather than the now-unbound physical port.
Defaults: `reassign_lan=true`, `perf_tunables=true`, `enable_stp=false`.
Perf tunables run BEFORE the bridge create so `net.link.bridge.inherit_mac=1`
is live when the first member is attached (otherwise the bridge gets an
auto-generated MAC and the host's L2 path silently breaks after the LAN-IP
move).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add `ensure_lan_bridge_atomic_via_ssh`, `ensure_physical_nic_assigned_via_ssh`,
`set_lan_member_via_ssh`, and the `AtomicBridgeOutcome` enum. These are the
SSH-driven primitives the LAN-bridge step will compose into a single Score
in the next commit.
The atomic helper runs one PHP-on-SSH script that, in a single `Config::save()`:
- promotes unassigned physical NICs to fresh `<interfaces><optN>` entries
- moves the current LAN-bound NIC to a new `optN` (when `reassign_lan` is set),
so the bridge references the OPT rather than `lan` itself — required to
avoid the circular `lan↔bridge0` member reference that silently breaks
L2 forwarding when both point at each other
- creates or updates `<bridges><bridged>` with the resolved logical members
- repoints `<interfaces><lan><if>` to the bridge
The detached configctl chain (`nohup … & </dev/null`) brings the new
topology up without deadlocking the initiating SSH channel — sshd is
restarted in the same chain so it rebinds to the new LAN device.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Thin wrappers over the generated bridge and interfaces/settings models,
exposed through the `Config` singleton via `Config::bridge()` and
`Config::interface_settings()` — same accessor pattern as the existing
`caddy()` / `lagg()` / `dnsmasq()` helpers.
`InterfaceSettingsConfig::ensure_offloads_disabled()` is the entry point
used by the LAN-bridge step to disable TSO/LRO globally.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OPNsense's `GET /api/interfaces/settings/get` returns `disablevlanhwfilter`
as a `BaseListField::getNodeOptions()` select-widget structure rather than
a plain string. Because the option keys are numeric strings (`"0"`/`"1"`/
`"2"`), PHP's `json_encode` collapses them into a JSON **array** — so the
array index IS the wire code.
The deserializer now accepts:
- plain string (the `setItem` round-trip path),
- object form (`{key: {value, selected}}`),
- array form (index = wire code).
Wire codes are also fixed to `"0"`/`"1"`/`"2"` (from the XML `value="…"`
attribute, per `BaseModel::parseOptionData`), not the element names
`"opt0"`/`"opt1"`/`"opt2"`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add codegen output for the OPNsense Interfaces/Bridge MVC model and its
settings controller. Pure generated code — no hand-written logic; mirrors
the structure of the other models under `src/generated/`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The module docstring on OPNsensePinNicNamesScore already explained the
NIC-shuffle problem in prose, but didn't actually link to the sources
that justify the design. Anyone reviewing the code or auditing the
vendored script had to Google their way to the OPNsense forum thread
and the upstream repo.
Adds:
- https://forum.opnsense.org/index.php?topic=27023.0 (the canonical
thread, with franco's endorsement of ethname)
- https://forums.freebsd.org/threads/how-to-associate-an-interface-name-with-its-mac.89337/
(broader FreeBSD context for the enumeration issue)
- https://github.com/eborisch/ethname (upstream repo)
- https://www.freshports.org/sysutils/ethname/ (FreeBSD port entry)
Also restructured the pin_nic_names module docstring into "Why this
exists" / "Background reading" / "What it does" / "Two ways to use
this" sections so reviewers can find the rationale faster. The
ETHNAME_SCRIPT const in bootstrap.rs gets the upstream URL inline too,
so the script's purpose is self-evident at every call site.
No code changes. cargo doc renders the links live; cargo check / fmt /
clippy stay clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous attempts to update the DHCP range as part of the LAN-rebind step
poked at config.xml from PHP, which didn't match the actual XML schema on
OPNsense 26.x. The result: LAN IP moved to 192.168.200.1 but DHCP was still
trying to hand out 192.168.1.100–199 leases → no clients could obtain an
address on the new subnet, and the bootstrapping operator was kicked off
their own LAN.
Use OPNsense's own REST API instead. The existing
`opnsense_config::modules::dnsmasq::DhcpConfigDnsMasq::set_dhcp_range` already
does the right thing — it finds the dnsmasq range bound to `interface ==
"lan"` (or creates one), updates `start_addr`/`end_addr`, then asks OPNsense
to reconfigure dnsmasq. Validation and dependent service restarts go through
OPNsense's model classes, not our XPath guesses.
Sequencing matters: the API endpoint lives on the firewall's current LAN IP,
so the range update has to be hit *before* the LAN IP flip kills our HTTP
connection. New flow in `OPNsenseBootstrapScore` step 5:
5a. set_lan_dhcp_range_via_api(...) ← OPNsense API on vanilla_ip:9443
5b. change_lan_ip_via_ssh(...) ← flips LAN IP, detached configctl
`change_lan_ip_via_ssh` is simplified back to a single concern: PHP rewrites
`interfaces.lan.ipaddr`/`subnet`, then a detached `configctl interface
reconfigure lan` + service-restart chain applies the change. No more
multi-backend XML guessing inside the PHP.
The DHCP pool follows OPNsense's install-default convention `<net>.100` –
`<net>.199` regardless of prefix length. Operators who want a different
range can resize via the WebUI / API after bootstrap.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pin-step lines I added in 9eeede18 invented two notations:
[1/3] / [2/3] / [3/3] in pin_nic_names_step
(a/c) / (b/c) / (c/c) in install_ethname_via_ssh
Neither matches the established convention. OKDAddNodeScore and the
existing OPNsenseBootstrapScore beats use plain prose verbs with no
ordinal markers — "Logged in to ...", "Enabled SSH ...", "Moved web GUI
port ...", "LAN rebind X -> Y", "Persisted OPNSenseApiCredentials + ...".
Top-level Score code carries a [ScoreName/host] tag; low-level SSH
helpers (e.g. change_lan_ip_via_ssh) log untagged short prose.
Rewrite the six pin-step log lines to follow that.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Leaked into commit 92717441 when an uncommitted debug breadcrumb in
the user's working tree was staged alongside the post-review cleanup.
`--setup` would panic at runtime before printing the closing newline.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- examples/opnsense_vm_integration: flip firmware_upgrade back to Disabled
(the Score pipeline already runs OPNsenseFirmwareUpgradeScore explicitly,
bootstrap-time upgrade was redundant); rewrite module docstring to match
post-refactor behavior.
- examples/opnsense_pair_integration: add TODO near abort_wizard noting the
example should migrate to compose OPNsenseBootstrapScore.
- harmony::modules::opnsense::firmware_upgrade: pull magic timeouts into
named module-scope consts with one-line rationale; reuse the new shared
check_firmware_task_done helper for upgradestatus polling.
- opnsense-config: add check_firmware_task_done helper + name install_package's
poll interval / max attempts; install_package now shares the helper.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The call to `OPNsenseBootstrap::abort_wizard()` (POST
/api/core/initial_setup/abort) failed with 403 Forbidden on every run:
that endpoint requires a session-CSRF token under cookie auth and we
don't fetch one before calling it (only `login()` extracts a token, and
it's tied to the login form). The 403 was logged as WARN and silently
ignored — and empirically the wizard flag doesn't block ANY of the
following steps (SSH enable, web GUI port change, API key mint via SSH,
LAN rebind, firmware upgrade). So the call was producing log noise for
no observable benefit.
Drop the call from the Score's interpret flow. The
`OPNsenseBootstrap::abort_wizard()` helper stays in the library — a
future caller that wants to do it properly (GET an authenticated page,
extract its CSRF token, include it in the abort POST) can still use it.
Only downside: a human operator who later opens the WebUI manually
will see the OPNsense first-run wizard prompt once and have to dismiss
it. Acceptable trade for clean automated bootstrap logs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In FirmwareUpgradeMode::Prompt, the summary block was being printed
twice — once via the `info!("{tag} Pending firmware …:\n{summary}")`
line just above the mode-gating match, and again inside the
inquire::Confirm prompt's header text. The prompt now asks only the
yes/no question; the operator reads the summary from the info! log
line one row above.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every other operational primitive in harmony has a Score wrapper between
the low-level call and the user-facing composition layer. Package
installation didn't — `Config::install_package` was being called naked
from the integration example with a hand-rolled `match install_package()
{ Ok=>… Err=>compose-firmware-upgrade-Score-and-retry }` glue. That's
exactly the "imperative orchestration in the caller" pattern harmony's
CLAUDE.md tells us to push into Scores.
This commit:
- Adds `OPNsensePackageInstallScore { packages: Vec<String> }` in
a new `harmony/src/modules/opnsense/package_install.rs`. The
Interpret iterates packages, skips ones already installed via
`is_package_installed`, calls `install_package` on the rest,
surfaces newly-installed vs. already-present in
`Outcome::success_with_details`. Idempotent on re-runs.
- Adds the `OPNsensePackageInstall` variant to `InterpretName` +
Display.
- The Score deliberately has NO firmware-upgrade fallback baked in.
If install fails because firmware is stale, `install_package`'s
error message already points the operator at
`OPNsenseFirmwareUpgradeScore`. Composition is the operator's job
— same as every other Score pair relationship in harmony.
- Rewrites `examples/opnsense_vm_integration::run_integration` to
drop the ~40-line try/Err/retry block. The two new Scores
(firmware upgrade + package install) are prepended to
`build_all_scores`, so the pipeline becomes a linear vec:
vec![
OPNsenseFirmwareUpgradeScore { mode: Auto, .. },
OPNsensePackageInstallScore { packages: vec!["os-haproxy"] },
webgui, lb, dhcp, … (existing config scores)
]
Both `run_cli` invocations (run 1 and the idempotency run 2)
exercise the new Scores. Both naturally NOOP on the second pass:
upgrade because `firmware/status == "none"`, install because
`is_package_installed("os-haproxy") == true`.
Three unit tests in the new module cover Score name, serialization,
and empty-package-list handling.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related changes the operator asked for, replacing the previous
"firmware/running ready/busy" heuristic that felt unclean.
1. **install_package**: drop the `firmware/info` + `firmware/running`
ready-idle polling. Restore the OPNsense-native pattern: poll
`/api/core/firmware/upgradestatus` until `status == "done"` (same
endpoint OPNsense's own WebUI uses for its install progress popup),
then verify via `firmware/info` whether the package actually got
installed. On failure, surface pkg's actual error from the `log`
field of the upgradestatus response (last 8 non-empty lines) plus a
"run OPNsenseFirmwareUpgradeScore first" hint. Tolerate transient
upgradestatus errors as the 26.1.6 release notes document the
endpoint as unstable; 120 × 3 s ceiling is the safety net.
Now produces the clear, fast-fail message the operator remembers
from before the branch, but with the actual pkg failure reason
("pkg: No packages available to install matching 'os-haproxy'", or
whatever the underlying issue is) included.
2. **opnsense_vm_integration**: the post-install-failure fallback now
composes `OPNsenseFirmwareUpgradeScore { mode: Auto }` into a
`Vec<Box<dyn Score<OPNSenseFirewall>>>` and dispatches it via
`harmony_cli::run_cli`, matching the way the rest of
`run_integration` runs its Scores. Replaces the direct call to the
bare `perform_firmware_upgrade()` helper.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
I added Signal B polling to install_package + wait_for_task_or_reboot
checking for `status == ""` or `"none"` as the "configd idle" condition,
but OPNsense's `configctl firmware running` script
(core/scripts/firmware/running.sh) actually outputs `"ready"` when no
firmware operation holds the lock and `"busy"` when one does.
So Signal B never fired against a real OPNsense — the loop kept seeing
`status: "ready"` (= idle) and treating it as "still running". For
install_package this meant a doomed install still consumed the full
6-minute timeout. For wait_for_task_or_reboot it was masked by
Signal A (version moved) almost always winning first, but the bug was
the same.
Recognize "ready" (case-insensitive) plus defensive "" / "none" as
idle. Verified against the upstream script:
if ${FLOCK} -n 9; then
echo "ready"
else
echo "busy"
fi
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Err arm after the first `install_package("os-haproxy")` attempt
used to POST `firmware/update` (= `pkg update`, repo-metadata refresh)
and sleep 5s before retrying. That's a weaker, hand-rolled subset of
what `OPNsenseFirmwareUpgradeScore` / `perform_firmware_upgrade`
already does properly.
Replace with a call to `perform_firmware_upgrade(...,
FirmwareUpgradeMode::Auto, ...)`. That does the full canonical flow:
firmware/check → firmware/status → firmware/update or upgrade →
poll (with multi-signal completion + reboot tolerance) → verify the
product_version moved. After it returns, the firewall is at the
latest firmware AND its package index is current, so the retry of
`install_package("os-haproxy")` finds the right packages and succeeds.
This is what the operator asked for: "[on install failure] it should
call the firmware update score."
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`install_package`'s poll loop only watched `firmware/info` for the
positive "package installed" signal. When OPNsense's background
install task failed silently (typical on stale repo metadata: pkg
can't find os-haproxy that matches the firmware), the package never
appeared in `firmware/info`, so the loop consumed its entire 6-minute
ceiling before returning Err. The caller's fallback ("refresh
metadata + retry") couldn't fire for 6+ minutes — looked like a hang.
Add a second poll signal each iteration: `firmware/running` reports
the name of the currently active configd task (empty when idle). When
the install task vanishes (empty for 2 consecutive polls) AND the
package still isn't in `firmware/info`, we know the install ended
without succeeding. Fail fast with:
"OPNsense install task for <pkg> ended without installing the
package. The repository metadata is likely stale — try refreshing
it via firmware/update, or run OPNsenseFirmwareUpgradeScore first,
then retry."
Typical failed install now detects within ~10s instead of 6min. The
120 × 3s ceiling stays as a safety net for "task running but never
completes" pathologies.
This restores the fast-fail behavior the OLD pre-refactor
install_package had (via its bail-on-upgradestatus-404 path), with a
proper, stable signal instead of relying on the documented-unstable
endpoint.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the boolean `OPNsenseBootstrapScore::upgrade_firmware` knob
with a four-variant enum that decides per pending upgrade whether to
apply it automatically, skip major-series upgrades, prompt the
operator, or skip entirely. Also exposed via
`OPNsenseFirmwareUpgradeScore::mode` for standalone composition.
- `Auto` (default): apply every pending update and upgrade.
- `AutoMinor`: apply in-series updates (status == "update"); skip
major-series upgrades (status == "upgrade"). Uses OPNsense's own
`status` field for the major/minor distinction — no version-string
parsing.
- `Prompt`: print a per-iteration summary and ask via inquire::Confirm.
Errors out with `PromptRequiresTty` when run headless (no TTY) so
CI contexts must pick `Auto` / `AutoMinor` / `Disabled` explicitly.
- `Disabled`: skip the upgrade step entirely.
The summary surfaced for Prompt (and logged in Auto/AutoMinor too)
includes:
- status_msg (OPNsense's "108 updates available, 349.2 MiB, reboot
required" line)
- whether the OPNsense product package itself is being upgraded
("Main OPNsense: 26.1 → 26.1.8") or whether the update only
touches plugins/packages ("Main OPNsense: staying at <ver>")
- kind (update vs upgrade)
- reboot required (yes/no)
Two new helpers — `extract_opnsense_version_change` and
`render_upgrade_summary` — pull the version diff out of
`status.all_packages` / `status.all_sets` (looking for the `opnsense`
or `opnsense-update` entry) and assemble the human-readable block.
Wired through:
- `OPNsenseFirmwareUpgradeScore::mode` (default Auto).
- `OPNsenseBootstrapScore::firmware_upgrade` (replaces
`upgrade_firmware: bool`; same default behavior).
- `examples/opnsense_vm_integration` opts out with
`firmware_upgrade: FirmwareUpgradeMode::Disabled`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a firmware update has status_reboot=1, the reboot IS the final
step of the install — once it completes, the task is done by
definition. But my multi-signal polling loop kept trying to verify
completion via signals A/B/C *after* the reboot, and all three are
unreliable post-reboot:
A. firmware/info `product_version` doesn't change if the update was
package-only (no version bump).
B. firmware/running keeps reporting the previous task as active
until the next firmware/check kicks it — the operator observed
"clicking 'check for updates' in the UI unstuck it", confirming
OPNsense retains stale task state until a fresh check resets it.
C. firmware/upgradestatus 404s (documented unstable on 26.1) when
no task is registered, which is the state after a real upgrade.
Net: in iteration 2 of a real upgrade run (26.1.8 → 26.1.x with 2 more
packages), the wait loop was stuck silently polling for several
minutes after the firewall had already rebooted and was fully
operational.
Now: when the TCP probe detects unreachable and wait_for_reboot_cycle
returns, immediately return TaskOutcome { rebooted: true } instead of
re-entering the polling loop. The outer perform_firmware_upgrade loop
already calls firmware/check at the top of the next iteration (which
both refreshes OPNsense's task state AND tells us if more updates are
pending) and reads firmware/info after wait_for_task_or_reboot returns
to verify the version moved. Those are the real post-reboot
completion signals — the in-loop polling was redundant and harmful.
The non-reboot path (status_reboot=0 updates, e.g. pure metadata
refresh) is unchanged: signals A/B/C still run because there's no
reboot to use as a definitive completion event.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`OpnsenseClient::handle_response_typed` logged the entire response body
in its WARN line on non-success status codes. OPNsense's 404 page is
~12 lines of HTML; every transient 404 (and the firmware/upgradestatus
endpoint 404s constantly on 26.1, per release notes calling it
"known to be unstable") dumped a multi-line block into the log.
Route the body through a new truncate_for_log helper that keeps the
first non-empty line, caps the result at 200 chars, and appends an
ellipsis if anything was elided. JSON error responses (typically one
short line) stay intact; HTML pages collapse to "<!DOCTYPE html>…".
The `Error::Api { body, .. }` value passed to callers is unchanged, so
code that wants to inspect the full body still can.
Three unit tests cover: short-line passthrough, HTML collapsing,
length-capped ellipsis.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous wait loop polled only `firmware/upgradestatus` for
`status == "done"`, with a TCP-probe fallback to detect reboots. Two
ways this got stuck against real OPNsense 26.1:
1. After the upgrade reboot completed, OPNsense had no active
background task to track, so `upgradestatus` 404'd indefinitely.
2. The TCP-probe fallback could miss the brief unreachable window
between two 5s-apart polls — if both polls saw the API up, we never
set `rebooted=true` and never bailed.
Net: the upgrade ran fine (26.1 → 26.1.8 applied), but our code waited
forever for a "done" signal that never came.
Now the wait loop polls THREE signals per iteration and exits on any:
A. firmware/info `product_version` differs from version_before_action
B. firmware/running `status` empty for 2 consecutive polls (configd
reports no active task)
C. firmware/upgradestatus `status == "done"` (when the endpoint works)
Plus the TCP probe still detects mid-task reboots and waits for the
firewall to come back — but it's no longer the sole exit path. After a
reboot, signal A almost always wins on the first post-recovery poll.
perform_firmware_upgrade now snapshots the version before each action
and passes it as `version_before_action` so signal A has a baseline
that's valid even after a previous iteration already bumped the
version.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous version did `POST firmware/check`, slept 5s, read
`firmware/status`, and used a fragile "status_msg contains 'up to date'"
heuristic to decide whether to upgrade. Two related bugs:
1. `firmware/check` is async — it returns immediately with a msg_uuid and
runs in the background. 5s is far less than the metadata refresh
takes; on a fresh boot the status still reads
"requires to check for update first to provide more information"
when we look.
2. That message doesn't match any of my "up to date" keywords, so my
pending-check returned true and triggered `firmware/upgrade` against
a system that had no actionable upgrade plan. firmware/upgrade
returned immediately (also async), status stayed "none", and the
helper reported success without anything having happened.
Rewrite per OPNsense's actual API (verified against FirmwareController.php
and firmware.volt):
1. GET firmware/info → capture initial product_version
2. Loop ≤ 5 iterations (a kernel upgrade can unlock further package
updates that need their own pass):
a. POST firmware/check (async)
b. Poll firmware/upgradestatus until status == "done"
c. POST firmware/status → read `status` enum
("none"/"update"/"upgrade"/"error")
d. If "none": done. (First iteration → NOOP. Later → success.)
e. If "update" / "upgrade": POST firmware/{that}, poll
upgradestatus until done, handle reboot (auto-reboot or explicit
firmware/reboot if status_reboot == "1"), then GET firmware/info
to verify product_version changed.
3. Return UpgradeOutcome with initial_version, final_version,
iterations, rebooted flag.
The upgrade Score's Outcome now reports the actual version transition
("Firmware upgraded: 25.7.4 → 26.1.6 in 2 iteration(s) (rebooted: true)").
Mid-upgrade reboots are detected by `upgradestatus` going unreachable +
a TCP probe confirming the API is down (vs. just a 404 from the
documented-unstable endpoint).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a `Score<OPNSenseFirewall>` that brings an OPNsense firewall to the
latest available firmware/package level via the REST API:
POST firmware/check → refresh upstream metadata
GET firmware/status → check what's actionable
POST firmware/upgrade → trigger if anything is pending
poll firmware/status → wait for status to return to "none"
POST firmware/reboot → if status_reboot == "1" and we haven't
already auto-rebooted mid-upgrade
+ wait unreachable → reachable → 30s settle
The core logic lives in `perform_firmware_upgrade()` so it can be
called from elsewhere. `OPNsenseBootstrapScore` now exposes
`upgrade_firmware: bool` (default `true`) and calls the same helper
after credentials are persisted, before any optional LAN rebind. The
firewall thus ends bootstrap on its latest firmware, exactly the right
beat operationally: no production traffic yet, operator already
babysitting, all subsequent Scores run against current code.
Why not extend `OPNSenseLaunchUpgrade` (the existing SSH-based
Score)? It calls a shell script (`opnsense-update.sh`), has a
`todo!()` Serialize impl, no idempotency check, and holds an
`Arc<Config>` directly instead of reading from a topology. The new
score uses the REST API end-to-end, idempotency-checks via
`status_upgrade_action`, and slots cleanly into normal Score<T>
composition. `OPNSenseLaunchUpgrade` stays alongside it for now;
affilium2 keeps working unchanged. We can deprecate the SSH one in a
follow-up once the API one has flown against real firewalls.
`opnsense_vm_integration` explicitly opts out of the bootstrap-time
upgrade (`upgrade_firmware: false`) — the VM image is a known
firmware version, and we don't want each integration run to spend 10+
minutes pulling firmware updates.
New `InterpretName::OPNsenseFirmwareUpgrade` variant. Unit tests
cover score name, default api_port (9443), and serialization.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After `fix(opnsense-config): poll firmware/info, not the unstable
upgradestatus, in install_package`, the library's install_package call
handles its own polling correctly and tolerates transient API errors.
So the example's fallback no longer needs to track firmware/status,
issue an explicit reboot, or wait for an unreachable→reachable cycle —
all of that logic was duplicating what should be (and now is) the
library's responsibility.
Collapse the ~100-line Err arm to ~15 lines: when the first
install_package attempt fails, kick `firmware/update` (== `pkg update`,
refresh repository metadata), sleep 5s, and retry. The original
failure mode (first install fails because a freshly bootstrapped
firewall has no pkg metadata yet) is what this fallback exists to
address; nothing more.
Net deletion of ~85 lines from the example.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The legacy polling loop in `Config::install_package` hit
`/api/core/firmware/upgradestatus` until `status == "done"`. That
endpoint is marked "known to be unstable" in OPNsense 26.1.6 release
notes (the WebUI itself traps its generic error popup) and reliably
404s on a freshly bootstrapped 26.1 system. The loop's error handling
used `.map_err(Error::Api)?` so a single 404 short-circuited the whole
install — even when the underlying install_package POST succeeded.
Switch to polling `/api/core/firmware/info` and looking for the package
in the response with `installed == "1"`. That's the same check the
existing code did AFTER the loop; moving it INTO the loop removes the
dependency on `upgradestatus` entirely. Transient errors from the
firmware/info call are now logged at debug! and tolerated as
"keep polling" (the API may briefly be unreachable if a package install
triggers a reboot — extremely rare for plugins, but defensible to
handle).
The unused `UpgradeStatus` struct is dropped along with the legacy loop.
Behavior on success is identical (returns Ok, same info! log). On
timeout the error message is more descriptive (`"did not appear as
installed within 360 seconds"`) than the previous `"installation did
not complete successfully"` which was actually printed for both
"polling timed out" and "the package isn't in firmware/info" cases.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The firmware-update fallback used to poll `/api/core/firmware/upgradestatus`
for a "done" signal, then wait_for_https + a 10s sleep before retrying
the package install. That endpoint is documented as "known to be
unstable" in OPNsense 26.1.6 release notes (the WebUI itself traps its
generic error popup), so the polling loop never breaks out via the
success path — it just times out. wait_for_https then succeeds during
the brief window before OPNsense actually starts rebooting, and the
install retry gets killed mid-reboot with a `reqwest::Request` timeout.
Switch to `/api/core/firmware/status`, which is the stable endpoint and
returns a definitive `status_reboot` field ('1' if a reboot is required
after the in-progress update/upgrade, computed from `needs_reboot` and
`upgrade_needs_reboot` per FirmwareController.php). Poll until the
update finishes (status == "none") or the API becomes unreachable
(auto-reboot during update), then read `status_reboot` and trigger an
explicit `POST /api/core/firmware/reboot` if needed. The
wait-for-unreachable window after the reboot is then a tight 60s — we
know the reboot just happened. No more blind multi-minute timeouts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous refactor only defaulted HARMONY_SECRET_NAMESPACE; running
--full / --boot then panicked because `init_secret_manager` falls back
to the Infisical backend when HARMONY_SECRET_STORE is unset (see
harmony_secret::lib:82), and that requires HARMONY_SECRET_INFISICAL_URL.
Default HARMONY_SECRET_STORE to "file" the same way so `cargo run -p
opnsense-vm-integration -- --full` works out of the box without
sourcing an env.sh.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the procedural login → abort wizard → SSH → port → API-key sequence
in `boot_vm` and `run_integration`, and replace the bootstrap leg with a
single `harmony_cli::run_cli` invocation of `OPNsenseBootstrapScore`
against `OPNsenseBootstrapTopology`. The diagnose_via_ssh fallback and
the SSH-22 polling loop go away too — both are covered by the Score's
own idempotency probe and the per-step error messages the Score emits.
Credentials now round-trip through `SecretManager` rather than through
local variables: the Score persists `OPNSenseApiCredentials` +
`OPNSenseFirewallCredentials` from `--boot` / `--full`, and
`run_integration` reads them back when constructing the production
`OPNSenseFirewall` topology and the typed `OpnsenseClient` used by the
verification step.
`SecretManager` panics on a missing `HARMONY_SECRET_NAMESPACE`, so main()
sets a binary-specific default if the operator hasn't already exported
one. `harmony_secret` is added as a direct dependency.
No behavior change for `--check` / `--download` / `--clean` / `--status`.
`--boot` and `--full` now emit `[OPNsenseBootstrap/192.168.1.1]`-prefixed
log lines from the Score's Interpret. Subsequent `--boot` runs against
an already-bootstrapped VM NOOP through the idempotency check instead of
re-running the dance.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Solve the OPNsense bootstrap chicken-and-egg problem with a Score-shaped
abstraction. Until now, every binary deploying onto a fresh OPNsense had
to copy ~80 lines of procedural orchestration (login → abort wizard →
SSH → port move → API key mint → LAN flip) into its own main.rs,
because the bootstrap creates the very credentials that `OPNSenseFirewall`
needs to construct.
The trick: a separate, minimal `OPNsenseBootstrapTopology` that holds only
{vanilla_ip, default_username, default_password}. The new
`Score<OPNsenseBootstrapTopology>` runs the dance from
`Interpret::execute`, persists `OPNSenseApiCredentials` and
`OPNSenseFirewallCredentials` to `SecretManager`, and optionally rebinds
the LAN. The calling binary then builds a normal `OPNSenseFirewall` from
the now-stored credentials and runs `Score<OPNSenseFirewall>` composition
against it — two Maestro<T> phases in sequence, SecretManager as the
bridge.
Idempotency is handled by a 4-boolean decision matrix
(api_creds_exist, ssh_creds_exist, vanilla_reachable, target_reachable)
extracted into a pure helper and table-tested. The Score is safe to
re-run: NOOP when already bootstrapped, DANCE on first-run or partial
resume, FAILURE with clear recovery instructions when target is up but
secrets are lost (factory-reset and re-run).
Output follows the precedent of `OKDAddNodeScore`:
- `[OPNsenseBootstrap/{vanilla_ip}]`-prefixed log lines, one info! per
state change
- Runbook-shaped Outcome::success_with_details listing where the firewall
now lives, where credentials were stored, and the manual reconnect step
if a LAN rebind happened
- Multi-sentence InterpretError messages including the recovery path
Includes a new `OPNsenseBootstrap` variant on `InterpretName`. Unit tests
cover Score name, serialization, the full idempotency decision matrix,
and `ensure_ready` failure when the firewall is unreachable.
Scope: abstraction-only. Example main.rs files keep their current
procedural shape; refactoring them to compose the new Score is a follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`create_api_key_ssh` and `change_lan_ip_via_ssh` were defined identically in
both `opnsense_vm_integration` and `opnsense_pair_integration` example
main.rs files. Lift them into `harmony::modules::opnsense::bootstrap` as
`pub` free functions so future callers (including a forthcoming
`OPNsenseBootstrapScore`) reuse a single canonical implementation.
Also add `probe_https`, a one-shot reachability probe with a short timeout,
which the bootstrap Score will use for its idempotency check.
Behavior in the two examples is unchanged — they pass `"root"`/`"opnsense"`
at their call sites, matching the hard-coded values the deleted local
helpers used. Username/password are now parameters (validated against
PHP-injection-prone characters), and `new_ip` in `change_lan_ip_via_ssh` is
strict-parsed as `IpAddr` before interpolation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>