Split the OPNsense webgui port change from OPNsenseBootstrap into a proper idempotent Score: - WebGuiConfigScore reads current port via SSH before changing - Returns NOOP if already on the target port - Modifies config.xml via PHP and restarts webgui via configctl - Runs before LoadBalancerScore to free port 443 for HAProxy Also: - Add Config::shell() accessor for SSH access from Scores - Add WebGuiConfigScore to VM integration example (12 Scores now) - Document the WebGuiConfig → LoadBalancer ordering dependency as a concrete use case in docs/architecture-challenges.md (ties into Challenge #2: Runtime Plan & Validation) The implicit dependency (LoadBalancerScore needs 443 free, which requires WebGuiConfigScore to run first) remains a convention-based ordering. This is tracked in architecture-challenges.md alongside the score_with_dep.rs design sketch.
8.6 KiB
Harmony Architecture — Three Open Challenges
Three problems that, if solved well, would make Harmony the most capable infrastructure automation framework in existence.
1. Topology Evolution During Deployment
The problem
A bare-metal OKD deployment is a multi-hour process where the infrastructure's capabilities change as the deployment progresses:
Phase 0: Network only → OPNsense reachable, Brocade reachable, no hosts
Phase 1: Discovery → PXE boots work, hosts appear via mDNS, no k8s
Phase 2: Bootstrap → openshift-install running, API partially available
Phase 3: Control plane → k8s API available, operators converging, no workers
Phase 4: Workers → Full cluster, apps can be deployed
Phase 5: Day-2 → Monitoring, alerting, tenant onboarding
Today, HAClusterTopology implements all capability traits from the start. If a Score calls k8s_client() during Phase 0, it hits DummyInfra which panics. The type system says "this is valid" but the runtime says "this will crash."
Why it matters
- Scores that require k8s compile and register happily at Phase 0, then panic if accidentally executed too early
- The pipeline is ordered by convention (Stage 01 → 02 → 03 → ...) but nothing enforces that Stage 04 can't run before Stage 02
- Adding new capabilities (like "cluster has monitoring installed") requires editing the topology struct, not declaring the capability was acquired
Design direction
The topology should evolve through phases where capabilities are acquired, not assumed. Two possible approaches:
A. Phase-gated topology (runtime)
The topology tracks which phase it's in. Capability methods check the phase before executing and return a meaningful error instead of panicking:
impl K8sclient for HAClusterTopology {
async fn k8s_client(&self) -> Result<Arc<K8sClient>, String> {
if self.phase < Phase::ControlPlaneReady {
return Err("k8s API not available yet (current phase: {})".into());
}
// ... actual implementation
}
}
Scores that fail due to phase mismatch get a clear error message, not a panic. The Maestro can validate phase requirements before executing a Score.
B. Typestate topology (compile-time)
Use Rust's type system to make invalid phase transitions unrepresentable:
struct Topology<P: Phase> { ... }
impl Topology<NetworkReady> {
fn bootstrap(self) -> Topology<Bootstrapping> { ... }
}
impl Topology<Bootstrapping> {
fn promote(self) -> Topology<ClusterReady> { ... }
}
// Only ClusterReady implements K8sclient
impl K8sclient for Topology<ClusterReady> { ... }
This is the "correct" Rust approach but requires significant refactoring and may be too rigid for real deployments where phases overlap.
Recommendation: Start with (A) — runtime phase tracking. It's additive (no breaking changes), catches the DummyInfra panic problem immediately, and provides the data needed for (B) later.
2. Runtime Plan & Validation Phase
The problem
Harmony validates Scores at compile time: if a Score requires DhcpServer + TftpServer, the topology must implement both traits or the program won't compile. This is powerful but insufficient.
What compile-time cannot check:
- Is the OPNsense API actually reachable right now?
- Does VLAN 100 already exist (so we can skip creating it)?
- Is there already a DHCP entry for this MAC address?
- Will this firewall rule conflict with an existing one?
- Is there enough disk space on the TFTP server for the boot images?
Today, these are discovered at execution time, deep inside an Interpret's execute() method. A failure at minute 45 of a deployment is expensive.
Why it matters
- No way to preview what Harmony will do before it does it
- No way to detect conflicts or precondition failures early
- Operators must read logs to understand what happened — there's no structured "here's what I did" report
- Re-running a deployment is scary because you don't know what will be re-applied vs skipped
Design direction
Add a validate phase to the Score/Interpret lifecycle:
#[async_trait]
pub trait Interpret<T>: Debug + Send {
/// Check preconditions and return what this interpret WOULD do.
/// Default implementation returns "will execute" (opt-in validation).
async fn validate(
&self,
inventory: &Inventory,
topology: &T,
) -> Result<ValidationReport, InterpretError> {
Ok(ValidationReport::will_execute(self.get_name()))
}
/// Execute the interpret (existing method, unchanged).
async fn execute(
&self,
inventory: &Inventory,
topology: &T,
) -> Result<Outcome, InterpretError>;
// ... existing methods
}
A ValidationReport would contain:
- Status:
WillCreate,WillUpdate,WillDelete,AlreadyApplied,Blocked(reason) - Details: human-readable description of planned changes
- Preconditions: list of checks performed and their results
The Maestro would run validation for all registered Scores before executing any of them, producing a plan that the operator reviews.
This is opt-in: Scores that don't implement validate() get a default "will execute" report. Over time, each Score adds validation logic. The OPNsense Scores are ideal first candidates since they can query current state via the API.
Relationship to state
This approach does not require a state file. Validation queries the infrastructure directly — the same philosophy Harmony already follows. The "plan" is computed fresh every time by asking the infrastructure what exists right now.
Concrete use case: WebGuiConfigScore → LoadBalancerScore
LoadBalancerScore configures HAProxy to bind on port 443. But OPNsense's webgui defaults to port 443 — creating a port conflict. WebGuiConfigScore moves the webgui to 9443 first.
Today this is solved by ordering convention: WebGuiConfigScore is registered before LoadBalancerScore in the Score list. If someone reorders them, HAProxy silently fails to bind.
This is the simplest example of an implicit Score dependency that the current system cannot express or enforce. The score_with_dep.rs sketch explores declaring these dependencies at the type level, and Challenge #1 (phase-gated topology) would also help — a topology in "webgui on 443" phase could reject LoadBalancerScore at validation time.
3. TUI as Primary Interface
The problem
The TUI (harmony_tui) exists with ratatui, crossterm, and tui-logger, but it's underused. The CLI (harmony_cli) is the primary interface. During a multi-hour deployment, operators watch scrolling log output with no structure, no ability to drill into a specific Score's progress, and no overview of where they are in the pipeline.
Why it matters
- Log output during interactive prompts corrupts the terminal
- No way to see "I'm on Stage 3 of 7, 2 hours elapsed, 3 Scores completed successfully"
- No way to inspect a Score's configuration or outcome without reading logs
- The pipeline feels like a black box during execution
Design direction
The TUI should provide three views:
Pipeline view — the default. Shows the ordered list of Scores with their status:
OKD HA Cluster Deployment [Stage 3/7 — 1h 42m elapsed]
──────────────────────────────────────────────────────────────────
✅ OKDIpxeScore 2m 14s
✅ OKDSetup01InventoryScore 8m 03s
✅ OKDSetup02BootstrapScore 34m 21s
▶ OKDSetup03ControlPlaneScore ... running
⏳ OKDSetupPersistNetworkBondScore
⏳ OKDSetup04WorkersScore
⏳ OKDSetup06InstallationReportScore
Detail view — press Enter on a Score to see its Outcome details, sub-score executions, and logs.
Log view — the current tui-logger panel, filtered to the selected Score.
The TUI already has the Score widget and log integration. What's missing is the pipeline-level orchestration view and the duration/status data — which the Score::interpret timing we just added now provides.
Immediate enablers
The instrumentation event system (HarmonyEvent) already captures start/finish with execution IDs. The TUI subscriber just needs to:
- Track the ordered list of Scores from the Maestro
- Update status as
InterpretExecutionStarted/Finishedevents arrive - Render the pipeline view using ratatui
This doesn't require architectural changes — it's a TUI feature built on existing infrastructure.