Files

Jean-Gabriel Gill-Couture c7aead7532 feat: extract WebGuiConfigScore from bootstrap, document dependency use case

Split the OPNsense webgui port change from OPNsenseBootstrap into a
proper idempotent Score:

- WebGuiConfigScore reads current port via SSH before changing
- Returns NOOP if already on the target port
- Modifies config.xml via PHP and restarts webgui via configctl
- Runs before LoadBalancerScore to free port 443 for HAProxy

Also:
- Add Config::shell() accessor for SSH access from Scores
- Add WebGuiConfigScore to VM integration example (12 Scores now)
- Document the WebGuiConfig → LoadBalancer ordering dependency
  as a concrete use case in docs/architecture-challenges.md
  (ties into Challenge #2: Runtime Plan & Validation)

The implicit dependency (LoadBalancerScore needs 443 free, which
requires WebGuiConfigScore to run first) remains a convention-based
ordering. This is tracked in architecture-challenges.md alongside
the score_with_dep.rs design sketch.

2026-04-06 17:00:27 -04:00

8.6 KiB

Raw Permalink Blame History

Harmony Architecture — Three Open Challenges

Three problems that, if solved well, would make Harmony the most capable infrastructure automation framework in existence.

1. Topology Evolution During Deployment

The problem

A bare-metal OKD deployment is a multi-hour process where the infrastructure's capabilities change as the deployment progresses:

Phase 0: Network only    → OPNsense reachable, Brocade reachable, no hosts
Phase 1: Discovery       → PXE boots work, hosts appear via mDNS, no k8s
Phase 2: Bootstrap       → openshift-install running, API partially available
Phase 3: Control plane   → k8s API available, operators converging, no workers
Phase 4: Workers         → Full cluster, apps can be deployed
Phase 5: Day-2           → Monitoring, alerting, tenant onboarding

Today, HAClusterTopology implements all capability traits from the start. If a Score calls k8s_client() during Phase 0, it hits DummyInfra which panics. The type system says "this is valid" but the runtime says "this will crash."

Why it matters

Scores that require k8s compile and register happily at Phase 0, then panic if accidentally executed too early
The pipeline is ordered by convention (Stage 01 → 02 → 03 → ...) but nothing enforces that Stage 04 can't run before Stage 02
Adding new capabilities (like "cluster has monitoring installed") requires editing the topology struct, not declaring the capability was acquired

Design direction

The topology should evolve through phases where capabilities are acquired, not assumed. Two possible approaches:

A. Phase-gated topology (runtime)

The topology tracks which phase it's in. Capability methods check the phase before executing and return a meaningful error instead of panicking:

impl K8sclient for HAClusterTopology {
    async fn k8s_client(&self) -> Result<Arc<K8sClient>, String> {
        if self.phase < Phase::ControlPlaneReady {
            return Err("k8s API not available yet (current phase: {})".into());
        }
        // ... actual implementation
    }
}

Scores that fail due to phase mismatch get a clear error message, not a panic. The Maestro can validate phase requirements before executing a Score.

B. Typestate topology (compile-time)

Use Rust's type system to make invalid phase transitions unrepresentable:

struct Topology<P: Phase> { ... }

impl Topology<NetworkReady> {
    fn bootstrap(self) -> Topology<Bootstrapping> { ... }
}
impl Topology<Bootstrapping> {
    fn promote(self) -> Topology<ClusterReady> { ... }
}

// Only ClusterReady implements K8sclient
impl K8sclient for Topology<ClusterReady> { ... }

This is the "correct" Rust approach but requires significant refactoring and may be too rigid for real deployments where phases overlap.

Recommendation: Start with (A) — runtime phase tracking. It's additive (no breaking changes), catches the DummyInfra panic problem immediately, and provides the data needed for (B) later.

2. Runtime Plan & Validation Phase

The problem

Harmony validates Scores at compile time: if a Score requires DhcpServer + TftpServer, the topology must implement both traits or the program won't compile. This is powerful but insufficient.

What compile-time cannot check:

Is the OPNsense API actually reachable right now?
Does VLAN 100 already exist (so we can skip creating it)?
Is there already a DHCP entry for this MAC address?
Will this firewall rule conflict with an existing one?
Is there enough disk space on the TFTP server for the boot images?

Today, these are discovered at execution time, deep inside an Interpret's execute() method. A failure at minute 45 of a deployment is expensive.

Why it matters

No way to preview what Harmony will do before it does it
No way to detect conflicts or precondition failures early
Operators must read logs to understand what happened — there's no structured "here's what I did" report
Re-running a deployment is scary because you don't know what will be re-applied vs skipped

Design direction

Add a validate phase to the Score/Interpret lifecycle:

#[async_trait]
pub trait Interpret<T>: Debug + Send {
    /// Check preconditions and return what this interpret WOULD do.
    /// Default implementation returns "will execute" (opt-in validation).
    async fn validate(
        &self,
        inventory: &Inventory,
        topology: &T,
    ) -> Result<ValidationReport, InterpretError> {
        Ok(ValidationReport::will_execute(self.get_name()))
    }

    /// Execute the interpret (existing method, unchanged).
    async fn execute(
        &self,
        inventory: &Inventory,
        topology: &T,
    ) -> Result<Outcome, InterpretError>;

    // ... existing methods
}

A ValidationReport would contain:

Status: WillCreate, WillUpdate, WillDelete, AlreadyApplied, Blocked(reason)
Details: human-readable description of planned changes
Preconditions: list of checks performed and their results

The Maestro would run validation for all registered Scores before executing any of them, producing a plan that the operator reviews.

This is opt-in: Scores that don't implement validate() get a default "will execute" report. Over time, each Score adds validation logic. The OPNsense Scores are ideal first candidates since they can query current state via the API.

Relationship to state

This approach does not require a state file. Validation queries the infrastructure directly — the same philosophy Harmony already follows. The "plan" is computed fresh every time by asking the infrastructure what exists right now.

Concrete use case: WebGuiConfigScore → LoadBalancerScore

LoadBalancerScore configures HAProxy to bind on port 443. But OPNsense's webgui defaults to port 443 — creating a port conflict. WebGuiConfigScore moves the webgui to 9443 first.

Today this is solved by ordering convention: WebGuiConfigScore is registered before LoadBalancerScore in the Score list. If someone reorders them, HAProxy silently fails to bind.

This is the simplest example of an implicit Score dependency that the current system cannot express or enforce. The score_with_dep.rs sketch explores declaring these dependencies at the type level, and Challenge #1 (phase-gated topology) would also help — a topology in "webgui on 443" phase could reject LoadBalancerScore at validation time.

3. TUI as Primary Interface

The problem

The TUI (harmony_tui) exists with ratatui, crossterm, and tui-logger, but it's underused. The CLI (harmony_cli) is the primary interface. During a multi-hour deployment, operators watch scrolling log output with no structure, no ability to drill into a specific Score's progress, and no overview of where they are in the pipeline.

Why it matters

Log output during interactive prompts corrupts the terminal
No way to see "I'm on Stage 3 of 7, 2 hours elapsed, 3 Scores completed successfully"
No way to inspect a Score's configuration or outcome without reading logs
The pipeline feels like a black box during execution

Design direction

The TUI should provide three views:

Pipeline view — the default. Shows the ordered list of Scores with their status:

  OKD HA Cluster Deployment           [Stage 3/7 — 1h 42m elapsed]
  ──────────────────────────────────────────────────────────────────
  ✅ OKDIpxeScore                                          2m 14s
  ✅ OKDSetup01InventoryScore                              8m 03s
  ✅ OKDSetup02BootstrapScore                             34m 21s
  ▶  OKDSetup03ControlPlaneScore                         ... running
  ⏳ OKDSetupPersistNetworkBondScore
  ⏳ OKDSetup04WorkersScore
  ⏳ OKDSetup06InstallationReportScore

Detail view — press Enter on a Score to see its Outcome details, sub-score executions, and logs.

Log view — the current tui-logger panel, filtered to the selected Score.

The TUI already has the Score widget and log integration. What's missing is the pipeline-level orchestration view and the duration/status data — which the Score::interpret timing we just added now provides.

Immediate enablers

The instrumentation event system (HarmonyEvent) already captures start/finish with execution IDs. The TUI subscriber just needs to:

Track the ordered list of Scores from the Maestro
Update status as InterpretExecutionStarted/Finished events arrive
Render the pipeline view using ratatui

This doesn't require architectural changes — it's a TUI feature built on existing infrastructure.

8.6 KiB Raw Permalink Blame History

Harmony Architecture — Three Open Challenges

1. Topology Evolution During Deployment

The problem

Why it matters

Design direction

2. Runtime Plan & Validation Phase

The problem

Why it matters

Design direction

Relationship to state

Concrete use case: WebGuiConfigScore → LoadBalancerScore

3. TUI as Primary Interface

The problem

Why it matters

Design direction

Immediate enablers

8.6 KiB

Raw Permalink Blame History