harmony/docs/architecture-challenges.md

# Harmony Architecture — Three Open Challenges

Three problems that, if solved well, would make Harmony the most capable infrastructure automation framework in existence.

## 1. Topology Evolution During Deployment

### The problem

A bare-metal OKD deployment is a multi-hour process where the infrastructure's capabilities change as the deployment progresses:

```
Phase 0: Network only    → OPNsense reachable, Brocade reachable, no hosts
Phase 1: Discovery       → PXE boots work, hosts appear via mDNS, no k8s
Phase 2: Bootstrap       → openshift-install running, API partially available
Phase 3: Control plane   → k8s API available, operators converging, no workers
Phase 4: Workers         → Full cluster, apps can be deployed
Phase 5: Day-2           → Monitoring, alerting, tenant onboarding
```

Today, `HAClusterTopology` implements _all_ capability traits from the start. If a Score calls `k8s_client()` during Phase 0, it hits `DummyInfra` which panics. The type system says "this is valid" but the runtime says "this will crash."

### Why it matters

- Scores that require k8s compile and register happily at Phase 0, then panic if accidentally executed too early
- The pipeline is ordered by convention (Stage 01 → 02 → 03 → ...) but nothing enforces that Stage 04 can't run before Stage 02
- Adding new capabilities (like "cluster has monitoring installed") requires editing the topology struct, not declaring the capability was acquired

### Design direction

The topology should evolve through **phases** where capabilities are _acquired_, not assumed. Two possible approaches:

**A. Phase-gated topology (runtime)**

The topology tracks which phase it's in. Capability methods check the phase before executing and return a meaningful error instead of panicking:

```rust
impl K8sclient for HAClusterTopology {
    async fn k8s_client(&self) -> Result<Arc<K8sClient>, String> {
        if self.phase < Phase::ControlPlaneReady {
            return Err("k8s API not available yet (current phase: {})".into());
        }
        // ... actual implementation
    }
}
```

Scores that fail due to phase mismatch get a clear error message, not a panic. The Maestro can validate phase requirements before executing a Score.

**B. Typestate topology (compile-time)**

Use Rust's type system to make invalid phase transitions unrepresentable:

```rust
struct Topology<P: Phase> { ... }

impl Topology<NetworkReady> {
    fn bootstrap(self) -> Topology<Bootstrapping> { ... }
}
impl Topology<Bootstrapping> {
    fn promote(self) -> Topology<ClusterReady> { ... }
}

// Only ClusterReady implements K8sclient
impl K8sclient for Topology<ClusterReady> { ... }
```

This is the "correct" Rust approach but requires significant refactoring and may be too rigid for real deployments where phases overlap.

**Recommendation**: Start with (A) — runtime phase tracking. It's additive (no breaking changes), catches the DummyInfra panic problem immediately, and provides the data needed for (B) later.

---

## 2. Runtime Plan & Validation Phase

### The problem

Harmony validates Scores at compile time: if a Score requires `DhcpServer + TftpServer`, the topology must implement both traits or the program won't compile. This is powerful but insufficient.

What compile-time _cannot_ check:
- Is the OPNsense API actually reachable right now?
- Does VLAN 100 already exist (so we can skip creating it)?
- Is there already a DHCP entry for this MAC address?
- Will this firewall rule conflict with an existing one?
- Is there enough disk space on the TFTP server for the boot images?

Today, these are discovered at execution time, deep inside an Interpret's `execute()` method. A failure at minute 45 of a deployment is expensive.

### Why it matters

- No way to preview what Harmony will do before it does it
- No way to detect conflicts or precondition failures early
- Operators must read logs to understand what happened — there's no structured "here's what I did" report
- Re-running a deployment is scary because you don't know what will be re-applied vs skipped

### Design direction

Add a **validate** phase to the Score/Interpret lifecycle:

```rust
#[async_trait]
pub trait Interpret<T>: Debug + Send {
    /// Check preconditions and return what this interpret WOULD do.
    /// Default implementation returns "will execute" (opt-in validation).
    async fn validate(
        &self,
        inventory: &Inventory,
        topology: &T,
    ) -> Result<ValidationReport, InterpretError> {
        Ok(ValidationReport::will_execute(self.get_name()))
    }

    /// Execute the interpret (existing method, unchanged).
    async fn execute(
        &self,
        inventory: &Inventory,
        topology: &T,
    ) -> Result<Outcome, InterpretError>;

    // ... existing methods
}
```

A `ValidationReport` would contain:
- **Status**: `WillCreate`, `WillUpdate`, `WillDelete`, `AlreadyApplied`, `Blocked(reason)`
- **Details**: human-readable description of planned changes
- **Preconditions**: list of checks performed and their results

The Maestro would run validation for all registered Scores before executing any of them, producing a plan that the operator reviews.

This is opt-in: Scores that don't implement `validate()` get a default "will execute" report. Over time, each Score adds validation logic. The OPNsense Scores are ideal first candidates since they can query current state via the API.

### Relationship to state

This approach does _not_ require a state file. Validation queries the infrastructure directly — the same philosophy Harmony already follows. The "plan" is computed fresh every time by asking the infrastructure what exists right now.

### Concrete use case: WebGuiConfigScore → LoadBalancerScore

`LoadBalancerScore` configures HAProxy to bind on port 443. But OPNsense's webgui defaults to port 443 — creating a port conflict. `WebGuiConfigScore` moves the webgui to 9443 first.

Today this is solved by ordering convention: `WebGuiConfigScore` is registered before `LoadBalancerScore` in the Score list. If someone reorders them, HAProxy silently fails to bind.

This is the simplest example of an implicit Score dependency that the current system cannot express or enforce. The `score_with_dep.rs` sketch explores declaring these dependencies at the type level, and Challenge #1 (phase-gated topology) would also help — a topology in "webgui on 443" phase could reject `LoadBalancerScore` at validation time.

---

## 3. TUI as Primary Interface

### The problem

The TUI (`harmony_tui`) exists with ratatui, crossterm, and tui-logger, but it's underused. The CLI (`harmony_cli`) is the primary interface. During a multi-hour deployment, operators watch scrolling log output with no structure, no ability to drill into a specific Score's progress, and no overview of where they are in the pipeline.

### Why it matters

- Log output during interactive prompts corrupts the terminal
- No way to see "I'm on Stage 3 of 7, 2 hours elapsed, 3 Scores completed successfully"
- No way to inspect a Score's configuration or outcome without reading logs
- The pipeline feels like a black box during execution

### Design direction

The TUI should provide three views:

**Pipeline view** — the default. Shows the ordered list of Scores with their status:
```
  OKD HA Cluster Deployment           [Stage 3/7 — 1h 42m elapsed]
  ──────────────────────────────────────────────────────────────────
  ✅ OKDIpxeScore                                          2m 14s
  ✅ OKDSetup01InventoryScore                              8m 03s
  ✅ OKDSetup02BootstrapScore                             34m 21s
  ▶  OKDSetup03ControlPlaneScore                         ... running
  ⏳ OKDSetupPersistNetworkBondScore
  ⏳ OKDSetup04WorkersScore
  ⏳ OKDSetup06InstallationReportScore
```

**Detail view** — press Enter on a Score to see its Outcome details, sub-score executions, and logs.

**Log view** — the current tui-logger panel, filtered to the selected Score.

The TUI already has the Score widget and log integration. What's missing is the pipeline-level orchestration view and the duration/status data — which the `Score::interpret` timing we just added now provides.

### Immediate enablers

The instrumentation event system (`HarmonyEvent`) already captures start/finish with execution IDs. The TUI subscriber just needs to:
1. Track the ordered list of Scores from the Maestro
2. Update status as `InterpretExecutionStarted`/`Finished` events arrive
3. Render the pipeline view using ratatui

This doesn't require architectural changes — it's a TUI feature built on existing infrastructure.