Split the OPNsense webgui port change from OPNsenseBootstrap into a proper idempotent Score: - WebGuiConfigScore reads current port via SSH before changing - Returns NOOP if already on the target port - Modifies config.xml via PHP and restarts webgui via configctl - Runs before LoadBalancerScore to free port 443 for HAProxy Also: - Add Config::shell() accessor for SSH access from Scores - Add WebGuiConfigScore to VM integration example (12 Scores now) - Document the WebGuiConfig → LoadBalancer ordering dependency as a concrete use case in docs/architecture-challenges.md (ties into Challenge #2: Runtime Plan & Validation) The implicit dependency (LoadBalancerScore needs 443 free, which requires WebGuiConfigScore to run first) remains a convention-based ordering. This is tracked in architecture-challenges.md alongside the score_with_dep.rs design sketch.
190 lines
8.6 KiB
Markdown
190 lines
8.6 KiB
Markdown
# Harmony Architecture — Three Open Challenges
|
|
|
|
Three problems that, if solved well, would make Harmony the most capable infrastructure automation framework in existence.
|
|
|
|
## 1. Topology Evolution During Deployment
|
|
|
|
### The problem
|
|
|
|
A bare-metal OKD deployment is a multi-hour process where the infrastructure's capabilities change as the deployment progresses:
|
|
|
|
```
|
|
Phase 0: Network only → OPNsense reachable, Brocade reachable, no hosts
|
|
Phase 1: Discovery → PXE boots work, hosts appear via mDNS, no k8s
|
|
Phase 2: Bootstrap → openshift-install running, API partially available
|
|
Phase 3: Control plane → k8s API available, operators converging, no workers
|
|
Phase 4: Workers → Full cluster, apps can be deployed
|
|
Phase 5: Day-2 → Monitoring, alerting, tenant onboarding
|
|
```
|
|
|
|
Today, `HAClusterTopology` implements _all_ capability traits from the start. If a Score calls `k8s_client()` during Phase 0, it hits `DummyInfra` which panics. The type system says "this is valid" but the runtime says "this will crash."
|
|
|
|
### Why it matters
|
|
|
|
- Scores that require k8s compile and register happily at Phase 0, then panic if accidentally executed too early
|
|
- The pipeline is ordered by convention (Stage 01 → 02 → 03 → ...) but nothing enforces that Stage 04 can't run before Stage 02
|
|
- Adding new capabilities (like "cluster has monitoring installed") requires editing the topology struct, not declaring the capability was acquired
|
|
|
|
### Design direction
|
|
|
|
The topology should evolve through **phases** where capabilities are _acquired_, not assumed. Two possible approaches:
|
|
|
|
**A. Phase-gated topology (runtime)**
|
|
|
|
The topology tracks which phase it's in. Capability methods check the phase before executing and return a meaningful error instead of panicking:
|
|
|
|
```rust
|
|
impl K8sclient for HAClusterTopology {
|
|
async fn k8s_client(&self) -> Result<Arc<K8sClient>, String> {
|
|
if self.phase < Phase::ControlPlaneReady {
|
|
return Err("k8s API not available yet (current phase: {})".into());
|
|
}
|
|
// ... actual implementation
|
|
}
|
|
}
|
|
```
|
|
|
|
Scores that fail due to phase mismatch get a clear error message, not a panic. The Maestro can validate phase requirements before executing a Score.
|
|
|
|
**B. Typestate topology (compile-time)**
|
|
|
|
Use Rust's type system to make invalid phase transitions unrepresentable:
|
|
|
|
```rust
|
|
struct Topology<P: Phase> { ... }
|
|
|
|
impl Topology<NetworkReady> {
|
|
fn bootstrap(self) -> Topology<Bootstrapping> { ... }
|
|
}
|
|
impl Topology<Bootstrapping> {
|
|
fn promote(self) -> Topology<ClusterReady> { ... }
|
|
}
|
|
|
|
// Only ClusterReady implements K8sclient
|
|
impl K8sclient for Topology<ClusterReady> { ... }
|
|
```
|
|
|
|
This is the "correct" Rust approach but requires significant refactoring and may be too rigid for real deployments where phases overlap.
|
|
|
|
**Recommendation**: Start with (A) — runtime phase tracking. It's additive (no breaking changes), catches the DummyInfra panic problem immediately, and provides the data needed for (B) later.
|
|
|
|
---
|
|
|
|
## 2. Runtime Plan & Validation Phase
|
|
|
|
### The problem
|
|
|
|
Harmony validates Scores at compile time: if a Score requires `DhcpServer + TftpServer`, the topology must implement both traits or the program won't compile. This is powerful but insufficient.
|
|
|
|
What compile-time _cannot_ check:
|
|
- Is the OPNsense API actually reachable right now?
|
|
- Does VLAN 100 already exist (so we can skip creating it)?
|
|
- Is there already a DHCP entry for this MAC address?
|
|
- Will this firewall rule conflict with an existing one?
|
|
- Is there enough disk space on the TFTP server for the boot images?
|
|
|
|
Today, these are discovered at execution time, deep inside an Interpret's `execute()` method. A failure at minute 45 of a deployment is expensive.
|
|
|
|
### Why it matters
|
|
|
|
- No way to preview what Harmony will do before it does it
|
|
- No way to detect conflicts or precondition failures early
|
|
- Operators must read logs to understand what happened — there's no structured "here's what I did" report
|
|
- Re-running a deployment is scary because you don't know what will be re-applied vs skipped
|
|
|
|
### Design direction
|
|
|
|
Add a **validate** phase to the Score/Interpret lifecycle:
|
|
|
|
```rust
|
|
#[async_trait]
|
|
pub trait Interpret<T>: Debug + Send {
|
|
/// Check preconditions and return what this interpret WOULD do.
|
|
/// Default implementation returns "will execute" (opt-in validation).
|
|
async fn validate(
|
|
&self,
|
|
inventory: &Inventory,
|
|
topology: &T,
|
|
) -> Result<ValidationReport, InterpretError> {
|
|
Ok(ValidationReport::will_execute(self.get_name()))
|
|
}
|
|
|
|
/// Execute the interpret (existing method, unchanged).
|
|
async fn execute(
|
|
&self,
|
|
inventory: &Inventory,
|
|
topology: &T,
|
|
) -> Result<Outcome, InterpretError>;
|
|
|
|
// ... existing methods
|
|
}
|
|
```
|
|
|
|
A `ValidationReport` would contain:
|
|
- **Status**: `WillCreate`, `WillUpdate`, `WillDelete`, `AlreadyApplied`, `Blocked(reason)`
|
|
- **Details**: human-readable description of planned changes
|
|
- **Preconditions**: list of checks performed and their results
|
|
|
|
The Maestro would run validation for all registered Scores before executing any of them, producing a plan that the operator reviews.
|
|
|
|
This is opt-in: Scores that don't implement `validate()` get a default "will execute" report. Over time, each Score adds validation logic. The OPNsense Scores are ideal first candidates since they can query current state via the API.
|
|
|
|
### Relationship to state
|
|
|
|
This approach does _not_ require a state file. Validation queries the infrastructure directly — the same philosophy Harmony already follows. The "plan" is computed fresh every time by asking the infrastructure what exists right now.
|
|
|
|
### Concrete use case: WebGuiConfigScore → LoadBalancerScore
|
|
|
|
`LoadBalancerScore` configures HAProxy to bind on port 443. But OPNsense's webgui defaults to port 443 — creating a port conflict. `WebGuiConfigScore` moves the webgui to 9443 first.
|
|
|
|
Today this is solved by ordering convention: `WebGuiConfigScore` is registered before `LoadBalancerScore` in the Score list. If someone reorders them, HAProxy silently fails to bind.
|
|
|
|
This is the simplest example of an implicit Score dependency that the current system cannot express or enforce. The `score_with_dep.rs` sketch explores declaring these dependencies at the type level, and Challenge #1 (phase-gated topology) would also help — a topology in "webgui on 443" phase could reject `LoadBalancerScore` at validation time.
|
|
|
|
---
|
|
|
|
## 3. TUI as Primary Interface
|
|
|
|
### The problem
|
|
|
|
The TUI (`harmony_tui`) exists with ratatui, crossterm, and tui-logger, but it's underused. The CLI (`harmony_cli`) is the primary interface. During a multi-hour deployment, operators watch scrolling log output with no structure, no ability to drill into a specific Score's progress, and no overview of where they are in the pipeline.
|
|
|
|
### Why it matters
|
|
|
|
- Log output during interactive prompts corrupts the terminal
|
|
- No way to see "I'm on Stage 3 of 7, 2 hours elapsed, 3 Scores completed successfully"
|
|
- No way to inspect a Score's configuration or outcome without reading logs
|
|
- The pipeline feels like a black box during execution
|
|
|
|
### Design direction
|
|
|
|
The TUI should provide three views:
|
|
|
|
**Pipeline view** — the default. Shows the ordered list of Scores with their status:
|
|
```
|
|
OKD HA Cluster Deployment [Stage 3/7 — 1h 42m elapsed]
|
|
──────────────────────────────────────────────────────────────────
|
|
✅ OKDIpxeScore 2m 14s
|
|
✅ OKDSetup01InventoryScore 8m 03s
|
|
✅ OKDSetup02BootstrapScore 34m 21s
|
|
▶ OKDSetup03ControlPlaneScore ... running
|
|
⏳ OKDSetupPersistNetworkBondScore
|
|
⏳ OKDSetup04WorkersScore
|
|
⏳ OKDSetup06InstallationReportScore
|
|
```
|
|
|
|
**Detail view** — press Enter on a Score to see its Outcome details, sub-score executions, and logs.
|
|
|
|
**Log view** — the current tui-logger panel, filtered to the selected Score.
|
|
|
|
The TUI already has the Score widget and log integration. What's missing is the pipeline-level orchestration view and the duration/status data — which the `Score::interpret` timing we just added now provides.
|
|
|
|
### Immediate enablers
|
|
|
|
The instrumentation event system (`HarmonyEvent`) already captures start/finish with execution IDs. The TUI subscriber just needs to:
|
|
1. Track the ordered list of Scores from the Maestro
|
|
2. Update status as `InterpretExecutionStarted`/`Finished` events arrive
|
|
3. Render the pipeline view using ratatui
|
|
|
|
This doesn't require architectural changes — it's a TUI feature built on existing infrastructure.
|