adr: Higher order topologies

These types of Topologies will orchestrate behavior in regular Topologies. For example, a FailoverTopology is a Higher Order, it will orchestrate its capabilities between a primary and a replica topology. A great use case for this is a database deployment. The FailoverTopology will deploy both instances, connect them, and the able to execute the appropriate actions to promote de replica to primary and revert back to original state. Other use cases are ShardedTopology, DecentralizedTopology, etc.
2025-12-09 11:23:30 -05:00
3 changed files with 272 additions and 177 deletions
--- a/adr/015-higher-order-topologies.md
+++ b/adr/015-higher-order-topologies.md
@@ -0,0 +1,114 @@
+# Architecture Decision Record: Higher-Order Topologies
+
+**Initial Author:** Jean-Gabriel Gill-Couture  
+**Initial Date:** 2025-12-08  
+**Last Updated Date:** 2025-12-08  
+
+## Status
+
+Implemented
+
+## Context
+
+Harmony models infrastructure as **Topologies** (deployment targets like `K8sAnywhereTopology`, `LinuxHostTopology`) implementing **Capabilities** (tech traits like `PostgreSQL`, `Docker`).
+
+**Higher-Order Topologies** (e.g., `FailoverTopology<T>`) compose/orchestrate capabilities *across* multiple underlying topologies (e.g., primary+replica `T`).
+
+Naive design requires manual `impl Capability for HigherOrderTopology<T>` *per T per capability*, causing:
+- **Impl explosion**: N topologies × M capabilities = N×M boilerplate.
+- **ISP violation**: Topologies forced to impl unrelated capabilities.
+- **Maintenance hell**: New topology needs impls for *all* orchestrated capabilities; new capability needs impls for *all* topologies/higher-order.
+- **Barrier to extension**: Users can't easily add topologies without todos/panics.
+
+This makes scaling Harmony impractical as ecosystem grows.
+
+## Decision
+
+Use **blanket trait impls** on higher-order topologies to *automatically* derive orchestration:
+
+````rust
+/// Higher-Order Topology: Orchestrates capabilities across sub-topologies.
+pub struct FailoverTopology<T> {
+    /// Primary sub-topology.
+    primary: T,
+    /// Replica sub-topology.
+    replica: T,
+}
+
+/// Automatically provides PostgreSQL failover for *any* `T: PostgreSQL`.
+/// Delegates to primary for queries; orchestrates deploy across both.
+#[async_trait]
+impl<T: PostgreSQL> PostgreSQL for FailoverTopology<T> {
+    async fn deploy(&self, config: &PostgreSQLConfig) -> Result<String, String> {
+        // Deploy primary; extract certs/endpoint;
+        // deploy replica with pg_basebackup + TLS passthrough.
+        // (Full impl logged/elaborated.)
+    }
+
+    // Delegate queries to primary.
+    async fn get_replication_certs(&self, cluster_name: &str) -> Result<ReplicationCerts, String> {
+        self.primary.get_replication_certs(cluster_name).await
+    }
+    // ...
+}
+
+/// Similarly for other capabilities.
+#[async_trait]
+impl<T: Docker> Docker for FailoverTopology<T> {
+    // Failover Docker orchestration.
+}
+````
+
+**Key properties:**
+- **Auto-derivation**: `Failover<K8sAnywhere>` gets `PostgreSQL` iff `K8sAnywhere: PostgreSQL`.
+- **No boilerplate**: One blanket impl per capability *per higher-order type*.
+
+## Rationale
+
+- **Composition via generics**: Rust trait solver auto-selects impls; zero runtime cost.
+- **Compile-time safety**: Missing `T: Capability` → compile error (no panics).
+- **Scalable**: O(capabilities) impls per higher-order; new `T` auto-works.
+- **ISP-respecting**: Capabilities only surface if sub-topology provides.
+- **Centralized logic**: Orchestration (e.g., cert propagation) in one place.
+
+**Example usage:**
+````rust
+// ✅ Works: K8sAnywhere: PostgreSQL → Failover provides failover PG
+let pg_failover: FailoverTopology<K8sAnywhereTopology> = ...;
+pg_failover.deploy_pg(config).await;
+
+// ✅ Works: LinuxHost: Docker → Failover provides failover Docker
+let docker_failover: FailoverTopology<LinuxHostTopology> = ...;
+docker_failover.deploy_docker(...).await;
+
+// ❌ Compile fail: K8sAnywhere !: Docker
+let invalid: FailoverTopology<K8sAnywhereTopology>;
+invalid.deploy_docker(...); // `T: Docker` bound unsatisfied
+````
+
+## Consequences
+
+**Pros:**
+- **Extensible**: New topology `AWSTopology: PostgreSQL` → instant `Failover<AWSTopology>: PostgreSQL`.
+- **Lean**: No useless impls (e.g., no `K8sAnywhere: Docker`).
+- **Observable**: Logs trace every step.
+
+**Cons:**
+- **Monomorphization**: Generics generate code per T (mitigated: few Ts).
+- **Delegation opacity**: Relies on rustdoc/logs for internals.
+
+## Alternatives considered
+
+| Approach | Pros | Cons |
+|----------|------|------|
+| **Manual per-T impls**<br>`impl PG for Failover<K8s> {..}`<br>`impl PG for Failover<Linux> {..}` | Explicit control | N×M explosion; violates ISP; hard to extend. |
+| **Dynamic trait objects**<br>`Box<dyn AnyCapability>` | Runtime flex | Perf hit; type erasure; error-prone dispatch. |
+| **Mega-topology trait**<br>All-in-one `OrchestratedTopology` | Simple wiring | Monolithic; poor composition. |
+| **Registry dispatch**<br>Runtime capability lookup | Decoupled | Complex; no compile safety; perf/debug overhead. |
+
+**Selected**: Blanket impls leverage Rust generics for safe, zero-cost composition.
+
+## Additional Notes
+
+- Applies to `MultisiteTopology<T>`, `ShardedTopology<T>`, etc.
+- `FailoverTopology` in `failover.rs` is first implementation.
--- a/adr/015-higher-order-topologies/example.rs
+++ b/adr/015-higher-order-topologies/example.rs
@@ -0,0 +1,153 @@
+//! Example of Higher-Order Topologies in Harmony.
+//! Demonstrates how `FailoverTopology<T>` automatically provides failover for *any* capability
+//! supported by a sub-topology `T` via blanket trait impls.
+//! 
+//! Key insight: No manual impls per T or capability -- scales effortlessly.
+//! Users can:
+//! - Write new `Topology` (impl capabilities on a struct).
+//! - Compose with `FailoverTopology` (gets capabilities if T has them).
+//! - Compile fails if capability missing (safety).
+
+use async_trait::async_trait;
+use tokio;
+
+/// Capability trait: Deploy and manage PostgreSQL.
+#[async_trait]
+pub trait PostgreSQL {
+    async fn deploy(&self, config: &PostgreSQLConfig) -> Result<String, String>;
+    async fn get_replication_certs(&self, cluster_name: &str) -> Result<ReplicationCerts, String>;
+}
+
+/// Capability trait: Deploy Docker.
+#[async_trait]
+pub trait Docker {
+    async fn deploy_docker(&self) -> Result<String, String>;
+}
+
+/// Configuration for PostgreSQL deployments.
+#[derive(Clone)]
+pub struct PostgreSQLConfig;
+
+/// Replication certificates.
+#[derive(Clone)]
+pub struct ReplicationCerts;
+
+/// Concrete topology: Kubernetes Anywhere (supports PostgreSQL).
+#[derive(Clone)]
+pub struct K8sAnywhereTopology;
+
+#[async_trait]
+impl PostgreSQL for K8sAnywhereTopology {
+    async fn deploy(&self, _config: &PostgreSQLConfig) -> Result<String, String> {
+        // Real impl: Use k8s helm chart, operator, etc.
+        Ok("K8sAnywhere PostgreSQL deployed".to_string())
+    }
+
+    async fn get_replication_certs(&self, _cluster_name: &str) -> Result<ReplicationCerts, String> {
+        Ok(ReplicationCerts)
+    }
+}
+
+/// Concrete topology: Linux Host (supports Docker).
+#[derive(Clone)]
+pub struct LinuxHostTopology;
+
+#[async_trait]
+impl Docker for LinuxHostTopology {
+    async fn deploy_docker(&self) -> Result<String, String> {
+        // Real impl: Install/configure Docker on host.
+        Ok("LinuxHost Docker deployed".to_string())
+    }
+}
+
+/// Higher-Order Topology: Composes multiple sub-topologies (primary + replica).
+/// Automatically derives *all* capabilities of `T` with failover orchestration.
+/// 
+/// - If `T: PostgreSQL`, then `FailoverTopology<T>: PostgreSQL` (blanket impl).
+/// - Same for `Docker`, etc. No boilerplate!
+/// - Compile-time safe: Missing `T: Capability` → error.
+#[derive(Clone)]
+pub struct FailoverTopology<T> {
+    /// Primary sub-topology.
+    pub primary: T,
+    /// Replica sub-topology.
+    pub replica: T,
+}
+
+/// Blanket impl: Failover PostgreSQL if T provides PostgreSQL.
+/// Delegates reads to primary; deploys to both.
+#[async_trait]
+impl<T: PostgreSQL + Send + Sync + Clone> PostgreSQL for FailoverTopology<T> {
+    async fn deploy(&self, config: &PostgreSQLConfig) -> Result<String, String> {
+        // Orchestrate: Deploy primary first, then replica (e.g., via pg_basebackup).
+        let primary_result = self.primary.deploy(config).await?;
+        let replica_result = self.replica.deploy(config).await?;
+        Ok(format!("Failover PG deployed: {} | {}", primary_result, replica_result))
+    }
+
+    async fn get_replication_certs(&self, cluster_name: &str) -> Result<ReplicationCerts, String> {
+        // Delegate to primary (replica follows).
+        self.primary.get_replication_certs(cluster_name).await
+    }
+}
+
+/// Blanket impl: Failover Docker if T provides Docker.
+#[async_trait]
+impl<T: Docker + Send + Sync + Clone> Docker for FailoverTopology<T> {
+    async fn deploy_docker(&self) -> Result<String, String> {
+        // Orchestrate across primary + replica.
+        let primary_result = self.primary.deploy_docker().await?;
+        let replica_result = self.replica.deploy_docker().await?;
+        Ok(format!("Failover Docker deployed: {} | {}", primary_result, replica_result))
+    }
+}
+
+#[tokio::main]
+async fn main() {
+    let config = PostgreSQLConfig;
+
+    println!("=== ✅ PostgreSQL Failover (K8sAnywhere supports PG) ===");
+    let pg_failover = FailoverTopology {
+        primary: K8sAnywhereTopology,
+        replica: K8sAnywhereTopology,
+    };
+    let result = pg_failover.deploy(&config).await.unwrap();
+    println!("Result: {}", result);
+
+    println!("\n=== ✅ Docker Failover (LinuxHost supports Docker) ===");
+    let docker_failover = FailoverTopology {
+        primary: LinuxHostTopology,
+        replica: LinuxHostTopology,
+    };
+    let result = docker_failover.deploy_docker().await.unwrap();
+    println!("Result: {}", result);
+
+    println!("\n=== ❌ Would fail to compile (K8sAnywhere !: Docker) ===");
+    // let invalid = FailoverTopology {
+    //     primary: K8sAnywhereTopology,
+    //     replica: K8sAnywhereTopology,
+    // };
+    // invalid.deploy_docker().await.unwrap(); // Error: `K8sAnywhereTopology: Docker` not satisfied!
+    // Very clear error message :
+    // error[E0599]: the method `deploy_docker` exists for struct `FailoverTopology<K8sAnywhereTopology>`, but its trait bounds were not satisfied
+    //   --> src/main.rs:90:9
+    //    |
+    //  4 | pub struct FailoverTopology<T> {
+    //    | ------------------------------ method `deploy_docker` not found for this struct because it doesn't satisfy `FailoverTopology<K8sAnywhereTopology>: Docker`
+    // ...
+    // 37 | struct K8sAnywhereTopology;
+    //    | -------------------------- doesn't satisfy `K8sAnywhereTopology: Docker`
+    // ...
+    // 90 | invalid.deploy_docker(); // `T: Docker` bound unsatisfied
+    //    |         ^^^^^^^^^^^^^ method cannot be called on `FailoverTopology<K8sAnywhereTopology>` due to unsatisfied trait bounds
+    //    |
+    // note: trait bound `K8sAnywhereTopology: Docker` was not satisfied
+    //   --> src/main.rs:61:9
+    //    |
+    // 61 | impl<T: Docker + Send + Sync> Docker for FailoverTopology<T> {
+    //    |         ^^^^^^                ------     -------------------
+    //    |         |
+    //    |         unsatisfied trait bound introduced here
+    // note: the trait `Docker` must be implemented
+}
+
--- a/harmony/src/modules/okd/bootstrap_04_workers.rs
+++ b/harmony/src/modules/okd/bootstrap_04_workers.rs
@@ -1,21 +1,15 @@
 use async_trait::async_trait;
 use derive_new::new;
 use harmony_types::id::Id;
-use log::{debug, info};
+use log::info;
 use serde::Serialize;

 use crate::{
    data::Version,
-    hardware::PhysicalHost,
-    infra::inventory::InventoryRepositoryFactory,
    interpret::{Interpret, InterpretError, InterpretName, InterpretStatus, Outcome},
-    inventory::{HostRole, Inventory},
-    modules::{
-        dhcp::DhcpHostBindingScore, http::IPxeMacBootFileScore,
-        inventory::DiscoverHostForRoleScore, okd::templates::BootstrapIpxeTpl,
-    },
+    inventory::Inventory,
    score::Score,
-    topology::{HAClusterTopology, HostBinding},
+    topology::HAClusterTopology,
 };

 // -------------------------------------------------------------------------------------------------
@@ -58,159 +52,6 @@ impl OKDSetup04WorkersInterpret {
        info!("[Workers] Rendering per-MAC PXE for workers and rebooting");
        Ok(())
    }
-
-    /// Ensures that three physical hosts are discovered and available for the ControlPlane role.
-    /// It will trigger discovery if not enough hosts are found.
-    async fn get_nodes(
-        &self,
-        inventory: &Inventory,
-        topology: &HAClusterTopology,
-    ) -> Result<Vec<PhysicalHost>, InterpretError> {
-        const REQUIRED_HOSTS: usize = 2;
-        let repo = InventoryRepositoryFactory::build().await?;
-        let mut control_plane_hosts = repo.get_host_for_role(&HostRole::Worker).await?;
-
-        while control_plane_hosts.len() < REQUIRED_HOSTS {
-            info!(
-                "Discovery of {} control plane hosts in progress, current number {}",
-                REQUIRED_HOSTS,
-                control_plane_hosts.len()
-            );
-            // This score triggers the discovery agent for a specific role.
-            DiscoverHostForRoleScore {
-                role: HostRole::Worker,
-            }
-            .interpret(inventory, topology)
-            .await?;
-            control_plane_hosts = repo.get_host_for_role(&HostRole::Worker).await?;
-        }
-
-        if control_plane_hosts.len() < REQUIRED_HOSTS {
-            Err(InterpretError::new(format!(
-                "OKD Requires at least {} control plane hosts, but only found {}. Cannot proceed.",
-                REQUIRED_HOSTS,
-                control_plane_hosts.len()
-            )))
-        } else {
-            // Take exactly the number of required hosts to ensure consistency.
-            Ok(control_plane_hosts
-                .into_iter()
-                .take(REQUIRED_HOSTS)
-                .collect())
-        }
-    }
-
-    /// Configures DHCP host bindings for all control plane nodes.
-    async fn configure_host_binding(
-        &self,
-        inventory: &Inventory,
-        topology: &HAClusterTopology,
-        nodes: &Vec<PhysicalHost>,
-    ) -> Result<(), InterpretError> {
-        info!("[Worker] Configuring host bindings for worker nodes.");
-
-        // Ensure the topology definition matches the number of physical nodes found.
-        if topology.control_plane.len() != nodes.len() {
-            return Err(InterpretError::new(format!(
-                "Mismatch between logical control plane hosts defined in topology ({}) and physical nodes found ({}).",
-                topology.control_plane.len(),
-                nodes.len()
-            )));
-        }
-
-        // Create a binding for each physical host to its corresponding logical host.
-        let bindings: Vec<HostBinding> = topology
-            .control_plane
-            .iter()
-            .zip(nodes.iter())
-            .map(|(logical_host, physical_host)| {
-                info!(
-                    "Creating binding: Logical Host '{}' -> Physical Host ID '{}'",
-                    logical_host.name, physical_host.id
-                );
-                HostBinding {
-                    logical_host: logical_host.clone(),
-                    physical_host: physical_host.clone(),
-                }
-            })
-            .collect();
-
-        DhcpHostBindingScore {
-            host_binding: bindings,
-            domain: Some(topology.domain_name.clone()),
-        }
-        .interpret(inventory, topology)
-        .await?;
-
-        Ok(())
-    }
-
-    /// Renders and deploys a per-MAC iPXE boot file for each control plane node.
-    async fn configure_ipxe(
-        &self,
-        inventory: &Inventory,
-        topology: &HAClusterTopology,
-        nodes: &Vec<PhysicalHost>,
-    ) -> Result<(), InterpretError> {
-        info!("[Worker] Rendering per-MAC iPXE configurations.");
-
-        // The iPXE script content is the same for all control plane nodes,
-        // pointing to the 'master.ign' ignition file.
-        let content = BootstrapIpxeTpl {
-            http_ip: &topology.http_server.get_ip().to_string(),
-            scos_path: "scos",
-            ignition_http_path: "okd_ignition_files",
-            installation_device: "/dev/sda", // This might need to be configurable per-host in the future
-            ignition_file_name: "worker.ign", // Worker nodes use the worker ignition file
-        }
-        .to_string();
-
-        debug!("[Worker] iPXE content template:\n{content}");
-
-        // Create and apply an iPXE boot file for each node.
-        for node in nodes {
-            let mac_address = node.get_mac_address();
-            if mac_address.is_empty() {
-                return Err(InterpretError::new(format!(
-                    "Physical host with ID '{}' has no MAC addresses defined.",
-                    node.id
-                )));
-            }
-            info!(
-                "[Worker] Applying iPXE config for node ID '{}' with MACs: {:?}",
-                node.id, mac_address
-            );
-
-            IPxeMacBootFileScore {
-                mac_address,
-                content: content.clone(),
-            }
-            .interpret(inventory, topology)
-            .await?;
-        }
-
-        Ok(())
-    }
-
-    /// Prompts the user to reboot the target control plane nodes.
-    async fn reboot_targets(&self, nodes: &Vec<PhysicalHost>) -> Result<(), InterpretError> {
-        let node_ids: Vec<String> = nodes.iter().map(|n| n.id.to_string()).collect();
-        info!("[Worker] Requesting reboot for control plane nodes: {node_ids:?}",);
-
-        let confirmation = inquire::Confirm::new(
-                &format!("Please reboot the {} worker nodes ({}) to apply their PXE configuration. Press enter when ready.", nodes.len(), node_ids.join(", ")),
-        )
-        .prompt()
-        .map_err(|e| InterpretError::new(format!("User prompt failed: {e}")))?;
-
-        if !confirmation {
-            return Err(InterpretError::new(
-                "User aborted the operation.".to_string(),
-            ));
-        }
-
-        Ok(())
-    }
 }

 #[async_trait]
@@ -233,23 +74,10 @@ impl Interpret<HAClusterTopology> for OKDSetup04WorkersInterpret {

    async fn execute(
        &self,
-        inventory: &Inventory,
-        topology: &HAClusterTopology,
+        _inventory: &Inventory,
+        _topology: &HAClusterTopology,
    ) -> Result<Outcome, InterpretError> {
        self.render_and_reboot().await?;
-        // 1. Ensure we have 2 physical hosts for the worker nodes.
-        let nodes = self.get_nodes(inventory, topology).await?;
-
-        // 2. Create DHCP reservations for the worker nodes.
-        self.configure_host_binding(inventory, topology, &nodes)
-            .await?;
-
-        // 3. Create iPXE files for each worker node to boot from the worker ignition.
-        self.configure_ipxe(inventory, topology, &nodes).await?;
-
-        // 4. Reboot the nodes to start the OS installation.
-        self.reboot_targets(&nodes).await?;
-
        Ok(Outcome::success("Workers provisioned".into()))
    }
 }