Compare commits
1 Commits
feat/worke
...
c7cbd9eeac
| Author | SHA1 | Date | |
|---|---|---|---|
| c7cbd9eeac |
105
docs/modules/Multisite_PostgreSQL.md
Normal file
105
docs/modules/Multisite_PostgreSQL.md
Normal file
@@ -0,0 +1,105 @@
|
||||
# Design Document: Harmony PostgreSQL Module
|
||||
|
||||
**Status:** Draft
|
||||
**Last Updated:** 2025-12-01
|
||||
**Context:** Multi-site Data Replication & Orchestration
|
||||
|
||||
## 1. Overview
|
||||
|
||||
The Harmony PostgreSQL Module provides a high-level abstraction for deploying and managing high-availability PostgreSQL clusters across geographically distributed Kubernetes/OKD sites.
|
||||
|
||||
Instead of manually configuring complex replication slots, firewalls, and operator settings on each cluster, users define a single intent (a **Score**), and Harmony orchestrates the underlying infrastructure (the **Arrangement**) to establish a Primary-Replica architecture.
|
||||
|
||||
Currently, the implementation relies on the **CloudNativePG (CNPG)** operator as the backing engine.
|
||||
|
||||
## 2. Architecture
|
||||
|
||||
### 2.1 The Abstraction Model
|
||||
Following **ADR 003 (Infrastructure Abstraction)**, Harmony separates the *intent* from the *implementation*.
|
||||
|
||||
1. **The Score (Intent):** The user defines a `MultisitePostgreSQL` resource. This describes *what* is needed (e.g., "A Postgres 15 cluster with 10GB storage, Primary on Site A, Replica on Site B").
|
||||
2. **The Interpret (Action):** Harmony MultisitePostgreSQLInterpret processes this Score and orchestrates the deployment on both sites to reach the state defined in the Score.
|
||||
3. **The Capability (Implementation):** The PostgreSQL Capability is implemented by the K8sTopology and the interpret can deploy it, configure it and fetch information about it. The concrete implementation will rely on the mature CloudnativePG operator to manage all the Kubernetes resources required.
|
||||
|
||||
### 2.2 Network Connectivity (TLS Passthrough)
|
||||
|
||||
One of the critical challenges in multi-site orchestration is secure connectivity between clusters that may have dynamic IPs or strict firewalls.
|
||||
|
||||
To solve this, we utilize **OKD/OpenShift Routes with TLS Passthrough**.
|
||||
|
||||
* **Mechanism:** The Primary site exposes a `Route` configured for `termination: passthrough`.
|
||||
* **Routing:** The OpenShift HAProxy router inspects the **SNI (Server Name Indication)** header of the incoming TCP connection to route traffic to the correct PostgreSQL Pod.
|
||||
* **Security:** SSL is **not** terminated at the ingress router. The encrypted stream is passed directly to the PostgreSQL instance. Mutual TLS (mTLS) authentication is handled natively by CNPG between the Primary and Replica instances.
|
||||
* **Dynamic IPs:** Because connections are established via DNS hostnames (the Route URL), this architecture is resilient to dynamic IP changes at the Primary site.
|
||||
|
||||
#### Traffic Flow Diagram
|
||||
|
||||
```text
|
||||
[ Site B: Replica ] [ Site A: Primary ]
|
||||
| |
|
||||
(CNPG Instance) --[Encrypted TCP]--> (OKD HAProxy Router)
|
||||
| (Port 443) |
|
||||
| |
|
||||
| [SNI Inspection]
|
||||
| |
|
||||
| v
|
||||
| (PostgreSQL Primary Pod)
|
||||
| (Port 5432)
|
||||
```
|
||||
|
||||
## 3. Design Decisions
|
||||
|
||||
### Why CloudNativePG?
|
||||
We selected CloudNativePG because it relies exclusively on standard Kubernetes primitives and uses the native PostgreSQL replication protocol (WAL shipping/Streaming). This aligns with Harmony's goal of being "K8s Native."
|
||||
|
||||
### Why TLS Passthrough instead of VPN/NodePort?
|
||||
* **NodePort:** Requires static IPs and opening non-standard ports on the firewall, which violates our security constraints.
|
||||
* **VPN (e.g., Wireguard/Tailscale):** While secure, it introduces significant complexity (sidecars, key management) and external dependencies.
|
||||
* **TLS Passthrough:** Leverages the existing Ingress/Router infrastructure already present in OKD. It requires zero additional software and respects multi-tenancy (Routes are namespaced).
|
||||
|
||||
### Configuration Philosophy (YAGNI)
|
||||
The current design exposes a **generic configuration surface**. Users can configure standard parameters (Storage size, CPU/Memory requests, Postgres version).
|
||||
|
||||
**We explicitly do not expose advanced CNPG or PostgreSQL configurations at this stage.**
|
||||
|
||||
* **Reasoning:** We aim to keep the API surface small and manageable.
|
||||
* **Future Path:** We plan to implement a "pass-through" mechanism to allow sending raw config maps or custom parameters to the underlying engine (CNPG) *only when a concrete use case arises*. Until then, we adhere to the **YAGNI (You Ain't Gonna Need It)** principle to avoid premature optimization and API bloat.
|
||||
|
||||
## 4. Usage Guide
|
||||
|
||||
To deploy a multi-site cluster, apply the `MultisitePostgreSQL` resource to the Harmony Control Plane.
|
||||
|
||||
### Example Manifest
|
||||
|
||||
```yaml
|
||||
apiVersion: harmony.io/v1alpha1
|
||||
kind: MultisitePostgreSQL
|
||||
metadata:
|
||||
name: finance-db
|
||||
namespace: tenant-a
|
||||
spec:
|
||||
version: "15"
|
||||
storage: "10Gi"
|
||||
resources:
|
||||
requests:
|
||||
cpu: "500m"
|
||||
memory: "1Gi"
|
||||
|
||||
# Topology Definition
|
||||
topology:
|
||||
primary:
|
||||
site: "site-paris" # The name of the cluster in Harmony
|
||||
replicas:
|
||||
- site: "site-newyork"
|
||||
```
|
||||
|
||||
### What happens next?
|
||||
1. Harmony detects the CR.
|
||||
2. **On Site Paris:** It deploys a CNPG Cluster (Primary) and creates a Passthrough Route `postgres-finance-db.apps.site-paris.example.com`.
|
||||
3. **On Site New York:** It deploys a CNPG Cluster (Replica) configured with `externalClusters` pointing to the Paris Route.
|
||||
4. Data begins replicating immediately over the encrypted channel.
|
||||
|
||||
## 5. Troubleshooting
|
||||
|
||||
* **Connection Refused:** Ensure the Primary site's Route is successfully admitted by the Ingress Controller.
|
||||
* **Certificate Errors:** CNPG manages mTLS automatically. If errors persist, ensure the CA secrets were correctly propagated by Harmony from Primary to Replica namespaces.
|
||||
@@ -1,21 +1,15 @@
|
||||
use async_trait::async_trait;
|
||||
use derive_new::new;
|
||||
use harmony_types::id::Id;
|
||||
use log::{debug, info};
|
||||
use log::info;
|
||||
use serde::Serialize;
|
||||
|
||||
use crate::{
|
||||
data::Version,
|
||||
hardware::PhysicalHost,
|
||||
infra::inventory::InventoryRepositoryFactory,
|
||||
interpret::{Interpret, InterpretError, InterpretName, InterpretStatus, Outcome},
|
||||
inventory::{HostRole, Inventory},
|
||||
modules::{
|
||||
dhcp::DhcpHostBindingScore, http::IPxeMacBootFileScore,
|
||||
inventory::DiscoverHostForRoleScore, okd::templates::BootstrapIpxeTpl,
|
||||
},
|
||||
inventory::Inventory,
|
||||
score::Score,
|
||||
topology::{HAClusterTopology, HostBinding},
|
||||
topology::HAClusterTopology,
|
||||
};
|
||||
|
||||
// -------------------------------------------------------------------------------------------------
|
||||
@@ -58,159 +52,6 @@ impl OKDSetup04WorkersInterpret {
|
||||
info!("[Workers] Rendering per-MAC PXE for workers and rebooting");
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Ensures that three physical hosts are discovered and available for the ControlPlane role.
|
||||
/// It will trigger discovery if not enough hosts are found.
|
||||
async fn get_nodes(
|
||||
&self,
|
||||
inventory: &Inventory,
|
||||
topology: &HAClusterTopology,
|
||||
) -> Result<Vec<PhysicalHost>, InterpretError> {
|
||||
const REQUIRED_HOSTS: usize = 2;
|
||||
let repo = InventoryRepositoryFactory::build().await?;
|
||||
let mut control_plane_hosts = repo.get_host_for_role(&HostRole::Worker).await?;
|
||||
|
||||
while control_plane_hosts.len() < REQUIRED_HOSTS {
|
||||
info!(
|
||||
"Discovery of {} control plane hosts in progress, current number {}",
|
||||
REQUIRED_HOSTS,
|
||||
control_plane_hosts.len()
|
||||
);
|
||||
// This score triggers the discovery agent for a specific role.
|
||||
DiscoverHostForRoleScore {
|
||||
role: HostRole::Worker,
|
||||
}
|
||||
.interpret(inventory, topology)
|
||||
.await?;
|
||||
control_plane_hosts = repo.get_host_for_role(&HostRole::Worker).await?;
|
||||
}
|
||||
|
||||
if control_plane_hosts.len() < REQUIRED_HOSTS {
|
||||
Err(InterpretError::new(format!(
|
||||
"OKD Requires at least {} control plane hosts, but only found {}. Cannot proceed.",
|
||||
REQUIRED_HOSTS,
|
||||
control_plane_hosts.len()
|
||||
)))
|
||||
} else {
|
||||
// Take exactly the number of required hosts to ensure consistency.
|
||||
Ok(control_plane_hosts
|
||||
.into_iter()
|
||||
.take(REQUIRED_HOSTS)
|
||||
.collect())
|
||||
}
|
||||
}
|
||||
|
||||
/// Configures DHCP host bindings for all control plane nodes.
|
||||
async fn configure_host_binding(
|
||||
&self,
|
||||
inventory: &Inventory,
|
||||
topology: &HAClusterTopology,
|
||||
nodes: &Vec<PhysicalHost>,
|
||||
) -> Result<(), InterpretError> {
|
||||
info!("[Worker] Configuring host bindings for worker nodes.");
|
||||
|
||||
// Ensure the topology definition matches the number of physical nodes found.
|
||||
if topology.control_plane.len() != nodes.len() {
|
||||
return Err(InterpretError::new(format!(
|
||||
"Mismatch between logical control plane hosts defined in topology ({}) and physical nodes found ({}).",
|
||||
topology.control_plane.len(),
|
||||
nodes.len()
|
||||
)));
|
||||
}
|
||||
|
||||
// Create a binding for each physical host to its corresponding logical host.
|
||||
let bindings: Vec<HostBinding> = topology
|
||||
.control_plane
|
||||
.iter()
|
||||
.zip(nodes.iter())
|
||||
.map(|(logical_host, physical_host)| {
|
||||
info!(
|
||||
"Creating binding: Logical Host '{}' -> Physical Host ID '{}'",
|
||||
logical_host.name, physical_host.id
|
||||
);
|
||||
HostBinding {
|
||||
logical_host: logical_host.clone(),
|
||||
physical_host: physical_host.clone(),
|
||||
}
|
||||
})
|
||||
.collect();
|
||||
|
||||
DhcpHostBindingScore {
|
||||
host_binding: bindings,
|
||||
domain: Some(topology.domain_name.clone()),
|
||||
}
|
||||
.interpret(inventory, topology)
|
||||
.await?;
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Renders and deploys a per-MAC iPXE boot file for each control plane node.
|
||||
async fn configure_ipxe(
|
||||
&self,
|
||||
inventory: &Inventory,
|
||||
topology: &HAClusterTopology,
|
||||
nodes: &Vec<PhysicalHost>,
|
||||
) -> Result<(), InterpretError> {
|
||||
info!("[Worker] Rendering per-MAC iPXE configurations.");
|
||||
|
||||
// The iPXE script content is the same for all control plane nodes,
|
||||
// pointing to the 'master.ign' ignition file.
|
||||
let content = BootstrapIpxeTpl {
|
||||
http_ip: &topology.http_server.get_ip().to_string(),
|
||||
scos_path: "scos",
|
||||
ignition_http_path: "okd_ignition_files",
|
||||
installation_device: "/dev/sda", // This might need to be configurable per-host in the future
|
||||
ignition_file_name: "worker.ign", // Worker nodes use the worker ignition file
|
||||
}
|
||||
.to_string();
|
||||
|
||||
debug!("[Worker] iPXE content template:\n{content}");
|
||||
|
||||
// Create and apply an iPXE boot file for each node.
|
||||
for node in nodes {
|
||||
let mac_address = node.get_mac_address();
|
||||
if mac_address.is_empty() {
|
||||
return Err(InterpretError::new(format!(
|
||||
"Physical host with ID '{}' has no MAC addresses defined.",
|
||||
node.id
|
||||
)));
|
||||
}
|
||||
info!(
|
||||
"[Worker] Applying iPXE config for node ID '{}' with MACs: {:?}",
|
||||
node.id, mac_address
|
||||
);
|
||||
|
||||
IPxeMacBootFileScore {
|
||||
mac_address,
|
||||
content: content.clone(),
|
||||
}
|
||||
.interpret(inventory, topology)
|
||||
.await?;
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Prompts the user to reboot the target control plane nodes.
|
||||
async fn reboot_targets(&self, nodes: &Vec<PhysicalHost>) -> Result<(), InterpretError> {
|
||||
let node_ids: Vec<String> = nodes.iter().map(|n| n.id.to_string()).collect();
|
||||
info!("[Worker] Requesting reboot for control plane nodes: {node_ids:?}",);
|
||||
|
||||
let confirmation = inquire::Confirm::new(
|
||||
&format!("Please reboot the {} worker nodes ({}) to apply their PXE configuration. Press enter when ready.", nodes.len(), node_ids.join(", ")),
|
||||
)
|
||||
.prompt()
|
||||
.map_err(|e| InterpretError::new(format!("User prompt failed: {e}")))?;
|
||||
|
||||
if !confirmation {
|
||||
return Err(InterpretError::new(
|
||||
"User aborted the operation.".to_string(),
|
||||
));
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
#[async_trait]
|
||||
@@ -233,23 +74,10 @@ impl Interpret<HAClusterTopology> for OKDSetup04WorkersInterpret {
|
||||
|
||||
async fn execute(
|
||||
&self,
|
||||
inventory: &Inventory,
|
||||
topology: &HAClusterTopology,
|
||||
_inventory: &Inventory,
|
||||
_topology: &HAClusterTopology,
|
||||
) -> Result<Outcome, InterpretError> {
|
||||
self.render_and_reboot().await?;
|
||||
// 1. Ensure we have 2 physical hosts for the worker nodes.
|
||||
let nodes = self.get_nodes(inventory, topology).await?;
|
||||
|
||||
// 2. Create DHCP reservations for the worker nodes.
|
||||
self.configure_host_binding(inventory, topology, &nodes)
|
||||
.await?;
|
||||
|
||||
// 3. Create iPXE files for each worker node to boot from the worker ignition.
|
||||
self.configure_ipxe(inventory, topology, &nodes).await?;
|
||||
|
||||
// 4. Reboot the nodes to start the OS installation.
|
||||
self.reboot_targets(&nodes).await?;
|
||||
|
||||
Ok(Outcome::success("Workers provisioned".into()))
|
||||
}
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user