doc: Initial documentation for the MultisitePostgreSQL module
All checks were successful
Run Check Script / check (pull_request) Successful in 1m12s

This commit is contained in:
Jean-Gabriel Gill-Couture 2025-12-02 11:44:24 -05:00
parent 83c1cc82b6
commit c7cbd9eeac

View File

@ -0,0 +1,105 @@
# Design Document: Harmony PostgreSQL Module
**Status:** Draft
**Last Updated:** 2025-12-01
**Context:** Multi-site Data Replication & Orchestration
## 1. Overview
The Harmony PostgreSQL Module provides a high-level abstraction for deploying and managing high-availability PostgreSQL clusters across geographically distributed Kubernetes/OKD sites.
Instead of manually configuring complex replication slots, firewalls, and operator settings on each cluster, users define a single intent (a **Score**), and Harmony orchestrates the underlying infrastructure (the **Arrangement**) to establish a Primary-Replica architecture.
Currently, the implementation relies on the **CloudNativePG (CNPG)** operator as the backing engine.
## 2. Architecture
### 2.1 The Abstraction Model
Following **ADR 003 (Infrastructure Abstraction)**, Harmony separates the *intent* from the *implementation*.
1. **The Score (Intent):** The user defines a `MultisitePostgreSQL` resource. This describes *what* is needed (e.g., "A Postgres 15 cluster with 10GB storage, Primary on Site A, Replica on Site B").
2. **The Interpret (Action):** Harmony MultisitePostgreSQLInterpret processes this Score and orchestrates the deployment on both sites to reach the state defined in the Score.
3. **The Capability (Implementation):** The PostgreSQL Capability is implemented by the K8sTopology and the interpret can deploy it, configure it and fetch information about it. The concrete implementation will rely on the mature CloudnativePG operator to manage all the Kubernetes resources required.
### 2.2 Network Connectivity (TLS Passthrough)
One of the critical challenges in multi-site orchestration is secure connectivity between clusters that may have dynamic IPs or strict firewalls.
To solve this, we utilize **OKD/OpenShift Routes with TLS Passthrough**.
* **Mechanism:** The Primary site exposes a `Route` configured for `termination: passthrough`.
* **Routing:** The OpenShift HAProxy router inspects the **SNI (Server Name Indication)** header of the incoming TCP connection to route traffic to the correct PostgreSQL Pod.
* **Security:** SSL is **not** terminated at the ingress router. The encrypted stream is passed directly to the PostgreSQL instance. Mutual TLS (mTLS) authentication is handled natively by CNPG between the Primary and Replica instances.
* **Dynamic IPs:** Because connections are established via DNS hostnames (the Route URL), this architecture is resilient to dynamic IP changes at the Primary site.
#### Traffic Flow Diagram
```text
[ Site B: Replica ] [ Site A: Primary ]
| |
(CNPG Instance) --[Encrypted TCP]--> (OKD HAProxy Router)
| (Port 443) |
| |
| [SNI Inspection]
| |
| v
| (PostgreSQL Primary Pod)
| (Port 5432)
```
## 3. Design Decisions
### Why CloudNativePG?
We selected CloudNativePG because it relies exclusively on standard Kubernetes primitives and uses the native PostgreSQL replication protocol (WAL shipping/Streaming). This aligns with Harmony's goal of being "K8s Native."
### Why TLS Passthrough instead of VPN/NodePort?
* **NodePort:** Requires static IPs and opening non-standard ports on the firewall, which violates our security constraints.
* **VPN (e.g., Wireguard/Tailscale):** While secure, it introduces significant complexity (sidecars, key management) and external dependencies.
* **TLS Passthrough:** Leverages the existing Ingress/Router infrastructure already present in OKD. It requires zero additional software and respects multi-tenancy (Routes are namespaced).
### Configuration Philosophy (YAGNI)
The current design exposes a **generic configuration surface**. Users can configure standard parameters (Storage size, CPU/Memory requests, Postgres version).
**We explicitly do not expose advanced CNPG or PostgreSQL configurations at this stage.**
* **Reasoning:** We aim to keep the API surface small and manageable.
* **Future Path:** We plan to implement a "pass-through" mechanism to allow sending raw config maps or custom parameters to the underlying engine (CNPG) *only when a concrete use case arises*. Until then, we adhere to the **YAGNI (You Ain't Gonna Need It)** principle to avoid premature optimization and API bloat.
## 4. Usage Guide
To deploy a multi-site cluster, apply the `MultisitePostgreSQL` resource to the Harmony Control Plane.
### Example Manifest
```yaml
apiVersion: harmony.io/v1alpha1
kind: MultisitePostgreSQL
metadata:
name: finance-db
namespace: tenant-a
spec:
version: "15"
storage: "10Gi"
resources:
requests:
cpu: "500m"
memory: "1Gi"
# Topology Definition
topology:
primary:
site: "site-paris" # The name of the cluster in Harmony
replicas:
- site: "site-newyork"
```
### What happens next?
1. Harmony detects the CR.
2. **On Site Paris:** It deploys a CNPG Cluster (Primary) and creates a Passthrough Route `postgres-finance-db.apps.site-paris.example.com`.
3. **On Site New York:** It deploys a CNPG Cluster (Replica) configured with `externalClusters` pointing to the Paris Route.
4. Data begins replicating immediately over the encrypted channel.
## 5. Troubleshooting
* **Connection Refused:** Ensure the Primary site's Route is successfully admitted by the Ingress Controller.
* **Certificate Errors:** CNPG manages mTLS automatically. If errors persist, ensure the CA secrets were correctly propagated by Harmony from Primary to Replica namespaces.