Files
harmony/docs/modules/Multisite_PostgreSQL.md
Jean-Gabriel Gill-Couture c7cbd9eeac
All checks were successful
Run Check Script / check (pull_request) Successful in 1m12s
doc: Initial documentation for the MultisitePostgreSQL module
2025-12-02 11:45:31 -05:00

5.6 KiB

Design Document: Harmony PostgreSQL Module

Status: Draft Last Updated: 2025-12-01 Context: Multi-site Data Replication & Orchestration

1. Overview

The Harmony PostgreSQL Module provides a high-level abstraction for deploying and managing high-availability PostgreSQL clusters across geographically distributed Kubernetes/OKD sites.

Instead of manually configuring complex replication slots, firewalls, and operator settings on each cluster, users define a single intent (a Score), and Harmony orchestrates the underlying infrastructure (the Arrangement) to establish a Primary-Replica architecture.

Currently, the implementation relies on the CloudNativePG (CNPG) operator as the backing engine.

2. Architecture

2.1 The Abstraction Model

Following ADR 003 (Infrastructure Abstraction), Harmony separates the intent from the implementation.

  1. The Score (Intent): The user defines a MultisitePostgreSQL resource. This describes what is needed (e.g., "A Postgres 15 cluster with 10GB storage, Primary on Site A, Replica on Site B").
  2. The Interpret (Action): Harmony MultisitePostgreSQLInterpret processes this Score and orchestrates the deployment on both sites to reach the state defined in the Score.
  3. The Capability (Implementation): The PostgreSQL Capability is implemented by the K8sTopology and the interpret can deploy it, configure it and fetch information about it. The concrete implementation will rely on the mature CloudnativePG operator to manage all the Kubernetes resources required.

2.2 Network Connectivity (TLS Passthrough)

One of the critical challenges in multi-site orchestration is secure connectivity between clusters that may have dynamic IPs or strict firewalls.

To solve this, we utilize OKD/OpenShift Routes with TLS Passthrough.

  • Mechanism: The Primary site exposes a Route configured for termination: passthrough.
  • Routing: The OpenShift HAProxy router inspects the SNI (Server Name Indication) header of the incoming TCP connection to route traffic to the correct PostgreSQL Pod.
  • Security: SSL is not terminated at the ingress router. The encrypted stream is passed directly to the PostgreSQL instance. Mutual TLS (mTLS) authentication is handled natively by CNPG between the Primary and Replica instances.
  • Dynamic IPs: Because connections are established via DNS hostnames (the Route URL), this architecture is resilient to dynamic IP changes at the Primary site.

Traffic Flow Diagram

[ Site B: Replica ]                 [ Site A: Primary ]
      |                                     |
(CNPG Instance) --[Encrypted TCP]--> (OKD HAProxy Router)
      |           (Port 443)                |
      |                                     |
      |                            [SNI Inspection]
      |                                     |
      |                                     v
      |                            (PostgreSQL Primary Pod)
      |                                   (Port 5432)

3. Design Decisions

Why CloudNativePG?

We selected CloudNativePG because it relies exclusively on standard Kubernetes primitives and uses the native PostgreSQL replication protocol (WAL shipping/Streaming). This aligns with Harmony's goal of being "K8s Native."

Why TLS Passthrough instead of VPN/NodePort?

  • NodePort: Requires static IPs and opening non-standard ports on the firewall, which violates our security constraints.
  • VPN (e.g., Wireguard/Tailscale): While secure, it introduces significant complexity (sidecars, key management) and external dependencies.
  • TLS Passthrough: Leverages the existing Ingress/Router infrastructure already present in OKD. It requires zero additional software and respects multi-tenancy (Routes are namespaced).

Configuration Philosophy (YAGNI)

The current design exposes a generic configuration surface. Users can configure standard parameters (Storage size, CPU/Memory requests, Postgres version).

We explicitly do not expose advanced CNPG or PostgreSQL configurations at this stage.

  • Reasoning: We aim to keep the API surface small and manageable.
  • Future Path: We plan to implement a "pass-through" mechanism to allow sending raw config maps or custom parameters to the underlying engine (CNPG) only when a concrete use case arises. Until then, we adhere to the YAGNI (You Ain't Gonna Need It) principle to avoid premature optimization and API bloat.

4. Usage Guide

To deploy a multi-site cluster, apply the MultisitePostgreSQL resource to the Harmony Control Plane.

Example Manifest

apiVersion: harmony.io/v1alpha1
kind: MultisitePostgreSQL
metadata:
  name: finance-db
  namespace: tenant-a
spec:
  version: "15"
  storage: "10Gi"
  resources:
    requests:
      cpu: "500m"
      memory: "1Gi"
  
  # Topology Definition
  topology:
    primary:
      site: "site-paris" # The name of the cluster in Harmony
    replicas:
      - site: "site-newyork"

What happens next?

  1. Harmony detects the CR.
  2. On Site Paris: It deploys a CNPG Cluster (Primary) and creates a Passthrough Route postgres-finance-db.apps.site-paris.example.com.
  3. On Site New York: It deploys a CNPG Cluster (Replica) configured with externalClusters pointing to the Paris Route.
  4. Data begins replicating immediately over the encrypted channel.

5. Troubleshooting

  • Connection Refused: Ensure the Primary site's Route is successfully admitted by the Ingress Controller.
  • Certificate Errors: CNPG manages mTLS automatically. If errors persist, ensure the CA secrets were correctly propagated by Harmony from Primary to Replica namespaces.