adr: 17-1 nats clusters interconnection using islands of trust. mTLS via shared ca-bundle with each cluster distributing its own CA.
Some checks failed
Run Check Script / check (pull_request) Failing after 51s
Some checks failed
Run Check Script / check (pull_request) Failing after 51s
This commit is contained in:
189
adr/017-1-Nats-Clusters-Interconnection-Topology.md
Normal file
189
adr/017-1-Nats-Clusters-Interconnection-Topology.md
Normal file
@@ -0,0 +1,189 @@
|
||||
### 1. ADR 017-1: NATS Cluster Interconnection & Trust Topology
|
||||
|
||||
# Architecture Decision Record: NATS Cluster Interconnection & Trust Topology
|
||||
|
||||
**Status:** Proposed
|
||||
**Date:** 2026-01-12
|
||||
**Precedes:** [017-Staleness-Detection-for-Failover.md]
|
||||
|
||||
## Context
|
||||
|
||||
In ADR 017, we defined the failover mechanisms for the Harmony mesh. However, for a Primary (Site A) and a Replica (Site B) to communicate securely—or for the Global Mesh to function across disparate locations—we must establish a robust Transport Layer Security (TLS) strategy.
|
||||
|
||||
Our primary deployment platform is OKD (Kubernetes). While OKD provides an internal `service-ca`, it is designed primarily for intra-cluster service-to-service communication. It lacks the flexibility required for:
|
||||
1. **Public/External Gateway Identities:** NATS Gateways need to identify themselves via public DNS names or external IPs, not just internal `.svc` cluster domains.
|
||||
2. **Cross-Cluster Trust:** We need a mechanism to allow Cluster A to trust Cluster B without sharing a single private root key.
|
||||
|
||||
## Decision
|
||||
|
||||
We will implement an **"Islands of Trust"** topology using **cert-manager** on OKD.
|
||||
|
||||
### 1. Per-Cluster Certificate Authorities (CA)
|
||||
|
||||
* We explicitly **reject** the use of a single "Supercluster CA" shared across all sites.
|
||||
* Instead, every Harmony Cluster (Site A, Site B, etc.) will generate its own unique Self-Signed Root CA managed by `cert-manager` inside that cluster.
|
||||
* **Lifecycle:** Root CAs will have a long duration (e.g., 10 years) to minimize rotation friction, while Leaf Certificates (NATS servers) will remain short-lived (e.g., 90 days) and rotate automatically.
|
||||
|
||||
> Note : The decision to have a single CA for various workloads managed by Harmony on each deployment, or to have multiple CA for each service that requires interconnection is not made yet. This ADR leans towards one CA per service. This allows for maximum flexibility. But the direction might change and no clear decision has been made yet. The alternative of establishing that each cluster/harmony deployment has a single identity could make mTLS very simple between tenants.
|
||||
|
||||
### 2. Trust Federation via Bundle Exchange
|
||||
|
||||
To enable secure communication (mTLS) between clusters (e.g., for NATS Gateways or Leaf Nodes):
|
||||
|
||||
* **No Private Keys are shared.**
|
||||
* We will aggregate the **Public CA Certificates** of all trusted clusters into a shared `ca-bundle.pem`.
|
||||
* This bundle is distributed to the NATS configuration of every node.
|
||||
* **Verification Logic:** When Site A connects to Site B, Site A verifies Site B's certificate against the bundle. Since Site B's CA public key is in the bundle, the connection is accepted.
|
||||
|
||||
### 3. Tooling
|
||||
|
||||
* We will use **cert-manager** (deployed via Operator on OKD) rather than OKD's built-in `service-ca`. This provides us with standard CRDs (`Issuer`, `Certificate`) to manage the lifecycle, rotation, and complex SANs (Subject Alternative Names) required for external connectivity.
|
||||
* Harmony will manage installation, configuration and bundle creation across all sites
|
||||
|
||||
## Rationale
|
||||
|
||||
**Security Blast Radius (The "Key Leak" Scenario)**
|
||||
If we used a single global CA and the private key for Site A was compromised (e.g., physical theft of a server from a basement), the attacker could impersonate *any* site in the global mesh.
|
||||
By using Per-Cluster CAs:
|
||||
* If Site A is compromised, only Site A's identity is stolen.
|
||||
* We can "evict" Site A from the mesh simply by removing Site A's Public CA from the `ca-bundle.pem` on the remaining healthy clusters and reloading. The attacker can no longer authenticate.
|
||||
|
||||
**Decentralized Autonomy**
|
||||
This aligns with the "Humane Computing" vision. A local cluster owns its identity. It does not depend on a central authority to issue its certificates. It can function in isolation (offline) indefinitely without needing to "phone home" to renew credentials.
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive**
|
||||
* **High Security:** Compromise of one node does not compromise the global mesh.
|
||||
* **Flexibility:** Easier to integrate with third-party clusters or partners by simply adding their public CA to the bundle.
|
||||
* **Standardization:** `cert-manager` is the industry standard, making the configuration portable to non-OKD K8s clusters if needed.
|
||||
|
||||
**Negative**
|
||||
* **Configuration Complexity:** We must manage a mechanism to distribute the `ca-bundle.pem` containing public keys to all sites. This should be automated (e.g., via a Harmony Agent) to ensure timely updates and revocation.
|
||||
* **Revocation Latency:** Revoking a compromised cluster requires updating and reloading the bundle on all other clusters. This is slower than OCSP/CRL but acceptable for infrastructure-level trust if automation is in place.
|
||||
|
||||
---
|
||||
|
||||
# 2. Concrete overview of the process, how it can be implemented manually across multiple OKD clusters
|
||||
|
||||
All of this will be automated via Harmony, but to understand correctly the process it is outlined in details here :
|
||||
|
||||
## 1. Deploying and Configuring cert-manager on OKD
|
||||
|
||||
While OKD has a built-in `service-ca` controller, it is "opinionated" and primarily signs certs for internal services (like `my-svc.my-namespace.svc`). It is **not suitable** for the Harmony Global Mesh because you cannot easily control the Subject Alternative Names (SANs) for external routes (e.g., `nats.site-a.nationtech.io`), nor can you easily export its CA to other clusters.
|
||||
|
||||
**The Solution:** Use the **cert-manager Operator for Red Hat OpenShift**.
|
||||
|
||||
### Step 1: Install the Operator
|
||||
1. Log in to the OKD Web Console.
|
||||
2. Navigate to **Operators** -> **OperatorHub**.
|
||||
3. Search for **"cert-manager"**.
|
||||
4. Choose the **"cert-manager Operator for Red Hat OpenShift"** (Red Hat provided) or the community version.
|
||||
5. Click **Install**. Use the default settings (Namespace: `cert-manager-operator`).
|
||||
|
||||
### Step 2: Create the "Island" CA (The Issuer)
|
||||
Once installed, you define your cluster's unique identity. Apply this YAML to your NATS namespace.
|
||||
|
||||
```yaml
|
||||
# filepath: k8s/01-issuer.yaml
|
||||
apiVersion: cert-manager.io/v1
|
||||
kind: Issuer
|
||||
metadata:
|
||||
name: harmony-selfsigned-issuer
|
||||
namespace: harmony-nats
|
||||
spec:
|
||||
selfSigned: {}
|
||||
---
|
||||
# This generates the unique Root CA for THIS specific cluster
|
||||
apiVersion: cert-manager.io/v1
|
||||
kind: Certificate
|
||||
metadata:
|
||||
name: harmony-root-ca
|
||||
namespace: harmony-nats
|
||||
spec:
|
||||
isCA: true
|
||||
commonName: "harmony-site-a-ca" # CHANGE THIS per cluster (e.g., site-b-ca)
|
||||
duration: 87600h # 10 years
|
||||
renewBefore: 2160h # 3 months before expiry
|
||||
secretName: harmony-root-ca-secret
|
||||
privateKey:
|
||||
algorithm: ECDSA
|
||||
size: 256
|
||||
issuerRef:
|
||||
name: harmony-selfsigned-issuer
|
||||
kind: Issuer
|
||||
group: cert-manager.io
|
||||
---
|
||||
# This Issuer uses the Root CA generated above to sign NATS certs
|
||||
apiVersion: cert-manager.io/v1
|
||||
kind: Issuer
|
||||
metadata:
|
||||
name: harmony-ca-issuer
|
||||
namespace: harmony-nats
|
||||
spec:
|
||||
ca:
|
||||
secretName: harmony-root-ca-secret
|
||||
```
|
||||
|
||||
### Step 3: Generate the NATS Server Certificate
|
||||
This certificate will be used by the NATS server. It includes both internal DNS names (for local clients) and external DNS names (for the global mesh).
|
||||
|
||||
```yaml
|
||||
# filepath: k8s/02-nats-cert.yaml
|
||||
apiVersion: cert-manager.io/v1
|
||||
kind: Certificate
|
||||
metadata:
|
||||
name: nats-server-cert
|
||||
namespace: harmony-nats
|
||||
spec:
|
||||
secretName: nats-server-tls
|
||||
duration: 2160h # 90 days
|
||||
renewBefore: 360h # 15 days
|
||||
issuerRef:
|
||||
name: harmony-ca-issuer
|
||||
kind: Issuer
|
||||
# CRITICAL: Define all names this server can be reached by
|
||||
dnsNames:
|
||||
- "nats"
|
||||
- "nats.harmony-nats.svc"
|
||||
- "nats.harmony-nats.svc.cluster.local"
|
||||
- "*.nats.harmony-nats.svc.cluster.local"
|
||||
- "nats-gateway.site-a.nationtech.io" # External Route for Mesh
|
||||
```
|
||||
|
||||
## 2. Implementing the "Islands of Trust" (Trust Bundle)
|
||||
|
||||
To make Site A and Site B talk, you need to exchange **Public Keys**.
|
||||
|
||||
1. **Extract Public CA from Site A:**
|
||||
```bash
|
||||
oc get secret harmony-root-ca-secret -n harmony-nats -o jsonpath='{.data.ca\.crt}' | base64 -d > site-a.crt
|
||||
```
|
||||
2. **Extract Public CA from Site B:**
|
||||
```bash
|
||||
oc get secret harmony-root-ca-secret -n harmony-nats -o jsonpath='{.data.ca\.crt}' | base64 -d > site-b.crt
|
||||
```
|
||||
3. **Create the Bundle:**
|
||||
Combine them into one file.
|
||||
```bash
|
||||
cat site-a.crt site-b.crt > ca-bundle.crt
|
||||
```
|
||||
4. **Upload Bundle to Both Clusters:**
|
||||
Create a ConfigMap or Secret in *both* clusters containing this combined bundle.
|
||||
```bash
|
||||
oc create configmap nats-trust-bundle --from-file=ca.crt=ca-bundle.crt -n harmony-nats
|
||||
```
|
||||
5. **Configure NATS:**
|
||||
Mount this ConfigMap and point NATS to it.
|
||||
|
||||
```conf
|
||||
# nats.conf snippet
|
||||
tls {
|
||||
cert_file: "/etc/nats-certs/tls.crt"
|
||||
key_file: "/etc/nats-certs/tls.key"
|
||||
# Point to the bundle containing BOTH Site A and Site B public CAs
|
||||
ca_file: "/etc/nats-trust/ca.crt"
|
||||
}
|
||||
```
|
||||
|
||||
This setup ensures that Site A can verify Site B's certificate (signed by `harmony-site-b-ca`) because Site B's CA is in Site A's trust store, and vice versa, without ever sharing the private keys that generated them.
|
||||
Reference in New Issue
Block a user