Files
harmony/adr/017-Multi-Cluster-Federation-And-Granular-Authorization.md

85 lines
6.3 KiB
Markdown

# ADR 017: Multi-Cluster Federation and Granular Authorization
* **Status:** Accepted
* **Date:** 2025-12-20
* **Context:** ADR-016 Harmony Agent and Global Mesh For Decentralized Workload Management
## Context and Problem Statement
As Harmony expands to manage workloads across multiple independent clusters, we face two specific requirements regarding network topology and security:
1. **Granular Authorization:** We need the ability to authorize a specific node (Leaf Node) from the decentralized mesh to access only a specific subset of topics within a cluster. For example, a collaborator connecting to our Public Cloud should not have visibility into other collaborator's data or our internal system events. Same goes the other way, when we connect to a collaborator cluster, we must not gain access to their internal data.
2. **Hybrid/Failover Topologies:** Users need to run their own private meshes (e.g., 5 interconnected office locations) with full internal trust, while simultaneously maintaining a connection to the Harmony Public Cloud for failover or bursting. In this scenario, the Public Cloud must not have access to the private mesh's internal data, and the private mesh should only access specific "public" topics on the cloud.
We need to decide how the Harmony Agent and the underlying NATS infrastructure will handle these "Split-Horizon" and "Zero-Trust" scenarios without complicating the application logic within the Agent itself.
## Decision Drivers
* **Security:** Default-deny posture is required for cross-domain connections.
* **Isolation:** Private on-premise clusters must remain private; data leakage to the public cloud is unacceptable.
* **Simplicity:** The Harmony Agent code should not be burdened with managing multiple complex TCP connections or application-layer routing logic.
* **Efficiency:** Failover traffic should only occur when necessary; network chatter (gossip) from the Public Cloud should not propagate to private nodes.
## Considered Options
1. **Application-Layer Multi-Homing:** The Harmony Agent maintains two distinct active connections (one to Local, one to Cloud) and handles routing logic in Go code.
2. **VPN/VPC Peering:** Connect private networks to the public cloud at the network layer (IPSec/WireGuard).
3. **NATS Account Federation & Leaf Nodes:** Utilize NATS native multi-tenancy (Accounts) and topology bridging (Leaf Nodes) to handle routing and permissions at the protocol layer.
## Decision Outcome
We will adopt **Option 3: NATS Account Federation & Leaf Nodes**.
We will leverage NATS built-in "Accounts" to create logical isolation boundaries and "Service Imports/Exports" to bridge specific data streams between private and public clusters.
### 1. Authorization Strategy (Subject-Based Permissions)
We will not implement custom authorization logic inside the Harmony Agent for topic access. Instead, we will enforce **Subject-Based Permissions** at the NATS server level.
* **Mechanism:** When a Leaf Node (e.g., a Customer Site) connects to the Public Cloud, it must authenticate using a scoped credential (JWT/NKey).
* **Policy:** This credential will carry a permission policy that strictly defines:
* `PUB`: Allowed subjects (e.g., `harmony.public.requests`).
* `SUB`: Allowed subjects (e.g., `harmony.public.responses`).
* `DENY`: Implicitly everything else (including `_SYS.>`, other customer streams).
* **Enforcement:** The NATS server rejects unauthorized subscriptions or publications at the protocol level before they reach the application layer.
### 2. Hybrid Topology Strategy (Federation)
We will use **NATS Account Federation** to solve the "Hybrid Cloud" requirement. We will treat the Private Mesh and Public Cloud as separate "Accounts."
* **The Private Mesh (Business Account):**
* Consists of the customer's internal nodes (e.g., 5 locations).
* Nodes share full trust and visibility of `local.*` subjects.
* Configured as a **Leaf Node** connection to the Public Cloud.
* **The Public Cloud (SaaS Account):**
* **Exports** a specific stream/service (e.g., `public.compute.rentals`).
* Does *not* join the customer's cluster mesh. It acts as a hub.
* **The Bridge (Import):**
* The Private Mesh **Imports** the public stream.
* The Private Mesh maps this import to a local subject, e.g., `external.cloud`.
### 3. Failover Logic
The Harmony Agent will implement failover by selecting the target subject, relying on NATS to route the message physically.
1. **Primary Attempt:** Agent publishes to `local.compute`. NATS routes this only to the 5 internal sites.
2. **Failover Condition:** If internal capacity is exhausted (determined by Agent heuristics), the Agent publishes to `external.cloud`.
3. **Routing:** NATS transparently routes `external.cloud` messages over the Leaf Node connection to the Public Cloud.
4. **Security:** Because the Public Cloud does not import `local.compute`, it never sees internal traffic.
## Consequences
### Positive
* **Zero-Trust by Design:** The Public Cloud cannot spy on the Private Mesh because no streams are exported from Private to Public.
* **Reduced Complexity:** The Harmony Agent remains simple; it just publishes to different topics. It does not need to manage connection pools or complex authentication handshakes for multiple clouds.
* **Bandwidth Efficiency:** Gossip protocol traffic (cluster topology updates) is contained within the Accounts. The Private Mesh does not receive heartbeat traffic from the massive Public Cloud.
### Negative
* **Configuration Overhead:** Setting up NATS Accounts, Imports, and Exports requires more complex configuration files (server.conf) and understanding of JWT/NKeys compared to a flat mesh.
* **Observability:** Tracing a message as it crosses Account boundaries (from Private `external.cloud` to Public `public.compute`) requires centralized monitoring that understands Federation.
## References
* [NATS Documentation: Subject-Based Access Control](https://docs.nats.io/running-a-nats-service/configuration/securing_nats/auth_intro/subject_access)
* [NATS Documentation: Leaf Nodes](https://docs.nats.io/running-a-nats-service/configuration/leafnodes)
* [NATS Documentation: Services & Streams (Imports/Exports)](https://docs.nats.io/running-a-nats-service/configuration/securing_nats/accounts)