Compare commits
1 Commits
feature/kv
...
adr-nats-c
| Author | SHA1 | Date | |
|---|---|---|---|
| 0b9499bc97 |
86
adr/019-Nats-credentials-management.md
Normal file
86
adr/019-Nats-credentials-management.md
Normal file
@@ -0,0 +1,86 @@
|
||||
Initial Date: 2025-02-06
|
||||
|
||||
## Status
|
||||
|
||||
Proposed
|
||||
|
||||
## Context
|
||||
|
||||
The Harmony Agent requires a persistent connection to the NATS Supercluster to perform Key-Value (KV) operations (Read/Write/Watch).
|
||||
|
||||
Service Requirements: The agent must authenticate with sufficient privileges to manage KV buckets and interact with the JetStream API.
|
||||
|
||||
Infrastructure: NATS is deployed as a multi-site Supercluster. Authentication must be consistent across sites to allow for agent failover and data replication.
|
||||
|
||||
https://docs.nats.io/running-a-nats-service/configuration/securing_nats/auth_intro
|
||||
|
||||
Technical Constraint: In NATS, JetStream functionality is not global by default; it must be explicitly enabled and capped at the Account level to allow KV bucket creation and persistence.
|
||||
|
||||
## Issues
|
||||
|
||||
1. The "System Account" Trap
|
||||
|
||||
The Hole: Using the system account for the Harmony Agent.
|
||||
|
||||
The Risk: The NATS System Account is for server heartbeat and monitoring. It cannot (and should not) own JetStream KV buckets.
|
||||
|
||||
2. Multi-Site Authorization Sync
|
||||
|
||||
The Hole: Defining users in local nats.conf files via Helm.
|
||||
|
||||
The Risk: If an agent at Site-2 fails over to Site-3, but Site-3’s local configuration doesn't have the testUser credentials, the agent will be locked out during an outage.
|
||||
|
||||
3. KV Replication Factor
|
||||
|
||||
The Hole: Not specifying the Replicas count for the KV bucket.
|
||||
|
||||
The Risk: If you create a KV bucket with the default (1 replica), it only exists at the site where it was created. If that site goes down, the data is lost despite having a Supercluster.
|
||||
|
||||
4. Subject-Level Permissions
|
||||
|
||||
The Hole: Only granting TEST.* permissions.
|
||||
|
||||
The Risk: NATS KV uses internal subjects (e.g., $KV.<bucket_name>.>). Without access to these, the agent will get an "Authorization Violation" even if it's logged in.
|
||||
|
||||
|
||||
## Proposed Solution
|
||||
|
||||
To enable reliable, secure communication between the Harmony Agent and the NATS Supercluster, we will implement Account-isolated JetStream using NKey Authentication (or mTLS).
|
||||
1. Dedicated Account Architecture
|
||||
|
||||
We will move away from the "Global/Default" account. A dedicated HARMONY account will be defined identically across all sites in the Supercluster. This ensures that the metadata for the KV bucket can replicate across the gateways.
|
||||
|
||||
System Account: Reserved for NATS internal health and Supercluster routing.
|
||||
|
||||
Harmony Account: Dedicated to Harmony Agent data, with JetStream explicitly enabled.
|
||||
|
||||
2. Authentication: Use harmony secret store mounted into nats container
|
||||
|
||||
Take advantage of currently implemented solution
|
||||
|
||||
3. JetStream & KV Configuration
|
||||
|
||||
To ensure the KV bucket is available across the Supercluster, the following configuration must be applied:
|
||||
|
||||
Replication Factor (R=3): KV buckets will be created with a replication factor of 3 to ensure data persists across Site-1, Site-2, and Site-3.
|
||||
|
||||
Permissions: The agent will be granted scoped access to:
|
||||
|
||||
$KV.HARMONY.> (Data operations)
|
||||
|
||||
$JS.API.CONSUMER.> and $JS.API.STREAM.> (Management operations)
|
||||
|
||||
## Consequence of Decision
|
||||
Pros
|
||||
|
||||
Resilience: Agents can fail over to any site in the Supercluster and find their credentials and data.
|
||||
|
||||
Security: By using a dedicated account, the Harmony Agent cannot see or interfere with NATS system traffic.
|
||||
|
||||
Scalability: We can add Site-4 or Site-5 simply by copying the HARMONY account definition.
|
||||
|
||||
Cons / Risks
|
||||
|
||||
Configuration Drift: If one site's ConfigMap is updated without the others, authentication will fail during a site failover.
|
||||
|
||||
Complexity: Requires a "Management" step to ensure the account exists on all NATS instances before the agent attempts to connect.
|
||||
Reference in New Issue
Block a user