adr: Staless based failover mechanism ADR proposed

2026-01-08 23:58:30 -05:00
1 changed files with 95 additions and 0 deletions
--- a/adr/017-Staleness-based-failover-mechanism-and-observability.md
+++ b/adr/017-Staleness-based-failover-mechanism-and-observability.md
@@ -0,0 +1,95 @@
+# Architecture Decision Record: Staleness-Based Failover Mechanism & Observability
+
+**Status:** Proposed
+**Date:** 2026-01-09
+**Precedes:** [016-Harmony-Agent-And-Global-Mesh-For-Decentralized-Workload-Management.md](https://git.nationtech.io/NationTech/harmony/raw/branch/master/adr/016-Harmony-Agent-And-Global-Mesh-For-Decentralized-Workload-Management.md)
+
+## Context
+
+In ADR 016, we established the **Harmony Agent** and the **Global Orchestration Mesh** (powered by NATS JetStream) as the foundation for our decentralized infrastructure. We defined the high-level need for a `FailoverStrategy` that can support both financial consistency (CP) and AI availability (AP).
+
+However, a specific implementation challenge remains: **How do we reliably detect node failure without losing the ability to debug the event later?**
+
+Standard distributed systems often use "Key Expiration" (TTL) for heartbeats. If a key disappears, the node is presumed dead. While simple, this approach is catastrophic for post-mortem analysis. When the key expires, the evidence of *when* and *how* the failure occurred evaporates.
+
+For NationTech’s vision of **Humane Computing**—where micro datacenters might be heating a family home or running a local business—reliability and diagnosability are paramount. If a cluster fails over, we owe it to the user to provide a clear, historical log of exactly what happened. We cannot build a "wonderful future for computers" on ephemeral, untraceable errors.
+
+## Decision
+
+We will implement a **Staleness Detection** mechanism rather than a Key Expiration mechanism. We will leverage NATS JetStream Key-Value (KV) stores with **History Enabled** to create an immutable audit trail of cluster health.
+
+### 1. The "Black Box" Flight Recorder (NATS Configuration)
+We will utilize a persistent NATS KV bucket named `harmony_failover`.
+*   **Storage:** File (Persistent).
+*   **History:** Set to `64` (or higher). This allows us to query the last 64 heartbeat entries to visualize the exact degradation of the primary node before failure.
+*   **TTL:** None. Data never disappears; it only becomes "stale."
+
+### 2. Data Structures
+We will define two primary schemas to manage the state.
+
+
+**A. The Rules of Engagement (`cluster_config`)**
+This persistent key defines the behavior of the mesh. It allows us to tune failover sensitivity dynamically without redeploying the Agent binary.
+
+```json
+{
+  "primary_site_id": "site-a-basement",
+  "replica_site_id": "site-b-cloud",
+  "failover_timeout_ms": 5000,   // Time before Replica takes over
+  "heartbeat_interval_ms": 1000  // Frequency of Primary updates
+}
+```
+
+> **Note :** The location for this configuration data structure is TBD. See https://git.nationtech.io/NationTech/harmony/issues/206
+
+**B. The Heartbeat (`primary_heartbeat`)**
+The Primary writes this; the Replica watches it.
+
+```json
+{
+  "site_id": "site-a-basement",
+  "status": "HEALTHY",
+  "counter": 10452,
+  "timestamp": 1704661549000
+}
+```
+
+### 3. The Failover Algorithm
+
+**The Primary (Site A) Logic:**
+The Primary's ability to write to the mesh is its "License to Operate."
+1.  **Write Loop:** Attempts to write `primary_heartbeat` every `heartbeat_interval_ms`.
+2.  **Self-Preservation (Fencing):** If the write fails (NATS Ack timeout or NATS unreachable), the Primary **immediately self-demotes**. It assumes it is network-isolated. This prevents Split Brain scenarios where a partitioned Primary continues to accept writes while the Replica promotes itself.
+
+**The Replica (Site B) Logic:**
+The Replica acts as the watchdog.
+1.  **Watch:** Subscribes to updates on `primary_heartbeat`.
+2.  **Staleness Check:** Maintains a local timer. Every time a heartbeat arrives, the timer resets.
+3.  **Promotion:** If the timer exceeds `failover_timeout_ms`, the Replica declares the Primary dead and promotes itself to Leader.
+4.  **Yielding:** If the Replica is Leader, but suddenly receives a valid, new heartbeat from the configured `primary_site_id` (indicating the Primary has recovered), the Replica will voluntarily **demote** itself to restore the preferred topology.
+
+## Rationale
+
+**Observability as a First-Class Citizen**
+By keeping the last 64 heartbeats, we can run `nats kv history` to see the exact timeline. Did the Primary stop suddenly (crash)? or did the heartbeats become erratic and slow before stopping (network congestion)? This data is critical for optimizing the "Micro Data Centers" described in our vision, where internet connections in residential areas may vary in quality.
+
+**Energy Efficiency & Resource Optimization**
+NationTech aims to "maximize the value of our energy." A "flapping" cluster (constantly failing over and back) wastes immense energy in data re-synchronization and startup costs. By making the `failover_timeout_ms` configurable via `cluster_config`, we can tune a cluster heating a greenhouse to be less sensitive (slower failover is fine) compared to a cluster running a payment gateway.
+
+**Decentralized Trust**
+This architecture relies on NATS as the consensus engine. If the Primary is part of the NATS majority, it lives. If it isn't, it dies. This removes ambiguity and allows us to scale to thousands of independent sites without a central "God mode" controller managing every single failover.
+
+## Consequences
+
+**Positive**
+*   **Auditability:** Every failover event leaves a permanent trace in the KV history.
+*   **Safety:** The "Write Ack" check on the Primary provides a strong guarantee against Split Brain in `AbsoluteConsistency` mode.
+*   **Dynamic Tuning:** We can adjust timeouts for specific environments (e.g., high-latency satellite links) by updating a JSON key, requiring no downtime.
+
+**Negative**
+*   **Storage Overhead:** Keeping history requires marginally more disk space on the NATS servers, though for 64 small JSON payloads, this is negligible.
+*   **Clock Skew:** While we rely on NATS server-side timestamps for ordering, extreme clock skew on the client side could confuse the debug logs (though not the failover logic itself).
+
+## Alignment with Vision
+This architecture supports the NationTech goal of a **"Beautifully Integrated Design."** It takes the complex, high-stakes problem of distributed consensus and wraps it in a mechanism that is robust enough for enterprise banking yet flexible enough to manage a basement server heating a swimming pool. It bridges the gap between the reliability of Web2 clouds and the decentralized nature of Web3 infrastructure.
+```