6.3 KiB
Architecture Decision Record: Staleness-Based Failover Mechanism & Observability
Status: Proposed Date: 2026-01-09 Precedes: 016-Harmony-Agent-And-Global-Mesh-For-Decentralized-Workload-Management.md
Context
In ADR 016, we established the Harmony Agent and the Global Orchestration Mesh (powered by NATS JetStream) as the foundation for our decentralized infrastructure. We defined the high-level need for a FailoverStrategy that can support both financial consistency (CP) and AI availability (AP).
However, a specific implementation challenge remains: How do we reliably detect node failure without losing the ability to debug the event later?
Standard distributed systems often use "Key Expiration" (TTL) for heartbeats. If a key disappears, the node is presumed dead. While simple, this approach is catastrophic for post-mortem analysis. When the key expires, the evidence of when and how the failure occurred evaporates.
For NationTech’s vision of Humane Computing—where micro datacenters might be heating a family home or running a local business—reliability and diagnosability are paramount. If a cluster fails over, we owe it to the user to provide a clear, historical log of exactly what happened. We cannot build a "wonderful future for computers" on ephemeral, untraceable errors.
Decision
We will implement a Staleness Detection mechanism rather than a Key Expiration mechanism. We will leverage NATS JetStream Key-Value (KV) stores with History Enabled to create an immutable audit trail of cluster health.
1. The "Black Box" Flight Recorder (NATS Configuration)
We will utilize a persistent NATS KV bucket named harmony_failover.
- Storage: File (Persistent).
- History: Set to
64(or higher). This allows us to query the last 64 heartbeat entries to visualize the exact degradation of the primary node before failure. - TTL: None. Data never disappears; it only becomes "stale."
2. Data Structures
We will define two primary schemas to manage the state.
A. The Rules of Engagement (cluster_config)
This persistent key defines the behavior of the mesh. It allows us to tune failover sensitivity dynamically without redeploying the Agent binary.
{
"primary_site_id": "site-a-basement",
"replica_site_id": "site-b-cloud",
"failover_timeout_ms": 5000, // Time before Replica takes over
"heartbeat_interval_ms": 1000 // Frequency of Primary updates
}
Note : The location for this configuration data structure is TBD. See #206
B. The Heartbeat (primary_heartbeat)
The Primary writes this; the Replica watches it.
{
"site_id": "site-a-basement",
"status": "HEALTHY",
"counter": 10452,
"timestamp": 1704661549000
}
3. The Failover Algorithm
The Primary (Site A) Logic: The Primary's ability to write to the mesh is its "License to Operate."
- Write Loop: Attempts to write
primary_heartbeateveryheartbeat_interval_ms. - Self-Preservation (Fencing): If the write fails (NATS Ack timeout or NATS unreachable), the Primary immediately self-demotes. It assumes it is network-isolated. This prevents Split Brain scenarios where a partitioned Primary continues to accept writes while the Replica promotes itself.
The Replica (Site B) Logic: The Replica acts as the watchdog.
- Watch: Subscribes to updates on
primary_heartbeat. - Staleness Check: Maintains a local timer. Every time a heartbeat arrives, the timer resets.
- Promotion: If the timer exceeds
failover_timeout_ms, the Replica declares the Primary dead and promotes itself to Leader. - Yielding: If the Replica is Leader, but suddenly receives a valid, new heartbeat from the configured
primary_site_id(indicating the Primary has recovered), the Replica will voluntarily demote itself to restore the preferred topology.
Rationale
Observability as a First-Class Citizen
By keeping the last 64 heartbeats, we can run nats kv history to see the exact timeline. Did the Primary stop suddenly (crash)? or did the heartbeats become erratic and slow before stopping (network congestion)? This data is critical for optimizing the "Micro Data Centers" described in our vision, where internet connections in residential areas may vary in quality.
Energy Efficiency & Resource Optimization
NationTech aims to "maximize the value of our energy." A "flapping" cluster (constantly failing over and back) wastes immense energy in data re-synchronization and startup costs. By making the failover_timeout_ms configurable via cluster_config, we can tune a cluster heating a greenhouse to be less sensitive (slower failover is fine) compared to a cluster running a payment gateway.
Decentralized Trust This architecture relies on NATS as the consensus engine. If the Primary is part of the NATS majority, it lives. If it isn't, it dies. This removes ambiguity and allows us to scale to thousands of independent sites without a central "God mode" controller managing every single failover.
Consequences
Positive
- Auditability: Every failover event leaves a permanent trace in the KV history.
- Safety: The "Write Ack" check on the Primary provides a strong guarantee against Split Brain in
AbsoluteConsistencymode. - Dynamic Tuning: We can adjust timeouts for specific environments (e.g., high-latency satellite links) by updating a JSON key, requiring no downtime.
Negative
- Storage Overhead: Keeping history requires marginally more disk space on the NATS servers, though for 64 small JSON payloads, this is negligible.
- Clock Skew: While we rely on NATS server-side timestamps for ordering, extreme clock skew on the client side could confuse the debug logs (though not the failover logic itself).
Alignment with Vision
This architecture supports the NationTech goal of a "Beautifully Integrated Design." It takes the complex, high-stakes problem of distributed consensus and wraps it in a mechanism that is robust enough for enterprise banking yet flexible enough to manage a basement server heating a swimming pool. It bridges the gap between the reliability of Web2 clouds and the decentralized nature of Web3 infrastructure.