Architecture Decision Record: Staleness-Based Failover Mechanism & Observability

Status: Proposed Date: 2026-01-09 Precedes: 016-Harmony-Agent-And-Global-Mesh-For-Decentralized-Workload-Management.md

Context

In ADR 016, we established the Harmony Agent and the Global Orchestration Mesh (powered by NATS JetStream) as the foundation for our decentralized infrastructure. We defined the high-level need for a FailoverStrategy that can support both financial consistency (CP) and AI availability (AP).

However, a specific implementation challenge remains: How do we reliably detect node failure without losing the ability to debug the event later?

Standard distributed systems often use "Key Expiration" (TTL) for heartbeats. If a key disappears, the node is presumed dead. While simple, this approach is catastrophic for post-mortem analysis. When the key expires, the evidence of when and how the failure occurred evaporates.

For NationTech’s vision of Humane Computing—where micro datacenters might be heating a family home or running a local business—reliability and diagnosability are paramount. If a cluster fails over, we owe it to the user to provide a clear, historical log of exactly what happened. We cannot build a "wonderful future for computers" on ephemeral, untraceable errors.

Decision

We will implement a Staleness Detection mechanism rather than a Key Expiration mechanism. We will leverage NATS JetStream Key-Value (KV) stores with History Enabled to create an immutable audit trail of cluster health.

1. The "Black Box" Flight Recorder (NATS Configuration)

We will utilize a persistent NATS KV bucket named harmony_failover.

Storage: File (Persistent).
History: Set to 64 (or higher). This allows us to query the last 64 heartbeat entries to visualize the exact degradation of the primary node before failure.
TTL: None. Data never disappears; it only becomes "stale."

2. Data Structures

We will define two primary schemas to manage the state.

A. The Rules of Engagement (cluster_config) This persistent key defines the behavior of the mesh. It allows us to tune failover sensitivity dynamically without redeploying the Agent binary.

{
  "primary_site_id": "site-a-basement",
  "replica_site_id": "site-b-cloud",
  "failover_timeout_ms": 5000,   // Time before Replica takes over
  "heartbeat_interval_ms": 1000  // Frequency of Primary updates
}

Note : The location for this configuration data structure is TBD. See #206

B. The Heartbeat (primary_heartbeat) The Primary writes this; the Replica watches it.

{
  "site_id": "site-a-basement",
  "status": "HEALTHY",
  "counter": 10452,
  "timestamp": 1704661549000
}

3. The Failover Algorithm

The Primary (Site A) Logic: The Primary's ability to write to the mesh is its "License to Operate."

Write Loop: Attempts to write primary_heartbeat every heartbeat_interval_ms.
Self-Preservation (Fencing): If the write fails (NATS Ack timeout or NATS unreachable), the Primary immediately self-demotes. It assumes it is network-isolated. This prevents Split Brain scenarios where a partitioned Primary continues to accept writes while the Replica promotes itself.

The Replica (Site B) Logic: The Replica acts as the watchdog.

Watch: Subscribes to updates on primary_heartbeat.
Staleness Check: Maintains a local timer. Every time a heartbeat arrives, the timer resets.
Promotion: If the timer exceeds failover_timeout_ms, the Replica declares the Primary dead and promotes itself to Leader.
Yielding: If the Replica is Leader, but suddenly receives a valid, new heartbeat from the configured primary_site_id (indicating the Primary has recovered), the Replica will voluntarily demote itself to restore the preferred topology.

Rationale

Observability as a First-Class Citizen By keeping the last 64 heartbeats, we can run nats kv history to see the exact timeline. Did the Primary stop suddenly (crash)? or did the heartbeats become erratic and slow before stopping (network congestion)? This data is critical for optimizing the "Micro Data Centers" described in our vision, where internet connections in residential areas may vary in quality.

Energy Efficiency & Resource Optimization NationTech aims to "maximize the value of our energy." A "flapping" cluster (constantly failing over and back) wastes immense energy in data re-synchronization and startup costs. By making the failover_timeout_ms configurable via cluster_config, we can tune a cluster heating a greenhouse to be less sensitive (slower failover is fine) compared to a cluster running a payment gateway.

Decentralized Trust This architecture relies on NATS as the consensus engine. If the Primary is part of the NATS majority, it lives. If it isn't, it dies. This removes ambiguity and allows us to scale to thousands of independent sites without a central "God mode" controller managing every single failover.

Consequences

Positive

Auditability: Every failover event leaves a permanent trace in the KV history.
Safety: The "Write Ack" check on the Primary provides a strong guarantee against Split Brain in AbsoluteConsistency mode.
Dynamic Tuning: We can adjust timeouts for specific environments (e.g., high-latency satellite links) by updating a JSON key, requiring no downtime.

Negative

Storage Overhead: Keeping history requires marginally more disk space on the NATS servers, though for 64 small JSON payloads, this is negligible.
Clock Skew: While we rely on NATS server-side timestamps for ordering, extreme clock skew on the client side could confuse the debug logs (though not the failover logic itself).

Alignment with Vision

This architecture supports the NationTech goal of a "Beautifully Integrated Design." It takes the complex, high-stakes problem of distributed consensus and wraps it in a mechanism that is robust enough for enterprise banking yet flexible enough to manage a basement server heating a swimming pool. It bridges the gap between the reliability of Web2 clouds and the decentralized nature of Web3 infrastructure.

6.3 KiB Raw Permalink Blame History Unescape Escape