Files
harmony/docs/adr/017-3-revised-staleness-inspired-by-kubernetes.md
Jean-Gabriel Gill-Couture 9d2308eca6
All checks were successful
Run Check Script / check (pull_request) Successful in 1m48s
Merge remote-tracking branch 'origin/master' into feature/kvm-module
2026-03-22 10:02:10 -04:00

8.2 KiB

Context & Use Case

We are implementing a High Availability (HA) Failover Strategy for decentralized "Micro Data Centers." The core challenge is managing stateful workloads (PostgreSQL) over unreliable networks.

We solve this using a Local Fencing First approach, backed by NATS JetStream Strict Ordering for the final promotion authority.

In CAP theorem terms, we are developing a CP system, intentionally sacrificing availability. In practical terms, we expect an average of two primary outages per year, with a failover delay of around 2 minutes. This translates to an uptime of over five nines. To be precise, 2 outages * 2 minutes = 4 minutes per year = 99.99924% uptime.

The Algorithm: Local Fencing & Remote Promotion

The safety (data consistency) of the system relies on the time gap between the Primary giving up (Fencing) and the Replica taking over (Promotion).

To avoid clock skew issues between agents and datastore (nats), all timestamps comparisons will be done using jetstream metadata. I.E. a harmony agent will never use Instant::now() to get a timestamp, it will use my_last_heartbeat.metadata.timestamp (conceptually).

1. Configuration

  • heartbeat_timeout (e.g., 1s): Max time allowed for a NATS write/DB check.
  • failure_threshold (e.g., 2): Consecutive failures before self-fencing.
  • failover_timeout (e.g., 5s): Time since last NATS update of Primary heartbeat before Replica promotes.
    • This timeout must be carefully configured to allow enough time for the primary to fence itself (after heartbeat_timeout * failure_threshold) BEFORE the replica gets promoted to avoid a split brain with two primaries.
      • Implementing this will rely on the actual deployment configuration. For example, a CNPG based PostgreSQL cluster might require a longer gap (such as 30s) than other technologies.
    • Expires when replica_heartbeat.metadata.timestamp - primary_heartbeat.metadata.timestamp > failover_timeout

2. The Primary (Self-Preservation)

The Primary is aggressive about killing itself.

  • It attempts a heartbeat.
  • If the network latency > heartbeat_timeout, the attempt is cancelled locally because the heartbeat did not make it back in time.
  • This counts as a failure and increments the consecutive_failures counter.
  • If consecutive_failures hit the threshold, FENCING occurs immediately. The database is stopped.

This means that the Primary will fence itself after heartbeat_timeout * failure_threshold.

3. The Replica (The Watchdog)

The Replica is patient.

  • It watches the NATS stream to measure if replica_heartbeat.metadata.timestamp - primary_heartbeat.metadata.timestamp > failover_timeout
  • It only attempts promotion if the failover_timeout (5s) has passed.
  • Crucial: Careful configuration of the failover_timeout is required. This is the only way to avoid a split brain in case of a network partition where the Primary cannot write its heartbeats in time anymore.
    • In short, failover_timeout should be tuned to be heartbeat_timeout * failure_threshold + safety_margin. This safety_margin will vary by use case. For example, a CNPG cluster may need 30 seconds to demote a Primary to Replica when fencing is triggered, so safety_margin should be at least 30s in that setup.

Since we forcibly fail timeouts after heartbeat_timeout, we are guaranteed that the primary will have started the fencing process after heartbeat_timeout * failure_threshold.

But, in a network split scenario where the failed primary is still accessible by clients but cannot write its heartbeat successfully, there is no way to know if the demotion has actually completed.

For example, in a CNPG cluster, the failed Primary agent will attempt to change the CNPG cluster state to read-only. But if anything fails after that attempt (permission error, k8s api failure, CNPG bug, etc) it is possible that the PostgreSQL instance keeps accepting writes.

While this is not a theoretical failure of the agent's algorithm, this is a practical failure where data corruption occurs.

This can be fixed by detecting the demotion failure and escalating the fencing procedure aggressiveness. Harmony being an infrastructure orchestrator, it can easily exert radical measures if given the proper credentials, such as forcibly powering off a server, disconnecting its network in the switch configuration, forcibly kill a pod/container/process, etc.

However, these details are out of scope of this algorithm, as they simply fall under the "fencing procedure".

The implementation of the fencing procedure itself is not relevant. This algorithm's responsibility stops at calling the fencing procedure in the appropriate situation.

4. The Demotion Handshake (Return to Normalcy)

When the original Primary recovers:

  1. It becomes healthy locally but sees current_primary = Replica. It waits.
  2. The Replica (current leader) detects the Original Primary is back (via NATS heartbeats).
  3. Replica performs a Clean Demotion:
    • Stops DB.
    • Writes current_primary = None to NATS.
  4. Original Primary sees current_primary = None and can launch the promotion procedure.

Depending on the implementation, the promotion procedure may require a transition phase. Typically, for a PostgreSQL use case the promoting primary will make sure it has caught up on WAL replication before starting to accept writes.


Failure Modes & Behavior Analysis

Case 1: Immediate Outage (Power Cut)

  • Primary: Dies instantly. Fencing is implicit (machine is off).
  • Replica: Waits for failover_timeout (5s). Sees staleness. Promotes self.
  • Outcome: Clean failover after 5s.

// TODO detail what happens when the primary comes back up. We will likely have to tie PostgreSQL's lifecycle (liveness/readiness probes) with the agent to ensure it does not come back up as primary.

Case 2: High Network Latency on the Primary (The "Split Brain" Trap)

  • Scenario: Network latency spikes to 5s on the Primary, still below heartbeat_timeout on the Replica.
  • T=0 to T=2 (Primary): Tries to write. Latency (5s) > Timeout (1s). Fails twice.
  • T=2 (Primary): consecutive_failures = 2. Primary Fences Self. (Service is DOWN).
  • T=2 to T=5 (Cluster): Read-Only Phase. No Primary exists.
  • T=5 (Replica): failover_timeout reached. Replica promotes self.
  • Outcome: Safe failover. The "Read-Only Gap" (T=2 to T=5) ensures no Split Brain occurred.

Case 3: Replica Network Lag (False Positive)

  • Scenario: Replica has high latency, greater than failover_timeout; Primary is fine.
  • Replica: Thinks Primary is dead. Tries to promote by setting cluster_state.current_primary = replica_id.
  • NATS: Rejects the write because the Primary is still updating the sequence numbers successfully.
  • Outcome: Promotion denied. Primary stays leader.

Case 4: Network Instability (Flapping)

  • Scenario: Intermittent packet loss.
  • Primary: Fails 1 heartbeat, succeeds the next. consecutive_failures resets.
  • Replica: Sees a slight delay in updates, but never reaches failover_timeout.
  • Outcome: No Fencing, No Promotion. System rides out the noise.

Contextual notes

  • Clock skew : Tokio relies on monotonic clocks. This means that tokio::time::sleep(...) will not be affected by system clock corrections (such as NTP). But monotonic clocks are known to jump forward in some cases such as VM live migrations. This could mean a false timeout of a single heartbeat. If failure_threshold = 1, this can mean a false negative on the nodes' health, and a potentially useless demotion.
  • heartbeat_timeout == heartbeat_interval : We intentionally do not provide two separate settings for the timeout before considering a heartbeat failed and the interval between heartbeats. It could make sense in some configurations where low network latency is required to have a small heartbeat_timeout = 50ms and larger hartbeat_interval == 2s, but we do not have a practical use case for it yet. And having timeout larger than interval does not make sense in any situation we can think of at the moment. So we decided to have a single value for both, which makes the algorithm easier to reason about and implement.