harmony/017-2-reviewed-staleness-detection-algorithm.md at feat/removesideeffect

NationTech/harmony

Fork 2

Files

Jean-Gabriel Gill-Couture 9d2308eca6

Run Check Script / check (pull_request) Successful in 1m48s

Details

Merge remote-tracking branch 'origin/master' into feature/kvm-module

2026-03-22 10:02:10 -04:00

16 KiB

Raw Permalink Blame History

Here are some rough notes on the previous design :

We found an issue where there could be primary flapping when network latency is larger than the primary self fencing timeout.
- e.g. network latency to get nats ack is 30 seconds (extreme but can happen), and self-fencing happens after 50 seconds. Then at second 50 self-fencing would occur, and then at second 60 ack comes in. At this point we reject the ack as already failed because of timeout. Self fencing happens. But then network latency comes back down to 5 seconds and lets one successful heartbeat through, this means the primary comes back to healthy, and the same thing repeats, so the primary flaps.
- At least this does not cause split brain since the replica never times out and wins the leadership write since we validate strict write ordering and we force consensus on writes.

Also, we were seeing that the implementation became more complex. There is a lot of timers to handle and that becomes hard to reason about for edge cases.

So, we came up with a slightly different approach, inspired by k8s liveness probes.

We now want to use a failure and success threshold counter . However, on the replica side, all we can do is use a timer. The timer we can use is time since last primary heartbeat jetstream metadata timestamp. We could also try and mitigate clock skew by measuring time between internal clock and jetstream metadata timestamp when writing our own heartbeat (not for now, but worth thinking about, though I feel like it is useless).

So the current working design is this :

configure :

number of consecutive success to mark the node as UP
number of consecutive failures to mark the node as DOWN
note that success/failure must be consecutive. One success in a row of failures is enough to keep service up. This allows for various configuration profiles, from very stict availability to very lenient depending on the number of failure tolerated and success required to keep the service up.
- failure_threshold at 100 will let a service fail (or timeout) 99/100 and stay up
- success_threshold at 100 will not bring back up a service until it has succeeded 100 heartbeat in a row
- failure threshold at 1 will fail the service at the slightest network latency spike/packet loss
- success threshold at 1 will bring the service up very quickly and may cause flapping in unstable network conditions

# heartbeat session log
# failure threshold  : 3
# success threshold : 2

STATUS UP :
t=1 probe : fail f=1 s=0
t=2 probe : fail : f=2 s=0
t=3 probe : ok f=0 s=1
t=4 probe : fail f=1 s=0

Scenario :

failure threshold = 2 heartbeat timeout = 1s total before fencing = 2 * 1 = 2s

staleness detection timer = 2*total before fencing

can we do this simple multiplication that staleness detection timer (time the replica waits since the last primary heartbeat before promoting itself) is double the time the replica will take before starting the fencing process.

Context

We are designing a Staleness-Based Failover Algorithm for the Harmony Agent. The goal is to manage High Availability (HA) for stateful workloads (like PostgreSQL) across decentralized, variable-quality networks ("Micro Data Centers").

We are moving away from complex, synchronized clocks in favor of a Counter-Based Liveness approach (inspired by Kubernetes probes) for the Primary, and a Time-Based Watchdog for the Replica.

1. The Algorithm

The Primary (Self-Health & Fencing)

The Primary validates its own "License to Operate" via a heartbeat loop.

Loop: Every heartbeat_interval (e.g., 1s), it attempts to write a heartbeat to NATS and check the local DB.
Counters: It maintains consecutive_failures and consecutive_successes.
State Transition:
- To UNHEALTHY: If consecutive_failures >= failure_threshold, the Primary Fences Self (stops DB, releases locks).
- To HEALTHY: If consecutive_successes >= success_threshold, the Primary Un-fences (starts DB, acquires locks).
Reset Logic: A single success resets the failure counter to 0, and vice versa.

The Replica (Staleness Detection)

The Replica acts as a passive watchdog observing the NATS stream.

Calculation: It calculates a MaxStaleness timeout. \text{MaxStaleness} = (\text{failure\_threshold} \times \text{heartbeat\_interval}) \times \text{SafetyMultiplier} (We use a SafetyMultiplier of 2 to ensure the Primary has definitely fenced itself before we take over).
Action: If Time.now() - LastPrimaryHeartbeat > MaxStaleness, the Replica assumes the Primary is dead and Promotes Self.

2. Configuration Trade-offs

The separation of success and failure thresholds allows us to tune the "personality" of the cluster.

Scenario A: The "Nervous" Cluster (High Sensitivity)

Config: failure_threshold: 1, success_threshold: 1
Behavior: Fails over immediately upon a single missed packet or slow disk write.
Pros: Maximum availability for perfect networks.
Cons: High Flapping Risk. In a residential network, a microwave turning on might cause a failover.

Scenario B: The "Tank" Cluster (High Stability)

Config: failure_threshold: 10, success_threshold: 1
Behavior: The node must be consistently broken for 10 seconds (assuming 1s interval) to give up.
Pros: Extremely stable on bad networks (e.g., Starlink, 4G). Ignores transient spikes.
Cons: Slow Failover. Users experience 10+ seconds of downtime before the Replica even thinks about taking over.

Scenario C: The "Sticky" Cluster (Hysteresis)

Config: failure_threshold: 5, success_threshold: 5
Behavior: Hard to kill, hard to bring back.
Pros: Prevents "Yo-Yo" effects. If a node fails, it must prove it is really stable (5 clean checks in a row) before re-joining the cluster.

3. Failure Modes & Behavior Analysis

Here is how the algorithm handles specific edge cases:

Case 1: Immediate Outage (Power Cut / Kernel Panic)

Event: Primary vanishes instantly. No more writes to NATS.
Primary: Does nothing (it's dead).
Replica: Sees the LastPrimaryHeartbeat timestamp age. Once it crosses MaxStaleness, it promotes itself.
Outcome: Clean failover after the timeout duration.

Case 2: Network Instability (Packet Loss / Jitter)

Event: The Primary fails to write to NATS for 2 cycles due to Wi-Fi interference, then succeeds on the 3rd.
Config: failure_threshold: 5.
Primary:
- t=1: Fail (Counter=1)
- t=2: Fail (Counter=2)
- t=3: Success (Counter resets to 0). State remains HEALTHY.
Replica: Sees a gap in heartbeats but the timestamp never exceeds MaxStaleness.
Outcome: No downtime, no failover. The system correctly identified this as noise, not failure.

Case 3: High Latency (The "Slow Death")

Event: Primary is under heavy load; heartbeats take 1.5s to complete (interval is 1s).
Primary: The timeout on the heartbeat logic triggers. consecutive_failures rises. Eventually, it hits failure_threshold and fences itself to prevent data corruption.
Replica: Sees the heartbeats stop (or arrive too late). The timestamp ages out.
Outcome: Primary fences self -> Replica waits for safety buffer -> Replica promotes. Split-brain is avoided because the Primary killed itself before the Replica acted (due to the SafetyMultiplier).

Case 4: Replica Network Partition

Event: Replica loses internet connection; Primary is fine.
Replica: Sees LastPrimaryHeartbeat age out (because it can't reach NATS). It wants to promote itself.
Constraint: To promote, the Replica must write to NATS. Since it is partitioned, the NATS write fails.
Outcome: The Replica remains in Standby (or fails to promote). The Primary continues serving traffic. Cluster integrity is preserved.

Context & Use Case

We are implementing a High Availability (HA) Failover Strategy for decentralized "Micro Data Centers." The core challenge is managing stateful workloads (PostgreSQL) over unreliable networks.

We solve this using a Local Fencing First approach, backed by NATS JetStream Strict Ordering for the final promotion authority.

In CAP theorem terms, we are developing a CP system, intentionally sacrificing availability. In practical terms, we expect an average of two primary outages per year, with a failover delay of around 2 minutes. This translates to an uptime of over five nines. To be precise, 2 outages * 2 minutes = 4 minutes per year = 99.99924% uptime.

The Algorithm: Local Fencing & Remote Promotion

The safety (data consistency) of the system relies on the time gap between the Primary giving up (Fencing) and the Replica taking over (Promotion).

To avoid clock skew issues between agents and datastore (nats), all timestamps comparisons will be done using jetstream metadata. I.E. a harmony agent will never use Instant::now() to get a timestamp, it will use my_last_heartbeat.metadata.timestamp (conceptually).

1. Configuration

heartbeat_timeout (e.g., 1s): Max time allowed for a NATS write/DB check.
failure_threshold (e.g., 2): Consecutive failures before self-fencing.
failover_timeout (e.g., 5s): Time since last NATS update of Primary heartbeat before Replica promotes.
- This timeout must be carefully configured to allow enough time for the primary to fence itself (after heartbeat_timeout * failure_threshold) BEFORE the replica gets promoted to avoid a split brain with two primaries.
  - Implementing this will rely on the actual deployment configuration. For example, a CNPG based PostgreSQL cluster might require a longer gap (such as 30s) than other technologies.
- Expires when replica_heartbeat.metadata.timestamp - primary_heartbeat.metadata.timestamp > failover_timeout

2. The Primary (Self-Preservation)

The Primary is aggressive about killing itself.

It attempts a heartbeat.
If the network latency > heartbeat_timeout, the attempt is cancelled locally because the heartbeat did not make it back in time.
This counts as a failure and increments the consecutive_failures counter.
If consecutive_failures hit the threshold, FENCING occurs immediately. The database is stopped.

This means that the Primary will fence itself after heartbeat_timeout * failure_threshold.

3. The Replica (The Watchdog)

The Replica is patient.

It watches the NATS stream to measure if replica_heartbeat.metadata.timestamp - primary_heartbeat.metadata.timestamp > failover_timeout
It only attempts promotion if the failover_timeout (5s) has passed.
Crucial: Careful configuration of the failover_timeout is required. This is the only way to avoid a split brain in case of a network partition where the Primary cannot write its heartbeats in time anymore.
- In short, failover_timeout should be tuned to be heartbeat_timeout * failure_threshold + safety_margin. This safety_margin will vary by use case. For example, a CNPG cluster may need 30 seconds to demote a Primary to Replica when fencing is triggered, so safety_margin should be at least 30s in that setup.

Since we forcibly fail timeouts after heartbeat_timeout, we are guaranteed that the primary will have started the fencing process after heartbeat_timeout * failure_threshold.

But, in a network split scenario where the failed primary is still accessible by clients but cannot write its heartbeat successfully, there is no way to know if the demotion has actually completed.

For example, in a CNPG cluster, the failed Primary agent will attempt to change the CNPG cluster state to read-only. But if anything fails after that attempt (permission error, k8s api failure, CNPG bug, etc) it is possible that the PostgreSQL instance keeps accepting writes.

While this is not a theoretical failure of the agent's algorithm, this is a practical failure where data corruption occurs.

This can be fixed by detecting the demotion failure and escalating the fencing procedure aggressiveness. Harmony being an infrastructure orchestrator, it can easily exert radical measures if given the proper credentials, such as forcibly powering off a server, disconnecting its network in the switch configuration, forcibly kill a pod/container/process, etc.

However, these details are out of scope of this algorithm, as they simply fall under the "fencing procedure".

The implementation of the fencing procedure itself is not relevant. This algorithm's responsibility stops at calling the fencing procedure in the appropriate situation.

4. The Demotion Handshake (Return to Normalcy)

When the original Primary recovers:

It becomes healthy locally but sees current_primary = Replica. It waits.
The Replica (current leader) detects the Original Primary is back (via NATS heartbeats).
Replica performs a Clean Demotion:
- Stops DB.
- Writes current_primary = None to NATS.
Original Primary sees current_primary = None and can launch the promotion procedure.

Depending on the implementation, the promotion procedure may require a transition phase. Typically, for a PostgreSQL use case the promoting primary will make sure it has caught up on WAL replication before starting to accept writes.

Failure Modes & Behavior Analysis

Case 1: Immediate Outage (Power Cut)

Primary: Dies instantly. Fencing is implicit (machine is off).
Replica: Waits for failover_timeout (5s). Sees staleness. Promotes self.
Outcome: Clean failover after 5s.

// TODO detail what happens when the primary comes back up. We will likely have to tie PostgreSQL's lifecycle (liveness/readiness probes) with the agent to ensure it does not come back up as primary.

Case 2: High Network Latency on the Primary (The "Split Brain" Trap)

Scenario: Network latency spikes to 5s on the Primary, still below heartbeat_timeout on the Replica.
T=0 to T=2 (Primary): Tries to write. Latency (5s) > Timeout (1s). Fails twice.
T=2 (Primary): consecutive_failures = 2. Primary Fences Self. (Service is DOWN).
T=2 to T=5 (Cluster): Read-Only Phase. No Primary exists.
T=5 (Replica): failover_timeout reached. Replica promotes self.
Outcome: Safe failover. The "Read-Only Gap" (T=2 to T=5) ensures no Split Brain occurred.

Case 3: Replica Network Lag (False Positive)

Scenario: Replica has high latency, greater than failover_timeout; Primary is fine.
Replica: Thinks Primary is dead. Tries to promote by setting cluster_state.current_primary = replica_id.
NATS: Rejects the write because the Primary is still updating the sequence numbers successfully.
Outcome: Promotion denied. Primary stays leader.

Case 4: Network Instability (Flapping)

Scenario: Intermittent packet loss.
Primary: Fails 1 heartbeat, succeeds the next. consecutive_failures resets.
Replica: Sees a slight delay in updates, but never reaches failover_timeout.
Outcome: No Fencing, No Promotion. System rides out the noise.

Contextual notes

Clock skew : Tokio relies on monotonic clocks. This means that tokio::time::sleep(...) will not be affected by system clock corrections (such as NTP). But monotonic clocks are known to jump forward in some cases such as VM live migrations. This could mean a false timeout of a single heartbeat. If failure_threshold = 1, this can mean a false negative on the nodes' health, and a potentially useless demotion.

16 KiB Raw Permalink Blame History