Files
harmony/harmony_agent/README.md
Sylvain Tremblay 5b04cc96d7
Some checks failed
Run Check Script / check (pull_request) Failing after 6s
wip: we want to initialize to the right seq number after a restart
2026-02-03 14:50:03 -05:00

12 KiB

TODO

DONE:

  1. store trait subscribe definition missing callback - Fixed with SubscriptionCallback type
  2. BUG: data integrity issue: nats store now using jetstream metadata (entry.created, entry.revision)
  3. fix replica workflow not transitioning to "failed" when failure_threshold is exceeded
  4. fix replica workflow to hold copy of cluster state - cluster_state field added to HarmonyAgent
  5. heartbeat metadata now passed to workflow via on_heartbeat_stored() callback
  6. failover_timeout added to AgentConfig
  7. NATS store properly detects SequenceMismatch and returns SequenceMismatch error
  8. startup reconciliation implemented via on_startup() method

REMAINING:

  • review all code and list implementation issues
  • review both workflow for each state transition
  • Complete replica workflow staleness detection (needs implementation in Watching state)
  • Implement state recovery from Failed state for both workflows
  • Implement subscribe in NATS store with watch() API
  • Implement config validation for failover_timeout constraints

TODO

  1. store trait subscribe definition missing callback
  2. BUG, data integrity issue : nats store not actually using jetstream metadata
  3. review all code and list implementation issues
  4. review both workflow for each state transition
  5. fix replica workflow not transitionning to "failed" when failure_threshold is exceeded
  6. fix replica workflow to hold also a copy of the cluster state (actually the agent itself should hold it probably, every agent should be subscribed to the cluster_state object and keep it in memory to allow workflows to process against it efficiently)

CRITICAL - Data Integrity Issues

  1. NATS Store set_strict doesn't enforce CAS (store/nats.rs)

    • Currently uses put() which overwrites unconditionally
    • Must use update() with revision parameter for proper compare-and-set
    • Without this, concurrent promotion attempts can cause split brain
  2. NATS Store uses local clock instead of JetStream metadata (store/nats.rs)

    • Lines 55, 68: Using SystemTime::now() violates ADR-017-3
    • NATS Entry has .revision and .created fields that must be used
    • This defeats the entire purpose of store-provided timestamps
  3. Heartbeat metadata not passed to ReplicaWorkflow (agent_loop.rs::run_heartbeat_loop)

    • Line ~156: TODO comment confirms missing metadata passing
    • Replica cannot calculate staleness without metadata.timestamp
    • Failover logic is broken
  4. No actual cluster state watching exists

    • Replica workflow declares ClusterState but never updates it
    • No subscription to primary heartbeat or cluster_state key
    • Replica cannot detect primary liveness

HIGH - Missing Core Functionality

  1. Replica Workflow incomplete - All key logic is TODO:

    • Watching primary staleness (line 114)
    • Promotion attempt (line 118)
    • Original primary recovery detection (line 127)
    • Demotion/handshake (line 131)
  2. Missing replica "Failed" state

    • ReplicaState enum has no Failed variant
    • User's TODO #5 correctly identifies this gap
    • What happens if replica's own heartbeats fail repeatedly?
  3. Primary Workflow incomplete - Key logic missing:

    • No NATS check before recovering from Fenced state (line 95)
    • No NATS check in Yielding state for demotion handshake (line 101)
    • No actual fencing failure handling
  4. Store subscribe not implemented (store/mod.rs)

    • Returns todo!() in NATS implementation
    • No callback mechanism defined in trait
    • Without this, agents cannot react to state changes
  5. Cluster state not tracked centrally

    • User's TODO #6 correctly identifies this
    • Each agent should maintain a local copy of cluster_state
    • No subscription mechanism to update this local copy
  6. No validation of configuration constraints

    • Should validate: failover_timeout > heartbeat_timeout * failure_threshold + safety_margin
    • Invalid config could cause split brain

MEDIUM - Incorrect State Transitions

  1. Primary immediately transitions Failed -> Fenced (workflow/primary.rs:120-121)

    • Two state transitions happen in one heartbeat cycle
    • Should stay in Failed until fencing actually completes
    • What if fencing fails? State machine won't reflect it
  2. No fencing failure handling

    • If on_failover() fails, node thinks it's fenced but DB is still accepting writes
    • ADR mentions escalating to radical measures, but no callback for failure
  3. Replica Watching state does nothing

    • Line 115: Just logs, checks nothing
    • Should be checking staleness of primary heartbeat
  4. Demotion handshake not implemented

    • ADR section 4 details this but code doesn't implement it
    • How does original primary know it should yield?

LOW - Observability & Reliability

  1. No graceful shutdown mechanism

    • run_heartbeat_loop runs forever
    • No signal handling (SIGTERM, SIGINT)
  2. Async task errors silently ignored

    • tokio::spawn at lines 74, 83, 123
    • No JoinHandle retention or error handling
  3. No metrics/observability

    • Only log output
    • No Prometheus metrics for state transitions, failure counts, etc.
  4. Hardcoded main() function (agent_loop.rs::main)

    • Not production-ready entry point
    • Should load config from environment or file
  5. Store factory pattern missing

    • TODO comment at line 54 confirms this
    • Can't switch between stores via config
  6. No backoff/retry logic for NATS operations

    • Transient failures could trigger unnecessary fencing
  7. AgentInfo status is hardcoded to "HEALTHY"

    • Line 137 in store_heartbeat
    • Should反映 actual workflow state
  8. Unused fields in structs

    • HeartbeatState.last_seq set but never read
    • ClusterState.current_primary set but never read

ADR-017-3 Compliance Issues

  1. ADR violation: Clock skew not avoided

    • While ADR says use store metadata, code uses local time
  2. Failover timeout not configurable

    • Defined in ADR but not in AgentConfig
    • Needed for replica staleness calculation
  3. Safety margin concept exists in ADR but not in code

    • Configuration should include this margin
  4. No handling of Case 3 (Replica Network Lag)

    • ADR describes NATS rejection prevention
    • But set_strict implementation accepts any write

Code Quality Issues

  1. Inconsistent error handling

    • Some paths return Err, others todo!(), others ignore
  2. Unnecessary Clone bounds

    • DeploymentConfig.clone() used frequently
    • Could be optimized with Arc
  3. Missing lifetime annotations

    • KvStore::get returns String key in error - inefficient
  4. No integration points mentioned

    • PostgreSQL lifecycle control implementation missing
    • Fencing via CNPG not connected

Production Readiness Checklist Summary

For battle testing preparation, you need:

Immediate ( blockers):

  • Fix NATS store metadata usage (issues #1, #2)
  • Implement strict set_strict with actual CAS (#1)
  • Implement replica primary watching (#4, #5)
  • Add failover_timeout config + staleness logic (#3, #24)
  • Implement subscribe mechanism with callbacks (#8)

High priority:

  • Complete all workflow transitions (#5, #7, #11-14)
  • Add cluster state tracking (#6, #9)
  • Add configuration validation (#10)
  • Add Replica Failed state (#6)

Before deployment:

  • Implement graceful shutdown (#15)
  • Add error handling for spawned tasks (#16)
  • Remove hardcoded main function (#18)
  • Implement store factory (#19)
  • Add Prometheus metrics (#17)

Documentation:

  • Document all configuration parameters and their trade-offs
  • Add runbooks for each failure mode
  • Document battle test scenarios to cover

Addendum: Missing Critical Issues

1. CRITICAL: Heartbeat "Lying" (Data Integrity)

  • Location: agent_loop.rs line 137 inside store_heartbeat.
  • The Bug: status: "HEALTHY".to_string() is hardcoded.
  • The Impact: The agent loop runs regardless of the workflow state. If the Primary transitions to Fenced or Failed, it continues to write a heartbeat saying "I am HEALTHY".
  • The Fix: The store_heartbeat function must accept the current status from the workflow (e.g., self.workflow.status()) to serialize into the JSON. A fenced agent must broadcast "FENCED" or stop writing entirely.

2. CRITICAL: Async Task Race Conditions (State Machine Corruption)

  • Location: workflow/primary.rs lines 74, 83, 123 (tokio::spawn).
  • The Bug: The callbacks (on_active, on_failover) are spawned as fire-and-forget background tasks.
  • Scenario:
    1. Primary fails -> transitions to Fenced -> spawns on_failover (takes 5s).
    2. Network recovers immediately -> transitions to Healthy -> spawns on_active (takes 1s).
    3. on_active finishes before on_failover.
    4. on_failover finishes last, killing the DB after the agent decided it was healthy.
  • The Fix: You need a JoinHandle or a cancellation token. When transitioning states, any pending conflicting background tasks must be aborted before starting the new one.

3. CRITICAL: Zombie Leader Prevention (Split Brain Risk)

  • Location: agent_loop.rs loop logic.
  • The Bug: There is no "Stop the World" gate.
  • Scenario: If store_heartbeat fails (NATS unreachable), the code returns Err, triggers handle_heartbeat_failure, and the loop continues.
  • The Risk: If the NATS write fails because of a CAS error (meaning a Replica has already promoted), this Primary is now a Zombie. It must immediately cease all operations. The current loop just sleeps and tries again.
  • The Fix: If store_heartbeat returns a SequenceMismatch error, the agent must treat this as a fatal demotion event, immediately fencing itself, rather than just incrementing a failure counter.

4. HIGH: NATS Bucket Name Collision

  • Location: agent_loop.rs (Config) vs store/nats.rs.
  • The Bug: FailoverCNPGConfig has cnpg_cluster_name, and AgentConfig has cluster_id.
  • The Impact: If you run two different Harmony clusters on the same NATS server, and they use the same bucket name logic (or hardcoded names), they will overwrite each other's state.
  • The Fix: The NATS KV bucket name must be namespaced dynamically, e.g., format!("harmony_{}", config.cluster_id).

5. HIGH: Startup State Reconciliation

  • Location: HarmonyAgent::new.
  • The Bug: Agents always start in Initializing.
  • Scenario: The process crashes while it is the Leader. It restarts. It enters Initializing. It doesn't know it should be the leader.
  • The Impact: The cluster might be leaderless until the failover_timeout expires, causing unnecessary downtime.
  • The Fix: On startup, the agent must fetch the ClusterState from NATS. If current_primary == my_id, it should jump directly to Healthy/Leader state (possibly after a sanity check).

Summary of Tasks to Add

Please add these to your master list before starting implementation:

  1. Dynamic Heartbeat Status: Pass workflow state to store_heartbeat to prevent Fenced nodes from reporting "HEALTHY".
  2. Async Task Cancellation: Implement AbortHandle for on_active/on_failover tasks to prevent race conditions during rapid state flapping.
  3. Fatal CAS Handling: Treat SequenceMismatch in store_heartbeat as an immediate "I have been replaced" signal (Zombie detection).
  4. NATS Namespace Isolation: Ensure KV bucket names include cluster_id.
  5. Startup Reconciliation: Check NATS on boot to restore previous state if valid.
  • Think about vacuum / stop-the-world operations