TODO
DONE:
- ✅ store trait subscribe definition missing callback - Fixed with SubscriptionCallback type
- ✅ BUG: data integrity issue: nats store now using jetstream metadata (entry.created, entry.revision)
- ✅ fix replica workflow not transitioning to "failed" when failure_threshold is exceeded
- ✅ fix replica workflow to hold copy of cluster state - cluster_state field added to HarmonyAgent
- ✅ heartbeat metadata now passed to workflow via on_heartbeat_stored() callback
- ✅ failover_timeout added to AgentConfig
- ✅ NATS store properly detects SequenceMismatch and returns SequenceMismatch error
- ✅ startup reconciliation implemented via on_startup() method
REMAINING:
- review all code and list implementation issues
- review both workflow for each state transition
- Complete replica workflow staleness detection (needs implementation in Watching state)
- Implement state recovery from Failed state for both workflows
- Implement subscribe in NATS store with watch() API
- Implement config validation for failover_timeout constraints
TODO
- store trait subscribe definition missing callback
- BUG, data integrity issue : nats store not actually using jetstream metadata
- review all code and list implementation issues
- review both workflow for each state transition
- fix replica workflow not transitionning to "failed" when failure_threshold is exceeded
- fix replica workflow to hold also a copy of the cluster state (actually the agent itself should hold it probably, every agent should be subscribed to the cluster_state object and keep it in memory to allow workflows to process against it efficiently)
CRITICAL - Data Integrity Issues
-
NATS Store
set_strictdoesn't enforce CAS (store/nats.rs)- Currently uses
put()which overwrites unconditionally - Must use
update()with revision parameter for proper compare-and-set - Without this, concurrent promotion attempts can cause split brain
- Currently uses
-
NATS Store uses local clock instead of JetStream metadata (
store/nats.rs)- Lines 55, 68: Using
SystemTime::now()violates ADR-017-3 - NATS Entry has
.revisionand.createdfields that must be used - This defeats the entire purpose of store-provided timestamps
- Lines 55, 68: Using
-
Heartbeat metadata not passed to ReplicaWorkflow (
agent_loop.rs::run_heartbeat_loop)- Line ~156: TODO comment confirms missing metadata passing
- Replica cannot calculate staleness without metadata.timestamp
- Failover logic is broken
-
No actual cluster state watching exists
- Replica workflow declares
ClusterStatebut never updates it - No subscription to primary heartbeat or cluster_state key
- Replica cannot detect primary liveness
- Replica workflow declares
HIGH - Missing Core Functionality
-
Replica Workflow incomplete - All key logic is TODO:
- Watching primary staleness (line 114)
- Promotion attempt (line 118)
- Original primary recovery detection (line 127)
- Demotion/handshake (line 131)
-
Missing replica "Failed" state
ReplicaStateenum has noFailedvariant- User's TODO #5 correctly identifies this gap
- What happens if replica's own heartbeats fail repeatedly?
-
Primary Workflow incomplete - Key logic missing:
- No NATS check before recovering from
Fencedstate (line 95) - No NATS check in
Yieldingstate for demotion handshake (line 101) - No actual fencing failure handling
- No NATS check before recovering from
-
Store
subscribenot implemented (store/mod.rs)- Returns
todo!()in NATS implementation - No callback mechanism defined in trait
- Without this, agents cannot react to state changes
- Returns
-
Cluster state not tracked centrally
- User's TODO #6 correctly identifies this
- Each agent should maintain a local copy of cluster_state
- No subscription mechanism to update this local copy
-
No validation of configuration constraints
- Should validate:
failover_timeout > heartbeat_timeout * failure_threshold + safety_margin - Invalid config could cause split brain
- Should validate:
MEDIUM - Incorrect State Transitions
-
Primary immediately transitions
Failed -> Fenced(workflow/primary.rs:120-121)- Two state transitions happen in one heartbeat cycle
- Should stay in
Faileduntil fencing actually completes - What if fencing fails? State machine won't reflect it
-
No fencing failure handling
- If
on_failover()fails, node thinks it's fenced but DB is still accepting writes - ADR mentions escalating to radical measures, but no callback for failure
- If
-
Replica
Watchingstate does nothing- Line 115: Just logs, checks nothing
- Should be checking staleness of primary heartbeat
-
Demotion handshake not implemented
- ADR section 4 details this but code doesn't implement it
- How does original primary know it should yield?
LOW - Observability & Reliability
-
No graceful shutdown mechanism
run_heartbeat_loopruns forever- No signal handling (SIGTERM, SIGINT)
-
Async task errors silently ignored
tokio::spawnat lines 74, 83, 123- No
JoinHandleretention or error handling
-
No metrics/observability
- Only log output
- No Prometheus metrics for state transitions, failure counts, etc.
-
Hardcoded main() function (
agent_loop.rs::main)- Not production-ready entry point
- Should load config from environment or file
-
Store factory pattern missing
- TODO comment at line 54 confirms this
- Can't switch between stores via config
-
No backoff/retry logic for NATS operations
- Transient failures could trigger unnecessary fencing
-
AgentInfostatus is hardcoded to "HEALTHY"- Line 137 in
store_heartbeat - Should反映 actual workflow state
- Line 137 in
-
Unused fields in structs
HeartbeatState.last_seqset but never readClusterState.current_primaryset but never read
ADR-017-3 Compliance Issues
-
ADR violation: Clock skew not avoided
- While ADR says use store metadata, code uses local time
-
Failover timeout not configurable
- Defined in ADR but not in
AgentConfig - Needed for replica staleness calculation
- Defined in ADR but not in
-
Safety margin concept exists in ADR but not in code
- Configuration should include this margin
-
No handling of Case 3 (Replica Network Lag)
- ADR describes NATS rejection prevention
- But
set_strictimplementation accepts any write
Code Quality Issues
-
Inconsistent error handling
- Some paths return
Err, otherstodo!(), others ignore
- Some paths return
-
Unnecessary
CloneboundsDeploymentConfig.clone()used frequently- Could be optimized with
Arc
-
Missing lifetime annotations
KvStore::getreturnsStringkey in error - inefficient
-
No integration points mentioned
- PostgreSQL lifecycle control implementation missing
- Fencing via CNPG not connected
Production Readiness Checklist Summary
For battle testing preparation, you need:
Immediate ( blockers):
- Fix NATS store metadata usage (issues #1, #2)
- Implement strict set_strict with actual CAS (#1)
- Implement replica primary watching (#4, #5)
- Add failover_timeout config + staleness logic (#3, #24)
- Implement subscribe mechanism with callbacks (#8)
High priority:
- Complete all workflow transitions (#5, #7, #11-14)
- Add cluster state tracking (#6, #9)
- Add configuration validation (#10)
- Add Replica Failed state (#6)
Before deployment:
- Implement graceful shutdown (#15)
- Add error handling for spawned tasks (#16)
- Remove hardcoded main function (#18)
- Implement store factory (#19)
- Add Prometheus metrics (#17)
Documentation:
- Document all configuration parameters and their trade-offs
- Add runbooks for each failure mode
- Document battle test scenarios to cover
Addendum: Missing Critical Issues
1. CRITICAL: Heartbeat "Lying" (Data Integrity)
- Location:
agent_loop.rsline 137 insidestore_heartbeat. - The Bug:
status: "HEALTHY".to_string()is hardcoded. - The Impact: The agent loop runs regardless of the workflow state. If the Primary transitions to
FencedorFailed, it continues to write a heartbeat saying "I am HEALTHY". - The Fix: The
store_heartbeatfunction must accept the current status from theworkflow(e.g.,self.workflow.status()) to serialize into the JSON. A fenced agent must broadcast "FENCED" or stop writing entirely.
2. CRITICAL: Async Task Race Conditions (State Machine Corruption)
- Location:
workflow/primary.rslines 74, 83, 123 (tokio::spawn). - The Bug: The callbacks (
on_active,on_failover) are spawned as fire-and-forget background tasks. - Scenario:
- Primary fails -> transitions to
Fenced-> spawnson_failover(takes 5s). - Network recovers immediately -> transitions to
Healthy-> spawnson_active(takes 1s). on_activefinishes beforeon_failover.on_failoverfinishes last, killing the DB after the agent decided it was healthy.
- Primary fails -> transitions to
- The Fix: You need a
JoinHandleor a cancellation token. When transitioning states, any pending conflicting background tasks must be aborted before starting the new one.
3. CRITICAL: Zombie Leader Prevention (Split Brain Risk)
- Location:
agent_loop.rsloop logic. - The Bug: There is no "Stop the World" gate.
- Scenario: If
store_heartbeatfails (NATS unreachable), the code returnsErr, triggershandle_heartbeat_failure, and the loop continues. - The Risk: If the NATS write fails because of a CAS error (meaning a Replica has already promoted), this Primary is now a Zombie. It must immediately cease all operations. The current loop just sleeps and tries again.
- The Fix: If
store_heartbeatreturns aSequenceMismatcherror, the agent must treat this as a fatal demotion event, immediately fencing itself, rather than just incrementing a failure counter.
4. HIGH: NATS Bucket Name Collision
- Location:
agent_loop.rs(Config) vsstore/nats.rs. - The Bug:
FailoverCNPGConfighascnpg_cluster_name, andAgentConfighascluster_id. - The Impact: If you run two different Harmony clusters on the same NATS server, and they use the same bucket name logic (or hardcoded names), they will overwrite each other's state.
- The Fix: The NATS KV bucket name must be namespaced dynamically, e.g.,
format!("harmony_{}", config.cluster_id).
5. HIGH: Startup State Reconciliation
- Location:
HarmonyAgent::new. - The Bug: Agents always start in
Initializing. - Scenario: The process crashes while it is the
Leader. It restarts. It entersInitializing. It doesn't know it should be the leader. - The Impact: The cluster might be leaderless until the
failover_timeoutexpires, causing unnecessary downtime. - The Fix: On startup, the agent must fetch the
ClusterStatefrom NATS. Ifcurrent_primary == my_id, it should jump directly toHealthy/Leaderstate (possibly after a sanity check).
Summary of Tasks to Add
Please add these to your master list before starting implementation:
- Dynamic Heartbeat Status: Pass workflow state to
store_heartbeatto prevent Fenced nodes from reporting "HEALTHY". - Async Task Cancellation: Implement
AbortHandleforon_active/on_failovertasks to prevent race conditions during rapid state flapping. - Fatal CAS Handling: Treat
SequenceMismatchinstore_heartbeatas an immediate "I have been replaced" signal (Zombie detection). - NATS Namespace Isolation: Ensure KV bucket names include
cluster_id. - Startup Reconciliation: Check NATS on boot to restore previous state if valid.
- Think about vacuum / stop-the-world operations