Files
harmony/harmony_agent/README.md
Sylvain Tremblay 5b04cc96d7
Some checks failed
Run Check Script / check (pull_request) Failing after 6s
wip: we want to initialize to the right seq number after a restart
2026-02-03 14:50:03 -05:00

249 lines
12 KiB
Markdown

TODO
DONE:
1. ✅ store trait subscribe definition missing callback - Fixed with SubscriptionCallback type
2. ✅ BUG: data integrity issue: nats store now using jetstream metadata (entry.created, entry.revision)
3. ✅ fix replica workflow not transitioning to "failed" when failure_threshold is exceeded
4. ✅ fix replica workflow to hold copy of cluster state - cluster_state field added to HarmonyAgent
5. ✅ heartbeat metadata now passed to workflow via on_heartbeat_stored() callback
6. ✅ failover_timeout added to AgentConfig
7. ✅ NATS store properly detects SequenceMismatch and returns SequenceMismatch error
8. ✅ startup reconciliation implemented via on_startup() method
REMAINING:
- review all code and list implementation issues
- review both workflow for each state transition
- Complete replica workflow staleness detection (needs implementation in Watching state)
- Implement state recovery from Failed state for both workflows
- Implement subscribe in NATS store with watch() API
- Implement config validation for failover_timeout constraints
TODO
1. store trait subscribe definition missing callback
2. BUG, data integrity issue : nats store not actually using jetstream metadata
3. review all code and list implementation issues
4. review both workflow for each state transition
5. fix replica workflow not transitionning to "failed" when failure_threshold is exceeded
6. fix replica workflow to hold also a copy of the cluster state (actually the agent itself
should hold it probably, every agent should be subscribed to the cluster_state object and
keep it in memory to allow workflows to process against it efficiently)
## CRITICAL - Data Integrity Issues
1. **NATS Store `set_strict` doesn't enforce CAS** (`store/nats.rs`)
- Currently uses `put()` which overwrites unconditionally
- Must use `update()` with revision parameter for proper compare-and-set
- Without this, concurrent promotion attempts can cause split brain
2. **NATS Store uses local clock instead of JetStream metadata** (`store/nats.rs`)
- Lines 55, 68: Using `SystemTime::now()` violates ADR-017-3
- NATS Entry has `.revision` and `.created` fields that must be used
- This defeats the entire purpose of store-provided timestamps
3. **Heartbeat metadata not passed to ReplicaWorkflow** (`agent_loop.rs::run_heartbeat_loop`)
- Line ~156: TODO comment confirms missing metadata passing
- Replica cannot calculate staleness without metadata.timestamp
- Failover logic is broken
4. **No actual cluster state watching exists**
- Replica workflow declares `ClusterState` but never updates it
- No subscription to primary heartbeat or cluster_state key
- Replica cannot detect primary liveness
## HIGH - Missing Core Functionality
5. **Replica Workflow incomplete** - All key logic is TODO:
- Watching primary staleness (line 114)
- Promotion attempt (line 118)
- Original primary recovery detection (line 127)
- Demotion/handshake (line 131)
6. **Missing replica "Failed" state**
- `ReplicaState` enum has no `Failed` variant
- User's TODO #5 correctly identifies this gap
- What happens if replica's own heartbeats fail repeatedly?
7. **Primary Workflow incomplete** - Key logic missing:
- No NATS check before recovering from `Fenced` state (line 95)
- No NATS check in `Yielding` state for demotion handshake (line 101)
- No actual fencing failure handling
8. **Store `subscribe` not implemented** (`store/mod.rs`)
- Returns `todo!()` in NATS implementation
- No callback mechanism defined in trait
- Without this, agents cannot react to state changes
9. **Cluster state not tracked centrally**
- User's TODO #6 correctly identifies this
- Each agent should maintain a local copy of cluster_state
- No subscription mechanism to update this local copy
10. **No validation of configuration constraints**
- Should validate: `failover_timeout > heartbeat_timeout * failure_threshold + safety_margin`
- Invalid config could cause split brain
## MEDIUM - Incorrect State Transitions
11. **Primary immediately transitions `Failed -> Fenced`** (`workflow/primary.rs:120-121`)
- Two state transitions happen in one heartbeat cycle
- Should stay in `Failed` until fencing actually completes
- What if fencing fails? State machine won't reflect it
12. **No fencing failure handling**
- If `on_failover()` fails, node thinks it's fenced but DB is still accepting writes
- ADR mentions escalating to radical measures, but no callback for failure
13. **Replica `Watching` state does nothing**
- Line 115: Just logs, checks nothing
- Should be checking staleness of primary heartbeat
14. **Demotion handshake not implemented**
- ADR section 4 details this but code doesn't implement it
- How does original primary know it should yield?
## LOW - Observability & Reliability
15. **No graceful shutdown mechanism**
- `run_heartbeat_loop` runs forever
- No signal handling (SIGTERM, SIGINT)
16. **Async task errors silently ignored**
- `tokio::spawn` at lines 74, 83, 123
- No `JoinHandle` retention or error handling
17. **No metrics/observability**
- Only log output
- No Prometheus metrics for state transitions, failure counts, etc.
18. **Hardcoded main() function** (`agent_loop.rs::main`)
- Not production-ready entry point
- Should load config from environment or file
19. **Store factory pattern missing**
- TODO comment at line 54 confirms this
- Can't switch between stores via config
20. **No backoff/retry logic for NATS operations**
- Transient failures could trigger unnecessary fencing
21. **`AgentInfo` status is hardcoded to "HEALTHY"**
- Line 137 in `store_heartbeat`
- Should反映 actual workflow state
22. **Unused fields in structs**
- `HeartbeatState.last_seq` set but never read
- `ClusterState.current_primary` set but never read
## ADR-017-3 Compliance Issues
23. **ADR violation: Clock skew not avoided**
- While ADR says use store metadata, code uses local time
24. **Failover timeout not configurable**
- Defined in ADR but not in `AgentConfig`
- Needed for replica staleness calculation
25. **Safety margin concept exists in ADR but not in code**
- Configuration should include this margin
26. **No handling of Case 3 (Replica Network Lag)**
- ADR describes NATS rejection prevention
- But `set_strict` implementation accepts any write
## Code Quality Issues
27. **Inconsistent error handling**
- Some paths return `Err`, others `todo!()`, others ignore
28. **Unnecessary `Clone` bounds**
- `DeploymentConfig.clone()` used frequently
- Could be optimized with `Arc`
29. **Missing lifetime annotations**
- `KvStore::get` returns `String` key in error - inefficient
30. **No integration points mentioned**
- PostgreSQL lifecycle control implementation missing
- Fencing via CNPG not connected
## Production Readiness Checklist Summary
For battle testing preparation, you need:
**Immediate ( blockers):**
- Fix NATS store metadata usage (issues #1, #2)
- Implement strict set_strict with actual CAS (#1)
- Implement replica primary watching (#4, #5)
- Add failover_timeout config + staleness logic (#3, #24)
- Implement subscribe mechanism with callbacks (#8)
**High priority:**
- Complete all workflow transitions (#5, #7, #11-14)
- Add cluster state tracking (#6, #9)
- Add configuration validation (#10)
- Add Replica Failed state (#6)
**Before deployment:**
- Implement graceful shutdown (#15)
- Add error handling for spawned tasks (#16)
- Remove hardcoded main function (#18)
- Implement store factory (#19)
- Add Prometheus metrics (#17)
**Documentation:**
- Document all configuration parameters and their trade-offs
- Add runbooks for each failure mode
- Document battle test scenarios to cover
### Addendum: Missing Critical Issues
#### 1. CRITICAL: Heartbeat "Lying" (Data Integrity)
* **Location:** `agent_loop.rs` line 137 inside `store_heartbeat`.
* **The Bug:** `status: "HEALTHY".to_string()` is hardcoded.
* **The Impact:** The agent loop runs regardless of the workflow state. If the Primary transitions to `Fenced` or `Failed`, it continues to write a heartbeat saying "I am HEALTHY".
* **The Fix:** The `store_heartbeat` function must accept the current status from the `workflow` (e.g., `self.workflow.status()`) to serialize into the JSON. A fenced agent must broadcast "FENCED" or stop writing entirely.
#### 2. CRITICAL: Async Task Race Conditions (State Machine Corruption)
* **Location:** `workflow/primary.rs` lines 74, 83, 123 (`tokio::spawn`).
* **The Bug:** The callbacks (`on_active`, `on_failover`) are spawned as fire-and-forget background tasks.
* **Scenario:**
1. Primary fails -> transitions to `Fenced` -> spawns `on_failover` (takes 5s).
2. Network recovers immediately -> transitions to `Healthy` -> spawns `on_active` (takes 1s).
3. `on_active` finishes *before* `on_failover`.
4. `on_failover` finishes last, killing the DB *after* the agent decided it was healthy.
* **The Fix:** You need a `JoinHandle` or a cancellation token. When transitioning states, any pending conflicting background tasks must be aborted before starting the new one.
#### 3. CRITICAL: Zombie Leader Prevention (Split Brain Risk)
* **Location:** `agent_loop.rs` loop logic.
* **The Bug:** There is no "Stop the World" gate.
* **Scenario:** If `store_heartbeat` fails (NATS unreachable), the code returns `Err`, triggers `handle_heartbeat_failure`, and the loop *continues*.
* **The Risk:** If the NATS write fails because of a CAS error (meaning a Replica has already promoted), this Primary is now a Zombie. It *must* immediately cease all operations. The current loop just sleeps and tries again.
* **The Fix:** If `store_heartbeat` returns a `SequenceMismatch` error, the agent must treat this as a fatal demotion event, immediately fencing itself, rather than just incrementing a failure counter.
#### 4. HIGH: NATS Bucket Name Collision
* **Location:** `agent_loop.rs` (Config) vs `store/nats.rs`.
* **The Bug:** `FailoverCNPGConfig` has `cnpg_cluster_name`, and `AgentConfig` has `cluster_id`.
* **The Impact:** If you run two different Harmony clusters on the same NATS server, and they use the same bucket name logic (or hardcoded names), they will overwrite each other's state.
* **The Fix:** The NATS KV bucket name must be namespaced dynamically, e.g., `format!("harmony_{}", config.cluster_id)`.
#### 5. HIGH: Startup State Reconciliation
* **Location:** `HarmonyAgent::new`.
* **The Bug:** Agents always start in `Initializing`.
* **Scenario:** The process crashes while it is the `Leader`. It restarts. It enters `Initializing`. It doesn't know it *should* be the leader.
* **The Impact:** The cluster might be leaderless until the `failover_timeout` expires, causing unnecessary downtime.
* **The Fix:** On startup, the agent must fetch the `ClusterState` from NATS. If `current_primary == my_id`, it should jump directly to `Healthy`/`Leader` state (possibly after a sanity check).
### Summary of Tasks to Add
Please add these to your master list before starting implementation:
28. **Dynamic Heartbeat Status:** Pass workflow state to `store_heartbeat` to prevent Fenced nodes from reporting "HEALTHY".
29. **Async Task Cancellation:** Implement `AbortHandle` for `on_active`/`on_failover` tasks to prevent race conditions during rapid state flapping.
30. **Fatal CAS Handling:** Treat `SequenceMismatch` in `store_heartbeat` as an immediate "I have been replaced" signal (Zombie detection).
31. **NATS Namespace Isolation:** Ensure KV bucket names include `cluster_id`.
32. **Startup Reconciliation:** Check NATS on boot to restore previous state if valid.
* **Think about vacuum / stop-the-world operations**