Some checks failed
Run Check Script / check (pull_request) Failing after 6s
249 lines
12 KiB
Markdown
249 lines
12 KiB
Markdown
TODO
|
|
|
|
DONE:
|
|
1. ✅ store trait subscribe definition missing callback - Fixed with SubscriptionCallback type
|
|
2. ✅ BUG: data integrity issue: nats store now using jetstream metadata (entry.created, entry.revision)
|
|
3. ✅ fix replica workflow not transitioning to "failed" when failure_threshold is exceeded
|
|
4. ✅ fix replica workflow to hold copy of cluster state - cluster_state field added to HarmonyAgent
|
|
5. ✅ heartbeat metadata now passed to workflow via on_heartbeat_stored() callback
|
|
6. ✅ failover_timeout added to AgentConfig
|
|
7. ✅ NATS store properly detects SequenceMismatch and returns SequenceMismatch error
|
|
8. ✅ startup reconciliation implemented via on_startup() method
|
|
|
|
REMAINING:
|
|
- review all code and list implementation issues
|
|
- review both workflow for each state transition
|
|
- Complete replica workflow staleness detection (needs implementation in Watching state)
|
|
- Implement state recovery from Failed state for both workflows
|
|
- Implement subscribe in NATS store with watch() API
|
|
- Implement config validation for failover_timeout constraints
|
|
|
|
TODO
|
|
|
|
1. store trait subscribe definition missing callback
|
|
2. BUG, data integrity issue : nats store not actually using jetstream metadata
|
|
3. review all code and list implementation issues
|
|
4. review both workflow for each state transition
|
|
5. fix replica workflow not transitionning to "failed" when failure_threshold is exceeded
|
|
6. fix replica workflow to hold also a copy of the cluster state (actually the agent itself
|
|
should hold it probably, every agent should be subscribed to the cluster_state object and
|
|
keep it in memory to allow workflows to process against it efficiently)
|
|
|
|
## CRITICAL - Data Integrity Issues
|
|
|
|
1. **NATS Store `set_strict` doesn't enforce CAS** (`store/nats.rs`)
|
|
- Currently uses `put()` which overwrites unconditionally
|
|
- Must use `update()` with revision parameter for proper compare-and-set
|
|
- Without this, concurrent promotion attempts can cause split brain
|
|
|
|
2. **NATS Store uses local clock instead of JetStream metadata** (`store/nats.rs`)
|
|
- Lines 55, 68: Using `SystemTime::now()` violates ADR-017-3
|
|
- NATS Entry has `.revision` and `.created` fields that must be used
|
|
- This defeats the entire purpose of store-provided timestamps
|
|
|
|
3. **Heartbeat metadata not passed to ReplicaWorkflow** (`agent_loop.rs::run_heartbeat_loop`)
|
|
- Line ~156: TODO comment confirms missing metadata passing
|
|
- Replica cannot calculate staleness without metadata.timestamp
|
|
- Failover logic is broken
|
|
|
|
4. **No actual cluster state watching exists**
|
|
- Replica workflow declares `ClusterState` but never updates it
|
|
- No subscription to primary heartbeat or cluster_state key
|
|
- Replica cannot detect primary liveness
|
|
|
|
## HIGH - Missing Core Functionality
|
|
|
|
5. **Replica Workflow incomplete** - All key logic is TODO:
|
|
- Watching primary staleness (line 114)
|
|
- Promotion attempt (line 118)
|
|
- Original primary recovery detection (line 127)
|
|
- Demotion/handshake (line 131)
|
|
|
|
6. **Missing replica "Failed" state**
|
|
- `ReplicaState` enum has no `Failed` variant
|
|
- User's TODO #5 correctly identifies this gap
|
|
- What happens if replica's own heartbeats fail repeatedly?
|
|
|
|
7. **Primary Workflow incomplete** - Key logic missing:
|
|
- No NATS check before recovering from `Fenced` state (line 95)
|
|
- No NATS check in `Yielding` state for demotion handshake (line 101)
|
|
- No actual fencing failure handling
|
|
|
|
8. **Store `subscribe` not implemented** (`store/mod.rs`)
|
|
- Returns `todo!()` in NATS implementation
|
|
- No callback mechanism defined in trait
|
|
- Without this, agents cannot react to state changes
|
|
|
|
9. **Cluster state not tracked centrally**
|
|
- User's TODO #6 correctly identifies this
|
|
- Each agent should maintain a local copy of cluster_state
|
|
- No subscription mechanism to update this local copy
|
|
|
|
10. **No validation of configuration constraints**
|
|
- Should validate: `failover_timeout > heartbeat_timeout * failure_threshold + safety_margin`
|
|
- Invalid config could cause split brain
|
|
|
|
## MEDIUM - Incorrect State Transitions
|
|
|
|
11. **Primary immediately transitions `Failed -> Fenced`** (`workflow/primary.rs:120-121`)
|
|
- Two state transitions happen in one heartbeat cycle
|
|
- Should stay in `Failed` until fencing actually completes
|
|
- What if fencing fails? State machine won't reflect it
|
|
|
|
12. **No fencing failure handling**
|
|
- If `on_failover()` fails, node thinks it's fenced but DB is still accepting writes
|
|
- ADR mentions escalating to radical measures, but no callback for failure
|
|
|
|
13. **Replica `Watching` state does nothing**
|
|
- Line 115: Just logs, checks nothing
|
|
- Should be checking staleness of primary heartbeat
|
|
|
|
14. **Demotion handshake not implemented**
|
|
- ADR section 4 details this but code doesn't implement it
|
|
- How does original primary know it should yield?
|
|
|
|
## LOW - Observability & Reliability
|
|
|
|
15. **No graceful shutdown mechanism**
|
|
- `run_heartbeat_loop` runs forever
|
|
- No signal handling (SIGTERM, SIGINT)
|
|
|
|
16. **Async task errors silently ignored**
|
|
- `tokio::spawn` at lines 74, 83, 123
|
|
- No `JoinHandle` retention or error handling
|
|
|
|
17. **No metrics/observability**
|
|
- Only log output
|
|
- No Prometheus metrics for state transitions, failure counts, etc.
|
|
|
|
18. **Hardcoded main() function** (`agent_loop.rs::main`)
|
|
- Not production-ready entry point
|
|
- Should load config from environment or file
|
|
|
|
19. **Store factory pattern missing**
|
|
- TODO comment at line 54 confirms this
|
|
- Can't switch between stores via config
|
|
|
|
20. **No backoff/retry logic for NATS operations**
|
|
- Transient failures could trigger unnecessary fencing
|
|
|
|
21. **`AgentInfo` status is hardcoded to "HEALTHY"**
|
|
- Line 137 in `store_heartbeat`
|
|
- Should反映 actual workflow state
|
|
|
|
22. **Unused fields in structs**
|
|
- `HeartbeatState.last_seq` set but never read
|
|
- `ClusterState.current_primary` set but never read
|
|
|
|
## ADR-017-3 Compliance Issues
|
|
|
|
23. **ADR violation: Clock skew not avoided**
|
|
- While ADR says use store metadata, code uses local time
|
|
|
|
24. **Failover timeout not configurable**
|
|
- Defined in ADR but not in `AgentConfig`
|
|
- Needed for replica staleness calculation
|
|
|
|
25. **Safety margin concept exists in ADR but not in code**
|
|
- Configuration should include this margin
|
|
|
|
26. **No handling of Case 3 (Replica Network Lag)**
|
|
- ADR describes NATS rejection prevention
|
|
- But `set_strict` implementation accepts any write
|
|
|
|
## Code Quality Issues
|
|
|
|
27. **Inconsistent error handling**
|
|
- Some paths return `Err`, others `todo!()`, others ignore
|
|
|
|
28. **Unnecessary `Clone` bounds**
|
|
- `DeploymentConfig.clone()` used frequently
|
|
- Could be optimized with `Arc`
|
|
|
|
29. **Missing lifetime annotations**
|
|
- `KvStore::get` returns `String` key in error - inefficient
|
|
|
|
30. **No integration points mentioned**
|
|
- PostgreSQL lifecycle control implementation missing
|
|
- Fencing via CNPG not connected
|
|
|
|
## Production Readiness Checklist Summary
|
|
|
|
For battle testing preparation, you need:
|
|
|
|
**Immediate ( blockers):**
|
|
- Fix NATS store metadata usage (issues #1, #2)
|
|
- Implement strict set_strict with actual CAS (#1)
|
|
- Implement replica primary watching (#4, #5)
|
|
- Add failover_timeout config + staleness logic (#3, #24)
|
|
- Implement subscribe mechanism with callbacks (#8)
|
|
|
|
**High priority:**
|
|
- Complete all workflow transitions (#5, #7, #11-14)
|
|
- Add cluster state tracking (#6, #9)
|
|
- Add configuration validation (#10)
|
|
- Add Replica Failed state (#6)
|
|
|
|
**Before deployment:**
|
|
- Implement graceful shutdown (#15)
|
|
- Add error handling for spawned tasks (#16)
|
|
- Remove hardcoded main function (#18)
|
|
- Implement store factory (#19)
|
|
- Add Prometheus metrics (#17)
|
|
|
|
**Documentation:**
|
|
- Document all configuration parameters and their trade-offs
|
|
- Add runbooks for each failure mode
|
|
- Document battle test scenarios to cover
|
|
|
|
### Addendum: Missing Critical Issues
|
|
|
|
#### 1. CRITICAL: Heartbeat "Lying" (Data Integrity)
|
|
* **Location:** `agent_loop.rs` line 137 inside `store_heartbeat`.
|
|
* **The Bug:** `status: "HEALTHY".to_string()` is hardcoded.
|
|
* **The Impact:** The agent loop runs regardless of the workflow state. If the Primary transitions to `Fenced` or `Failed`, it continues to write a heartbeat saying "I am HEALTHY".
|
|
* **The Fix:** The `store_heartbeat` function must accept the current status from the `workflow` (e.g., `self.workflow.status()`) to serialize into the JSON. A fenced agent must broadcast "FENCED" or stop writing entirely.
|
|
|
|
#### 2. CRITICAL: Async Task Race Conditions (State Machine Corruption)
|
|
* **Location:** `workflow/primary.rs` lines 74, 83, 123 (`tokio::spawn`).
|
|
* **The Bug:** The callbacks (`on_active`, `on_failover`) are spawned as fire-and-forget background tasks.
|
|
* **Scenario:**
|
|
1. Primary fails -> transitions to `Fenced` -> spawns `on_failover` (takes 5s).
|
|
2. Network recovers immediately -> transitions to `Healthy` -> spawns `on_active` (takes 1s).
|
|
3. `on_active` finishes *before* `on_failover`.
|
|
4. `on_failover` finishes last, killing the DB *after* the agent decided it was healthy.
|
|
* **The Fix:** You need a `JoinHandle` or a cancellation token. When transitioning states, any pending conflicting background tasks must be aborted before starting the new one.
|
|
|
|
#### 3. CRITICAL: Zombie Leader Prevention (Split Brain Risk)
|
|
* **Location:** `agent_loop.rs` loop logic.
|
|
* **The Bug:** There is no "Stop the World" gate.
|
|
* **Scenario:** If `store_heartbeat` fails (NATS unreachable), the code returns `Err`, triggers `handle_heartbeat_failure`, and the loop *continues*.
|
|
* **The Risk:** If the NATS write fails because of a CAS error (meaning a Replica has already promoted), this Primary is now a Zombie. It *must* immediately cease all operations. The current loop just sleeps and tries again.
|
|
* **The Fix:** If `store_heartbeat` returns a `SequenceMismatch` error, the agent must treat this as a fatal demotion event, immediately fencing itself, rather than just incrementing a failure counter.
|
|
|
|
#### 4. HIGH: NATS Bucket Name Collision
|
|
* **Location:** `agent_loop.rs` (Config) vs `store/nats.rs`.
|
|
* **The Bug:** `FailoverCNPGConfig` has `cnpg_cluster_name`, and `AgentConfig` has `cluster_id`.
|
|
* **The Impact:** If you run two different Harmony clusters on the same NATS server, and they use the same bucket name logic (or hardcoded names), they will overwrite each other's state.
|
|
* **The Fix:** The NATS KV bucket name must be namespaced dynamically, e.g., `format!("harmony_{}", config.cluster_id)`.
|
|
|
|
#### 5. HIGH: Startup State Reconciliation
|
|
* **Location:** `HarmonyAgent::new`.
|
|
* **The Bug:** Agents always start in `Initializing`.
|
|
* **Scenario:** The process crashes while it is the `Leader`. It restarts. It enters `Initializing`. It doesn't know it *should* be the leader.
|
|
* **The Impact:** The cluster might be leaderless until the `failover_timeout` expires, causing unnecessary downtime.
|
|
* **The Fix:** On startup, the agent must fetch the `ClusterState` from NATS. If `current_primary == my_id`, it should jump directly to `Healthy`/`Leader` state (possibly after a sanity check).
|
|
|
|
### Summary of Tasks to Add
|
|
|
|
Please add these to your master list before starting implementation:
|
|
|
|
28. **Dynamic Heartbeat Status:** Pass workflow state to `store_heartbeat` to prevent Fenced nodes from reporting "HEALTHY".
|
|
29. **Async Task Cancellation:** Implement `AbortHandle` for `on_active`/`on_failover` tasks to prevent race conditions during rapid state flapping.
|
|
30. **Fatal CAS Handling:** Treat `SequenceMismatch` in `store_heartbeat` as an immediate "I have been replaced" signal (Zombie detection).
|
|
31. **NATS Namespace Isolation:** Ensure KV bucket names include `cluster_id`.
|
|
32. **Startup Reconciliation:** Check NATS on boot to restore previous state if valid.
|
|
|
|
* **Think about vacuum / stop-the-world operations**
|
|
|