harmony/docs/fleet-operator-recovery-scenarios.md

# Fleet operator recovery scenarios

The operator is **stateless across restarts**. It can be killed, upgraded, or
rescheduled at any time. On restart it cold-rebuilds from two durable sources —
kube CRs and NATS KV — with no customer-visible "unknown state" window.

## Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                        Operator process                         │
│                                                                 │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐  │
│  │ Deployment   │  │ Device       │  │ Device-state KV      │  │
│  │ CR watcher   │  │ CR watcher   │  │ watcher (history)    │  │
│  └──────┬───────┘  └──────┬───────┘  └──────────┬───────────┘  │
│         │                 │                      │              │
│         ▼                 ▼                      ▼              │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │                    FleetState (mutex)                     │  │
│  │  deployments: {key → CachedDeployment}                   │  │
│  │  devices:     {name → labels}                            │  │
│  │  states:      {(device,deployment) → DeploymentState}    │  │
│  │  owned_targets: {deployment → {device…}}                 │  │
│  │  dirty:       {key…}  ──► patch_tick (1 Hz)              │  │
│  └──────────────────────────────────────────────────────────┘  │
│         │                                                       │
│         ▼                                                       │
│  ┌──────────────────┐     ┌────────────────────────────────┐   │
│  │ desired-state KV │     │ Deployment.status.aggregate    │   │
│  │ put / delete     │     │ (patched to kube at 1 Hz)      │   │
│  └──────────────────┘     └────────────────────────────────┘   │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │ OperatorLiveness (3 atomic latches, shared w/ dashboard) │  │
│  │   deployments_ready ─┐                                   │  │
│  │   devices_ready ─────┼──► all three → Converged          │  │
│  │   states_ready ──────┘    (else: Recovering → banner)    │  │
│  └──────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘
```

### Cold-start sequence

```
Operator starts
  │
  ├─ seed_owned_targets()       reads desired-state KV keys
  │                               → populates owned_targets
  │                                 (so first reconcile diffs correctly)
  │
  ├─ spawn 3 watchers (JoinSet) │
  │   ├─ Deployment CR watcher  │  replay → InitDone → mark_deployments_ready
  │   │                         │           + gc_orphaned_desired_state
  │   ├─ Device CR watcher      │  replay → InitDone → mark_devices_ready
  │   └─ Device-state KV watch  │  replay → seen_current → mark_states_ready
  │                             │  (or: empty bucket → mark immediately)
  │
  └─ spawn patch_tick (1 Hz)    flushes dirty CR status patches
```

### Convergence banner

```
Dashboard reads OperatorLiveness.phase():

  Recovering ──────► Converged
  ┌──────────────┐   ┌─────┐
  │ ⚠ rebuilding │   │     │  (empty div, polling stops)
  │ polls /2s    │   │     │
  └──────────────┘   └─────┘
```

The banner self-clears when all three latches trip. Typically sub-second.

### Key invariants

| Invariant | Mechanism |
|---|---|
| Writes are byte-deterministic | Same CR → same serialized score JSON. Two operators write identical bytes. |
| No leader election needed | Idempotent multi-writer at current fleet size. (HA = D3, deferred.) |
| Cancel = clean teardown | `JoinSet` owns all watchers; dropping `run`'s future aborts children. |
| `owned_targets` seeded before first reconcile | `seed_owned_targets()` reads KV keys so diff doesn't orphan entries. |
| Orphan GC on `InitDone` | Deployment CRs deleted while operator was down → purged at convergence. |
| Out-of-order state dedup | `apply_state` rejects entries with older `last_event_at`. |

---

## Scenarios

### 1. Cold restart, healthy fleet

Operator killed and restarted. Devices kept reporting to KV throughout.

```
Time ──────────────────────────────────────────────────►

Operator:   [running]─── KILL ───[restart]──────────────────►
                                │
                                ├─ seed_owned_targets (from KV keys)
                                ├─ replay CRs → InitDone
                                ├─ replay device-state → seen_current
                                └─ Converged

KV:         [desired-state ✓]  [device-state ✓]  (untouched)

Expected:   desired-state re-written identically (agents byte-compare → no-op)
            health counts rebuilt from device-state KV
            liveness → Converged
```

**Test:** `aggregator_converges_from_kv_after_restart`

### 2. Stale KV — CR deleted while operator down

A Deployment CR is force-deleted (finalizer bypassed: namespace teardown,
`--force`) while no operator is running. The desired-state KV entry is orphaned.

```
Time ──────────────────────────────────────────────────►

Operator:   [running]─── KILL ──────────────[restart]──────────►
                                │                  │
                                │                  ├─ seed_owned_targets
                                │                  │   (picks up orphan key)
                                │                  │
                                │                  ├─ CR replay: InitDone
                                │                  │   deployment NOT in cache
                                │                  │
                                │                  └─ gc_orphaned_desired_state
                                │                      deletes orphan from KV
                                │
CR:         [exists]─── force-delete ──[gone]─────────────────►
KV:         [desired-state ✓]──────────[orphan]───[purged]────►

Expected:   orphan desired-state deleted at convergence
            agents stop running the dead deployment
```

**Test:** `aggregator_gcs_desired_state_for_deleted_cr`

### 3. Two operators racing (rolling deploy overlap)

Brief period where two operator replicas run simultaneously.

```
Time ──────────────────────────────────────────────────►

Operator A: [running]────────────────────────────────────────►
Operator B:        [start]───────────────────────────────────►
                   │
                   ├─ both seed_owned_targets (same KV keys)
                   ├─ both replay same CRs
                   ├─ both compute same matched_devices
                   └─ both put identical score_json to same KV keys

KV:         [desired-state: bytes_A]  ==  [desired-state: bytes_B]

Expected:   KV value stable regardless of write order
            no orphan, no flapping, no leader election needed
```

**Test:** `two_aggregators_produce_identical_desired_state`

### 4. Partial KV — device offline during restart

A device hasn't reported state (no `device-state` entry) when the operator
restarts.

```
Time ──────────────────────────────────────────────────►

Operator:   [running]─── KILL ───[restart]──────────────────►

Device A:   [reporting ✓]────────────────────────────────────►
Device B:   [offline]────────────────────────────────────────►
            (no device-state entry in KV)

Expected:   Device A → Running (rebuilt from KV)
            Device B → Pending (no state entry = pending in aggregate)
            Device B recovers to Running when it reports
```

**Test:** `device_offline_during_restart_counts_as_pending`

### 5. Chaos — kill under write load

Operator repeatedly killed and restarted while deployments are being created.

```
Time ──────────────────────────────────────────────────►

Deployments:  [create d0] [create d1] [create d2] [create d3] [create d4]
                                           ▲
Operator:     [running]──────── KILL ──────┘──[restart]──── KILL ──[final]──►
                                                                     │
                                                                     └─ converge
                                                                        all 5

Expected:   final operator converges full desired-state set within 30s
```

**Test:** `chaos_kill_under_write_load_converges`

---

## Running the regression tests

```bash
HARMONY_FLEET_E2E=1 cargo test -p harmony-fleet-e2e --test operator_recovery
```

**Prerequisites:** k3d + podman on PATH. The shared harness brings up NATS in
a fresh namespace.

**How it works:** tests drive `fleet_aggregator::run` in-process against the
real NATS + k3d. `spawn_aggregator` returns a `JoinHandle`; `kill` aborts it
and awaits (ensuring the `JoinSet` children tear down before the next instance
starts).