feat/iot-aggregation-scale #274
Closed
johnride
wants to merge 18 commits from
feat/iot-aggregation-scale into feat/iot-operator-helm-chart
pull from: feat/iot-aggregation-scale
merge into: NationTech:feat/iot-operator-helm-chart
NationTech:master
NationTech:feat/openwebui
NationTech:feat/iot-walking-skeleton
NationTech:feat/add-new-node
NationTech:feat/iot-operator-helm-chart
NationTech:feat/removesideeffect
NationTech:feat/test-alert-receivers-sttest
NationTech:feat/brocade-client-add-vlans
NationTech:feat/agent-desired-state
NationTech:feat/opnsense-dns-implementation
NationTech:feat/named-config-instances
NationTech:worktree-bridge-cse_012j1jB37XfjXvDGHUjHrKSj
NationTech:chore/leftover-adr
NationTech:feat/config_e2e_zitadel_openbao
NationTech:example/vllm
NationTech:feat/config_sqlite
NationTech:chore/roadmap
NationTech:feature/kvm-module
NationTech:feat/rustfs
NationTech:feat/harmony_assets
NationTech:feat/brocade_assisted_setup
NationTech:feat/cluster_alerting_score
NationTech:e2e-tests-multicluster
NationTech:fix/refactor_alert_receivers
NationTech:feat/change-node-readiness-strategy
NationTech:feat/zitadel
NationTech:feat/improve-inventory-discovery
NationTech:fix/monitoring_abstractions_openshift
NationTech:feat/nats-jetstream
NationTech:adr-nats-creds
NationTech:feat/st_test
NationTech:feat/dockerAutoinstall
NationTech:chore/cleanup_hacluster
NationTech:doc/cert-management
NationTech:feat/certificate_management
NationTech:adr/017-staleness-failover
NationTech:fix/nats_non_root
NationTech:feat/rebuild_inventory
NationTech:fix/opnsense_update
NationTech:feat/unshedulable_control_planes
NationTech:feat/worker_okd_install
NationTech:doc-and-braindump
NationTech:fix/pxe_install
NationTech:switch-client
NationTech:okd_enable_user_workload_monitoring
NationTech:configure-switch
NationTech:fix/clippy
NationTech:feat/gen-ca-cert
NationTech:feat/okd_default_ingress_class
NationTech:fix/add_routes_to_domain
NationTech:secrets-prompt-editor
NationTech:feat/multisiteApplication
NationTech:feat/ceph-install-score
NationTech:feat/ceph-osd-score
NationTech:feat/ceph_validate_health
NationTech:better-indicatif-progress-grouped
NationTech:feat/crd-alertmanager-configs
NationTech:better-cli
NationTech:opnsense_upgrade
NationTech:feat/monitoring-application-feature
NationTech:dev/postgres
NationTech:feat/cd/localdeploymentdemo
NationTech:feat/webhook_receiver
NationTech:feat/kube-prometheus
NationTech:feat/init_k8s_tenant
NationTech:feat/discord-webhook-receiver
NationTech:feat/kube-prometheus-monitor
NationTech:feat/tenantScore
NationTech:feat/teams-integration
NationTech:feat/slack-notifs
NationTech:monitoring
NationTech:runtime-profiles
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
No description provided.
Delete Branch "feat/iot-aggregation-scale"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
First milestone of the aggregation rework. Lands the contract layer without any runtime side effects: the agent + operator still run their legacy paths unchanged. New types (module `fleet`): - DeviceInfo: routing labels + inventory, rewritten on label change. Stored in KV `device-info` at `info.<device_id>`. - DeploymentState: current phase per (device, deployment). Stored in KV `device-state` at `state.<device>.<deployment>`. Authoritative snapshot; operator rebuilds counters from it on cold-start. - HeartbeatPayload: tiny liveness ping in KV `device-heartbeat`. Payload capped by a test (< 96 bytes) so it stays cheap at 1M-device rates. - StateChangeEvent: `from: Option<Phase>, to: Phase, sequence` emitted on each transition to JS stream `device-state-events` on subject `events.state.<device>.<deployment>`. Operator folds these events into in-memory counters. - LogEvent: shorter-retention user-facing event log to JS stream `device-log-events` on subject `events.log.<device>`. Transport constants + key/subject helpers in `kv` with cross-component wire-stability tests so a rename here gets caught. 10 new tests (roundtrip serde, forward-compat parse, size bound, key/subject format). Legacy `AgentStatus` tests + constants stay green; retirement is scheduled for M8 once the live path has switched over.Agent now writes the new per-concern KV shapes + event streams alongside the legacy AgentStatus. Nothing consumes the new data yet — the legacy aggregator still drives CR .status from `agent-status`. M3 will add the operator-side cold-start + consumer paths in parity mode; M5 flips the CR-patch source once counters verify against the legacy aggregator. New module `fleet_publisher.rs` owns: - Opening + idempotent-creating the three new KV buckets (`device-info`, `device-state`, `device-heartbeat`) and two JetStream streams (`device-state-events`, `device-log-events`). - Publish methods for DeviceInfo, HeartbeatPayload, DeploymentState (KV put), StateChangeEvent + LogEvent (stream publish), and delete for deployment-state cleanup. - Log-and-swallow failure mode. The operator re-walks KV on cold-start, so a missed event publish is self-healing on the next transition or operator restart. Reconciler grew: - `device_id`: Id + `fleet`: Option<Arc<FleetPublisher>> - per-(deployment) monotonic sequence counter in StatusState - `set_phase` detects actual transitions (prev_phase vs new) and emits a DeploymentState KV write + StateChangeEvent stream publish only on change. No-op re-confirmation still bumps the sequence (lets operator detect duplicate events via sequence comparison) but stays off the wire. - `drop_phase` deletes the device-state KV entry. - `push_event` also publishes a LogEvent to the stream. main.rs: - Builds FleetPublisher after connect_nats, passes into Reconciler. - Publishes DeviceInfo once at startup (empty labels — populated by the selector-targeting branch once it merges). - Spawns a heartbeat loop on 30 s cadence. - Legacy `report_status` AgentStatus task kept running unchanged. 8 unit tests added for the transition-detection + sequence + ring- buffer invariants (drive set_phase / drop_phase / push_event with fleet: None). 18 contract tests from M1 still green.New module `fleet_aggregator` spawns a 5 s tick task that: - Walks the Chapter 4 KV buckets (`device-info`, `device-state`) every tick. - Computes per-CR phase counters via `compute_counters` (pure function, unit tested). - Computes the legacy aggregator's counts from the same `agent-status` snapshot map the legacy task is already maintaining. - Compares the two per CR and logs per-tick at DEBUG level (matches) or WARN (mismatches), with running totals at INFO every 60 s. Explicit `cr_targets_device` predicate is the one-line plug point for the selector-based rewrite coming from the review-fix branch: swap `target_devices.contains()` for `target_selector.matches(&info.labels)`, everything else in the aggregator is label/selector-agnostic. Refactored `aggregate::run` to accept the `StatusSnapshots` map from outside so the parity-check task reads the same agent-status view the legacy aggregator writes to. Added `aggregate::new_snapshots()` helper so `main` owns the one shared Arc. The task is strictly read-only: no CR patches, no side effects. M5 flips `.status.aggregate` over to the new counter-driven path once M4 replaces the periodic re-walk with the event-stream consumer and the parity check has stayed green under load. 5 unit tests cover the pure counter logic (target match, multi-CR fan-in, zero-target CR, phase dispatch).Replaces M3's per-tick KV re-walk with an incremental JetStream consumer on `device-state-events`. Cold-start still walks KV once to seed counters; steady state consumes events and applies `from -= 1; to += 1` diffs. New in `fleet_aggregator`: FleetState (shared via Arc<Mutex<_>>): - counters: per-deployment phase counts. - phase_of: per-(device, deployment) current phase, for duplicate + resync detection. - latest_sequence: per-(device, deployment) highest sequence applied, drops stale and duplicate deliveries. - deployment_namespace: name → namespace map refreshed each parity tick from the CR list (events carry only the deployment name, matching the `<device>.<deployment>` KV key format). apply_state_change_event(): - Idempotent for duplicate sequence numbers. - Idempotent for out-of-order lower-sequence events. - On from-phase disagreement with our belief, trusts the event and re-syncs (logs warn — parity check will catch any resulting drift against the legacy aggregator). - Counter decrement saturates at zero so replays can't underflow. run_event_consumer(): - Durable JetStream pull consumer on STATE_EVENT_WILDCARD, DeliverPolicy::New (cold-start already seeded state from KV — replaying from the beginning would double-count). - Explicit ack; malformed payloads are logged + acked to avoid infinite redelivery. parity_tick() no longer walks KV — it reads live counters from the shared FleetState and compares with the legacy aggregator's per-CR fold. Same match/mismatch/running-totals logging as M3. 8 new unit tests cover the event-apply invariants: first transition (no from), transition (from+to), duplicate sequence, out-of-order sequence, from-disagreement resync, unknown- deployment ignore, cold-start seeding, underflow saturation. Plus the 5 M3 tests from before — 13 aggregator tests total, all green.`bucket.watch_all_from_revision(0)` sends the JetStream consumer request with DeliverByStartSequence and an optional-missing start sequence, which the server rejects with error 10094: consumer delivery policy is deliver by start sequence, but optional start sequence is not set `watch_with_history(">")` uses DeliverPolicy::LastPerSubject instead — replays the current value of every key, then streams live updates. Same cold-start-plus-steady-state semantics, correct wire. Caught by smoke-a4 --auto: state watcher exited immediately on startup, no deployments ever reconciled.Superseded by #275 and #276
Pull request closed