feat(fleet-operator): aggregator recovery signal + orphan GC + recovery e2e (Ch2) #328
Open
johnride
wants to merge 3 commits from
feat/fleet-ch2-operator-recovery into feat/fleet-device-exec-logs
pull from: feat/fleet-ch2-operator-recovery
merge into: NationTech:feat/fleet-device-exec-logs
NationTech:master
NationTech:feat/fleet-device-exec-logs
NationTech:feat/zitadel-web-pkce-and-human-user
NationTech:feat/jwt-bearer-openbao-auth
NationTech:feat/fleet-ch5-graceful-deploy-upgrade
NationTech:feat/fleet-ch4-agent-upgrade
NationTech:feat/fleet-ch3-log-streaming
NationTech:feat/add-claims-for-openbao
NationTech:refactor/move-zitadel-jwt-to-module
NationTech:feat/fleet-operator-real-data
NationTech:docs/fleet-secrets-device-access
NationTech:chore/fleet-operator-prune-mock-dtos
NationTech:chore/rename-release-to-publish
NationTech:refactor/config-namespace-env-var
NationTech:feat/fleet-staging-openbao
NationTech:feat/auth-add-next-url-redirect
NationTech:pr/harmony-sso-example
NationTech:feat/unified-config-and-secrets
NationTech:ci/fleet-argo-cd
NationTech:ci/fleet-operator-release-pipeline
NationTech:feat/on-device-key-gen
NationTech:feat/install-gitea
NationTech:feat/v0-3-logs-companion
NationTech:refactor/smoke-companion-minimal
NationTech:feat/smoke-test-contract
NationTech:feat/iobench-redpanda-profile
NationTech:feat/v0-3-dashboard-role-enforcement
NationTech:feat/v0-3-init-containers
NationTech:feat/v0-3-operator-restart-baseline
NationTech:feat/fleet-e2e-x86
NationTech:feat/ceph-score
NationTech:feat/opnsense-bootstrap-score
NationTech:feat/fleet-e2e
NationTech:feat/fleet-e2e-harness-and-ping
NationTech:feat/dashboard-auth
NationTech:feat/fleet-operator-web-frontend
NationTech:feat/deploy_fleet_server_side
NationTech:feat/openwebui
NationTech:feat/iot-aggregation-scale
NationTech:feat/iot-operator-helm-chart
NationTech:feat/removesideeffect
NationTech:feat/test-alert-receivers-sttest
NationTech:feat/brocade-client-add-vlans
NationTech:feat/agent-desired-state
NationTech:feat/opnsense-dns-implementation
NationTech:feat/named-config-instances
NationTech:worktree-bridge-cse_012j1jB37XfjXvDGHUjHrKSj
NationTech:chore/leftover-adr
NationTech:feat/config_e2e_zitadel_openbao
NationTech:example/vllm
NationTech:feat/config_sqlite
NationTech:chore/roadmap
NationTech:feature/kvm-module
NationTech:feat/rustfs
NationTech:feat/harmony_assets
NationTech:feat/brocade_assisted_setup
NationTech:feat/cluster_alerting_score
NationTech:e2e-tests-multicluster
NationTech:fix/refactor_alert_receivers
NationTech:feat/change-node-readiness-strategy
NationTech:feat/zitadel
NationTech:feat/improve-inventory-discovery
NationTech:fix/monitoring_abstractions_openshift
NationTech:feat/nats-jetstream
NationTech:adr-nats-creds
NationTech:feat/st_test
NationTech:feat/dockerAutoinstall
NationTech:chore/cleanup_hacluster
NationTech:doc/cert-management
NationTech:feat/certificate_management
NationTech:adr/017-staleness-failover
NationTech:fix/nats_non_root
NationTech:feat/rebuild_inventory
NationTech:fix/opnsense_update
NationTech:feat/unshedulable_control_planes
NationTech:feat/worker_okd_install
NationTech:doc-and-braindump
NationTech:fix/pxe_install
NationTech:switch-client
NationTech:okd_enable_user_workload_monitoring
NationTech:configure-switch
NationTech:fix/clippy
NationTech:feat/gen-ca-cert
NationTech:feat/okd_default_ingress_class
NationTech:fix/add_routes_to_domain
NationTech:secrets-prompt-editor
NationTech:feat/multisiteApplication
NationTech:feat/ceph-install-score
NationTech:feat/ceph-osd-score
NationTech:feat/ceph_validate_health
NationTech:better-indicatif-progress-grouped
NationTech:feat/crd-alertmanager-configs
NationTech:better-cli
NationTech:opnsense_upgrade
NationTech:feat/monitoring-application-feature
NationTech:dev/postgres
NationTech:feat/cd/localdeploymentdemo
NationTech:feat/webhook_receiver
NationTech:feat/kube-prometheus
NationTech:feat/init_k8s_tenant
NationTech:feat/discord-webhook-receiver
NationTech:feat/kube-prometheus-monitor
NationTech:feat/tenantScore
NationTech:feat/teams-integration
NationTech:feat/slack-notifs
NationTech:monitoring
NationTech:runtime-profiles
3 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
| 086d905586 |
fix(fleet): clarify recovery tests + add missing scenario 4 + dedup test
Some checks failed
Run Check Script / check (pull_request) Failing after 52s
- Rewrite e2e tests with explicit SETUP/ACTION/ASSERT structure per scenario. Each test header documents devices, deployments, and expected end state. - Add scenario 4 test: device offline during restart counts as pending (2 devices, 1 deployment, only 1 reports → succeeded=1, pending=1). - Add apply_state_rejects_older_timestamp unit test (previously claimed in docs but missing). - Fix split_once → rsplit_once in parse_state_key and seed_owned_targets (device IDs with dots would silently drop entries). |
|||
| 81b0f79f55 |
docs(fleet): rewrite operator recovery scenarios with diagrams
ASCII diagrams per scenario replace dense paragraphs. Architecture overview shows the three watchers, FleetState, and convergence latches. Cold-start sequence and key invariants in scannable tables. |
|||
| 56602b505c |
feat(fleet-operator): aggregator recovery signal + orphan GC + recovery e2e (Ch2)
All checks were successful
Run Check Script / check (pull_request) Successful in 2m35s
Operator restart + aggregator recovery (v0.3 plan Ch2). The aggregator already cold-rebuilds from NATS KV + CR watches; this makes recovery observable, closes an orphan gap, and pins each failure shape with a regression test. - OperatorLiveness: a shared in-process latch (Recovering → Converged) the aggregator sets once all three cold-start sources replay (Deployment/Device watcher InitDone, device-state KV seen_current; empty-bucket short-circuit). The in-process dashboard reads it and shows a self-clearing banner via an HTMX self-poll (/__recovery), so the customer sees progress, not a blank. - gc_orphaned_desired_state: at convergence, purge desired-state whose Deployment CR no longer exists (force-deleted while the operator was down, finalizer bypassed). Belt-and-suspenders with the controller finalizer. - run() now owns its watchers in a JoinSet, so cancelling the aggregator aborts its children — no orphan tasks outliving a restart (matters for the restart-simulation tests and clean process teardown). Also made run() Send (hoisted a .await out of a tracing macro) so it can be spawned. - docs/fleet-operator-recovery-scenarios.md enumerates the failure shapes and maps each to its test. - harmony-fleet-e2e/tests/operator_recovery.rs: regression test per scenario (cold restart converges from KV; orphan GC; two operators write identical bytes; chaos kill under write load converges <30s) + AdminKv::put_device_state. Writes stay idempotent + byte-deterministic, so two operators racing agree without leader election (operator HA = D3, deferred). |