feat/ceph-score #297
Open
stremblay
wants to merge 19 commits from
feat/ceph-score into master
pull from: feat/ceph-score
merge into: NationTech:master
NationTech:master
NationTech:feat/fleet-ch2-operator-recovery
NationTech:feat/fleet-device-exec-logs
NationTech:feat/zitadel-web-pkce-and-human-user
NationTech:feat/jwt-bearer-openbao-auth
NationTech:feat/fleet-ch5-graceful-deploy-upgrade
NationTech:feat/fleet-ch4-agent-upgrade
NationTech:feat/fleet-ch3-log-streaming
NationTech:feat/add-claims-for-openbao
NationTech:refactor/move-zitadel-jwt-to-module
NationTech:feat/fleet-operator-real-data
NationTech:docs/fleet-secrets-device-access
NationTech:chore/fleet-operator-prune-mock-dtos
NationTech:chore/rename-release-to-publish
NationTech:refactor/config-namespace-env-var
NationTech:feat/fleet-staging-openbao
NationTech:feat/auth-add-next-url-redirect
NationTech:pr/harmony-sso-example
NationTech:feat/unified-config-and-secrets
NationTech:ci/fleet-argo-cd
NationTech:ci/fleet-operator-release-pipeline
NationTech:feat/on-device-key-gen
NationTech:feat/install-gitea
NationTech:feat/v0-3-logs-companion
NationTech:refactor/smoke-companion-minimal
NationTech:feat/smoke-test-contract
NationTech:feat/iobench-redpanda-profile
NationTech:feat/v0-3-dashboard-role-enforcement
NationTech:feat/v0-3-init-containers
NationTech:feat/v0-3-operator-restart-baseline
NationTech:feat/fleet-e2e-x86
NationTech:feat/opnsense-bootstrap-score
NationTech:feat/fleet-e2e
NationTech:feat/fleet-e2e-harness-and-ping
NationTech:feat/dashboard-auth
NationTech:feat/fleet-operator-web-frontend
NationTech:feat/deploy_fleet_server_side
NationTech:feat/openwebui
NationTech:feat/iot-aggregation-scale
NationTech:feat/iot-operator-helm-chart
NationTech:feat/removesideeffect
NationTech:feat/test-alert-receivers-sttest
NationTech:feat/brocade-client-add-vlans
NationTech:feat/agent-desired-state
NationTech:feat/opnsense-dns-implementation
NationTech:feat/named-config-instances
NationTech:worktree-bridge-cse_012j1jB37XfjXvDGHUjHrKSj
NationTech:chore/leftover-adr
NationTech:feat/config_e2e_zitadel_openbao
NationTech:example/vllm
NationTech:feat/config_sqlite
NationTech:chore/roadmap
NationTech:feature/kvm-module
NationTech:feat/rustfs
NationTech:feat/harmony_assets
NationTech:feat/brocade_assisted_setup
NationTech:feat/cluster_alerting_score
NationTech:e2e-tests-multicluster
NationTech:fix/refactor_alert_receivers
NationTech:feat/change-node-readiness-strategy
NationTech:feat/zitadel
NationTech:feat/improve-inventory-discovery
NationTech:fix/monitoring_abstractions_openshift
NationTech:feat/nats-jetstream
NationTech:adr-nats-creds
NationTech:feat/st_test
NationTech:feat/dockerAutoinstall
NationTech:chore/cleanup_hacluster
NationTech:doc/cert-management
NationTech:feat/certificate_management
NationTech:adr/017-staleness-failover
NationTech:fix/nats_non_root
NationTech:feat/rebuild_inventory
NationTech:fix/opnsense_update
NationTech:feat/unshedulable_control_planes
NationTech:feat/worker_okd_install
NationTech:doc-and-braindump
NationTech:fix/pxe_install
NationTech:switch-client
NationTech:okd_enable_user_workload_monitoring
NationTech:configure-switch
NationTech:fix/clippy
NationTech:feat/gen-ca-cert
NationTech:feat/okd_default_ingress_class
NationTech:fix/add_routes_to_domain
NationTech:secrets-prompt-editor
NationTech:feat/multisiteApplication
NationTech:feat/ceph-install-score
NationTech:feat/ceph-osd-score
NationTech:feat/ceph_validate_health
NationTech:better-indicatif-progress-grouped
NationTech:feat/crd-alertmanager-configs
NationTech:better-cli
NationTech:opnsense_upgrade
NationTech:feat/monitoring-application-feature
NationTech:dev/postgres
NationTech:feat/cd/localdeploymentdemo
NationTech:feat/webhook_receiver
NationTech:feat/kube-prometheus
NationTech:feat/init_k8s_tenant
NationTech:feat/discord-webhook-receiver
NationTech:feat/kube-prometheus-monitor
NationTech:feat/tenantScore
NationTech:feat/teams-integration
NationTech:feat/slack-notifs
NationTech:monitoring
NationTech:runtime-profiles
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
No description provided.
Delete Branch "feat/ceph-score"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Adds the missing surface to drive Rook's declarative centralized config without dropping out to imperative `ceph config set` calls in the toolbox. The new `ceph_config: Option<BTreeMap<String, BTreeMap<String, String>>>` field on CephClusterSpec mirrors the Rook v1.18 `spec.cephConfig` shape: outer key is the Ceph "WHO" target ("global", "osd.*", "mon.*", "mgr.*", "client.rgw.<store>", "osd.0", ...), inner is `option-name -> value`. All values are strings per Rook (Ceph parses them). Rook applies these after MONs reach quorum and re-applies on drift. `CephClusterSpec::set_config(who, key, value)` is a chainable helper that lazily allocates the maps so callers can write `.set_config("osd.*", "osd_max_backfills", "1")` instead of building the nested BTreeMaps by hand. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>The Rook operator Helm chart ships toolbox.enabled=false by default, so the rook-ceph-tools Deployment is never created. That breaks two downstream consumers: - CephVerifyClusterHealth, which looks up the Deployment and execs `ceph health` inside it - RookCephClusterScore's new post-apply readiness wait (next commit), which polls the same path Add an `enable_toolbox: bool` field on RookCephOperatorScore (default true via both default_okd() and default_k8s()) that sets the Helm value `toolbox.enabled` to the requested string. Users who genuinely don't want the toolbox can opt out, but the typical Harmony flow needs it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>K8sResourceScore returns once the API server has accepted a CR — not once the operator has reconciled it. So previously, RookCephClusterScore's interpret() returned in ~5 seconds while the actual cluster was still 2-15 minutes from being usable, causing the immediately-following CephVerifyClusterHealth to fail with "rook-ceph-tools not found" or HEALTH_WARN ≠ HEALTH_OK on a single-shot check. After applying the CephCluster CR, the Score now waits for: 1. The rook-ceph-tools Deployment to have ≥1 ready replica (10 min timeout). Gating exec on this is mandatory because exec_app_capture_output panics (`.expect("No matching pod")`) if called when no toolbox pod exists yet. 2. `ceph health` to return HEALTH_OK (20 min timeout). Fresh clusters sit in HEALTH_WARN for a few minutes while mons reach quorum, mgrs come up, and OSDs bootstrap their PGs. The wait logs every status transition so the user can tell what's happening. Only after both waits succeed does the Score apply the dependent CRs (block pools, filesystems, object stores, users). Failing fast at the cluster stage is better than applying CRs the operator can never reconcile. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>The example is meant to target a real pico OKD / external cluster, not a throwaway local K3D. Default `K8sAnywhereConfig::from_env()` reads HARMONY_USE_LOCAL_K3D and treats unset/true as "spin up a local K3D" — which is the wrong behavior here. env.sh sets: - HARMONY_USE_LOCAL_K3D=false force external cluster - HARMONY_PROFILE=staging required when use_local_k3d=false (current_target() panics otherwise) - HARMONY_USE_SYSTEM_KUBECONFIG=true use $HOME/.kube/config - HARMONY_SECRET_NAMESPACE/STORE/DATABASE_URL per-example state - RUST_LOG=harmony=debug to see the wait-loop progress Leaves KUBECONFIG and HARMONY_K8S_CONTEXT commented-out as overrides the user can uncomment when their kubeconfig isn't in the default location. Usage: source env.sh && cargo run -p example-install-rook-ceph Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Two vars in the previous env.sh were either dead code or actively broken for this example's flow: HARMONY_PROFILE Only read by K8sAnywhereTopology::current_target(), which is only invoked by Scores implementing MultiTargetTopology (ntfy, application packaging, app monitoring). None of the Scores in install_rook_ceph require that trait, so the value is never read and the panic case is never reached. Removed. HARMONY_USE_SYSTEM_KUBECONFIG=true Setting this to true is actively worse: try_get_or_install_k8s_client hits `todo!()` at k8s_anywhere.rs:900 when this branch is taken. The correct way to point at an existing kubeconfig is the standard KUBECONFIG env var (or the default $HOME/.kube/config fallback in get_kube_config_path()). Removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>helm install returns once chart resources are created — not once the operator Deployment is Ready, and not once the API server's discovery cache has picked up the Rook CRDs. The kube-rs client that K8sAnywhereTopology hands out is shared and OnceCell-initialized, so its own discovery cache was populated before the chart added any ceph.rook.io/v1 resources. Applying CephCluster immediately after the operator install therefore tended to fail with "Cannot resolve GVK ceph.rook.io/v1/CephCluster". This is the same race CNPG handles at postgresql/score_k8s.rs:180-203 via wait_until_deployment_ready + wait_for_crd + invalidate_discovery. RookCephClusterScore now does the same dance, at the top of execute(), before any CR apply: 1. wait_until_deployment_ready("rook-ceph-operator", 300s) 2. wait_for_crd("cephclusters.ceph.rook.io", 60s) 3. invalidate_discovery() 4. Apply CephCluster 5. (existing) wait for toolbox ready 6. (existing) wait for HEALTH_OK 7. Apply pools / fs / object stores / users The subsequent pool/fs/object-store/user CRD applies happen many minutes later (after HEALTH_OK), by which point the discovery cache has long since refreshed — no per-apply invalidation needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>The previous defaults shipped an unpinned Helm chart_version and a guessed Ceph image tag ("v19.2.3" — no build suffix, no source-of-truth reference). Both are unacceptable for a production install path. Replaced with the official pairing per the Rook 1.19 documentation: - RookCephOperatorScore::default_okd().chart_version = Some("v1.19.5") Latest stable release of the rook-ceph Helm chart at https://charts.rook.io/release as of 2026-05; verified via the chart repo's index.yaml. - CephVersionSpec::default().image = "quay.io/ceph/ceph:v19.2.3-20250717" The full version+build tag the Rook 1.19 upgrade docs explicitly recommend for production at https://rook.io/docs/rook/v1.19/Upgrade/ceph-upgrade/, with the date-stamped suffix that pins an immutable container image. Pinning here means heterogeneous-daemon-version scenarios are impossible by construction, and upgrades become a deliberate change to this code rather than an unobservable container pull side-effect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>The previous fix passed --set toolbox.enabled=true to the rook-ceph operator chart. That was wrong: in Rook v1.19 the toolbox is no longer part of the operator chart — it's a standalone manifest at deploy/examples/toolbox.yaml. The Helm value was silently ignored, so the rook-ceph-tools Deployment was never created. Symptom on a real install: cluster came up healthy (mons + mgrs + OSDs all Running) but `oc -n rook-ceph get deploy/rook-ceph-tools` returned NotFound, and RookCephClusterScore's wait_for_toolbox_ready timed out after 10 minutes. This commit: - Adds a new toolbox.rs module that ports the canonical rook/rook@v1.19.5 toolbox.yaml verbatim to a typed k8s_openapi::Deployment, including the inline toolbox.sh bash script (~50 lines) that re-renders /etc/ceph/ceph.conf when mon endpoints change. - Sources the container image from the CephCluster spec's cephVersion.image so the toolbox stays in lockstep with the cluster's Ceph version automatically — no second pin to keep in sync. - Has RookCephClusterScore apply the typed Deployment via K8sResourceScore::single immediately after applying CephCluster, then waits for it to be Ready as before. - Removes the now-dead enable_toolbox field and toolbox.enabled Helm value from RookCephOperatorScore, plus the misleading doc claim that the chart deploys the toolbox. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Server-Side Apply rejects re-applies of a resource whose fields another field manager has taken ownership of. Rook is exactly such a manager: after reconciling a CephCluster, it claims ownership of fields like .spec.mgr.modules (the operator dynamically toggles modules) and likely .spec.storage.* (discovered nodes), .spec.dashboard.* (port assignments), etc. Re-running the example against an existing cluster therefore failed: ApiError: Apply failed with 1 conflict: conflict with "rook" using ceph.rook.io/v1: .spec.mgr.modules The kube-rs apply flow used by K8sResourceScore was hardcoding `PatchParams::apply(FIELD_MANAGER)` without `.force`. apply_dynamic_many already supports force_conflicts but the typed path didn't expose it. Changes: - K8sResourceScore gains a `force_conflicts: bool` field (default false, so all existing call sites keep their semantics) plus a chainable builder `with_force_conflicts(true)`. When set, execute() round-trips each typed resource through serde_json to DynamicObject and routes via apply_dynamic_many with force=true. - RookCephClusterScore opts in via with_force_conflicts(true) on every Rook CR apply (CephCluster, CephBlockPool, CephFilesystem, CephObjectStore, CephObjectStoreUser). The toolbox Deployment and auto-generated StorageClasses keep the default (no force) — they're only managed by Harmony, no other field manager to conflict with. For declarative IaC this is the correct semantic: Harmony's declared state is authoritative; any operator-side mutations to fields we set get overridden on the next reconcile. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>View command line instructions
Checkout
From your project repository, check out a new branch and test the changes.