feat(monitoring): Ceph alerts integrated with OKD's native alerting stack #265
Reference in New Issue
Block a user
No description provided.
Delete Branch "feat/ceph-alerts"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Adds Ceph alert rules that fire through OKD's existing Alertmanager and show
up in the console's Alerting tab — no parallel Prometheus stack, no UWM
required. Also splits the old
storage.jsondashboard into two focused viewsand fixes a handful of dashboard queries that broke on OKD 4.20.
What's new
OpenshiftPrometheusRuleScore— generic PrometheusRule CRD installer(namespace + name +
Vec<RuleGroup>+ optional label override). Reuses theexisting
PrometheusRule/RuleGrouptypes.ceph_alert_rule_groups()— 13 Ceph alerts across 5 groups (health,OSD, capacity, PGs, nodes) with severity labels and summary/description
annotations.
examples/okd_ceph_alerts/— runnable example, mirrors thecluster_dashboardsexample.ceph.jsondashboard (Ceph admin view with$pool/$osdvariables, PG states, latency, recovery, capacity forecast, OSD/Pool
tables); rewritten
storage.jsonas pure K8s persistent-storage view(PVC usage top-N from kubelet stats, storage-class breakdown).
Why this approach
The
rook-cephnamespace already has theopenshift.io/cluster-monitoring: "true"label that feeds platformPrometheus. Dropping PrometheusRule CRDs there is all that's needed —
simplest path that meets the goal.
None of the four existing alerting scores fit:
OpenshiftClusterAlertScoreonly handles receivers, and the three CRD/Helm/RHOB scores each deploy their
own Prometheus. The new score is small (~100 lines) and composes with them.
Notes
CephHealthWarnfires on anyHEALTH_WARNcluster — expected, good E2E test.(
ceph_num_objects_*,ceph_osd_recovery_*) — will show "No data" ifabsent, easy to tune.
OpenshiftPrometheusRuleScoreis generic and also works for non-Cephrules in UWM-watched namespaces.