feat(monitoring): Ceph alerts integrated with OKD's native alerting stack #265

stremblay · 2026-04-20T18:05:07Z

stremblay commented

2026-04-20 18:05:07 +00:00

Summary

Adds Ceph alert rules that fire through OKD's existing Alertmanager and show
up in the console's Alerting tab — no parallel Prometheus stack, no UWM
required. Also splits the old storage.json dashboard into two focused views
and fixes a handful of dashboard queries that broke on OKD 4.20.

What's new

OpenshiftPrometheusRuleScore — generic PrometheusRule CRD installer
(namespace + name + Vec<RuleGroup> + optional label override). Reuses the
existing PrometheusRule/RuleGroup types.
ceph_alert_rule_groups() — 13 Ceph alerts across 5 groups (health,
OSD, capacity, PGs, nodes) with severity labels and summary/description
annotations.
examples/okd_ceph_alerts/ — runnable example, mirrors the
cluster_dashboards example.
New ceph.json dashboard (Ceph admin view with $pool/$osd
variables, PG states, latency, recovery, capacity forecast, OSD/Pool
tables); rewritten storage.json as pure K8s persistent-storage view
(PVC usage top-N from kubelet stats, storage-class breakdown).

Why this approach

The rook-ceph namespace already has the
openshift.io/cluster-monitoring: "true" label that feeds platform
Prometheus. Dropping PrometheusRule CRDs there is all that's needed —
simplest path that meets the goal.

None of the four existing alerting scores fit: OpenshiftClusterAlertScore
only handles receivers, and the three CRD/Helm/RHOB scores each deploy their
own Prometheus. The new score is small (~100 lines) and composes with them.

Notes

CephHealthWarn fires on any HEALTH_WARN cluster — expected, good E2E test.
A few alert exprs reference metrics that vary between Rook versions
(ceph_num_objects_*, ceph_osd_recovery_*) — will show "No data" if
absent, easy to tune.
OpenshiftPrometheusRuleScore is generic and also works for non-Ceph
rules in UWM-watched namespaces.

## Summary Adds Ceph alert rules that fire through OKD's existing Alertmanager and show up in the console's Alerting tab — no parallel Prometheus stack, no UWM required. Also splits the old `storage.json` dashboard into two focused views and fixes a handful of dashboard queries that broke on OKD 4.20. ## What's new - **`OpenshiftPrometheusRuleScore`** — generic PrometheusRule CRD installer (namespace + name + `Vec<RuleGroup>` + optional label override). Reuses the existing `PrometheusRule`/`RuleGroup` types. - **`ceph_alert_rule_groups()`** — 13 Ceph alerts across 5 groups (health, OSD, capacity, PGs, nodes) with severity labels and summary/description annotations. - **`examples/okd_ceph_alerts/`** — runnable example, mirrors the `cluster_dashboards` example. - **New `ceph.json` dashboard** (Ceph admin view with `$pool`/`$osd` variables, PG states, latency, recovery, capacity forecast, OSD/Pool tables); **rewritten `storage.json`** as pure K8s persistent-storage view (PVC usage top-N from kubelet stats, storage-class breakdown). ## Why this approach The `rook-ceph` namespace already has the `openshift.io/cluster-monitoring: "true"` label that feeds platform Prometheus. Dropping PrometheusRule CRDs there is all that's needed — simplest path that meets the goal. None of the four existing alerting scores fit: `OpenshiftClusterAlertScore` only handles receivers, and the three CRD/Helm/RHOB scores each deploy their own Prometheus. The new score is small (~100 lines) and composes with them. ## Notes - `CephHealthWarn` fires on any `HEALTH_WARN` cluster — expected, good E2E test. - A few alert exprs reference metrics that vary between Rook versions (`ceph_num_objects_*`, `ceph_osd_recovery_*`) — will show "No data" if absent, easy to tune. - `OpenshiftPrometheusRuleScore` is generic and also works for non-Ceph rules in UWM-watched namespaces.

stremblay added 3 commits 2026-04-20 18:05:08 +00:00

fix: fix ceph dashboard for root volumes not populated 7265d8a4f3

feat: split storage dashboard in two : ceph + persistent storage 126390bb63

feat: score to create ceph alerts in the okd default alerting stack

Run Check Script / check (pull_request) Successful in 2m10s

Details

8acd9de275

stremblay merged commit bae162a3e4 into master

2026-04-20 18:14:45 +00:00

stremblay deleted branch feat/ceph-alerts

2026-04-20 18:14:46 +00:00

stremblay referenced this issue from a commit