feat(monitoring): Ceph alerts integrated with OKD's native alerting stack #265

Merged
stremblay merged 3 commits from feat/ceph-alerts into master 2026-04-20 18:14:45 +00:00
Owner

Summary

Adds Ceph alert rules that fire through OKD's existing Alertmanager and show
up in the console's Alerting tab — no parallel Prometheus stack, no UWM
required. Also splits the old storage.json dashboard into two focused views
and fixes a handful of dashboard queries that broke on OKD 4.20.

What's new

  • OpenshiftPrometheusRuleScore — generic PrometheusRule CRD installer
    (namespace + name + Vec<RuleGroup> + optional label override). Reuses the
    existing PrometheusRule/RuleGroup types.
  • ceph_alert_rule_groups() — 13 Ceph alerts across 5 groups (health,
    OSD, capacity, PGs, nodes) with severity labels and summary/description
    annotations.
  • examples/okd_ceph_alerts/ — runnable example, mirrors the
    cluster_dashboards example.
  • New ceph.json dashboard (Ceph admin view with $pool/$osd
    variables, PG states, latency, recovery, capacity forecast, OSD/Pool
    tables); rewritten storage.json as pure K8s persistent-storage view
    (PVC usage top-N from kubelet stats, storage-class breakdown).

Why this approach

The rook-ceph namespace already has the
openshift.io/cluster-monitoring: "true" label that feeds platform
Prometheus. Dropping PrometheusRule CRDs there is all that's needed —
simplest path that meets the goal.

None of the four existing alerting scores fit: OpenshiftClusterAlertScore
only handles receivers, and the three CRD/Helm/RHOB scores each deploy their
own Prometheus. The new score is small (~100 lines) and composes with them.

Notes

  • CephHealthWarn fires on any HEALTH_WARN cluster — expected, good E2E test.
  • A few alert exprs reference metrics that vary between Rook versions
    (ceph_num_objects_*, ceph_osd_recovery_*) — will show "No data" if
    absent, easy to tune.
  • OpenshiftPrometheusRuleScore is generic and also works for non-Ceph
    rules in UWM-watched namespaces.
## Summary Adds Ceph alert rules that fire through OKD's existing Alertmanager and show up in the console's Alerting tab — no parallel Prometheus stack, no UWM required. Also splits the old `storage.json` dashboard into two focused views and fixes a handful of dashboard queries that broke on OKD 4.20. ## What's new - **`OpenshiftPrometheusRuleScore`** — generic PrometheusRule CRD installer (namespace + name + `Vec<RuleGroup>` + optional label override). Reuses the existing `PrometheusRule`/`RuleGroup` types. - **`ceph_alert_rule_groups()`** — 13 Ceph alerts across 5 groups (health, OSD, capacity, PGs, nodes) with severity labels and summary/description annotations. - **`examples/okd_ceph_alerts/`** — runnable example, mirrors the `cluster_dashboards` example. - **New `ceph.json` dashboard** (Ceph admin view with `$pool`/`$osd` variables, PG states, latency, recovery, capacity forecast, OSD/Pool tables); **rewritten `storage.json`** as pure K8s persistent-storage view (PVC usage top-N from kubelet stats, storage-class breakdown). ## Why this approach The `rook-ceph` namespace already has the `openshift.io/cluster-monitoring: "true"` label that feeds platform Prometheus. Dropping PrometheusRule CRDs there is all that's needed — simplest path that meets the goal. None of the four existing alerting scores fit: `OpenshiftClusterAlertScore` only handles receivers, and the three CRD/Helm/RHOB scores each deploy their own Prometheus. The new score is small (~100 lines) and composes with them. ## Notes - `CephHealthWarn` fires on any `HEALTH_WARN` cluster — expected, good E2E test. - A few alert exprs reference metrics that vary between Rook versions (`ceph_num_objects_*`, `ceph_osd_recovery_*`) — will show "No data" if absent, easy to tune. - `OpenshiftPrometheusRuleScore` is generic and also works for non-Ceph rules in UWM-watched namespaces.
stremblay added 3 commits 2026-04-20 18:05:08 +00:00
stremblay merged commit bae162a3e4 into master 2026-04-20 18:14:45 +00:00
stremblay deleted branch feat/ceph-alerts 2026-04-20 18:14:46 +00:00
Sign in to join this conversation.
No Reviewers
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: NationTech/harmony#265
No description provided.