feat(monitoring): Datadog 15-key-metrics dashboard + Ceph "what's wrong" drilldown #266

Merged
stremblay merged 3 commits from feat/datadog-k8s-metrics into master 2026-04-21 11:21:29 +00:00
Owner

Two complementary additions to ClusterDashboardsScore.

  1. New dashboard — Datadog — 15 Key Kubernetes Metrics

Implements the 15 metrics from Datadog's Optimal Kubernetes Performance whitepaper (dashboards/datadog-15-k8s-metrics.json, registered in score.rs). 33 panels in 7 collapsible rows, organised by the whitepaper's taxonomy:

  • Cluster state (1–3): node conditions, Deployment/DaemonSet desired-vs-current, available-vs-unavailable.
  • Resources (4–10): memory, CPU, disk — usage vs requests vs limits at cluster, top-pod, and per-node levels.
  • Control plane (11–15): etcd leader + transitions, API-server rate/latency/errors, workqueue durations, scheduler attempts and latency.

Style-matched to the suite (Prometheus-Cluster datasource, schema 36, row sectioning, $namespace/$node variables). Two Datadog metric names were renamed/removed in K8s ≥ 1.23 and remapped to the OKD-native equivalents: apiserver_request_latencies_* →
apiserver_request_duration_seconds_bucket, scheduler_e2e_scheduling_duration_seconds_* → scheduler_scheduling_attempt_duration_seconds_*.

  1. Ceph dashboard — "Active Issues" section

The Health tile turned yellow/red without telling operators why. Added a new section between Cluster Status and Capacity:

  • Always visible: critical / warning firing-alert counts as stat tiles.
  • Collapsed drilldown with two side-by-side tables:
    • Active Ceph health checks (ceph_health_detail == 1) — Severity + Check columns, one row per active native Ceph check (OSD_DOWN, TOO_MANY_PGS, POOL_NEARFULL…). Direct answer to "why isn't it green?".
    • Firing Ceph alerts (Alertmanager view) — catches rules without a ceph_health_detail counterpart (CephDaysUntilFull, CephNodeRootDiskUsage).

Friendly empty states; subsequent rows shifted to stay compact when collapsed.

Two complementary additions to ClusterDashboardsScore. 1. New dashboard — Datadog — 15 Key Kubernetes Metrics Implements the 15 metrics from Datadog's Optimal Kubernetes Performance whitepaper (dashboards/datadog-15-k8s-metrics.json, registered in score.rs). 33 panels in 7 collapsible rows, organised by the whitepaper's taxonomy: - Cluster state (1–3): node conditions, Deployment/DaemonSet desired-vs-current, available-vs-unavailable. - Resources (4–10): memory, CPU, disk — usage vs requests vs limits at cluster, top-pod, and per-node levels. - Control plane (11–15): etcd leader + transitions, API-server rate/latency/errors, workqueue durations, scheduler attempts and latency. Style-matched to the suite (Prometheus-Cluster datasource, schema 36, row sectioning, $namespace/$node variables). Two Datadog metric names were renamed/removed in K8s ≥ 1.23 and remapped to the OKD-native equivalents: apiserver_request_latencies_* → apiserver_request_duration_seconds_bucket, scheduler_e2e_scheduling_duration_seconds_* → scheduler_scheduling_attempt_duration_seconds_*. 2. Ceph dashboard — "Active Issues" section The Health tile turned yellow/red without telling operators why. Added a new section between Cluster Status and Capacity: - Always visible: critical / warning firing-alert counts as stat tiles. - Collapsed drilldown with two side-by-side tables: - Active Ceph health checks (ceph_health_detail == 1) — Severity + Check columns, one row per active native Ceph check (OSD_DOWN, TOO_MANY_PGS, POOL_NEARFULL…). Direct answer to "why isn't it green?". - Firing Ceph alerts (Alertmanager view) — catches rules without a ceph_health_detail counterpart (CephDaysUntilFull, CephNodeRootDiskUsage). Friendly empty states; subsequent rows shifted to stay compact when collapsed.
stremblay added 2 commits 2026-04-20 19:50:04 +00:00
stremblay added 1 commit 2026-04-20 19:58:57 +00:00
feat: improve ceph dashboard
All checks were successful
Run Check Script / check (pull_request) Successful in 2m17s
349c2a1358
stremblay merged commit 52db82865d into master 2026-04-21 11:21:29 +00:00
stremblay deleted branch feat/datadog-k8s-metrics 2026-04-21 11:21:30 +00:00
Sign in to join this conversation.
No Reviewers
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: NationTech/harmony#266
No description provided.