feat(monitoring): Datadog 15-key-metrics dashboard + Ceph "what's wrong" drilldown #266
Reference in New Issue
Block a user
No description provided.
Delete Branch "feat/datadog-k8s-metrics"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Two complementary additions to ClusterDashboardsScore.
Implements the 15 metrics from Datadog's Optimal Kubernetes Performance whitepaper (dashboards/datadog-15-k8s-metrics.json, registered in score.rs). 33 panels in 7 collapsible rows, organised by the whitepaper's taxonomy:
Style-matched to the suite (Prometheus-Cluster datasource, schema 36, row sectioning, $namespace/$node variables). Two Datadog metric names were renamed/removed in K8s ≥ 1.23 and remapped to the OKD-native equivalents: apiserver_request_latencies_* →
apiserver_request_duration_seconds_bucket, scheduler_e2e_scheduling_duration_seconds_* → scheduler_scheduling_attempt_duration_seconds_*.
The Health tile turned yellow/red without telling operators why. Added a new section between Cluster Status and Capacity:
Friendly empty states; subsequent rows shifted to stay compact when collapsed.