"description":"Count of Ceph alert rules currently in firing state with severity=critical. Drives the red tile on the Health stat to concrete action. 0 when the cluster is healthy.",
"description":"Count of Ceph alert rules currently in firing state with severity=warning. Matches what drives the yellow HEALTH_WARN tile on this dashboard.",
"type":"row","id":104,"title":"Issue details — click to expand","collapsed":true,
"gridPos":{"h":1,"w":24,"x":0,"y":11},
"panels":[
{
"type":"table","id":105,"title":"Active Ceph health checks (ceph health detail)",
"description":"Exactly what `ceph health detail` would show. One row per active health check; the Check column is the Ceph check code (OSD_DOWN, POOL_NEARFULL, PG_DEGRADED, MON_CLOCK_SKEW, etc.). Severity is the Ceph-native HEALTH_WARN / HEALTH_ERR label emitted by the mgr prometheus module. An empty table means Ceph reports no active health checks — the Health tile above should be HEALTH_OK. This is the primary answer to 'why isn't it green?'.",
"description":"Instant-query view of every Ceph alert currently firing — the same set that pages oncall through Alertmanager. Usually matches the health-checks table above, plus derived alerts that have no direct ceph_health_detail counterpart (CephDaysUntilFull, CephNodeRootDiskUsage). The ALERTS metric carries labels only, not annotations: alert name plus daemon/pool/instance labels should be enough to identify the problem; run `oc -n openshift-monitoring get prometheusrule ceph-alerts -o yaml` or check Alertmanager for the full summary/description.",
"id":100,"type":"row","title":"Cluster State — metrics 1–3 (Node status, Desired vs current pods, Available vs unavailable pods)",
"collapsed":false,
"gridPos":{"h":1,"w":24,"x":0,"y":0}
},
{
"id":1,"type":"stat","title":"Ready Nodes",
"description":"Metric 1 — Node status. Count of nodes with condition Ready=true. A node that drops out of Ready can no longer accept new pods; scheduling freezes until it recovers or is drained.",
"targets":[{"expr":"count(kube_node_status_condition{condition=\"MemoryPressure\",status=\"true\"} == 1) or vector(0)","refId":"A","legendFormat":""}],
"description":"Nodes under disk pressure. Kubelet runs GC (removing unused images and dead containers) and, if space stays low, starts evicting pods.",
"targets":[{"expr":"count(kube_node_status_condition{condition=\"NetworkUnavailable\",status=\"true\"} == 1) or vector(0)","refId":"A","legendFormat":""}],
"id":7,"type":"timeseries","title":"Deployments — Desired vs Current pods",
"description":"Metric 2 — Desired vs current pods (Deployments). A persistent gap means pods cannot be scheduled: check node capacity, PodDisruptionBudgets, and image pull failures.",
"id":10,"type":"timeseries","title":"DaemonSets — Desired vs Scheduled",
"description":"Metric 2 — Desired vs current pods (DaemonSets). DaemonSets should have one pod per matching node; a gap means the pod cannot be placed (taints, resources, node selectors).",
"id":11,"type":"timeseries","title":"DaemonSets — Available vs Unavailable",
"description":"Metric 3 — Available/unavailable (DaemonSets). Unavailable DaemonSet pods often mean per-node infrastructure pods (CNI, logging, monitoring agents) are failing on specific nodes.",
"id":20,"type":"timeseries","title":"Cluster memory — usage vs requests vs limits",
"description":"Metrics 4–5 — aggregate. Compares how much memory containers actually consume (working set) to what they requested and what they are limited to. A pod that crosses its limit is OOMKilled.",
"description":"Metric 4 — pod-level. Pods approaching 100% of their memory limit will be OOMKilled. If a pod persistently sits near the limit, either raise the limit or optimize memory use.",
"expr":"topk(15,\n 100 * sum by(namespace, pod)(container_memory_working_set_bytes{namespace=~\"$namespace\",container!=\"\",container!=\"POD\"})\n /\n sum by(namespace, pod)(kube_pod_container_resource_limits{namespace=~\"$namespace\",resource=\"memory\",container!=\"\"})\n)",
"id":22,"type":"timeseries","title":"Node memory — requests vs allocatable",
"description":"Metric 6 — per node. Compares the sum of pod memory requests placed on each node to the node's allocatable memory. If requests approach allocatable, the scheduler can no longer place new pods on that node.",
"description":"How full each node is in terms of scheduled (requested) memory. ≥ 100% means no further pods requesting memory can be scheduled there.",
"expr":"100 *\n sum by(node)(kube_pod_container_resource_requests{resource=\"memory\",container!=\"\",node=~\"$node\"})\n /\n sum by(node)(kube_node_status_allocatable{resource=\"memory\",node=~\"$node\"})",
"id":300,"type":"row","title":"Resources — CPU (metrics 8–10)",
"collapsed":false,
"gridPos":{"h":1,"w":24,"x":0,"y":38}
},
{
"id":30,"type":"timeseries","title":"Cluster CPU — usage vs requests vs limits",
"description":"Metrics 9–10 — aggregate. Unlike memory, CPU is compressible: exceeding a limit causes throttling (slow), not OOMKill. A persistent gap between usage and limits is fine; a persistent gap between usage and requests wastes capacity.",
"id":31,"type":"timeseries","title":"Top 15 pods — CPU usage / CPU limit (%)",
"description":"Metric 9 — pod-level. Pods that sit above 100% for long windows are being throttled by the kernel, which causes latency spikes even though the pod is not killed.",
"expr":"topk(15,\n 100 * sum by(namespace, pod)(rate(container_cpu_usage_seconds_total{namespace=~\"$namespace\",container!=\"\",container!=\"POD\"}[5m]))\n /\n sum by(namespace, pod)(kube_pod_container_resource_limits{namespace=~\"$namespace\",resource=\"cpu\",container!=\"\"})\n)",
"id":32,"type":"timeseries","title":"Node CPU — requests vs allocatable",
"description":"Metric 8 — per node. Same shape as memory: once requests saturate allocatable CPU, no more pods requesting CPU can be placed on the node.",
"expr":"100 *\n sum by(node)(kube_pod_container_resource_requests{resource=\"cpu\",container!=\"\",node=~\"$node\"})\n /\n sum by(node)(kube_node_status_allocatable{resource=\"cpu\",node=~\"$node\"})",
"description":"Metric 7 — node level. Disk is non-compressible: when it is exhausted, kubelet raises DiskPressure and evicts pods. Alert well before 100%.",
"expr":"100 * (1 - (\n sum by(instance)(node_filesystem_avail_bytes{mountpoint=~\"/|/var\",fstype!~\"tmpfs|overlay|squashfs|ramfs\"})\n /\n sum by(instance)(node_filesystem_size_bytes{mountpoint=~\"/|/var\",fstype!~\"tmpfs|overlay|squashfs|ramfs\"})\n))",
"description":"Metric 7 — volume level. Persistent volumes that fill up cause write errors inside applications. Alert at ~80% so there is time to expand or free space.",
"expr":"topk(20,\n 100 * max by(namespace, persistentvolumeclaim)(kubelet_volume_stats_used_bytes{namespace=~\"$namespace\"})\n /\n max by(namespace, persistentvolumeclaim)(kubelet_volume_stats_capacity_bytes{namespace=~\"$namespace\"})\n)",
"description":"Metric 11 — etcd_server_has_leader. Minimum across members. 0 means at least one member does not see a leader — the cluster may be partitioned or mid-election.",
"description":"Metric 12 — etcd_server_leader_changes_seen_total increase over 1h. Frequent elections usually mean network flapping or resource exhaustion on a member.",
"id":600,"type":"row","title":"Control plane — API Server (metric 13)",
"collapsed":false,
"gridPos":{"h":1,"w":24,"x":0,"y":77}
},
{
"id":60,"type":"timeseries","title":"API server request rate by verb",
"description":"Metric 13 — request count. Non-streaming calls per second by verb. Read-heavy (GET/LIST) load is usually controllers; write-heavy (POST/PUT/PATCH/DELETE) is user activity or autoscaling.",
"id":61,"type":"timeseries","title":"API server latency p50 / p95 / p99",
"description":"Metric 13 — request duration. Rising p99 with flat p50 is classic tail-latency degradation — look at a single slow resource or an overloaded admission webhook.",
{"expr":"histogram_quantile(0.50, sum(rate(apiserver_request_duration_seconds_bucket{verb!~\"WATCH|CONNECT\"}[5m])) by (le))","refId":"A","legendFormat":"p50"},
{"expr":"histogram_quantile(0.95, sum(rate(apiserver_request_duration_seconds_bucket{verb!~\"WATCH|CONNECT\"}[5m])) by (le))","refId":"B","legendFormat":"p95"},
{"expr":"histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{verb!~\"WATCH|CONNECT\"}[5m])) by (le))","refId":"C","legendFormat":"p99"}
"id":70,"type":"timeseries","title":"Workqueue wait (queue_duration) — p99 by queue",
"description":"Metric 14 — how long items sit in each controller's workqueue before being picked up. A rising line indicates the controller can no longer keep up with cluster changes.",
"id":72,"type":"timeseries","title":"Scheduler — attempts per second by result",
"description":"Metric 15 — scheduler_schedule_attempts_total. 'unschedulable' = no node meets the pod's requirements (resources, taints, selectors); 'error' = a bug or stale cache in the scheduler.",
"description":"Metric 15 — scheduler attempt duration. The PDF's scheduler_e2e_scheduling_duration_seconds was removed in Kubernetes 1.23; the modern equivalent is scheduler_scheduling_attempt_duration_seconds (time from picking a pod off the queue to binding it). A rising p99 often correlates with an overloaded apiserver or large, highly-constrained pod fleets.",
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.