Compare commits

..

19 Commits

Author SHA1 Message Date
b1ff4e4a0f feat: add json versions of dashboards
Some checks failed
Run Check Script / check (pull_request) Failing after 30s
2026-04-03 14:08:24 -04:00
ee8f033143 feat: add the route creation to the score
Some checks failed
Run Check Script / check (pull_request) Failing after 16s
2026-03-15 18:00:26 -04:00
1298ac9a18 feat: all 8 dashboards now available 2026-03-15 17:34:22 -04:00
53e361e84e feat: create a dashboard. needs refactoring for multiple dashboards 2026-03-15 16:42:40 -04:00
220e0c2bb8 feat: add creation of the prometheus datasource 2026-03-15 14:34:29 -04:00
82e47d22a2 feat: add creation of the secret with the token to access prometheus datasource 2026-03-15 14:01:55 -04:00
fb17d7ed40 feat: add creation of grafana crd 2026-03-15 13:43:17 -04:00
d4bf80779e chore: refactor: extracting stuff into functions 2026-03-14 14:29:31 -04:00
28dadf3a70 feat: starting to implement score for cluster_dashboards 2026-03-14 14:07:47 -04:00
15c454aa65 feat: creating dashboards 2026-03-14 12:02:35 -04:00
f9a3e51529 wip: working on cluster monitoring dashboard 2026-03-13 16:46:39 -04:00
d10598d01e Merge pull request 'okdload balancer using 1936 port http healthcheck' (#240) from feat/okd_loadbalancer_betterhealthcheck into master
Some checks failed
Run Check Script / check (push) Successful in 1m26s
Compile and package harmony_composer / package_harmony_composer (push) Failing after 1m56s
Reviewed-on: #240
2026-03-10 17:45:51 +00:00
61ba7257d0 fix: remove broken test
All checks were successful
Run Check Script / check (pull_request) Successful in 1m22s
2026-03-10 13:40:24 -04:00
b0e9594d92 Merge branch 'master' into feat/okd_loadbalancer_betterhealthcheck
Some checks failed
Run Check Script / check (pull_request) Failing after 47s
2026-03-07 23:06:50 +00:00
bfb86f63ce fix: xml field for vlan
All checks were successful
Run Check Script / check (pull_request) Successful in 1m31s
2026-03-07 11:29:44 -05:00
d920de34cf fix: configure health_check: None for public_services
All checks were successful
Run Check Script / check (pull_request) Successful in 1m35s
2026-03-05 14:55:00 -05:00
4276b9137b fix: put the hc on private_services, not public_services
All checks were successful
Run Check Script / check (pull_request) Successful in 1m32s
2026-03-05 14:35:33 -05:00
6ab88ab8d9 Merge branch 'master' into feat/okd_loadbalancer_betterhealthcheck 2026-03-04 10:46:57 -05:00
53d0704a35 wip: okdload balancer using 1936 port http healthcheck
Some checks failed
Run Check Script / check (pull_request) Failing after 31s
2026-03-02 20:47:41 -05:00
163 changed files with 16515 additions and 5673 deletions

View File

@@ -1,318 +0,0 @@
# Architecture Decision Record: Monitoring and Alerting Architecture
Initial Author: Willem Rolleman, Jean-Gabriel Carrier
Initial Date: March 9, 2026
Last Updated Date: March 9, 2026
## Status
Accepted
Supersedes: [ADR-010](010-monitoring-and-alerting.md)
## Context
Harmony needs a unified approach to monitoring and alerting across different infrastructure targets:
1. **Cluster-level monitoring**: Administrators managing entire Kubernetes/OKD clusters need to define cluster-wide alerts, receivers, and scrape targets.
2. **Tenant-level monitoring**: Multi-tenant clusters where teams are confined to namespaces need monitoring scoped to their resources.
3. **Application-level monitoring**: Developers deploying applications want zero-config monitoring that "just works" for their services.
The monitoring landscape is fragmented:
- **OKD/OpenShift**: Built-in Prometheus with AlertmanagerConfig CRDs
- **KubePrometheus**: Helm-based stack with PrometheusRule CRDs
- **RHOB (Red Hat Observability)**: Operator-based with MonitoringStack CRDs
- **Standalone Prometheus**: Raw Prometheus deployments
Each system has different CRDs, different installation methods, and different configuration APIs.
## Decision
We implement a **trait-based architecture with compile-time capability verification** that provides:
1. **Type-safe abstractions** via parameterized traits: `AlertReceiver<S>`, `AlertRule<S>`, `ScrapeTarget<S>`
2. **Compile-time topology compatibility** via the `Observability<S>` capability bound
3. **Three levels of abstraction**: Cluster, Tenant, and Application monitoring
4. **Pre-built alert rules** as functions that return typed structs
### Core Traits
```rust
// domain/topology/monitoring.rs
/// Marker trait for systems that send alerts (Prometheus, etc.)
pub trait AlertSender: Send + Sync + std::fmt::Debug {
fn name(&self) -> String;
}
/// Defines how a receiver (Discord, Slack, etc.) builds its configuration
/// for a specific sender type
pub trait AlertReceiver<S: AlertSender>: std::fmt::Debug + Send + Sync {
fn build(&self) -> Result<ReceiverInstallPlan, InterpretError>;
fn name(&self) -> String;
fn clone_box(&self) -> Box<dyn AlertReceiver<S>>;
}
/// Defines how an alert rule builds its PrometheusRule configuration
pub trait AlertRule<S: AlertSender>: std::fmt::Debug + Send + Sync {
fn build_rule(&self) -> Result<serde_json::Value, InterpretError>;
fn name(&self) -> String;
fn clone_box(&self) -> Box<dyn AlertRule<S>>;
}
/// Capability that topologies implement to support monitoring
pub trait Observability<S: AlertSender> {
async fn install_alert_sender(&self, sender: &S, inventory: &Inventory)
-> Result<PreparationOutcome, PreparationError>;
async fn install_receivers(&self, sender: &S, inventory: &Inventory,
receivers: Option<Vec<Box<dyn AlertReceiver<S>>>>) -> Result<...>;
async fn install_rules(&self, sender: &S, inventory: &Inventory,
rules: Option<Vec<Box<dyn AlertRule<S>>>>) -> Result<...>;
async fn add_scrape_targets(&self, sender: &S, inventory: &Inventory,
scrape_targets: Option<Vec<Box<dyn ScrapeTarget<S>>>>) -> Result<...>;
async fn ensure_monitoring_installed(&self, sender: &S, inventory: &Inventory)
-> Result<...>;
}
```
### Alert Sender Types
Each monitoring stack is a distinct `AlertSender`:
| Sender | Module | Use Case |
|--------|--------|----------|
| `OpenshiftClusterAlertSender` | `monitoring/okd/` | OKD/OpenShift built-in monitoring |
| `KubePrometheus` | `monitoring/kube_prometheus/` | Helm-deployed kube-prometheus-stack |
| `Prometheus` | `monitoring/prometheus/` | Standalone Prometheus via Helm |
| `RedHatClusterObservability` | `monitoring/red_hat_cluster_observability/` | RHOB operator |
| `Grafana` | `monitoring/grafana/` | Grafana-managed alerting |
### Three Levels of Monitoring
#### 1. Cluster-Level Monitoring
For cluster administrators. Full control over monitoring infrastructure.
```rust
// examples/okd_cluster_alerts/src/main.rs
OpenshiftClusterAlertScore {
sender: OpenshiftClusterAlertSender,
receivers: vec![Box::new(DiscordReceiver { ... })],
rules: vec![Box::new(alert_rules)],
scrape_targets: Some(vec![Box::new(external_exporters)]),
}
```
**Characteristics:**
- Cluster-scoped CRDs and resources
- Can add external scrape targets (outside cluster)
- Manages Alertmanager configuration
- Requires cluster-admin privileges
#### 2. Tenant-Level Monitoring
For teams confined to namespaces. The topology determines tenant context.
```rust
// The topology's Observability impl handles namespace scoping
impl Observability<KubePrometheus> for K8sAnywhereTopology {
async fn install_rules(&self, sender: &KubePrometheus, ...) {
// Topology knows if it's tenant-scoped
let namespace = self.get_tenant_config().await
.map(|t| t.name)
.unwrap_or("default");
// Install rules in tenant namespace
}
}
```
**Characteristics:**
- Namespace-scoped resources
- Cannot modify cluster-level monitoring config
- May have restricted receiver types
- Runtime validation of permissions (cannot be fully compile-time)
#### 3. Application-Level Monitoring
For developers. Zero-config, opinionated monitoring.
```rust
// modules/application/features/monitoring.rs
pub struct Monitoring {
pub application: Arc<dyn Application>,
pub alert_receiver: Vec<Box<dyn AlertReceiver<Prometheus>>>,
}
impl<T: Topology + Observability<Prometheus> + TenantManager + ...>
ApplicationFeature<T> for Monitoring
{
async fn ensure_installed(&self, topology: &T) -> Result<...> {
// Auto-creates ServiceMonitor
// Auto-installs Ntfy for notifications
// Handles tenant namespace automatically
// Wires up sensible defaults
}
}
```
**Characteristics:**
- Automatic ServiceMonitor creation
- Opinionated notification channel (Ntfy)
- Tenant-aware via topology
- Minimal configuration required
## Rationale
### Why Generic Traits Instead of Unified Types?
Each monitoring stack (OKD, KubePrometheus, RHOB) has fundamentally different CRDs:
```rust
// OKD uses AlertmanagerConfig with different structure
AlertmanagerConfig { spec: { receivers: [...] } }
// RHOB uses secret references for webhook URLs
MonitoringStack { spec: { alertmanagerConfig: { discordConfigs: [{ apiURL: { key: "..." } }] } } }
// KubePrometheus uses Alertmanager CRD with different field names
Alertmanager { spec: { config: { receivers: [...] } } }
```
A unified type would either:
1. Be a lowest-common-denominator (loses stack-specific features)
2. Be a complex union type (hard to use, easy to misconfigure)
Generic traits let each stack express its configuration naturally while providing a consistent interface.
### Why Compile-Time Capability Bounds?
```rust
impl<T: Topology + Observability<OpenshiftClusterAlertSender>> Score<T>
for OpenshiftClusterAlertScore { ... }
```
This fails at compile time if you try to use `OpenshiftClusterAlertScore` with a topology that doesn't support OKD monitoring. This prevents the "config-is-valid-but-platform-is-wrong" errors that Harmony was designed to eliminate.
### Why Not a MonitoringStack Abstraction (V2 Approach)?
The V2 approach proposed a unified `MonitoringStack` that hides sender selection:
```rust
// V2 approach - rejected
MonitoringStack::new(MonitoringApiVersion::V2CRD)
.add_alert_channel(discord)
```
**Problems:**
1. Hides which sender you're using, losing compile-time guarantees
2. "Version selection" actually chooses between fundamentally different systems
3. Would need to handle all stack-specific features through a generic interface
The current approach is explicit: you choose `OpenshiftClusterAlertSender` and the compiler verifies compatibility.
### Why Runtime Validation for Tenants?
Tenant confinement is determined at runtime by the topology and K8s RBAC. We cannot know at compile time whether a user has cluster-admin or namespace-only access.
Options considered:
1. **Compile-time tenant markers** - Would require modeling entire RBAC hierarchy in types. Over-engineering.
2. **Runtime validation** - Current approach. Fails with clear K8s permission errors if insufficient access.
3. **No tenant support** - Would exclude a major use case.
Runtime validation is the pragmatic choice. The failure mode is clear (K8s API error) and occurs early in execution.
> Note : we will eventually have compile time validation for such things. Rust macros are powerful and we could discover the actual capabilities we're dealing with, similar to sqlx approach in query! macros.
## Consequences
### Pros
1. **Type Safety**: Invalid configurations are caught at compile time
2. **Extensibility**: Adding a new monitoring stack requires implementing traits, not modifying core code
3. **Clear Separation**: Cluster/Tenant/Application levels have distinct entry points
4. **Reusable Rules**: Pre-built alert rules as functions (`high_pvc_fill_rate_over_two_days()`)
5. **CRD Accuracy**: Type definitions match actual Kubernetes CRDs exactly
### Cons
1. **Implementation Explosion**: `DiscordReceiver` implements `AlertReceiver<S>` for each sender type (3+ implementations)
2. **Learning Curve**: Understanding the trait hierarchy takes time
3. **clone_box Boilerplate**: Required for trait object cloning (3 lines per impl)
### Mitigations
- Implementation explosion is contained: each receiver type has O(senders) implementations, but receivers are rare compared to rules
- Learning curve is documented with examples at each level
- clone_box boilerplate is minimal and copy-paste
## Alternatives Considered
### Unified MonitoringStack Type
See "Why Not a MonitoringStack Abstraction" above. Rejected for losing compile-time safety.
### Helm-Only Approach
Use `HelmScore` directly for each monitoring deployment. Rejected because:
- No type safety for alert rules
- Cannot compose with application features
- No tenant awareness
### Separate Modules Per Use Case
Have `cluster_monitoring/`, `tenant_monitoring/`, `app_monitoring/` as separate modules. Rejected because:
- Massive code duplication
- No shared abstraction for receivers/rules
- Adding a feature requires three implementations
## Implementation Notes
### Module Structure
```
modules/monitoring/
├── mod.rs # Public exports
├── alert_channel/ # Receivers (Discord, Webhook)
├── alert_rule/ # Rules and pre-built alerts
│ ├── prometheus_alert_rule.rs
│ └── alerts/ # Library of pre-built rules
│ ├── k8s/ # K8s-specific (pvc, pod, memory)
│ └── infra/ # Infrastructure (opnsense, dell)
├── okd/ # OpenshiftClusterAlertSender
├── kube_prometheus/ # KubePrometheus
├── prometheus/ # Prometheus
├── red_hat_cluster_observability/ # RHOB
├── grafana/ # Grafana
├── application_monitoring/ # Application-level scores
└── scrape_target/ # External scrape targets
```
### Adding a New Alert Sender
1. Create sender type: `pub struct MySender; impl AlertSender for MySender { ... }`
2. Implement `Observability<MySender>` for topologies that support it
3. Create CRD types in `crd/` subdirectory
4. Implement `AlertReceiver<MySender>` for existing receivers
5. Implement `AlertRule<MySender>` for `AlertManagerRuleGroup`
### Adding a New Alert Rule
```rust
pub fn my_custom_alert() -> PrometheusAlertRule {
PrometheusAlertRule::new("MyAlert", "up == 0")
.for_duration("5m")
.label("severity", "critical")
.annotation("summary", "Service is down")
}
```
No trait implementation needed - `AlertManagerRuleGroup` already handles conversion.
## Related ADRs
- [ADR-013](013-monitoring-notifications.md): Notification channel selection (ntfy)
- [ADR-011](011-multi-tenant-cluster.md): Multi-tenant cluster architecture

View File

@@ -1,21 +0,0 @@
[package]
name = "example-monitoring-v2"
edition = "2024"
version.workspace = true
readme.workspace = true
license.workspace = true
[dependencies]
harmony = { path = "../../harmony" }
harmony_cli = { path = "../../harmony_cli" }
harmony-k8s = { path = "../../harmony-k8s" }
harmony_types = { path = "../../harmony_types" }
kube = { workspace = true }
schemars = "0.8"
serde = { workspace = true, features = ["derive"] }
serde_json = { workspace = true }
serde_yaml = { workspace = true }
url = { workspace = true }
log = { workspace = true }
async-trait = { workspace = true }
k8s-openapi = { workspace = true }

View File

@@ -1,91 +0,0 @@
# Monitoring v2 - Improved Architecture
This example demonstrates the improved monitoring architecture that addresses the "WTF/minute" issues in the original design.
## Key Improvements
### 1. **Single AlertChannel Trait with Generic Sender**
The original design required 9-12 implementations for each alert channel (Discord, Webhook, etc.) - one for each sender type. The new design uses a single trait with generic sender parameterization:
pub trait AlertChannel<Sender: AlertSender> {
async fn install_config(&self, sender: &Sender) -> Result<Outcome, InterpretError>;
fn name(&self) -> String;
fn as_any(&self) -> &dyn std::any::Any;
}
**Benefits:**
- One Discord implementation works with all sender types
- Type safety at compile time
- No runtime dispatch overhead
### 2. **MonitoringStack Abstraction**
Instead of manually selecting CRDPrometheus vs KubePrometheus vs RHOBObservability, you now have a unified MonitoringStack that handles versioning:
let monitoring_stack = MonitoringStack::new(MonitoringApiVersion::V2CRD)
.set_namespace("monitoring")
.add_alert_channel(discord_receiver)
.set_scrape_targets(vec![...]);
**Benefits:**
- Single source of truth for monitoring configuration
- Easy to switch between monitoring versions
- Automatic version-specific configuration
### 3. **TenantMonitoringScore - True Composition**
The original monitoring_with_tenant example just put tenant and monitoring as separate items in a vec. The new design truly composes them:
let tenant_score = TenantMonitoringScore::new("test-tenant", monitoring_stack);
This creates a single score that:
- Has tenant context
- Has monitoring configuration
- Automatically installs monitoring scoped to tenant namespace
**Benefits:**
- No more "two separate things" confusion
- Automatic tenant namespace scoping
- Clear ownership: tenant owns its monitoring
### 4. **Versioned Monitoring APIs**
Clear versioning makes it obvious which monitoring stack you're using:
pub enum MonitoringApiVersion {
V1Helm, // Old Helm charts
V2CRD, // Current CRDs
V3RHOB, // RHOB (future)
}
**Benefits:**
- No guessing which API version you're using
- Easy to migrate between versions
- Backward compatibility path
## Comparison
### Original Design (monitoring_with_tenant)
- Manual selection of each component
- Manual installation of both components
- Need to remember to pass both to harmony_cli::run
- Monitoring not scoped to tenant automatically
### New Design (monitoring_v2)
- Single composed score
- One score does it all
## Usage
cd examples/monitoring_v2
cargo run
## Migration Path
To migrate from the old design to the new:
1. Replace individual alert channel implementations with AlertChannel<Sender>
2. Use MonitoringStack instead of manual *Prometheus selection
3. Use TenantMonitoringScore instead of separate TenantScore + monitoring scores
4. Select monitoring version via MonitoringApiVersion

View File

@@ -1,343 +0,0 @@
use std::collections::HashMap;
use std::sync::{Arc, Mutex};
use log::debug;
use serde::{Deserialize, Serialize};
use serde_yaml::{Mapping, Value};
use harmony::data::Version;
use harmony::interpret::{Interpret, InterpretError, InterpretName, InterpretStatus, Outcome};
use harmony::inventory::Inventory;
use harmony::score::Score;
use harmony::topology::{Topology, tenant::TenantManager};
use harmony_k8s::K8sClient;
use harmony_types::k8s_name::K8sName;
use harmony_types::net::Url;
pub trait AlertSender: Send + Sync + std::fmt::Debug {
fn name(&self) -> String;
fn namespace(&self) -> String;
}
#[derive(Debug)]
pub struct CRDPrometheus {
pub namespace: String,
pub client: Arc<K8sClient>,
}
impl AlertSender for CRDPrometheus {
fn name(&self) -> String {
"CRDPrometheus".to_string()
}
fn namespace(&self) -> String {
self.namespace.clone()
}
}
#[derive(Debug)]
pub struct RHOBObservability {
pub namespace: String,
pub client: Arc<K8sClient>,
}
impl AlertSender for RHOBObservability {
fn name(&self) -> String {
"RHOBObservability".to_string()
}
fn namespace(&self) -> String {
self.namespace.clone()
}
}
#[derive(Debug)]
pub struct KubePrometheus {
pub config: Arc<Mutex<KubePrometheusConfig>>,
}
impl Default for KubePrometheus {
fn default() -> Self {
Self::new()
}
}
impl KubePrometheus {
pub fn new() -> Self {
Self {
config: Arc::new(Mutex::new(KubePrometheusConfig::new())),
}
}
}
impl AlertSender for KubePrometheus {
fn name(&self) -> String {
"KubePrometheus".to_string()
}
fn namespace(&self) -> String {
self.config.lock().unwrap().namespace.clone().unwrap_or_else(|| "monitoring".to_string())
}
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct KubePrometheusConfig {
pub namespace: Option<String>,
#[serde(skip)]
pub alert_receiver_configs: Vec<AlertManagerChannelConfig>,
}
impl KubePrometheusConfig {
pub fn new() -> Self {
Self {
namespace: None,
alert_receiver_configs: Vec::new(),
}
}
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct AlertManagerChannelConfig {
pub channel_receiver: serde_yaml::Value,
pub channel_route: serde_yaml::Value,
}
impl Default for AlertManagerChannelConfig {
fn default() -> Self {
Self {
channel_receiver: serde_yaml::Value::Mapping(Default::default()),
channel_route: serde_yaml::Value::Mapping(Default::default()),
}
}
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ScrapeTargetConfig {
pub service_name: String,
pub port: String,
pub path: String,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum MonitoringApiVersion {
V1Helm,
V2CRD,
V3RHOB,
}
#[derive(Debug, Clone)]
pub struct MonitoringStack {
pub version: MonitoringApiVersion,
pub namespace: String,
pub alert_channels: Vec<Arc<dyn AlertSender>>,
pub scrape_targets: Vec<ScrapeTargetConfig>,
}
impl MonitoringStack {
pub fn new(version: MonitoringApiVersion) -> Self {
Self {
version,
namespace: "monitoring".to_string(),
alert_channels: Vec::new(),
scrape_targets: Vec::new(),
}
}
pub fn set_namespace(mut self, namespace: &str) -> Self {
self.namespace = namespace.to_string();
self
}
pub fn add_alert_channel(mut self, channel: impl AlertSender + 'static) -> Self {
self.alert_channels.push(Arc::new(channel));
self
}
pub fn set_scrape_targets(mut self, targets: Vec<(&str, &str, String)>) -> Self {
self.scrape_targets = targets
.into_iter()
.map(|(name, port, path)| ScrapeTargetConfig {
service_name: name.to_string(),
port: port.to_string(),
path,
})
.collect();
self
}
}
pub trait AlertChannel<Sender: AlertSender> {
fn install_config(&self, sender: &Sender);
fn name(&self) -> String;
}
#[derive(Debug, Clone)]
pub struct DiscordWebhook {
pub name: K8sName,
pub url: Url,
pub selectors: Vec<HashMap<String, String>>,
}
impl DiscordWebhook {
fn get_config(&self) -> AlertManagerChannelConfig {
let mut route = Mapping::new();
route.insert(
Value::String("receiver".to_string()),
Value::String(self.name.to_string()),
);
route.insert(
Value::String("matchers".to_string()),
Value::Sequence(vec![Value::String("alertname!=Watchdog".to_string())]),
);
let mut receiver = Mapping::new();
receiver.insert(
Value::String("name".to_string()),
Value::String(self.name.to_string()),
);
let mut discord_config = Mapping::new();
discord_config.insert(
Value::String("webhook_url".to_string()),
Value::String(self.url.to_string()),
);
receiver.insert(
Value::String("discord_configs".to_string()),
Value::Sequence(vec![Value::Mapping(discord_config)]),
);
AlertManagerChannelConfig {
channel_receiver: Value::Mapping(receiver),
channel_route: Value::Mapping(route),
}
}
}
impl AlertChannel<CRDPrometheus> for DiscordWebhook {
fn install_config(&self, sender: &CRDPrometheus) {
debug!("Installing Discord webhook for CRDPrometheus in namespace: {}", sender.namespace());
debug!("Config: {:?}", self.get_config());
debug!("Installed!");
}
fn name(&self) -> String {
"discord-webhook".to_string()
}
}
impl AlertChannel<RHOBObservability> for DiscordWebhook {
fn install_config(&self, sender: &RHOBObservability) {
debug!("Installing Discord webhook for RHOBObservability in namespace: {}", sender.namespace());
debug!("Config: {:?}", self.get_config());
debug!("Installed!");
}
fn name(&self) -> String {
"webhook-receiver".to_string()
}
}
impl AlertChannel<KubePrometheus> for DiscordWebhook {
fn install_config(&self, sender: &KubePrometheus) {
debug!("Installing Discord webhook for KubePrometheus in namespace: {}", sender.namespace());
let config = sender.config.lock().unwrap();
let ns = config.namespace.clone().unwrap_or_else(|| "monitoring".to_string());
debug!("Namespace: {}", ns);
let mut config = sender.config.lock().unwrap();
config.alert_receiver_configs.push(self.get_config());
debug!("Installed!");
}
fn name(&self) -> String {
"discord-webhook".to_string()
}
}
fn default_monitoring_stack() -> MonitoringStack {
MonitoringStack::new(MonitoringApiVersion::V2CRD)
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct TenantMonitoringScore {
pub tenant_id: harmony_types::id::Id,
pub tenant_name: String,
#[serde(skip)]
#[serde(default = "default_monitoring_stack")]
pub monitoring_stack: MonitoringStack,
}
impl TenantMonitoringScore {
pub fn new(tenant_name: &str, monitoring_stack: MonitoringStack) -> Self {
Self {
tenant_id: harmony_types::id::Id::default(),
tenant_name: tenant_name.to_string(),
monitoring_stack,
}
}
}
impl<T: Topology + TenantManager> Score<T> for TenantMonitoringScore {
fn create_interpret(&self) -> Box<dyn Interpret<T>> {
Box::new(TenantMonitoringInterpret {
score: self.clone(),
})
}
fn name(&self) -> String {
format!("{} monitoring [TenantMonitoringScore]", self.tenant_name)
}
}
#[derive(Debug)]
pub struct TenantMonitoringInterpret {
pub score: TenantMonitoringScore,
}
#[async_trait::async_trait]
impl<T: Topology + TenantManager> Interpret<T> for TenantMonitoringInterpret {
async fn execute(
&self,
_inventory: &Inventory,
topology: &T,
) -> Result<Outcome, InterpretError> {
let tenant_config = topology.get_tenant_config().await.unwrap();
let tenant_ns = tenant_config.name.clone();
match self.score.monitoring_stack.version {
MonitoringApiVersion::V1Helm => {
debug!("Installing Helm monitoring for tenant {}", tenant_ns);
}
MonitoringApiVersion::V2CRD => {
debug!("Installing CRD monitoring for tenant {}", tenant_ns);
}
MonitoringApiVersion::V3RHOB => {
debug!("Installing RHOB monitoring for tenant {}", tenant_ns);
}
}
Ok(Outcome::success(format!(
"Installed monitoring stack for tenant {} with version {:?}",
self.score.tenant_name,
self.score.monitoring_stack.version
)))
}
fn get_name(&self) -> InterpretName {
InterpretName::Custom("TenantMonitoringInterpret")
}
fn get_version(&self) -> Version {
Version::from("1.0.0").unwrap()
}
fn get_status(&self) -> InterpretStatus {
InterpretStatus::SUCCESS
}
fn get_children(&self) -> Vec<harmony_types::id::Id> {
Vec::new()
}
}

View File

@@ -31,16 +31,3 @@ Ready to build your own components? These guides show you how.
- [**Writing a Score**](./guides/writing-a-score.md): Learn how to create your own `Score` and `Interpret` logic to define a new desired state.
- [**Writing a Topology**](./guides/writing-a-topology.md): Learn how to model a new environment (like AWS, GCP, or custom hardware) as a `Topology`.
- [**Adding Capabilities**](./guides/adding-capabilities.md): See how to add a `Capability` to your custom `Topology`.
- [**Coding Guide**](./coding-guide.md): Conventions and best practices for writing Harmony code.
## 5. Module Documentation
Deep dives into specific Harmony modules and features.
- [**Monitoring and Alerting**](./monitoring.md): Comprehensive guide to cluster, tenant, and application-level monitoring with support for OKD, KubePrometheus, RHOB, and more.
## 6. Architecture Decision Records
Important architectural decisions are documented in the `adr/` directory:
- [Full ADR Index](../adr/)

View File

@@ -1,443 +0,0 @@
# Monitoring and Alerting in Harmony
Harmony provides a unified, type-safe approach to monitoring and alerting across Kubernetes, OpenShift, and bare-metal infrastructure. This guide explains the architecture and how to use it at different levels of abstraction.
## Overview
Harmony's monitoring module supports three distinct use cases:
| Level | Who Uses It | What It Provides |
|-------|-------------|------------------|
| **Cluster** | Cluster administrators | Full control over monitoring stack, cluster-wide alerts, external scrape targets |
| **Tenant** | Platform teams | Namespace-scoped monitoring in multi-tenant environments |
| **Application** | Application developers | Zero-config monitoring that "just works" |
Each level builds on the same underlying abstractions, ensuring consistency while providing appropriate complexity for each audience.
## Core Concepts
### AlertSender
An `AlertSender` represents the system that evaluates alert rules and sends notifications. Harmony supports multiple monitoring stacks:
| Sender | Description | Use When |
|--------|-------------|----------|
| `OpenshiftClusterAlertSender` | OKD/OpenShift built-in monitoring | Running on OKD/OpenShift |
| `KubePrometheus` | kube-prometheus-stack via Helm | Standard Kubernetes, need full stack |
| `Prometheus` | Standalone Prometheus | Custom Prometheus deployment |
| `RedHatClusterObservability` | RHOB operator | Red Hat managed clusters |
| `Grafana` | Grafana-managed alerting | Grafana as primary alerting layer |
### AlertReceiver
An `AlertReceiver` defines where alerts are sent (Discord, Slack, email, webhook, etc.). Receivers are parameterized by sender type because each monitoring stack has different configuration formats.
```rust
pub trait AlertReceiver<S: AlertSender> {
fn build(&self) -> Result<ReceiverInstallPlan, InterpretError>;
fn name(&self) -> String;
}
```
Built-in receivers:
- `DiscordReceiver` - Discord webhooks
- `WebhookReceiver` - Generic HTTP webhooks
### AlertRule
An `AlertRule` defines a Prometheus alert expression. Rules are also parameterized by sender to handle different CRD formats.
```rust
pub trait AlertRule<S: AlertSender> {
fn build_rule(&self) -> Result<serde_json::Value, InterpretError>;
fn name(&self) -> String;
}
```
### Observability Capability
Topologies implement `Observability<S>` to indicate they support a specific alert sender:
```rust
impl Observability<OpenshiftClusterAlertSender> for K8sAnywhereTopology {
async fn install_receivers(&self, sender, inventory, receivers) { ... }
async fn install_rules(&self, sender, inventory, rules) { ... }
// ...
}
```
This provides **compile-time verification**: if you try to use `OpenshiftClusterAlertScore` with a topology that doesn't implement `Observability<OpenshiftClusterAlertSender>`, the code won't compile.
---
## Level 1: Cluster Monitoring
Cluster monitoring is for administrators who need full control over the monitoring infrastructure. This includes:
- Installing/managing the monitoring stack
- Configuring cluster-wide alert receivers
- Defining cluster-level alert rules
- Adding external scrape targets (e.g., bare-metal servers, firewalls)
### Example: OKD Cluster Alerts
```rust
use harmony::{
modules::monitoring::{
alert_channel::discord_alert_channel::DiscordReceiver,
alert_rule::{alerts::k8s::pvc::high_pvc_fill_rate_over_two_days, prometheus_alert_rule::AlertManagerRuleGroup},
okd::openshift_cluster_alerting_score::OpenshiftClusterAlertScore,
scrape_target::prometheus_node_exporter::PrometheusNodeExporter,
},
topology::{K8sAnywhereTopology, monitoring::{AlertMatcher, AlertRoute, MatchOp}},
};
let severity_matcher = AlertMatcher {
label: "severity".to_string(),
operator: MatchOp::Eq,
value: "critical".to_string(),
};
let rule_group = AlertManagerRuleGroup::new(
"cluster-rules",
vec![high_pvc_fill_rate_over_two_days()],
);
let external_exporter = PrometheusNodeExporter {
job_name: "firewall".to_string(),
metrics_path: "/metrics".to_string(),
listen_address: ip!("192.168.1.1"),
port: 9100,
..Default::default()
};
harmony_cli::run(
Inventory::autoload(),
K8sAnywhereTopology::from_env(),
vec![Box::new(OpenshiftClusterAlertScore {
sender: OpenshiftClusterAlertSender,
receivers: vec![Box::new(DiscordReceiver {
name: "critical-alerts".to_string(),
url: hurl!("https://discord.com/api/webhooks/..."),
route: AlertRoute {
matchers: vec![severity_matcher],
..AlertRoute::default("critical-alerts".to_string())
},
})],
rules: vec![Box::new(rule_group)],
scrape_targets: Some(vec![Box::new(external_exporter)]),
})],
None,
).await?;
```
### What This Does
1. **Enables cluster monitoring** - Activates OKD's built-in Prometheus
2. **Enables user workload monitoring** - Allows namespace-scoped rules
3. **Configures Alertmanager** - Adds Discord receiver with route matching
4. **Deploys alert rules** - Creates `AlertingRule` CRD with PVC fill rate alert
5. **Adds external scrape target** - Configures Prometheus to scrape the firewall
### Compile-Time Safety
The `OpenshiftClusterAlertScore` requires:
```rust
impl<T: Topology + Observability<OpenshiftClusterAlertSender>> Score<T>
for OpenshiftClusterAlertScore
```
If `K8sAnywhereTopology` didn't implement `Observability<OpenshiftClusterAlertSender>`, this code would fail to compile. You cannot accidentally deploy OKD alerts to a cluster that doesn't support them.
---
## Level 2: Tenant Monitoring
In multi-tenant clusters, teams are often confined to specific namespaces. Tenant monitoring adapts to this constraint:
- Resources are deployed in the tenant's namespace
- Cannot modify cluster-level monitoring configuration
- The topology determines namespace context at runtime
### How It Works
The topology's `Observability` implementation handles tenant scoping:
```rust
impl Observability<KubePrometheus> for K8sAnywhereTopology {
async fn install_rules(&self, sender, inventory, rules) {
// Topology knows if it's tenant-scoped
let namespace = self.get_tenant_config().await
.map(|t| t.name)
.unwrap_or_else(|| "monitoring".to_string());
// Rules are installed in the appropriate namespace
for rule in rules.unwrap_or_default() {
let score = KubePrometheusRuleScore {
sender: sender.clone(),
rule,
namespace: namespace.clone(), // Tenant namespace
};
score.create_interpret().execute(inventory, self).await?;
}
}
}
```
### Tenant vs Cluster Resources
| Resource | Cluster-Level | Tenant-Level |
|----------|---------------|--------------|
| Alertmanager config | Global receivers | Namespaced receivers (where supported) |
| PrometheusRules | Cluster-wide alerts | Namespace alerts only |
| ServiceMonitors | Any namespace | Own namespace only |
| External scrape targets | Can add | Cannot add (cluster config) |
### Runtime Validation
Tenant constraints are validated at runtime via Kubernetes RBAC. If a tenant-scoped deployment attempts cluster-level operations, it fails with a clear permission error from the Kubernetes API.
This cannot be fully compile-time because tenant context is determined by who's running the code and what permissions they have—information only available at runtime.
---
## Level 3: Application Monitoring
Application monitoring provides zero-config, opinionated monitoring for developers. Just add the `Monitoring` feature to your application and it works.
### Example
```rust
use harmony::modules::{
application::{Application, ApplicationFeature},
monitoring::alert_channel::webhook_receiver::WebhookReceiver,
};
// Define your application
let my_app = MyApplication::new();
// Add monitoring as a feature
let monitoring = Monitoring {
application: Arc::new(my_app),
alert_receiver: vec![], // Uses defaults
};
// Install with the application
my_app.add_feature(monitoring);
```
### What Application Monitoring Provides
1. **Automatic ServiceMonitor** - Creates a ServiceMonitor for your application's pods
2. **Ntfy Notification Channel** - Auto-installs and configures Ntfy for push notifications
3. **Tenant Awareness** - Automatically scopes to the correct namespace
4. **Sensible Defaults** - Pre-configured alert routes and receivers
### Under the Hood
```rust
impl<T: Topology + Observability<Prometheus> + TenantManager>
ApplicationFeature<T> for Monitoring
{
async fn ensure_installed(&self, topology: &T) -> Result<...> {
// 1. Get tenant namespace (or use app name)
let namespace = topology.get_tenant_config().await
.map(|ns| ns.name.clone())
.unwrap_or_else(|| self.application.name());
// 2. Create ServiceMonitor for the app
let app_service_monitor = ServiceMonitor {
metadata: ObjectMeta {
name: Some(self.application.name()),
namespace: Some(namespace.clone()),
..Default::default()
},
spec: ServiceMonitorSpec::default(),
};
// 3. Install Ntfy for notifications
let ntfy = NtfyScore { namespace, host };
ntfy.interpret(&Inventory::empty(), topology).await?;
// 4. Wire up webhook receiver to Ntfy
let ntfy_receiver = WebhookReceiver { ... };
// 5. Execute monitoring score
alerting_score.interpret(&Inventory::empty(), topology).await?;
}
}
```
---
## Pre-Built Alert Rules
Harmony provides a library of common alert rules in `modules/monitoring/alert_rule/alerts/`:
### Kubernetes Alerts (`alerts/k8s/`)
```rust
use harmony::modules::monitoring::alert_rule::alerts::k8s::{
pod::pod_failed,
pvc::high_pvc_fill_rate_over_two_days,
memory_usage::alert_high_memory_usage,
};
let rules = AlertManagerRuleGroup::new("k8s-rules", vec![
pod_failed(),
high_pvc_fill_rate_over_two_days(),
alert_high_memory_usage(),
]);
```
Available rules:
- `pod_failed()` - Pod in failed state
- `alert_container_restarting()` - Container restart loop
- `alert_pod_not_ready()` - Pod not ready for extended period
- `high_pvc_fill_rate_over_two_days()` - PVC will fill within 2 days
- `alert_high_memory_usage()` - Memory usage above threshold
- `alert_high_cpu_usage()` - CPU usage above threshold
### Infrastructure Alerts (`alerts/infra/`)
```rust
use harmony::modules::monitoring::alert_rule::alerts::infra::opnsense::high_http_error_rate;
let rules = AlertManagerRuleGroup::new("infra-rules", vec![
high_http_error_rate(),
]);
```
### Creating Custom Rules
```rust
use harmony::modules::monitoring::alert_rule::prometheus_alert_rule::PrometheusAlertRule;
pub fn my_custom_alert() -> PrometheusAlertRule {
PrometheusAlertRule::new("MyServiceDown", "up{job=\"my-service\"} == 0")
.for_duration("5m")
.label("severity", "critical")
.annotation("summary", "My service is down")
.annotation("description", "The my-service job has been down for more than 5 minutes")
}
```
---
## Alert Receivers
### Discord Webhook
```rust
use harmony::modules::monitoring::alert_channel::discord_alert_channel::DiscordReceiver;
use harmony::topology::monitoring::{AlertRoute, AlertMatcher, MatchOp};
let discord = DiscordReceiver {
name: "ops-alerts".to_string(),
url: hurl!("https://discord.com/api/webhooks/123456/abcdef"),
route: AlertRoute {
receiver: "ops-alerts".to_string(),
matchers: vec![AlertMatcher {
label: "severity".to_string(),
operator: MatchOp::Eq,
value: "critical".to_string(),
}],
group_by: vec!["alertname".to_string()],
repeat_interval: Some("30m".to_string()),
continue_matching: false,
children: vec![],
},
};
```
### Generic Webhook
```rust
use harmony::modules::monitoring::alert_channel::webhook_receiver::WebhookReceiver;
let webhook = WebhookReceiver {
name: "custom-webhook".to_string(),
url: hurl!("https://api.example.com/alerts"),
route: AlertRoute::default("custom-webhook".to_string()),
};
```
---
## Adding a New Monitoring Stack
To add support for a new monitoring stack:
1. **Create the sender type** in `modules/monitoring/my_sender/mod.rs`:
```rust
#[derive(Debug, Clone)]
pub struct MySender;
impl AlertSender for MySender {
fn name(&self) -> String { "MySender".to_string() }
}
```
2. **Define CRD types** in `modules/monitoring/my_sender/crd/`:
```rust
#[derive(CustomResource, Debug, Serialize, Deserialize, Clone)]
#[kube(group = "monitoring.example.com", version = "v1", kind = "MyAlertRule")]
pub struct MyAlertRuleSpec { ... }
```
3. **Implement Observability** in `domain/topology/k8s_anywhere/observability/my_sender.rs`:
```rust
impl Observability<MySender> for K8sAnywhereTopology {
async fn install_receivers(&self, sender, inventory, receivers) { ... }
async fn install_rules(&self, sender, inventory, rules) { ... }
// ...
}
```
4. **Implement receiver conversions** for existing receivers:
```rust
impl AlertReceiver<MySender> for DiscordReceiver {
fn build(&self) -> Result<ReceiverInstallPlan, InterpretError> {
// Convert DiscordReceiver to MySender's format
}
}
```
5. **Create score types**:
```rust
pub struct MySenderAlertScore {
pub sender: MySender,
pub receivers: Vec<Box<dyn AlertReceiver<MySender>>>,
pub rules: Vec<Box<dyn AlertRule<MySender>>>,
}
```
---
## Architecture Principles
### Type Safety Over Flexibility
Each monitoring stack has distinct CRDs and configuration formats. Rather than a unified "MonitoringStack" type that loses stack-specific features, we use generic traits that provide type safety while allowing each stack to express its unique configuration.
### Compile-Time Capability Verification
The `Observability<S>` bound ensures you can't deploy OKD alerts to a KubePrometheus cluster. The compiler catches platform mismatches before deployment.
### Explicit Over Implicit
Monitoring stacks are chosen explicitly (`OpenshiftClusterAlertSender` vs `KubePrometheus`). There's no "auto-detection" that could lead to surprising behavior.
### Three Levels, One Foundation
Cluster, tenant, and application monitoring all use the same traits (`AlertSender`, `AlertReceiver`, `AlertRule`). The difference is in how scores are constructed and how topologies interpret them.
---
## Related Documentation
- [ADR-020: Monitoring and Alerting Architecture](../adr/020-monitoring-alerting-architecture.md)
- [ADR-013: Monitoring Notifications (ntfy)](../adr/013-monitoring-notifications.md)
- [ADR-011: Multi-Tenant Cluster Architecture](../adr/011-multi-tenant-cluster.md)
- [Coding Guide](coding-guide.md)
- [Core Concepts](concepts.md)

View File

@@ -7,7 +7,7 @@ use harmony::{
monitoring::alert_channel::webhook_receiver::WebhookReceiver,
tenant::TenantScore,
},
topology::{K8sAnywhereTopology, monitoring::AlertRoute, tenant::TenantConfig},
topology::{K8sAnywhereTopology, tenant::TenantConfig},
};
use harmony_types::id::Id;
use harmony_types::net::Url;
@@ -33,14 +33,9 @@ async fn main() {
service_port: 3000,
});
let receiver_name = "sample-webhook-receiver".to_string();
let webhook_receiver = WebhookReceiver {
name: receiver_name.clone(),
name: "sample-webhook-receiver".to_string(),
url: Url::Url(url::Url::parse("https://webhook-doesnt-exist.com").unwrap()),
route: AlertRoute {
..AlertRoute::default(receiver_name)
},
};
let app = ApplicationScore {

View File

@@ -0,0 +1,16 @@
[workspace]
[package]
name = "example-cluster-dashboards"
edition = "2021"
version = "0.1.0"
license = "GNU AGPL v3"
publish = false
[dependencies]
harmony = { path = "../../harmony" }
harmony_cli = { path = "../../harmony_cli" }
harmony_types = { path = "../../harmony_types" }
tokio = { version = "1.40", features = ["macros", "rt-multi-thread"] }
log = "0.4"
env_logger = "0.11"

View File

@@ -0,0 +1,21 @@
use harmony::{
inventory::Inventory,
modules::monitoring::cluster_dashboards::ClusterDashboardsScore,
topology::K8sAnywhereTopology,
};
#[tokio::main]
async fn main() {
harmony_cli::cli_logger::init();
let cluster_dashboards_score = ClusterDashboardsScore::default();
harmony_cli::run(
Inventory::autoload(),
K8sAnywhereTopology::from_env(),
vec![Box::new(cluster_dashboards_score)],
None,
)
.await
.unwrap();
}

View File

@@ -1,45 +1,37 @@
use std::{
collections::HashMap,
sync::{Arc, Mutex},
};
use std::collections::HashMap;
use harmony::{
inventory::Inventory,
modules::monitoring::{
alert_channel::discord_alert_channel::DiscordReceiver,
alert_rule::{
alerts::{
infra::dell_server::{
alert_global_storage_status_critical,
alert_global_storage_status_non_recoverable,
global_storage_status_degraded_non_critical,
modules::{
monitoring::{
alert_channel::discord_alert_channel::DiscordWebhook,
alert_rule::prometheus_alert_rule::AlertManagerRuleGroup,
kube_prometheus::{
helm_prometheus_alert_score::HelmPrometheusAlertingScore,
types::{
HTTPScheme, MatchExpression, Operator, Selector, ServiceMonitor,
ServiceMonitorEndpoint,
},
k8s::pvc::high_pvc_fill_rate_over_two_days,
},
prometheus_alert_rule::AlertManagerRuleGroup,
},
kube_prometheus::{
helm::config::KubePrometheusConfig,
kube_prometheus_alerting_score::KubePrometheusAlertingScore,
types::{
HTTPScheme, MatchExpression, Operator, Selector, ServiceMonitor,
ServiceMonitorEndpoint,
prometheus::alerts::{
infra::dell_server::{
alert_global_storage_status_critical, alert_global_storage_status_non_recoverable,
global_storage_status_degraded_non_critical,
},
k8s::pvc::high_pvc_fill_rate_over_two_days,
},
},
topology::{K8sAnywhereTopology, monitoring::AlertRoute},
topology::K8sAnywhereTopology,
};
use harmony_types::{k8s_name::K8sName, net::Url};
#[tokio::main]
async fn main() {
let receiver_name = "test-discord".to_string();
let discord_receiver = DiscordReceiver {
name: receiver_name.clone(),
let discord_receiver = DiscordWebhook {
name: K8sName("test-discord".to_string()),
url: Url::Url(url::Url::parse("https://discord.doesnt.exist.com").unwrap()),
route: AlertRoute {
..AlertRoute::default(receiver_name)
},
selectors: vec![],
};
let high_pvc_fill_rate_over_two_days_alert = high_pvc_fill_rate_over_two_days();
@@ -78,15 +70,10 @@ async fn main() {
endpoints: vec![service_monitor_endpoint],
..Default::default()
};
let config = Arc::new(Mutex::new(KubePrometheusConfig::new()));
let alerting_score = KubePrometheusAlertingScore {
let alerting_score = HelmPrometheusAlertingScore {
receivers: vec![Box::new(discord_receiver)],
rules: vec![Box::new(additional_rules), Box::new(additional_rules2)],
service_monitors: vec![service_monitor],
scrape_targets: None,
config,
};
harmony_cli::run(

View File

@@ -1,32 +1,24 @@
use std::{
collections::HashMap,
str::FromStr,
sync::{Arc, Mutex},
};
use std::{collections::HashMap, str::FromStr};
use harmony::{
inventory::Inventory,
modules::{
monitoring::{
alert_channel::discord_alert_channel::DiscordReceiver,
alert_rule::{
alerts::k8s::pvc::high_pvc_fill_rate_over_two_days,
prometheus_alert_rule::AlertManagerRuleGroup,
},
alert_channel::discord_alert_channel::DiscordWebhook,
alert_rule::prometheus_alert_rule::AlertManagerRuleGroup,
kube_prometheus::{
helm::config::KubePrometheusConfig,
kube_prometheus_alerting_score::KubePrometheusAlertingScore,
helm_prometheus_alert_score::HelmPrometheusAlertingScore,
types::{
HTTPScheme, MatchExpression, Operator, Selector, ServiceMonitor,
ServiceMonitorEndpoint,
},
},
},
prometheus::alerts::k8s::pvc::high_pvc_fill_rate_over_two_days,
tenant::TenantScore,
},
topology::{
K8sAnywhereTopology,
monitoring::AlertRoute,
tenant::{ResourceLimits, TenantConfig, TenantNetworkPolicy},
},
};
@@ -50,13 +42,10 @@ async fn main() {
},
};
let receiver_name = "test-discord".to_string();
let discord_receiver = DiscordReceiver {
name: receiver_name.clone(),
let discord_receiver = DiscordWebhook {
name: K8sName("test-discord".to_string()),
url: Url::Url(url::Url::parse("https://discord.doesnt.exist.com").unwrap()),
route: AlertRoute {
..AlertRoute::default(receiver_name)
},
selectors: vec![],
};
let high_pvc_fill_rate_over_two_days_alert = high_pvc_fill_rate_over_two_days();
@@ -85,14 +74,10 @@ async fn main() {
..Default::default()
};
let config = Arc::new(Mutex::new(KubePrometheusConfig::new()));
let alerting_score = KubePrometheusAlertingScore {
let alerting_score = HelmPrometheusAlertingScore {
receivers: vec![Box::new(discord_receiver)],
rules: vec![Box::new(additional_rules)],
service_monitors: vec![service_monitor],
scrape_targets: None,
config,
};
harmony_cli::run(

View File

@@ -1,64 +1,35 @@
use std::collections::HashMap;
use harmony::{
inventory::Inventory,
modules::monitoring::{
alert_channel::discord_alert_channel::DiscordReceiver,
alert_rule::{
alerts::{
infra::opnsense::high_http_error_rate, k8s::pvc::high_pvc_fill_rate_over_two_days,
},
prometheus_alert_rule::AlertManagerRuleGroup,
},
okd::openshift_cluster_alerting_score::OpenshiftClusterAlertScore,
scrape_target::prometheus_node_exporter::PrometheusNodeExporter,
},
topology::{
K8sAnywhereTopology,
monitoring::{AlertMatcher, AlertRoute, MatchOp},
alert_channel::discord_alert_channel::DiscordWebhook,
okd::cluster_monitoring::OpenshiftClusterAlertScore,
},
topology::K8sAnywhereTopology,
};
use harmony_macros::{hurl, ip};
use harmony_macros::hurl;
use harmony_types::k8s_name::K8sName;
#[tokio::main]
async fn main() {
let platform_matcher = AlertMatcher {
label: "prometheus".to_string(),
operator: MatchOp::Eq,
value: "openshift-monitoring/k8s".to_string(),
};
let severity = AlertMatcher {
label: "severity".to_string(),
operator: MatchOp::Eq,
value: "critical".to_string(),
};
let high_http_error_rate = high_http_error_rate();
let additional_rules = AlertManagerRuleGroup::new("test-rule", vec![high_http_error_rate]);
let scrape_target = PrometheusNodeExporter {
job_name: "firewall".to_string(),
metrics_path: "/metrics".to_string(),
listen_address: ip!("192.168.1.1"),
port: 9100,
..Default::default()
};
let mut sel = HashMap::new();
sel.insert(
"openshift_io_alert_source".to_string(),
"platform".to_string(),
);
let mut sel2 = HashMap::new();
sel2.insert("openshift_io_alert_source".to_string(), "".to_string());
let selectors = vec![sel, sel2];
harmony_cli::run(
Inventory::autoload(),
K8sAnywhereTopology::from_env(),
vec![Box::new(OpenshiftClusterAlertScore {
receivers: vec![Box::new(DiscordReceiver {
name: "crit-wills-discord-channel-example".to_string(),
url: hurl!("https://test.io"),
route: AlertRoute {
matchers: vec![severity],
..AlertRoute::default("crit-wills-discord-channel-example".to_string())
},
receivers: vec![Box::new(DiscordWebhook {
name: K8sName("wills-discord-webhook-example".to_string()),
url: hurl!("https://something.io"),
selectors: selectors,
})],
sender: harmony::modules::monitoring::okd::OpenshiftClusterAlertSender,
rules: vec![Box::new(additional_rules)],
scrape_targets: Some(vec![Box::new(scrape_target)]),
})],
None,
)

View File

@@ -6,7 +6,10 @@ use harmony::{
data::{FileContent, FilePath},
modules::{
inventory::HarmonyDiscoveryStrategy,
okd::{installation::OKDInstallationPipeline, ipxe::OKDIpxeScore},
okd::{
installation::OKDInstallationPipeline, ipxe::OKDIpxeScore,
load_balancer::OKDLoadBalancerScore,
},
},
score::Score,
topology::HAClusterTopology,
@@ -32,6 +35,7 @@ async fn main() {
scores
.append(&mut OKDInstallationPipeline::get_all_scores(HarmonyDiscoveryStrategy::MDNS).await);
scores.push(Box::new(OKDLoadBalancerScore::new(&topology)));
harmony_cli::run(inventory, topology, scores, None)
.await
.unwrap();

View File

@@ -6,9 +6,9 @@ use harmony::{
application::{
ApplicationScore, RustWebFramework, RustWebapp, features::rhob_monitoring::Monitoring,
},
monitoring::alert_channel::discord_alert_channel::DiscordReceiver,
monitoring::alert_channel::discord_alert_channel::DiscordWebhook,
},
topology::{K8sAnywhereTopology, monitoring::AlertRoute},
topology::K8sAnywhereTopology,
};
use harmony_types::{k8s_name::K8sName, net::Url};
@@ -22,21 +22,18 @@ async fn main() {
service_port: 3000,
});
let receiver_name = "test-discord".to_string();
let discord_receiver = DiscordReceiver {
name: receiver_name.clone(),
let discord_receiver = DiscordWebhook {
name: K8sName("test-discord".to_string()),
url: Url::Url(url::Url::parse("https://discord.doesnt.exist.com").unwrap()),
route: AlertRoute {
..AlertRoute::default(receiver_name)
},
selectors: vec![],
};
let app = ApplicationScore {
features: vec![
// Box::new(Monitoring {
// application: application.clone(),
// alert_receiver: vec![Box::new(discord_receiver)],
// }),
Box::new(Monitoring {
application: application.clone(),
alert_receiver: vec![Box::new(discord_receiver)],
}),
// TODO add backups, multisite ha, etc
],
application,

View File

@@ -8,13 +8,13 @@ use harmony::{
features::{Monitoring, PackagingDeployment},
},
monitoring::alert_channel::{
discord_alert_channel::DiscordReceiver, webhook_receiver::WebhookReceiver,
discord_alert_channel::DiscordWebhook, webhook_receiver::WebhookReceiver,
},
},
topology::{K8sAnywhereTopology, monitoring::AlertRoute},
topology::K8sAnywhereTopology,
};
use harmony_macros::hurl;
use harmony_types::{k8s_name::K8sName, net::Url};
use harmony_types::k8s_name::K8sName;
#[tokio::main]
async fn main() {
@@ -26,23 +26,15 @@ async fn main() {
service_port: 3000,
});
let receiver_name = "test-discord".to_string();
let discord_receiver = DiscordReceiver {
name: receiver_name.clone(),
url: Url::Url(url::Url::parse("https://discord.doesnt.exist.com").unwrap()),
route: AlertRoute {
..AlertRoute::default(receiver_name)
},
let discord_receiver = DiscordWebhook {
name: K8sName("test-discord".to_string()),
url: hurl!("https://discord.doesnt.exist.com"),
selectors: vec![],
};
let receiver_name = "sample-webhook-receiver".to_string();
let webhook_receiver = WebhookReceiver {
name: receiver_name.clone(),
name: "sample-webhook-receiver".to_string(),
url: hurl!("https://webhook-doesnt-exist.com"),
route: AlertRoute {
..AlertRoute::default(receiver_name)
},
};
let app = ApplicationScore {
@@ -50,10 +42,10 @@ async fn main() {
Box::new(PackagingDeployment {
application: application.clone(),
}),
// Box::new(Monitoring {
// application: application.clone(),
// alert_receiver: vec![Box::new(discord_receiver), Box::new(webhook_receiver)],
// }),
Box::new(Monitoring {
application: application.clone(),
alert_receiver: vec![Box::new(discord_receiver), Box::new(webhook_receiver)],
}),
// TODO add backups, multisite ha, etc
],
application,

View File

@@ -1,8 +1,11 @@
use harmony::{
inventory::Inventory,
modules::application::{
ApplicationScore, RustWebFramework, RustWebapp,
features::{Monitoring, PackagingDeployment},
modules::{
application::{
ApplicationScore, RustWebFramework, RustWebapp,
features::{Monitoring, PackagingDeployment},
},
monitoring::alert_channel::discord_alert_channel::DiscordWebhook,
},
topology::K8sAnywhereTopology,
};
@@ -27,14 +30,14 @@ async fn main() {
Box::new(PackagingDeployment {
application: application.clone(),
}),
// Box::new(Monitoring {
// application: application.clone(),
// alert_receiver: vec![Box::new(DiscordWebhook {
// name: K8sName("test-discord".to_string()),
// url: hurl!("https://discord.doesnt.exist.com"),
// selectors: vec![],
// })],
// }),
Box::new(Monitoring {
application: application.clone(),
alert_receiver: vec![Box::new(DiscordWebhook {
name: K8sName("test-discord".to_string()),
url: hurl!("https://discord.doesnt.exist.com"),
selectors: vec![],
})],
}),
],
application,
};

View File

@@ -44,6 +44,7 @@ fn build_large_score() -> LoadBalancerScore {
],
listening_port: SocketAddr::V4(SocketAddrV4::new(ipv4!("192.168.0.0"), 49387)),
health_check: Some(HealthCheck::HTTP(
Some(1993),
"/some_long_ass_path_to_see_how_it_is_displayed_but_it_has_to_be_even_longer"
.to_string(),
HttpMethod::GET,

View File

@@ -1,4 +1,4 @@
use std::{collections::BTreeMap, process::Command, sync::Arc};
use std::{collections::BTreeMap, process::Command, sync::Arc, time::Duration};
use async_trait::async_trait;
use base64::{Engine, engine::general_purpose};
@@ -8,7 +8,7 @@ use k8s_openapi::api::{
core::v1::{Pod, Secret},
rbac::v1::{ClusterRoleBinding, RoleRef, Subject},
};
use kube::api::{GroupVersionKind, ObjectMeta};
use kube::api::{DynamicObject, GroupVersionKind, ObjectMeta};
use log::{debug, info, trace, warn};
use serde::Serialize;
use tokio::sync::OnceCell;
@@ -29,7 +29,28 @@ use crate::{
score_cert_management::CertificateManagementScore,
},
k3d::K3DInstallationScore,
k8s::ingress::{K8sIngressScore, PathType},
monitoring::{
grafana::{grafana::Grafana, helm::helm_grafana::grafana_helm_chart_score},
kube_prometheus::crd::{
crd_alertmanager_config::CRDPrometheus,
crd_grafana::{
Grafana as GrafanaCRD, GrafanaCom, GrafanaDashboard,
GrafanaDashboardDatasource, GrafanaDashboardSpec, GrafanaDatasource,
GrafanaDatasourceConfig, GrafanaDatasourceJsonData,
GrafanaDatasourceSecureJsonData, GrafanaDatasourceSpec, GrafanaSpec,
},
crd_prometheuses::LabelSelector,
prometheus_operator::prometheus_operator_helm_chart_score,
rhob_alertmanager_config::RHOBObservability,
service_monitor::ServiceMonitor,
},
},
okd::{crd::ingresses_config::Ingress as IngressResource, route::OKDTlsPassthroughScore},
prometheus::{
k8s_prometheus_alerting_score::K8sPrometheusCRDAlertingScore,
prometheus::PrometheusMonitoring, rhob_alerting_score::RHOBAlertingScore,
},
},
score::Score,
topology::{TlsRoute, TlsRouter, ingress::Ingress},
@@ -38,6 +59,7 @@ use crate::{
use super::super::{
DeploymentTarget, HelmCommand, K8sclient, MultiTargetTopology, PreparationError,
PreparationOutcome, Topology,
oberservability::monitoring::AlertReceiver,
tenant::{
TenantConfig, TenantManager,
k8s::K8sTenantManager,
@@ -144,6 +166,216 @@ impl TlsRouter for K8sAnywhereTopology {
}
}
#[async_trait]
impl Grafana for K8sAnywhereTopology {
async fn ensure_grafana_operator(
&self,
inventory: &Inventory,
) -> Result<PreparationOutcome, PreparationError> {
debug!("ensure grafana operator");
let client = self.k8s_client().await.unwrap();
let grafana_gvk = GroupVersionKind {
group: "grafana.integreatly.org".to_string(),
version: "v1beta1".to_string(),
kind: "Grafana".to_string(),
};
let name = "grafanas.grafana.integreatly.org";
let ns = "grafana";
let grafana_crd = client
.get_resource_json_value(name, Some(ns), &grafana_gvk)
.await;
match grafana_crd {
Ok(_) => {
return Ok(PreparationOutcome::Success {
details: "Found grafana CRDs in cluster".to_string(),
});
}
Err(_) => {
return self
.install_grafana_operator(inventory, Some("grafana"))
.await;
}
};
}
async fn install_grafana(&self) -> Result<PreparationOutcome, PreparationError> {
let ns = "grafana";
let mut label = BTreeMap::new();
label.insert("dashboards".to_string(), "grafana".to_string());
let label_selector = LabelSelector {
match_labels: label.clone(),
match_expressions: vec![],
};
let client = self.k8s_client().await?;
let grafana = self.build_grafana(ns, &label);
client.apply(&grafana, Some(ns)).await?;
//TODO change this to a ensure ready or something better than just a timeout
client
.wait_until_deployment_ready(
"grafana-grafana-deployment",
Some("grafana"),
Some(Duration::from_secs(30)),
)
.await?;
let sa_name = "grafana-grafana-sa";
let token_secret_name = "grafana-sa-token-secret";
let sa_token_secret = self.build_sa_token_secret(token_secret_name, sa_name, ns);
client.apply(&sa_token_secret, Some(ns)).await?;
let secret_gvk = GroupVersionKind {
group: "".to_string(),
version: "v1".to_string(),
kind: "Secret".to_string(),
};
let secret = client
.get_resource_json_value(token_secret_name, Some(ns), &secret_gvk)
.await?;
let token = format!(
"Bearer {}",
self.extract_and_normalize_token(&secret).unwrap()
);
debug!("creating grafana clusterrole binding");
let clusterrolebinding =
self.build_cluster_rolebinding(sa_name, "cluster-monitoring-view", ns);
client.apply(&clusterrolebinding, Some(ns)).await?;
debug!("creating grafana datasource crd");
let thanos_url = format!(
"https://{}",
self.get_domain("thanos-querier-openshift-monitoring")
.await
.unwrap()
);
let thanos_openshift_datasource = self.build_grafana_datasource(
"thanos-openshift-monitoring",
ns,
&label_selector,
&thanos_url,
&token,
);
client.apply(&thanos_openshift_datasource, Some(ns)).await?;
debug!("creating grafana dashboard crd");
let dashboard = self.build_grafana_dashboard(ns, &label_selector);
client.apply(&dashboard, Some(ns)).await?;
debug!("creating grafana ingress");
let grafana_ingress = self.build_grafana_ingress(ns).await;
grafana_ingress
.interpret(&Inventory::empty(), self)
.await
.map_err(|e| PreparationError::new(e.to_string()))?;
Ok(PreparationOutcome::Success {
details: "Installed grafana composants".to_string(),
})
}
}
#[async_trait]
impl PrometheusMonitoring<CRDPrometheus> for K8sAnywhereTopology {
async fn install_prometheus(
&self,
sender: &CRDPrometheus,
_inventory: &Inventory,
_receivers: Option<Vec<Box<dyn AlertReceiver<CRDPrometheus>>>>,
) -> Result<PreparationOutcome, PreparationError> {
let client = self.k8s_client().await?;
for monitor in sender.service_monitor.iter() {
client
.apply(monitor, Some(&sender.namespace))
.await
.map_err(|e| PreparationError::new(e.to_string()))?;
}
Ok(PreparationOutcome::Success {
details: "successfuly installed prometheus components".to_string(),
})
}
async fn ensure_prometheus_operator(
&self,
sender: &CRDPrometheus,
_inventory: &Inventory,
) -> Result<PreparationOutcome, PreparationError> {
let po_result = self.ensure_prometheus_operator(sender).await?;
match po_result {
PreparationOutcome::Success { details: _ } => {
debug!("Detected prometheus crds operator present in cluster.");
return Ok(po_result);
}
PreparationOutcome::Noop => {
debug!("Skipping Prometheus CR installation due to missing operator.");
return Ok(po_result);
}
}
}
}
#[async_trait]
impl PrometheusMonitoring<RHOBObservability> for K8sAnywhereTopology {
async fn install_prometheus(
&self,
sender: &RHOBObservability,
inventory: &Inventory,
receivers: Option<Vec<Box<dyn AlertReceiver<RHOBObservability>>>>,
) -> Result<PreparationOutcome, PreparationError> {
let po_result = self.ensure_cluster_observability_operator(sender).await?;
if po_result == PreparationOutcome::Noop {
debug!("Skipping Prometheus CR installation due to missing operator.");
return Ok(po_result);
}
let result = self
.get_cluster_observability_operator_prometheus_application_score(
sender.clone(),
receivers,
)
.await
.interpret(inventory, self)
.await;
match result {
Ok(outcome) => match outcome.status {
InterpretStatus::SUCCESS => Ok(PreparationOutcome::Success {
details: outcome.message,
}),
InterpretStatus::NOOP => Ok(PreparationOutcome::Noop),
_ => Err(PreparationError::new(outcome.message)),
},
Err(err) => Err(PreparationError::new(err.to_string())),
}
}
async fn ensure_prometheus_operator(
&self,
sender: &RHOBObservability,
inventory: &Inventory,
) -> Result<PreparationOutcome, PreparationError> {
todo!()
}
}
impl Serialize for K8sAnywhereTopology {
fn serialize<S>(&self, _serializer: S) -> Result<S::Ok, S::Error>
where
@@ -348,6 +580,23 @@ impl K8sAnywhereTopology {
}
}
fn extract_and_normalize_token(&self, secret: &DynamicObject) -> Option<String> {
let token_b64 = secret
.data
.get("token")
.or_else(|| secret.data.get("data").and_then(|d| d.get("token")))
.and_then(|v| v.as_str())?;
let bytes = general_purpose::STANDARD.decode(token_b64).ok()?;
let s = String::from_utf8(bytes).ok()?;
let cleaned = s
.trim_matches(|c: char| c.is_whitespace() || c == '\0')
.to_string();
Some(cleaned)
}
pub async fn get_k8s_distribution(&self) -> Result<KubernetesDistribution, PreparationError> {
self.k8s_client()
.await?
@@ -407,6 +656,141 @@ impl K8sAnywhereTopology {
}
}
fn build_grafana_datasource(
&self,
name: &str,
ns: &str,
label_selector: &LabelSelector,
url: &str,
token: &str,
) -> GrafanaDatasource {
let mut json_data = BTreeMap::new();
json_data.insert("timeInterval".to_string(), "5s".to_string());
GrafanaDatasource {
metadata: ObjectMeta {
name: Some(name.to_string()),
namespace: Some(ns.to_string()),
..Default::default()
},
spec: GrafanaDatasourceSpec {
instance_selector: label_selector.clone(),
allow_cross_namespace_import: Some(true),
values_from: None,
datasource: GrafanaDatasourceConfig {
access: "proxy".to_string(),
name: name.to_string(),
r#type: "prometheus".to_string(),
url: url.to_string(),
database: None,
json_data: Some(GrafanaDatasourceJsonData {
time_interval: Some("60s".to_string()),
http_header_name1: Some("Authorization".to_string()),
tls_skip_verify: Some(true),
oauth_pass_thru: Some(true),
}),
secure_json_data: Some(GrafanaDatasourceSecureJsonData {
http_header_value1: Some(format!("Bearer {token}")),
}),
is_default: Some(false),
editable: Some(true),
},
},
}
}
fn build_grafana_dashboard(
&self,
ns: &str,
label_selector: &LabelSelector,
) -> GrafanaDashboard {
let graf_dashboard = GrafanaDashboard {
metadata: ObjectMeta {
name: Some(format!("grafana-dashboard-{}", ns)),
namespace: Some(ns.to_string()),
..Default::default()
},
spec: GrafanaDashboardSpec {
resync_period: Some("30s".to_string()),
instance_selector: label_selector.clone(),
datasources: Some(vec![GrafanaDashboardDatasource {
input_name: "DS_PROMETHEUS".to_string(),
datasource_name: "thanos-openshift-monitoring".to_string(),
}]),
json: None,
grafana_com: Some(GrafanaCom {
id: 17406,
revision: None,
}),
},
};
graf_dashboard
}
fn build_grafana(&self, ns: &str, labels: &BTreeMap<String, String>) -> GrafanaCRD {
let grafana = GrafanaCRD {
metadata: ObjectMeta {
name: Some(format!("grafana-{}", ns)),
namespace: Some(ns.to_string()),
labels: Some(labels.clone()),
..Default::default()
},
spec: GrafanaSpec {
config: None,
admin_user: None,
admin_password: None,
ingress: None,
persistence: None,
resources: None,
},
};
grafana
}
async fn build_grafana_ingress(&self, ns: &str) -> K8sIngressScore {
let domain = self.get_domain(&format!("grafana-{}", ns)).await.unwrap();
let name = format!("{}-grafana", ns);
let backend_service = format!("grafana-{}-service", ns);
K8sIngressScore {
name: fqdn::fqdn!(&name),
host: fqdn::fqdn!(&domain),
backend_service: fqdn::fqdn!(&backend_service),
port: 3000,
path: Some("/".to_string()),
path_type: Some(PathType::Prefix),
namespace: Some(fqdn::fqdn!(&ns)),
ingress_class_name: Some("openshift-default".to_string()),
}
}
async fn get_cluster_observability_operator_prometheus_application_score(
&self,
sender: RHOBObservability,
receivers: Option<Vec<Box<dyn AlertReceiver<RHOBObservability>>>>,
) -> RHOBAlertingScore {
RHOBAlertingScore {
sender,
receivers: receivers.unwrap_or_default(),
service_monitors: vec![],
prometheus_rules: vec![],
}
}
async fn get_k8s_prometheus_application_score(
&self,
sender: CRDPrometheus,
receivers: Option<Vec<Box<dyn AlertReceiver<CRDPrometheus>>>>,
service_monitors: Option<Vec<ServiceMonitor>>,
) -> K8sPrometheusCRDAlertingScore {
return K8sPrometheusCRDAlertingScore {
sender,
receivers: receivers.unwrap_or_default(),
service_monitors: service_monitors.unwrap_or_default(),
prometheus_rules: vec![],
};
}
async fn openshift_ingress_operator_available(&self) -> Result<(), PreparationError> {
let client = self.k8s_client().await?;
let gvk = GroupVersionKind {
@@ -572,6 +956,137 @@ impl K8sAnywhereTopology {
)),
}
}
async fn ensure_cluster_observability_operator(
&self,
sender: &RHOBObservability,
) -> Result<PreparationOutcome, PreparationError> {
let status = Command::new("sh")
.args(["-c", "kubectl get crd -A | grep -i rhobs"])
.status()
.map_err(|e| PreparationError::new(format!("could not connect to cluster: {}", e)))?;
if !status.success() {
if let Some(Some(k8s_state)) = self.k8s_state.get() {
match k8s_state.source {
K8sSource::LocalK3d => {
warn!(
"Installing observability operator is not supported on LocalK3d source"
);
return Ok(PreparationOutcome::Noop);
debug!("installing cluster observability operator");
todo!();
let op_score =
prometheus_operator_helm_chart_score(sender.namespace.clone());
let result = op_score.interpret(&Inventory::empty(), self).await;
return match result {
Ok(outcome) => match outcome.status {
InterpretStatus::SUCCESS => Ok(PreparationOutcome::Success {
details: "installed cluster observability operator".into(),
}),
InterpretStatus::NOOP => Ok(PreparationOutcome::Noop),
_ => Err(PreparationError::new(
"failed to install cluster observability operator (unknown error)".into(),
)),
},
Err(err) => Err(PreparationError::new(err.to_string())),
};
}
K8sSource::Kubeconfig => {
debug!(
"unable to install cluster observability operator, contact cluster admin"
);
return Ok(PreparationOutcome::Noop);
}
}
} else {
warn!(
"Unable to detect k8s_state. Skipping Cluster Observability Operator install."
);
return Ok(PreparationOutcome::Noop);
}
}
debug!("Cluster Observability Operator is already present, skipping install");
Ok(PreparationOutcome::Success {
details: "cluster observability operator present in cluster".into(),
})
}
async fn ensure_prometheus_operator(
&self,
sender: &CRDPrometheus,
) -> Result<PreparationOutcome, PreparationError> {
let status = Command::new("sh")
.args(["-c", "kubectl get crd -A | grep -i prometheuses"])
.status()
.map_err(|e| PreparationError::new(format!("could not connect to cluster: {}", e)))?;
if !status.success() {
if let Some(Some(k8s_state)) = self.k8s_state.get() {
match k8s_state.source {
K8sSource::LocalK3d => {
debug!("installing prometheus operator");
let op_score =
prometheus_operator_helm_chart_score(sender.namespace.clone());
let result = op_score.interpret(&Inventory::empty(), self).await;
return match result {
Ok(outcome) => match outcome.status {
InterpretStatus::SUCCESS => Ok(PreparationOutcome::Success {
details: "installed prometheus operator".into(),
}),
InterpretStatus::NOOP => Ok(PreparationOutcome::Noop),
_ => Err(PreparationError::new(
"failed to install prometheus operator (unknown error)".into(),
)),
},
Err(err) => Err(PreparationError::new(err.to_string())),
};
}
K8sSource::Kubeconfig => {
debug!("unable to install prometheus operator, contact cluster admin");
return Ok(PreparationOutcome::Noop);
}
}
} else {
warn!("Unable to detect k8s_state. Skipping Prometheus Operator install.");
return Ok(PreparationOutcome::Noop);
}
}
debug!("Prometheus operator is already present, skipping install");
Ok(PreparationOutcome::Success {
details: "prometheus operator present in cluster".into(),
})
}
async fn install_grafana_operator(
&self,
inventory: &Inventory,
ns: Option<&str>,
) -> Result<PreparationOutcome, PreparationError> {
let namespace = ns.unwrap_or("grafana");
info!("installing grafana operator in ns {namespace}");
let tenant = self.get_k8s_tenant_manager()?.get_tenant_config().await;
let mut namespace_scope = false;
if tenant.is_some() {
namespace_scope = true;
}
let _grafana_operator_score = grafana_helm_chart_score(namespace, namespace_scope)
.interpret(inventory, self)
.await
.map_err(|e| PreparationError::new(e.to_string()));
Ok(PreparationOutcome::Success {
details: format!(
"Successfully installed grafana operator in ns {}",
ns.unwrap()
),
})
}
}
#[derive(Clone, Debug)]

View File

@@ -1,5 +1,4 @@
mod k8s_anywhere;
pub mod nats;
pub mod observability;
mod postgres;
pub use k8s_anywhere::*;

View File

@@ -1,147 +0,0 @@
use async_trait::async_trait;
use crate::{
inventory::Inventory,
modules::monitoring::grafana::{
grafana::Grafana,
k8s::{
score_ensure_grafana_ready::GrafanaK8sEnsureReadyScore,
score_grafana_alert_receiver::GrafanaK8sReceiverScore,
score_grafana_datasource::GrafanaK8sDatasourceScore,
score_grafana_rule::GrafanaK8sRuleScore, score_install_grafana::GrafanaK8sInstallScore,
},
},
score::Score,
topology::{
K8sAnywhereTopology, PreparationError, PreparationOutcome,
monitoring::{AlertReceiver, AlertRule, Observability, ScrapeTarget},
},
};
#[async_trait]
impl Observability<Grafana> for K8sAnywhereTopology {
async fn install_alert_sender(
&self,
sender: &Grafana,
inventory: &Inventory,
) -> Result<PreparationOutcome, PreparationError> {
let score = GrafanaK8sInstallScore {
sender: sender.clone(),
};
score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(format!("Grafana not installed {}", e)))?;
Ok(PreparationOutcome::Success {
details: "Successfully installed grafana alert sender".to_string(),
})
}
async fn install_receivers(
&self,
sender: &Grafana,
inventory: &Inventory,
receivers: Option<Vec<Box<dyn AlertReceiver<Grafana>>>>,
) -> Result<PreparationOutcome, PreparationError> {
let receivers = match receivers {
Some(r) if !r.is_empty() => r,
_ => return Ok(PreparationOutcome::Noop),
};
for receiver in receivers {
let score = GrafanaK8sReceiverScore {
receiver,
sender: sender.clone(),
};
score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(format!("Failed to install receiver: {}", e)))?;
}
Ok(PreparationOutcome::Success {
details: "All alert receivers installed successfully".to_string(),
})
}
async fn install_rules(
&self,
sender: &Grafana,
inventory: &Inventory,
rules: Option<Vec<Box<dyn AlertRule<Grafana>>>>,
) -> Result<PreparationOutcome, PreparationError> {
let rules = match rules {
Some(r) if !r.is_empty() => r,
_ => return Ok(PreparationOutcome::Noop),
};
for rule in rules {
let score = GrafanaK8sRuleScore {
sender: sender.clone(),
rule,
};
score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(format!("Failed to install rule: {}", e)))?;
}
Ok(PreparationOutcome::Success {
details: "All alert rules installed successfully".to_string(),
})
}
async fn add_scrape_targets(
&self,
sender: &Grafana,
inventory: &Inventory,
scrape_targets: Option<Vec<Box<dyn ScrapeTarget<Grafana>>>>,
) -> Result<PreparationOutcome, PreparationError> {
let scrape_targets = match scrape_targets {
Some(r) if !r.is_empty() => r,
_ => return Ok(PreparationOutcome::Noop),
};
for scrape_target in scrape_targets {
let score = GrafanaK8sDatasourceScore {
scrape_target,
sender: sender.clone(),
};
score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(format!("Failed to add DataSource: {}", e)))?;
}
Ok(PreparationOutcome::Success {
details: "All datasources installed successfully".to_string(),
})
}
async fn ensure_monitoring_installed(
&self,
sender: &Grafana,
inventory: &Inventory,
) -> Result<PreparationOutcome, PreparationError> {
let score = GrafanaK8sEnsureReadyScore {
sender: sender.clone(),
};
score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(format!("Grafana not ready {}", e)))?;
Ok(PreparationOutcome::Success {
details: "Grafana Ready".to_string(),
})
}
}

View File

@@ -1,142 +0,0 @@
use async_trait::async_trait;
use crate::{
inventory::Inventory,
modules::monitoring::kube_prometheus::{
KubePrometheus, helm::kube_prometheus_helm_chart::kube_prometheus_helm_chart_score,
score_kube_prometheus_alert_receivers::KubePrometheusReceiverScore,
score_kube_prometheus_ensure_ready::KubePrometheusEnsureReadyScore,
score_kube_prometheus_rule::KubePrometheusRuleScore,
score_kube_prometheus_scrape_target::KubePrometheusScrapeTargetScore,
},
score::Score,
topology::{
K8sAnywhereTopology, PreparationError, PreparationOutcome,
monitoring::{AlertReceiver, AlertRule, Observability, ScrapeTarget},
},
};
#[async_trait]
impl Observability<KubePrometheus> for K8sAnywhereTopology {
async fn install_alert_sender(
&self,
sender: &KubePrometheus,
inventory: &Inventory,
) -> Result<PreparationOutcome, PreparationError> {
kube_prometheus_helm_chart_score(sender.config.clone())
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(e.to_string()))?;
Ok(PreparationOutcome::Success {
details: "Successfully installed kubeprometheus alert sender".to_string(),
})
}
async fn install_receivers(
&self,
sender: &KubePrometheus,
inventory: &Inventory,
receivers: Option<Vec<Box<dyn AlertReceiver<KubePrometheus>>>>,
) -> Result<PreparationOutcome, PreparationError> {
let receivers = match receivers {
Some(r) if !r.is_empty() => r,
_ => return Ok(PreparationOutcome::Noop),
};
for receiver in receivers {
let score = KubePrometheusReceiverScore {
receiver,
sender: sender.clone(),
};
score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(format!("Failed to install receiver: {}", e)))?;
}
Ok(PreparationOutcome::Success {
details: "All alert receivers installed successfully".to_string(),
})
}
async fn install_rules(
&self,
sender: &KubePrometheus,
inventory: &Inventory,
rules: Option<Vec<Box<dyn AlertRule<KubePrometheus>>>>,
) -> Result<PreparationOutcome, PreparationError> {
let rules = match rules {
Some(r) if !r.is_empty() => r,
_ => return Ok(PreparationOutcome::Noop),
};
for rule in rules {
let score = KubePrometheusRuleScore {
sender: sender.clone(),
rule,
};
score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(format!("Failed to install rule: {}", e)))?;
}
Ok(PreparationOutcome::Success {
details: "All alert rules installed successfully".to_string(),
})
}
async fn add_scrape_targets(
&self,
sender: &KubePrometheus,
inventory: &Inventory,
scrape_targets: Option<Vec<Box<dyn ScrapeTarget<KubePrometheus>>>>,
) -> Result<PreparationOutcome, PreparationError> {
let scrape_targets = match scrape_targets {
Some(r) if !r.is_empty() => r,
_ => return Ok(PreparationOutcome::Noop),
};
for scrape_target in scrape_targets {
let score = KubePrometheusScrapeTargetScore {
scrape_target,
sender: sender.clone(),
};
score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(format!("Failed to install rule: {}", e)))?;
}
Ok(PreparationOutcome::Success {
details: "All scrap targets installed successfully".to_string(),
})
}
async fn ensure_monitoring_installed(
&self,
sender: &KubePrometheus,
inventory: &Inventory,
) -> Result<PreparationOutcome, PreparationError> {
let score = KubePrometheusEnsureReadyScore {
sender: sender.clone(),
};
score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(format!("KubePrometheus not ready {}", e)))?;
Ok(PreparationOutcome::Success {
details: "KubePrometheus Ready".to_string(),
})
}
}

View File

@@ -1,5 +0,0 @@
pub mod grafana;
pub mod kube_prometheus;
pub mod openshift_monitoring;
pub mod prometheus;
pub mod redhat_cluster_observability;

View File

@@ -1,142 +0,0 @@
use async_trait::async_trait;
use log::info;
use crate::score::Score;
use crate::{
inventory::Inventory,
modules::monitoring::okd::{
OpenshiftClusterAlertSender,
score_enable_cluster_monitoring::OpenshiftEnableClusterMonitoringScore,
score_openshift_alert_rule::OpenshiftAlertRuleScore,
score_openshift_receiver::OpenshiftReceiverScore,
score_openshift_scrape_target::OpenshiftScrapeTargetScore,
score_user_workload::OpenshiftUserWorkloadMonitoring,
score_verify_user_workload_monitoring::VerifyUserWorkload,
},
topology::{
K8sAnywhereTopology, PreparationError, PreparationOutcome,
monitoring::{AlertReceiver, AlertRule, Observability, ScrapeTarget},
},
};
#[async_trait]
impl Observability<OpenshiftClusterAlertSender> for K8sAnywhereTopology {
async fn install_alert_sender(
&self,
_sender: &OpenshiftClusterAlertSender,
inventory: &Inventory,
) -> Result<PreparationOutcome, PreparationError> {
info!("enabling cluster monitoring");
let cluster_monitoring_score = OpenshiftEnableClusterMonitoringScore {};
cluster_monitoring_score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError { msg: e.to_string() })?;
info!("enabling user workload monitoring");
let user_workload_score = OpenshiftUserWorkloadMonitoring {};
user_workload_score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError { msg: e.to_string() })?;
Ok(PreparationOutcome::Success {
details: "Successfully configured cluster monitoring".to_string(),
})
}
async fn install_receivers(
&self,
_sender: &OpenshiftClusterAlertSender,
inventory: &Inventory,
receivers: Option<Vec<Box<dyn AlertReceiver<OpenshiftClusterAlertSender>>>>,
) -> Result<PreparationOutcome, PreparationError> {
if let Some(receivers) = receivers {
for receiver in receivers {
info!("Installing receiver {}", receiver.name());
let receiver_score = OpenshiftReceiverScore { receiver };
receiver_score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError { msg: e.to_string() })?;
}
Ok(PreparationOutcome::Success {
details: "Successfully installed receivers for OpenshiftClusterMonitoring"
.to_string(),
})
} else {
Ok(PreparationOutcome::Noop)
}
}
async fn install_rules(
&self,
_sender: &OpenshiftClusterAlertSender,
inventory: &Inventory,
rules: Option<Vec<Box<dyn AlertRule<OpenshiftClusterAlertSender>>>>,
) -> Result<PreparationOutcome, PreparationError> {
if let Some(rules) = rules {
for rule in rules {
info!("Installing rule ");
let rule_score = OpenshiftAlertRuleScore { rule: rule };
rule_score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError { msg: e.to_string() })?;
}
Ok(PreparationOutcome::Success {
details: "Successfully installed rules for OpenshiftClusterMonitoring".to_string(),
})
} else {
Ok(PreparationOutcome::Noop)
}
}
async fn add_scrape_targets(
&self,
_sender: &OpenshiftClusterAlertSender,
inventory: &Inventory,
scrape_targets: Option<Vec<Box<dyn ScrapeTarget<OpenshiftClusterAlertSender>>>>,
) -> Result<PreparationOutcome, PreparationError> {
if let Some(scrape_targets) = scrape_targets {
for scrape_target in scrape_targets {
info!("Installing scrape target");
let scrape_target_score = OpenshiftScrapeTargetScore {
scrape_target: scrape_target,
};
scrape_target_score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError { msg: e.to_string() })?;
}
Ok(PreparationOutcome::Success {
details: "Successfully added scrape targets for OpenshiftClusterMonitoring"
.to_string(),
})
} else {
Ok(PreparationOutcome::Noop)
}
}
async fn ensure_monitoring_installed(
&self,
_sender: &OpenshiftClusterAlertSender,
inventory: &Inventory,
) -> Result<PreparationOutcome, PreparationError> {
let verify_monitoring_score = VerifyUserWorkload {};
info!("Verifying user workload and cluster monitoring installed");
verify_monitoring_score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError { msg: e.to_string() })?;
Ok(PreparationOutcome::Success {
details: "OpenshiftClusterMonitoring ready".to_string(),
})
}
}

View File

@@ -1,147 +0,0 @@
use async_trait::async_trait;
use crate::{
inventory::Inventory,
modules::monitoring::prometheus::{
Prometheus, score_prometheus_alert_receivers::PrometheusReceiverScore,
score_prometheus_ensure_ready::PrometheusEnsureReadyScore,
score_prometheus_install::PrometheusInstallScore,
score_prometheus_rule::PrometheusRuleScore,
score_prometheus_scrape_target::PrometheusScrapeTargetScore,
},
score::Score,
topology::{
K8sAnywhereTopology, PreparationError, PreparationOutcome,
monitoring::{AlertReceiver, AlertRule, Observability, ScrapeTarget},
},
};
#[async_trait]
impl Observability<Prometheus> for K8sAnywhereTopology {
async fn install_alert_sender(
&self,
sender: &Prometheus,
inventory: &Inventory,
) -> Result<PreparationOutcome, PreparationError> {
let score = PrometheusInstallScore {
sender: sender.clone(),
};
score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(format!("Prometheus not installed {}", e)))?;
Ok(PreparationOutcome::Success {
details: "Successfully installed kubeprometheus alert sender".to_string(),
})
}
async fn install_receivers(
&self,
sender: &Prometheus,
inventory: &Inventory,
receivers: Option<Vec<Box<dyn AlertReceiver<Prometheus>>>>,
) -> Result<PreparationOutcome, PreparationError> {
let receivers = match receivers {
Some(r) if !r.is_empty() => r,
_ => return Ok(PreparationOutcome::Noop),
};
for receiver in receivers {
let score = PrometheusReceiverScore {
receiver,
sender: sender.clone(),
};
score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(format!("Failed to install receiver: {}", e)))?;
}
Ok(PreparationOutcome::Success {
details: "All alert receivers installed successfully".to_string(),
})
}
async fn install_rules(
&self,
sender: &Prometheus,
inventory: &Inventory,
rules: Option<Vec<Box<dyn AlertRule<Prometheus>>>>,
) -> Result<PreparationOutcome, PreparationError> {
let rules = match rules {
Some(r) if !r.is_empty() => r,
_ => return Ok(PreparationOutcome::Noop),
};
for rule in rules {
let score = PrometheusRuleScore {
sender: sender.clone(),
rule,
};
score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(format!("Failed to install rule: {}", e)))?;
}
Ok(PreparationOutcome::Success {
details: "All alert rules installed successfully".to_string(),
})
}
async fn add_scrape_targets(
&self,
sender: &Prometheus,
inventory: &Inventory,
scrape_targets: Option<Vec<Box<dyn ScrapeTarget<Prometheus>>>>,
) -> Result<PreparationOutcome, PreparationError> {
let scrape_targets = match scrape_targets {
Some(r) if !r.is_empty() => r,
_ => return Ok(PreparationOutcome::Noop),
};
for scrape_target in scrape_targets {
let score = PrometheusScrapeTargetScore {
scrape_target,
sender: sender.clone(),
};
score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(format!("Failed to install rule: {}", e)))?;
}
Ok(PreparationOutcome::Success {
details: "All scrap targets installed successfully".to_string(),
})
}
async fn ensure_monitoring_installed(
&self,
sender: &Prometheus,
inventory: &Inventory,
) -> Result<PreparationOutcome, PreparationError> {
let score = PrometheusEnsureReadyScore {
sender: sender.clone(),
};
score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(format!("Prometheus not ready {}", e)))?;
Ok(PreparationOutcome::Success {
details: "Prometheus Ready".to_string(),
})
}
}

View File

@@ -1,116 +0,0 @@
use crate::{
modules::monitoring::red_hat_cluster_observability::{
score_alert_receiver::RedHatClusterObservabilityReceiverScore,
score_coo_monitoring_stack::RedHatClusterObservabilityMonitoringStackScore,
},
score::Score,
};
use async_trait::async_trait;
use log::info;
use crate::{
inventory::Inventory,
modules::monitoring::red_hat_cluster_observability::{
RedHatClusterObservability,
score_redhat_cluster_observability_operator::RedHatClusterObservabilityOperatorScore,
},
topology::{
K8sAnywhereTopology, PreparationError, PreparationOutcome,
monitoring::{AlertReceiver, AlertRule, Observability, ScrapeTarget},
},
};
#[async_trait]
impl Observability<RedHatClusterObservability> for K8sAnywhereTopology {
async fn install_alert_sender(
&self,
sender: &RedHatClusterObservability,
inventory: &Inventory,
) -> Result<PreparationOutcome, PreparationError> {
info!("Verifying Redhat Cluster Observability Operator");
let coo_score = RedHatClusterObservabilityOperatorScore::default();
coo_score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(e.to_string()))?;
info!(
"Installing Cluster Observability Operator Monitoring Stack in ns {}",
sender.namespace.clone()
);
let coo_monitoring_stack_score = RedHatClusterObservabilityMonitoringStackScore {
namespace: sender.namespace.clone(),
resource_selector: sender.resource_selector.clone(),
};
coo_monitoring_stack_score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(e.to_string()))?;
Ok(PreparationOutcome::Success {
details: "Successfully installed RedHatClusterObservability Operator".to_string(),
})
}
async fn install_receivers(
&self,
sender: &RedHatClusterObservability,
inventory: &Inventory,
receivers: Option<Vec<Box<dyn AlertReceiver<RedHatClusterObservability>>>>,
) -> Result<PreparationOutcome, PreparationError> {
let receivers = match receivers {
Some(r) if !r.is_empty() => r,
_ => return Ok(PreparationOutcome::Noop),
};
for receiver in receivers {
info!("Installing receiver {}", receiver.name());
let receiver_score = RedHatClusterObservabilityReceiverScore {
receiver,
sender: sender.clone(),
};
receiver_score
.create_interpret()
.execute(inventory, self)
.await
.map_err(|e| PreparationError::new(e.to_string()))?;
}
Ok(PreparationOutcome::Success {
details: "Successfully installed receivers for OpenshiftClusterMonitoring".to_string(),
})
}
async fn install_rules(
&self,
_sender: &RedHatClusterObservability,
_inventory: &Inventory,
_rules: Option<Vec<Box<dyn AlertRule<RedHatClusterObservability>>>>,
) -> Result<PreparationOutcome, PreparationError> {
todo!()
}
async fn add_scrape_targets(
&self,
_sender: &RedHatClusterObservability,
_inventory: &Inventory,
_scrape_targets: Option<Vec<Box<dyn ScrapeTarget<RedHatClusterObservability>>>>,
) -> Result<PreparationOutcome, PreparationError> {
todo!()
}
async fn ensure_monitoring_installed(
&self,
_sender: &RedHatClusterObservability,
_inventory: &Inventory,
) -> Result<PreparationOutcome, PreparationError> {
todo!()
}
}

View File

@@ -106,6 +106,7 @@ pub enum SSL {
#[derive(Debug, Clone, PartialEq, Serialize)]
pub enum HealthCheck {
HTTP(String, HttpMethod, HttpStatusCode, SSL),
/// HTTP(None, "/healthz/ready", HttpMethod::GET, HttpStatusCode::Success2xx, SSL::Disabled)
HTTP(Option<u16>, String, HttpMethod, HttpStatusCode, SSL),
TCP(Option<u16>),
}

View File

@@ -2,7 +2,6 @@ pub mod decentralized;
mod failover;
mod ha_cluster;
pub mod ingress;
pub mod monitoring;
pub mod node_exporter;
pub mod opnsense;
pub use failover::*;
@@ -12,6 +11,7 @@ mod http;
pub mod installable;
mod k8s_anywhere;
mod localhost;
pub mod oberservability;
pub mod tenant;
use derive_new::new;
pub use k8s_anywhere::*;

View File

@@ -1,256 +0,0 @@
use std::{
any::Any,
collections::{BTreeMap, HashMap},
net::IpAddr,
};
use async_trait::async_trait;
use kube::api::DynamicObject;
use log::{debug, info};
use serde::{Deserialize, Serialize};
use crate::{
data::Version,
interpret::{Interpret, InterpretError, InterpretName, InterpretStatus, Outcome},
inventory::Inventory,
topology::{PreparationError, PreparationOutcome, Topology, installable::Installable},
};
use harmony_types::id::Id;
/// Defines the application that sends the alerts to a receivers
/// for example prometheus
#[async_trait]
pub trait AlertSender: Send + Sync + std::fmt::Debug {
fn name(&self) -> String;
}
/// Trait which defines how an alert sender is impleneted for a specific topology
#[async_trait]
pub trait Observability<S: AlertSender> {
async fn install_alert_sender(
&self,
sender: &S,
inventory: &Inventory,
) -> Result<PreparationOutcome, PreparationError>;
async fn install_receivers(
&self,
sender: &S,
inventory: &Inventory,
receivers: Option<Vec<Box<dyn AlertReceiver<S>>>>,
) -> Result<PreparationOutcome, PreparationError>;
async fn install_rules(
&self,
sender: &S,
inventory: &Inventory,
rules: Option<Vec<Box<dyn AlertRule<S>>>>,
) -> Result<PreparationOutcome, PreparationError>;
async fn add_scrape_targets(
&self,
sender: &S,
inventory: &Inventory,
scrape_targets: Option<Vec<Box<dyn ScrapeTarget<S>>>>,
) -> Result<PreparationOutcome, PreparationError>;
async fn ensure_monitoring_installed(
&self,
sender: &S,
inventory: &Inventory,
) -> Result<PreparationOutcome, PreparationError>;
}
/// Defines the entity that receives the alerts from a sender. For example Discord, Slack, etc
///
pub trait AlertReceiver<S: AlertSender>: std::fmt::Debug + Send + Sync {
fn build(&self) -> Result<ReceiverInstallPlan, InterpretError>;
fn name(&self) -> String;
fn clone_box(&self) -> Box<dyn AlertReceiver<S>>;
}
/// Defines a generic rule that can be applied to a sender, such as aprometheus alert rule
pub trait AlertRule<S: AlertSender>: std::fmt::Debug + Send + Sync {
fn build_rule(&self) -> Result<serde_json::Value, InterpretError>;
fn name(&self) -> String;
fn clone_box(&self) -> Box<dyn AlertRule<S>>;
}
/// A generic scrape target that can be added to a sender to scrape metrics from, for example a
/// server outside of the cluster
pub trait ScrapeTarget<S: AlertSender>: std::fmt::Debug + Send + Sync {
fn build_scrape_target(&self) -> Result<ExternalScrapeTarget, InterpretError>;
fn name(&self) -> String;
fn clone_box(&self) -> Box<dyn ScrapeTarget<S>>;
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ExternalScrapeTarget {
pub ip: IpAddr,
pub port: i32,
pub interval: Option<String>,
pub path: Option<String>,
pub labels: Option<BTreeMap<String, String>>,
}
/// Alerting interpret to install an alert sender on a given topology
#[derive(Debug)]
pub struct AlertingInterpret<S: AlertSender> {
pub sender: S,
pub receivers: Vec<Box<dyn AlertReceiver<S>>>,
pub rules: Vec<Box<dyn AlertRule<S>>>,
pub scrape_targets: Option<Vec<Box<dyn ScrapeTarget<S>>>>,
}
#[async_trait]
impl<S: AlertSender, T: Topology + Observability<S>> Interpret<T> for AlertingInterpret<S> {
async fn execute(
&self,
inventory: &Inventory,
topology: &T,
) -> Result<Outcome, InterpretError> {
info!("Configuring alert sender {}", self.sender.name());
topology
.install_alert_sender(&self.sender, inventory)
.await?;
info!("Installing receivers");
topology
.install_receivers(&self.sender, inventory, Some(self.receivers.clone()))
.await?;
info!("Installing rules");
topology
.install_rules(&self.sender, inventory, Some(self.rules.clone()))
.await?;
info!("Adding extra scrape targets");
topology
.add_scrape_targets(&self.sender, inventory, self.scrape_targets.clone())
.await?;
info!("Ensuring alert sender {} is ready", self.sender.name());
topology
.ensure_monitoring_installed(&self.sender, inventory)
.await?;
Ok(Outcome::success(format!(
"successfully installed alert sender {}",
self.sender.name()
)))
}
fn get_name(&self) -> InterpretName {
InterpretName::Alerting
}
fn get_version(&self) -> Version {
todo!()
}
fn get_status(&self) -> InterpretStatus {
todo!()
}
fn get_children(&self) -> Vec<Id> {
todo!()
}
}
impl<S: AlertSender> Clone for Box<dyn AlertReceiver<S>> {
fn clone(&self) -> Self {
self.clone_box()
}
}
impl<S: AlertSender> Clone for Box<dyn AlertRule<S>> {
fn clone(&self) -> Self {
self.clone_box()
}
}
impl<S: AlertSender> Clone for Box<dyn ScrapeTarget<S>> {
fn clone(&self) -> Self {
self.clone_box()
}
}
pub struct ReceiverInstallPlan {
pub install_operation: Option<Vec<InstallOperation>>,
pub route: Option<AlertRoute>,
pub receiver: Option<serde_yaml::Value>,
}
impl Default for ReceiverInstallPlan {
fn default() -> Self {
Self {
install_operation: None,
route: None,
receiver: None,
}
}
}
pub enum InstallOperation {
CreateSecret {
name: String,
data: BTreeMap<String, String>,
},
}
///Generic routing that can map to various alert sender backends
#[derive(Debug, Clone, Serialize)]
pub struct AlertRoute {
pub receiver: String,
#[serde(skip_serializing_if = "Vec::is_empty")]
pub matchers: Vec<AlertMatcher>,
#[serde(skip_serializing_if = "Vec::is_empty")]
pub group_by: Vec<String>,
#[serde(skip_serializing_if = "Option::is_none")]
pub repeat_interval: Option<String>,
#[serde(rename = "continue")]
pub continue_matching: bool,
#[serde(skip_serializing_if = "Vec::is_empty")]
pub children: Vec<AlertRoute>,
}
impl AlertRoute {
pub fn default(name: String) -> Self {
Self {
receiver: name,
matchers: vec![],
group_by: vec![],
repeat_interval: Some("30s".to_string()),
continue_matching: true,
children: vec![],
}
}
}
#[derive(Debug, Clone, Serialize)]
pub struct AlertMatcher {
pub label: String,
pub operator: MatchOp,
pub value: String,
}
#[derive(Debug, Clone)]
pub enum MatchOp {
Eq,
NotEq,
Regex,
}
impl Serialize for MatchOp {
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
let op = match self {
MatchOp::Eq => "=",
MatchOp::NotEq => "!=",
MatchOp::Regex => "=~",
};
serializer.serialize_str(op)
}
}

View File

@@ -0,0 +1 @@
pub mod monitoring;

View File

@@ -0,0 +1,101 @@
use std::{any::Any, collections::HashMap};
use async_trait::async_trait;
use kube::api::DynamicObject;
use log::debug;
use crate::{
data::Version,
interpret::{Interpret, InterpretError, InterpretName, InterpretStatus, Outcome},
inventory::Inventory,
topology::{Topology, installable::Installable},
};
use harmony_types::id::Id;
#[async_trait]
pub trait AlertSender: Send + Sync + std::fmt::Debug {
fn name(&self) -> String;
}
#[derive(Debug)]
pub struct AlertingInterpret<S: AlertSender> {
pub sender: S,
pub receivers: Vec<Box<dyn AlertReceiver<S>>>,
pub rules: Vec<Box<dyn AlertRule<S>>>,
pub scrape_targets: Option<Vec<Box<dyn ScrapeTarget<S>>>>,
}
#[async_trait]
impl<S: AlertSender + Installable<T>, T: Topology> Interpret<T> for AlertingInterpret<S> {
async fn execute(
&self,
inventory: &Inventory,
topology: &T,
) -> Result<Outcome, InterpretError> {
debug!("hit sender configure for AlertingInterpret");
self.sender.configure(inventory, topology).await?;
for receiver in self.receivers.iter() {
receiver.install(&self.sender).await?;
}
for rule in self.rules.iter() {
debug!("installing rule: {:#?}", rule);
rule.install(&self.sender).await?;
}
if let Some(targets) = &self.scrape_targets {
for target in targets.iter() {
debug!("installing scrape_target: {:#?}", target);
target.install(&self.sender).await?;
}
}
self.sender.ensure_installed(inventory, topology).await?;
Ok(Outcome::success(format!(
"successfully installed alert sender {}",
self.sender.name()
)))
}
fn get_name(&self) -> InterpretName {
InterpretName::Alerting
}
fn get_version(&self) -> Version {
todo!()
}
fn get_status(&self) -> InterpretStatus {
todo!()
}
fn get_children(&self) -> Vec<Id> {
todo!()
}
}
#[async_trait]
pub trait AlertReceiver<S: AlertSender>: std::fmt::Debug + Send + Sync {
async fn install(&self, sender: &S) -> Result<Outcome, InterpretError>;
fn name(&self) -> String;
fn clone_box(&self) -> Box<dyn AlertReceiver<S>>;
fn as_any(&self) -> &dyn Any;
fn as_alertmanager_receiver(&self) -> Result<AlertManagerReceiver, String>;
}
#[derive(Debug)]
pub struct AlertManagerReceiver {
pub receiver_config: serde_json::Value,
// FIXME we should not leak k8s here. DynamicObject is k8s specific
pub additional_ressources: Vec<DynamicObject>,
pub route_config: serde_json::Value,
}
#[async_trait]
pub trait AlertRule<S: AlertSender>: std::fmt::Debug + Send + Sync {
async fn install(&self, sender: &S) -> Result<Outcome, InterpretError>;
fn clone_box(&self) -> Box<dyn AlertRule<S>>;
}
#[async_trait]
pub trait ScrapeTarget<S: AlertSender>: std::fmt::Debug + Send + Sync {
async fn install(&self, sender: &S) -> Result<Outcome, InterpretError>;
fn clone_box(&self) -> Box<dyn ScrapeTarget<S>>;
}

View File

@@ -216,7 +216,15 @@ pub(crate) fn get_health_check_for_backend(
SSL::Other(other.to_string())
}
};
Some(HealthCheck::HTTP(path, method, status_code, ssl))
let port = haproxy_health_check
.checkport
.content_string()
.parse::<u16>()
.ok();
debug!("Found haproxy healthcheck port {port:?}");
Some(HealthCheck::HTTP(port, path, method, status_code, ssl))
}
_ => panic!("Received unsupported health check type {}", uppercase),
}
@@ -251,7 +259,7 @@ pub(crate) fn harmony_load_balancer_service_to_haproxy_xml(
// frontend points to backend
let healthcheck = if let Some(health_check) = &service.health_check {
match health_check {
HealthCheck::HTTP(path, http_method, _http_status_code, ssl) => {
HealthCheck::HTTP(port, path, http_method, _http_status_code, ssl) => {
let ssl: MaybeString = match ssl {
SSL::SSL => "ssl".into(),
SSL::SNI => "sslni".into(),
@@ -267,6 +275,7 @@ pub(crate) fn harmony_load_balancer_service_to_haproxy_xml(
http_uri: path.clone().into(),
interval: "2s".to_string(),
ssl,
checkport: MaybeString::from(port.map(|p| p.to_string())),
..Default::default()
};

View File

@@ -2,15 +2,13 @@ use crate::modules::application::{
Application, ApplicationFeature, InstallationError, InstallationOutcome,
};
use crate::modules::monitoring::application_monitoring::application_monitoring_score::ApplicationMonitoringScore;
use crate::modules::monitoring::grafana::grafana::Grafana;
use crate::modules::monitoring::kube_prometheus::crd::crd_alertmanager_config::CRDPrometheus;
use crate::modules::monitoring::kube_prometheus::crd::service_monitor::{
ServiceMonitor, ServiceMonitorSpec,
};
use crate::modules::monitoring::prometheus::Prometheus;
use crate::modules::monitoring::prometheus::helm::prometheus_config::PrometheusConfig;
use crate::topology::MultiTargetTopology;
use crate::topology::ingress::Ingress;
use crate::topology::monitoring::Observability;
use crate::topology::monitoring::{AlertReceiver, AlertRoute};
use crate::{
inventory::Inventory,
modules::monitoring::{
@@ -19,6 +17,10 @@ use crate::{
score::Score,
topology::{HelmCommand, K8sclient, Topology, tenant::TenantManager},
};
use crate::{
modules::prometheus::prometheus::PrometheusMonitoring,
topology::oberservability::monitoring::AlertReceiver,
};
use async_trait::async_trait;
use base64::{Engine as _, engine::general_purpose};
use harmony_secret::SecretManager;
@@ -28,13 +30,12 @@ use kube::api::ObjectMeta;
use log::{debug, info};
use schemars::JsonSchema;
use serde::{Deserialize, Serialize};
use std::sync::{Arc, Mutex};
use std::sync::Arc;
//TODO test this
#[derive(Debug, Clone)]
pub struct Monitoring {
pub application: Arc<dyn Application>,
pub alert_receiver: Vec<Box<dyn AlertReceiver<Prometheus>>>,
pub alert_receiver: Vec<Box<dyn AlertReceiver<CRDPrometheus>>>,
}
#[async_trait]
@@ -45,7 +46,8 @@ impl<
+ TenantManager
+ K8sclient
+ MultiTargetTopology
+ Observability<Prometheus>
+ PrometheusMonitoring<CRDPrometheus>
+ Grafana
+ Ingress
+ std::fmt::Debug,
> ApplicationFeature<T> for Monitoring
@@ -72,15 +74,17 @@ impl<
};
let mut alerting_score = ApplicationMonitoringScore {
sender: Prometheus {
config: Arc::new(Mutex::new(PrometheusConfig::new())),
sender: CRDPrometheus {
namespace: namespace.clone(),
client: topology.k8s_client().await.unwrap(),
service_monitor: vec![app_service_monitor],
},
application: self.application.clone(),
receivers: self.alert_receiver.clone(),
};
let ntfy = NtfyScore {
namespace: namespace.clone(),
host: domain.clone(),
host: domain,
};
ntfy.interpret(&Inventory::empty(), topology)
.await
@@ -101,28 +105,20 @@ impl<
debug!("ntfy_default_auth_param: {ntfy_default_auth_param}");
debug!("ntfy_default_auth_param: {ntfy_default_auth_param}");
let ntfy_receiver = WebhookReceiver {
name: "ntfy-webhook".to_string(),
url: Url::Url(
url::Url::parse(
format!(
"http://{domain}/{}?auth={ntfy_default_auth_param}",
__self.application.name()
"http://ntfy.{}.svc.cluster.local/rust-web-app?auth={ntfy_default_auth_param}",
namespace.clone()
)
.as_str(),
)
.unwrap(),
),
route: AlertRoute {
..AlertRoute::default("ntfy-webhook".to_string())
},
};
debug!(
"ntfy webhook receiver \n{:#?}\nntfy topic: {}",
ntfy_receiver.clone(),
self.application.name()
);
alerting_score.receivers.push(Box::new(ntfy_receiver));
alerting_score
.interpret(&Inventory::empty(), topology)

View File

@@ -3,13 +3,11 @@ use std::sync::Arc;
use crate::modules::application::{
Application, ApplicationFeature, InstallationError, InstallationOutcome,
};
use crate::modules::monitoring::application_monitoring::rhobs_application_monitoring_score::ApplicationRHOBMonitoringScore;
use crate::modules::monitoring::red_hat_cluster_observability::RedHatClusterObservability;
use crate::modules::monitoring::red_hat_cluster_observability::redhat_cluster_observability::RedHatClusterObservabilityScore;
use crate::modules::monitoring::kube_prometheus::crd::rhob_alertmanager_config::RHOBObservability;
use crate::topology::MultiTargetTopology;
use crate::topology::ingress::Ingress;
use crate::topology::monitoring::Observability;
use crate::topology::monitoring::{AlertReceiver, AlertRoute};
use crate::{
inventory::Inventory,
modules::monitoring::{
@@ -18,6 +16,10 @@ use crate::{
score::Score,
topology::{HelmCommand, K8sclient, Topology, tenant::TenantManager},
};
use crate::{
modules::prometheus::prometheus::PrometheusMonitoring,
topology::oberservability::monitoring::AlertReceiver,
};
use async_trait::async_trait;
use base64::{Engine as _, engine::general_purpose};
use harmony_types::net::Url;
@@ -26,10 +28,9 @@ use log::{debug, info};
#[derive(Debug, Clone)]
pub struct Monitoring {
pub application: Arc<dyn Application>,
pub alert_receiver: Vec<Box<dyn AlertReceiver<RedHatClusterObservability>>>,
pub alert_receiver: Vec<Box<dyn AlertReceiver<RHOBObservability>>>,
}
///TODO TEST this
#[async_trait]
impl<
T: Topology
@@ -40,7 +41,7 @@ impl<
+ MultiTargetTopology
+ Ingress
+ std::fmt::Debug
+ Observability<RedHatClusterObservability>,
+ PrometheusMonitoring<RHOBObservability>,
> ApplicationFeature<T> for Monitoring
{
async fn ensure_installed(
@@ -54,14 +55,13 @@ impl<
.map(|ns| ns.name.clone())
.unwrap_or_else(|| self.application.name());
let mut alerting_score = RedHatClusterObservabilityScore {
sender: RedHatClusterObservability {
let mut alerting_score = ApplicationRHOBMonitoringScore {
sender: RHOBObservability {
namespace: namespace.clone(),
resource_selector: todo!(),
client: topology.k8s_client().await.unwrap(),
},
application: self.application.clone(),
receivers: self.alert_receiver.clone(),
rules: vec![],
scrape_targets: None,
};
let domain = topology
.get_domain("ntfy")
@@ -97,15 +97,12 @@ impl<
url::Url::parse(
format!(
"http://{domain}/{}?auth={ntfy_default_auth_param}",
__self.application.name()
self.application.name()
)
.as_str(),
)
.unwrap(),
),
route: AlertRoute {
..AlertRoute::default("ntfy-webhook".to_string())
},
};
debug!(
"ntfy webhook receiver \n{:#?}\nntfy topic: {}",

View File

@@ -1,4 +1,5 @@
use async_trait::async_trait;
use k8s_openapi::ResourceScope;
use kube::Resource;
use log::info;
use serde::{Serialize, de::DeserializeOwned};
@@ -28,7 +29,7 @@ impl<K: Resource + std::fmt::Debug> K8sResourceScore<K> {
}
impl<
K: Resource
K: Resource<Scope: ResourceScope>
+ std::fmt::Debug
+ Sync
+ DeserializeOwned
@@ -60,7 +61,7 @@ pub struct K8sResourceInterpret<K: Resource + std::fmt::Debug + Sync + Send> {
#[async_trait]
impl<
K: Resource
K: Resource<Scope: ResourceScope>
+ Clone
+ std::fmt::Debug
+ DeserializeOwned

View File

@@ -20,6 +20,7 @@ pub mod okd;
pub mod openbao;
pub mod opnsense;
pub mod postgresql;
pub mod prometheus;
pub mod storage;
pub mod tenant;
pub mod tftp;

View File

@@ -1,38 +1,99 @@
use crate::modules::monitoring::kube_prometheus::KubePrometheus;
use crate::modules::monitoring::okd::OpenshiftClusterAlertSender;
use crate::modules::monitoring::red_hat_cluster_observability::RedHatClusterObservability;
use crate::topology::monitoring::{AlertRoute, InstallOperation, ReceiverInstallPlan};
use crate::{interpret::InterpretError, topology::monitoring::AlertReceiver};
use harmony_types::net::Url;
use std::any::Any;
use std::collections::{BTreeMap, HashMap};
use async_trait::async_trait;
use harmony_types::k8s_name::K8sName;
use k8s_openapi::api::core::v1::Secret;
use kube::Resource;
use kube::api::{DynamicObject, ObjectMeta};
use log::{debug, trace};
use serde::Serialize;
use serde_json::json;
use std::collections::BTreeMap;
use serde_yaml::{Mapping, Value};
use crate::infra::kube::kube_resource_to_dynamic;
use crate::modules::monitoring::kube_prometheus::crd::crd_alertmanager_config::{
AlertmanagerConfig, AlertmanagerConfigSpec, CRDPrometheus,
};
use crate::modules::monitoring::kube_prometheus::crd::rhob_alertmanager_config::RHOBObservability;
use crate::modules::monitoring::okd::OpenshiftClusterAlertSender;
use crate::topology::oberservability::monitoring::AlertManagerReceiver;
use crate::{
interpret::{InterpretError, Outcome},
modules::monitoring::{
kube_prometheus::{
prometheus::{KubePrometheus, KubePrometheusReceiver},
types::{AlertChannelConfig, AlertManagerChannelConfig},
},
prometheus::prometheus::{Prometheus, PrometheusReceiver},
},
topology::oberservability::monitoring::AlertReceiver,
};
use harmony_types::net::Url;
#[derive(Debug, Clone, Serialize)]
pub struct DiscordReceiver {
pub name: String,
pub struct DiscordWebhook {
pub name: K8sName,
pub url: Url,
pub route: AlertRoute,
pub selectors: Vec<HashMap<String, String>>,
}
impl AlertReceiver<OpenshiftClusterAlertSender> for DiscordReceiver {
fn build(&self) -> Result<ReceiverInstallPlan, InterpretError> {
let receiver_block = serde_yaml::to_value(json!({
"name": self.name,
"discord_configs": [{
"webhook_url": format!("{}", self.url),
"title": "{{ template \"discord.default.title\" . }}",
"message": "{{ template \"discord.default.message\" . }}"
}]
}))
.map_err(|e| InterpretError::new(e.to_string()))?;
impl DiscordWebhook {
fn get_receiver_config(&self) -> Result<AlertManagerReceiver, String> {
let secret_name = format!("{}-secret", self.name.clone());
let webhook_key = format!("{}", self.url.clone());
Ok(ReceiverInstallPlan {
install_operation: None,
route: Some(self.route.clone()),
receiver: Some(receiver_block),
let mut string_data = BTreeMap::new();
string_data.insert("webhook-url".to_string(), webhook_key.clone());
let secret = Secret {
metadata: kube::core::ObjectMeta {
name: Some(secret_name.clone()),
..Default::default()
},
string_data: Some(string_data),
type_: Some("Opaque".to_string()),
..Default::default()
};
let mut matchers: Vec<String> = Vec::new();
for selector in &self.selectors {
trace!("selector: {:#?}", selector);
for (k, v) in selector {
matchers.push(format!("{} = {}", k, v));
}
}
Ok(AlertManagerReceiver {
additional_ressources: vec![kube_resource_to_dynamic(&secret)?],
receiver_config: json!({
"name": self.name,
"discord_configs": [
{
"webhook_url": self.url.clone(),
"title": "{{ template \"discord.default.title\" . }}",
"message": "{{ template \"discord.default.message\" . }}"
}
]
}),
route_config: json!({
"receiver": self.name,
"matchers": matchers,
}),
})
}
}
#[async_trait]
impl AlertReceiver<OpenshiftClusterAlertSender> for DiscordWebhook {
async fn install(
&self,
sender: &OpenshiftClusterAlertSender,
) -> Result<Outcome, InterpretError> {
todo!()
}
fn name(&self) -> String {
self.name.clone().to_string()
@@ -41,77 +102,309 @@ impl AlertReceiver<OpenshiftClusterAlertSender> for DiscordReceiver {
fn clone_box(&self) -> Box<dyn AlertReceiver<OpenshiftClusterAlertSender>> {
Box::new(self.clone())
}
fn as_any(&self) -> &dyn Any {
todo!()
}
fn as_alertmanager_receiver(&self) -> Result<AlertManagerReceiver, String> {
self.get_receiver_config()
}
}
impl AlertReceiver<RedHatClusterObservability> for DiscordReceiver {
fn build(&self) -> Result<ReceiverInstallPlan, InterpretError> {
#[async_trait]
impl AlertReceiver<RHOBObservability> for DiscordWebhook {
fn as_alertmanager_receiver(&self) -> Result<AlertManagerReceiver, String> {
todo!()
}
async fn install(&self, sender: &RHOBObservability) -> Result<Outcome, InterpretError> {
let ns = sender.namespace.clone();
let config = self.get_receiver_config()?;
for resource in config.additional_ressources.iter() {
todo!("can I apply a dynamicresource");
// sender.client.apply(resource, Some(&ns)).await;
}
let spec = crate::modules::monitoring::kube_prometheus::crd::rhob_alertmanager_config::AlertmanagerConfigSpec {
data: json!({
"route": {
"receiver": self.name,
},
"receivers": [
config.receiver_config
]
}),
};
let alertmanager_configs = crate::modules::monitoring::kube_prometheus::crd::rhob_alertmanager_config::AlertmanagerConfig {
metadata: ObjectMeta {
name: Some(self.name.clone().to_string()),
labels: Some(std::collections::BTreeMap::from([(
"alertmanagerConfig".to_string(),
"enabled".to_string(),
)])),
namespace: Some(sender.namespace.clone()),
..Default::default()
},
spec,
};
debug!(
"alertmanager_configs yaml:\n{:#?}",
serde_yaml::to_string(&alertmanager_configs)
);
debug!(
"alert manager configs: \n{:#?}",
alertmanager_configs.clone()
);
sender
.client
.apply(&alertmanager_configs, Some(&sender.namespace))
.await?;
Ok(Outcome::success(format!(
"installed rhob-alertmanagerconfigs for {}",
self.name
)))
}
fn name(&self) -> String {
"webhook-receiver".to_string()
}
fn clone_box(&self) -> Box<dyn AlertReceiver<RHOBObservability>> {
Box::new(self.clone())
}
fn as_any(&self) -> &dyn Any {
self
}
}
#[async_trait]
impl AlertReceiver<CRDPrometheus> for DiscordWebhook {
fn as_alertmanager_receiver(&self) -> Result<AlertManagerReceiver, String> {
todo!()
}
async fn install(&self, sender: &CRDPrometheus) -> Result<Outcome, InterpretError> {
let ns = sender.namespace.clone();
let secret_name = format!("{}-secret", self.name.clone());
let webhook_key = format!("{}", self.url.clone());
let mut string_data = BTreeMap::new();
string_data.insert("webhook-url".to_string(), webhook_key.clone());
let receiver_config = json!({
"name": self.name,
"discordConfigs": [
{
"apiURL": {
"key": "webhook-url",
"name": format!("{}-secret", self.name)
},
"title": "{{ template \"discord.default.title\" . }}",
"message": "{{ template \"discord.default.message\" . }}"
}
]
});
let secret = Secret {
metadata: kube::core::ObjectMeta {
name: Some(secret_name.clone()),
..Default::default()
},
string_data: Some(string_data),
type_: Some("Opaque".to_string()),
..Default::default()
};
Ok(ReceiverInstallPlan {
install_operation: Some(vec![InstallOperation::CreateSecret {
name: secret_name,
data: string_data,
}]),
route: Some(self.route.clone()),
receiver: Some(
serde_yaml::to_value(receiver_config)
.map_err(|e| InterpretError::new(e.to_string()))
.expect("failed to build yaml value"),
),
})
let _ = sender.client.apply(&secret, Some(&ns)).await;
let spec = AlertmanagerConfigSpec {
data: json!({
"route": {
"receiver": self.name,
},
"receivers": [
{
"name": self.name,
"discordConfigs": [
{
"apiURL": {
"name": secret_name,
"key": "webhook-url",
},
"title": "{{ template \"discord.default.title\" . }}",
"message": "{{ template \"discord.default.message\" . }}"
}
]
}
]
}),
};
let alertmanager_configs = AlertmanagerConfig {
metadata: ObjectMeta {
name: Some(self.name.clone().to_string()),
labels: Some(std::collections::BTreeMap::from([(
"alertmanagerConfig".to_string(),
"enabled".to_string(),
)])),
namespace: Some(ns),
..Default::default()
},
spec,
};
sender
.client
.apply(&alertmanager_configs, Some(&sender.namespace))
.await?;
Ok(Outcome::success(format!(
"installed crd-alertmanagerconfigs for {}",
self.name
)))
}
fn name(&self) -> String {
self.name.clone()
"discord-webhook".to_string()
}
fn clone_box(&self) -> Box<dyn AlertReceiver<RedHatClusterObservability>> {
fn clone_box(&self) -> Box<dyn AlertReceiver<CRDPrometheus>> {
Box::new(self.clone())
}
fn as_any(&self) -> &dyn Any {
self
}
}
impl AlertReceiver<KubePrometheus> for DiscordReceiver {
fn build(&self) -> Result<ReceiverInstallPlan, InterpretError> {
let receiver_block = serde_yaml::to_value(json!({
"name": self.name,
"discord_configs": [{
"webhook_url": format!("{}", self.url),
"title": "{{ template \"discord.default.title\" . }}",
"message": "{{ template \"discord.default.message\" . }}"
}]
}))
.map_err(|e| InterpretError::new(e.to_string()))?;
Ok(ReceiverInstallPlan {
install_operation: None,
route: Some(self.route.clone()),
receiver: Some(receiver_block),
})
#[async_trait]
impl AlertReceiver<Prometheus> for DiscordWebhook {
fn as_alertmanager_receiver(&self) -> Result<AlertManagerReceiver, String> {
todo!()
}
async fn install(&self, sender: &Prometheus) -> Result<Outcome, InterpretError> {
sender.install_receiver(self).await
}
fn name(&self) -> String {
self.name.clone()
"discord-webhook".to_string()
}
fn clone_box(&self) -> Box<dyn AlertReceiver<Prometheus>> {
Box::new(self.clone())
}
fn as_any(&self) -> &dyn Any {
self
}
}
#[async_trait]
impl PrometheusReceiver for DiscordWebhook {
fn name(&self) -> String {
self.name.clone().to_string()
}
async fn configure_receiver(&self) -> AlertManagerChannelConfig {
self.get_config().await
}
}
#[async_trait]
impl AlertReceiver<KubePrometheus> for DiscordWebhook {
fn as_alertmanager_receiver(&self) -> Result<AlertManagerReceiver, String> {
todo!()
}
async fn install(&self, sender: &KubePrometheus) -> Result<Outcome, InterpretError> {
sender.install_receiver(self).await
}
fn clone_box(&self) -> Box<dyn AlertReceiver<KubePrometheus>> {
Box::new(self.clone())
}
fn name(&self) -> String {
"discord-webhook".to_string()
}
fn as_any(&self) -> &dyn Any {
self
}
}
#[async_trait]
impl KubePrometheusReceiver for DiscordWebhook {
fn name(&self) -> String {
self.name.clone().to_string()
}
async fn configure_receiver(&self) -> AlertManagerChannelConfig {
self.get_config().await
}
}
#[async_trait]
impl AlertChannelConfig for DiscordWebhook {
async fn get_config(&self) -> AlertManagerChannelConfig {
let channel_global_config = None;
let channel_receiver = self.alert_channel_receiver().await;
let channel_route = self.alert_channel_route().await;
AlertManagerChannelConfig {
channel_global_config,
channel_receiver,
channel_route,
}
}
}
impl DiscordWebhook {
async fn alert_channel_route(&self) -> serde_yaml::Value {
let mut route = Mapping::new();
route.insert(
Value::String("receiver".to_string()),
Value::String(self.name.clone().to_string()),
);
route.insert(
Value::String("matchers".to_string()),
Value::Sequence(vec![Value::String("alertname!=Watchdog".to_string())]),
);
route.insert(Value::String("continue".to_string()), Value::Bool(true));
Value::Mapping(route)
}
async fn alert_channel_receiver(&self) -> serde_yaml::Value {
let mut receiver = Mapping::new();
receiver.insert(
Value::String("name".to_string()),
Value::String(self.name.clone().to_string()),
);
let mut discord_config = Mapping::new();
discord_config.insert(
Value::String("webhook_url".to_string()),
Value::String(self.url.to_string()),
);
receiver.insert(
Value::String("discord_configs".to_string()),
Value::Sequence(vec![Value::Mapping(discord_config)]),
);
Value::Mapping(receiver)
}
}
#[cfg(test)]
mod tests {
use super::*;
#[tokio::test]
async fn discord_serialize_should_match() {
let discord_receiver = DiscordWebhook {
name: K8sName("test-discord".to_string()),
url: Url::Url(url::Url::parse("https://discord.i.dont.exist.com").unwrap()),
selectors: vec![],
};
let discord_receiver_receiver =
serde_yaml::to_string(&discord_receiver.alert_channel_receiver().await).unwrap();
println!("receiver \n{:#}", discord_receiver_receiver);
let discord_receiver_receiver_yaml = r#"name: test-discord
discord_configs:
- webhook_url: https://discord.i.dont.exist.com/
"#
.to_string();
let discord_receiver_route =
serde_yaml::to_string(&discord_receiver.alert_channel_route().await).unwrap();
println!("route \n{:#}", discord_receiver_route);
let discord_receiver_route_yaml = r#"receiver: test-discord
matchers:
- alertname!=Watchdog
continue: true
"#
.to_string();
assert_eq!(discord_receiver_receiver, discord_receiver_receiver_yaml);
assert_eq!(discord_receiver_route, discord_receiver_route_yaml);
}
}

View File

@@ -1,13 +1,25 @@
use std::any::Any;
use async_trait::async_trait;
use kube::api::ObjectMeta;
use log::debug;
use serde::Serialize;
use serde_json::json;
use serde_yaml::{Mapping, Value};
use crate::{
interpret::InterpretError,
interpret::{InterpretError, Outcome},
modules::monitoring::{
kube_prometheus::KubePrometheus, okd::OpenshiftClusterAlertSender, prometheus::Prometheus,
red_hat_cluster_observability::RedHatClusterObservability,
kube_prometheus::{
crd::{
crd_alertmanager_config::CRDPrometheus, rhob_alertmanager_config::RHOBObservability,
},
prometheus::{KubePrometheus, KubePrometheusReceiver},
types::{AlertChannelConfig, AlertManagerChannelConfig},
},
prometheus::prometheus::{Prometheus, PrometheusReceiver},
},
topology::monitoring::{AlertReceiver, AlertRoute, ReceiverInstallPlan},
topology::oberservability::monitoring::{AlertManagerReceiver, AlertReceiver},
};
use harmony_types::net::Url;
@@ -15,115 +27,281 @@ use harmony_types::net::Url;
pub struct WebhookReceiver {
pub name: String,
pub url: Url,
pub route: AlertRoute,
}
impl WebhookReceiver {
fn build_receiver(&self) -> serde_json::Value {
json!({
"name": self.name,
"webhookConfigs": [
{
"url": self.url,
"httpConfig": {
"tlsConfig": {
"insecureSkipVerify": true
#[async_trait]
impl AlertReceiver<RHOBObservability> for WebhookReceiver {
fn as_alertmanager_receiver(&self) -> Result<AlertManagerReceiver, String> {
todo!()
}
async fn install(&self, sender: &RHOBObservability) -> Result<Outcome, InterpretError> {
let spec = crate::modules::monitoring::kube_prometheus::crd::rhob_alertmanager_config::AlertmanagerConfigSpec {
data: json!({
"route": {
"receiver": self.name,
},
"receivers": [
{
"name": self.name,
"webhookConfigs": [
{
"url": self.url,
"httpConfig": {
"tlsConfig": {
"insecureSkipVerify": true
}
}
}
]
}
}
}
]})
]
}),
};
let alertmanager_configs = crate::modules::monitoring::kube_prometheus::crd::rhob_alertmanager_config::AlertmanagerConfig {
metadata: ObjectMeta {
name: Some(self.name.clone()),
labels: Some(std::collections::BTreeMap::from([(
"alertmanagerConfig".to_string(),
"enabled".to_string(),
)])),
namespace: Some(sender.namespace.clone()),
..Default::default()
},
spec,
};
debug!(
"alert manager configs: \n{:#?}",
alertmanager_configs.clone()
);
sender
.client
.apply(&alertmanager_configs, Some(&sender.namespace))
.await?;
Ok(Outcome::success(format!(
"installed rhob-alertmanagerconfigs for {}",
self.name
)))
}
fn build_route(&self) -> serde_json::Value {
json!({
"name": self.name})
}
}
impl AlertReceiver<OpenshiftClusterAlertSender> for WebhookReceiver {
fn name(&self) -> String {
self.name.clone()
"webhook-receiver".to_string()
}
fn clone_box(&self) -> Box<dyn AlertReceiver<OpenshiftClusterAlertSender>> {
fn clone_box(&self) -> Box<dyn AlertReceiver<RHOBObservability>> {
Box::new(self.clone())
}
fn build(&self) -> Result<crate::topology::monitoring::ReceiverInstallPlan, InterpretError> {
let receiver = self.build_receiver();
let receiver =
serde_yaml::to_value(receiver).map_err(|e| InterpretError::new(e.to_string()))?;
Ok(ReceiverInstallPlan {
install_operation: None,
route: Some(self.route.clone()),
receiver: Some(receiver),
})
fn as_any(&self) -> &dyn Any {
self
}
}
impl AlertReceiver<RedHatClusterObservability> for WebhookReceiver {
fn name(&self) -> String {
self.name.clone()
#[async_trait]
impl AlertReceiver<CRDPrometheus> for WebhookReceiver {
fn as_alertmanager_receiver(&self) -> Result<AlertManagerReceiver, String> {
todo!()
}
async fn install(&self, sender: &CRDPrometheus) -> Result<Outcome, InterpretError> {
let spec = crate::modules::monitoring::kube_prometheus::crd::crd_alertmanager_config::AlertmanagerConfigSpec {
data: json!({
"route": {
"receiver": self.name,
},
"receivers": [
{
"name": self.name,
"webhookConfigs": [
{
"url": self.url,
}
]
}
]
}),
};
let alertmanager_configs = crate::modules::monitoring::kube_prometheus::crd::crd_alertmanager_config::AlertmanagerConfig {
metadata: ObjectMeta {
name: Some(self.name.clone()),
labels: Some(std::collections::BTreeMap::from([(
"alertmanagerConfig".to_string(),
"enabled".to_string(),
)])),
namespace: Some(sender.namespace.clone()),
..Default::default()
},
spec,
};
debug!(
"alert manager configs: \n{:#?}",
alertmanager_configs.clone()
);
sender
.client
.apply(&alertmanager_configs, Some(&sender.namespace))
.await?;
Ok(Outcome::success(format!(
"installed crd-alertmanagerconfigs for {}",
self.name
)))
}
fn clone_box(&self) -> Box<dyn AlertReceiver<RedHatClusterObservability>> {
fn name(&self) -> String {
"webhook-receiver".to_string()
}
fn clone_box(&self) -> Box<dyn AlertReceiver<CRDPrometheus>> {
Box::new(self.clone())
}
fn build(&self) -> Result<crate::topology::monitoring::ReceiverInstallPlan, InterpretError> {
let receiver = self.build_receiver();
let receiver =
serde_yaml::to_value(receiver).map_err(|e| InterpretError::new(e.to_string()))?;
Ok(ReceiverInstallPlan {
install_operation: None,
route: Some(self.route.clone()),
receiver: Some(receiver),
})
}
}
impl AlertReceiver<KubePrometheus> for WebhookReceiver {
fn name(&self) -> String {
self.name.clone()
}
fn clone_box(&self) -> Box<dyn AlertReceiver<KubePrometheus>> {
Box::new(self.clone())
}
fn build(&self) -> Result<crate::topology::monitoring::ReceiverInstallPlan, InterpretError> {
let receiver = self.build_receiver();
let receiver =
serde_yaml::to_value(receiver).map_err(|e| InterpretError::new(e.to_string()))?;
Ok(ReceiverInstallPlan {
install_operation: None,
route: Some(self.route.clone()),
receiver: Some(receiver),
})
fn as_any(&self) -> &dyn Any {
self
}
}
#[async_trait]
impl AlertReceiver<Prometheus> for WebhookReceiver {
fn name(&self) -> String {
self.name.clone()
fn as_alertmanager_receiver(&self) -> Result<AlertManagerReceiver, String> {
todo!()
}
async fn install(&self, sender: &Prometheus) -> Result<Outcome, InterpretError> {
sender.install_receiver(self).await
}
fn name(&self) -> String {
"webhook-receiver".to_string()
}
fn clone_box(&self) -> Box<dyn AlertReceiver<Prometheus>> {
Box::new(self.clone())
}
fn build(&self) -> Result<crate::topology::monitoring::ReceiverInstallPlan, InterpretError> {
let receiver = self.build_receiver();
let receiver =
serde_yaml::to_value(receiver).map_err(|e| InterpretError::new(e.to_string()))?;
Ok(ReceiverInstallPlan {
install_operation: None,
route: Some(self.route.clone()),
receiver: Some(receiver),
})
fn as_any(&self) -> &dyn Any {
self
}
}
#[async_trait]
impl PrometheusReceiver for WebhookReceiver {
fn name(&self) -> String {
self.name.clone()
}
async fn configure_receiver(&self) -> AlertManagerChannelConfig {
self.get_config().await
}
}
#[async_trait]
impl AlertReceiver<KubePrometheus> for WebhookReceiver {
fn as_alertmanager_receiver(&self) -> Result<AlertManagerReceiver, String> {
todo!()
}
async fn install(&self, sender: &KubePrometheus) -> Result<Outcome, InterpretError> {
sender.install_receiver(self).await
}
fn name(&self) -> String {
"webhook-receiver".to_string()
}
fn clone_box(&self) -> Box<dyn AlertReceiver<KubePrometheus>> {
Box::new(self.clone())
}
fn as_any(&self) -> &dyn Any {
self
}
}
#[async_trait]
impl KubePrometheusReceiver for WebhookReceiver {
fn name(&self) -> String {
self.name.clone()
}
async fn configure_receiver(&self) -> AlertManagerChannelConfig {
self.get_config().await
}
}
#[async_trait]
impl AlertChannelConfig for WebhookReceiver {
async fn get_config(&self) -> AlertManagerChannelConfig {
let channel_global_config = None;
let channel_receiver = self.alert_channel_receiver().await;
let channel_route = self.alert_channel_route().await;
AlertManagerChannelConfig {
channel_global_config,
channel_receiver,
channel_route,
}
}
}
impl WebhookReceiver {
async fn alert_channel_route(&self) -> serde_yaml::Value {
let mut route = Mapping::new();
route.insert(
Value::String("receiver".to_string()),
Value::String(self.name.clone()),
);
route.insert(
Value::String("matchers".to_string()),
Value::Sequence(vec![Value::String("alertname!=Watchdog".to_string())]),
);
route.insert(Value::String("continue".to_string()), Value::Bool(true));
Value::Mapping(route)
}
async fn alert_channel_receiver(&self) -> serde_yaml::Value {
let mut receiver = Mapping::new();
receiver.insert(
Value::String("name".to_string()),
Value::String(self.name.clone()),
);
let mut webhook_config = Mapping::new();
webhook_config.insert(
Value::String("url".to_string()),
Value::String(self.url.to_string()),
);
receiver.insert(
Value::String("webhook_configs".to_string()),
Value::Sequence(vec![Value::Mapping(webhook_config)]),
);
Value::Mapping(receiver)
}
}
#[cfg(test)]
mod tests {
use super::*;
#[tokio::test]
async fn webhook_serialize_should_match() {
let webhook_receiver = WebhookReceiver {
name: "test-webhook".to_string(),
url: Url::Url(url::Url::parse("https://webhook.i.dont.exist.com").unwrap()),
};
let webhook_receiver_receiver =
serde_yaml::to_string(&webhook_receiver.alert_channel_receiver().await).unwrap();
println!("receiver \n{:#}", webhook_receiver_receiver);
let webhook_receiver_receiver_yaml = r#"name: test-webhook
webhook_configs:
- url: https://webhook.i.dont.exist.com/
"#
.to_string();
let webhook_receiver_route =
serde_yaml::to_string(&webhook_receiver.alert_channel_route().await).unwrap();
println!("route \n{:#}", webhook_receiver_route);
let webhook_receiver_route_yaml = r#"receiver: test-webhook
matchers:
- alertname!=Watchdog
continue: true
"#
.to_string();
assert_eq!(webhook_receiver_receiver, webhook_receiver_receiver_yaml);
assert_eq!(webhook_receiver_route, webhook_receiver_route_yaml);
}
}

View File

@@ -1,15 +0,0 @@
use crate::modules::monitoring::alert_rule::prometheus_alert_rule::PrometheusAlertRule;
pub fn high_http_error_rate() -> PrometheusAlertRule {
let expression = r#"(
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job, route, service)
/
sum(rate(http_requests_total[5m])) by (job, route, service)
) > 0.05 and sum(rate(http_requests_total[5m])) by (job, route, service) > 10"#;
PrometheusAlertRule::new("HighApplicationErrorRate", expression)
.for_duration("10m")
.label("severity", "warning")
.annotation("summary", "High HTTP error rate on {{ $labels.job }}")
.annotation("description", "Job {{ $labels.job }} (route {{ $labels.route }}) has an error rate > 5% over the last 10m.")
}

View File

@@ -1,2 +1 @@
pub mod alerts;
pub mod prometheus_alert_rule;

View File

@@ -1,13 +1,79 @@
use std::collections::HashMap;
use std::collections::{BTreeMap, HashMap};
use async_trait::async_trait;
use serde::Serialize;
use crate::{
interpret::InterpretError,
modules::monitoring::{kube_prometheus::KubePrometheus, okd::OpenshiftClusterAlertSender},
topology::monitoring::AlertRule,
interpret::{InterpretError, Outcome},
modules::monitoring::{
kube_prometheus::{
prometheus::{KubePrometheus, KubePrometheusRule},
types::{AlertGroup, AlertManagerAdditionalPromRules},
},
prometheus::prometheus::{Prometheus, PrometheusRule},
},
topology::oberservability::monitoring::AlertRule,
};
#[async_trait]
impl AlertRule<KubePrometheus> for AlertManagerRuleGroup {
async fn install(&self, sender: &KubePrometheus) -> Result<Outcome, InterpretError> {
sender.install_rule(self).await
}
fn clone_box(&self) -> Box<dyn AlertRule<KubePrometheus>> {
Box::new(self.clone())
}
}
#[async_trait]
impl AlertRule<Prometheus> for AlertManagerRuleGroup {
async fn install(&self, sender: &Prometheus) -> Result<Outcome, InterpretError> {
sender.install_rule(self).await
}
fn clone_box(&self) -> Box<dyn AlertRule<Prometheus>> {
Box::new(self.clone())
}
}
#[async_trait]
impl PrometheusRule for AlertManagerRuleGroup {
fn name(&self) -> String {
self.name.clone()
}
async fn configure_rule(&self) -> AlertManagerAdditionalPromRules {
let mut additional_prom_rules = BTreeMap::new();
additional_prom_rules.insert(
self.name.clone(),
AlertGroup {
groups: vec![self.clone()],
},
);
AlertManagerAdditionalPromRules {
rules: additional_prom_rules,
}
}
}
#[async_trait]
impl KubePrometheusRule for AlertManagerRuleGroup {
fn name(&self) -> String {
self.name.clone()
}
async fn configure_rule(&self) -> AlertManagerAdditionalPromRules {
let mut additional_prom_rules = BTreeMap::new();
additional_prom_rules.insert(
self.name.clone(),
AlertGroup {
groups: vec![self.clone()],
},
);
AlertManagerAdditionalPromRules {
rules: additional_prom_rules,
}
}
}
impl AlertManagerRuleGroup {
pub fn new(name: &str, rules: Vec<PrometheusAlertRule>) -> AlertManagerRuleGroup {
AlertManagerRuleGroup {
@@ -63,55 +129,3 @@ impl PrometheusAlertRule {
self
}
}
impl AlertRule<OpenshiftClusterAlertSender> for AlertManagerRuleGroup {
fn build_rule(&self) -> Result<serde_json::Value, InterpretError> {
let name = self.name.clone();
let mut rules: Vec<crate::modules::monitoring::okd::crd::alerting_rules::Rule> = vec![];
for rule in self.rules.clone() {
rules.push(rule.into())
}
let rule_groups =
vec![crate::modules::monitoring::okd::crd::alerting_rules::RuleGroup { name, rules }];
Ok(serde_json::to_value(rule_groups).map_err(|e| InterpretError::new(e.to_string()))?)
}
fn name(&self) -> String {
self.name.clone()
}
fn clone_box(&self) -> Box<dyn AlertRule<OpenshiftClusterAlertSender>> {
Box::new(self.clone())
}
}
impl AlertRule<KubePrometheus> for AlertManagerRuleGroup {
fn build_rule(&self) -> Result<serde_json::Value, InterpretError> {
let name = self.name.clone();
let mut rules: Vec<
crate::modules::monitoring::kube_prometheus::crd::crd_prometheus_rules::Rule,
> = vec![];
for rule in self.rules.clone() {
rules.push(rule.into())
}
let rule_groups = vec![
crate::modules::monitoring::kube_prometheus::crd::crd_prometheus_rules::RuleGroup {
name,
rules,
},
];
Ok(serde_json::to_value(rule_groups).map_err(|e| InterpretError::new(e.to_string()))?)
}
fn name(&self) -> String {
self.name.clone()
}
fn clone_box(&self) -> Box<dyn AlertRule<KubePrometheus>> {
Box::new(self.clone())
}
}

View File

@@ -5,26 +5,32 @@ use serde::Serialize;
use crate::{
interpret::Interpret,
modules::{application::Application, monitoring::prometheus::Prometheus},
modules::{
application::Application,
monitoring::{
grafana::grafana::Grafana, kube_prometheus::crd::crd_alertmanager_config::CRDPrometheus,
},
prometheus::prometheus::PrometheusMonitoring,
},
score::Score,
topology::{
K8sclient, Topology,
monitoring::{AlertReceiver, AlertingInterpret, Observability, ScrapeTarget},
oberservability::monitoring::{AlertReceiver, AlertingInterpret, ScrapeTarget},
},
};
#[derive(Debug, Clone, Serialize)]
pub struct ApplicationMonitoringScore {
pub sender: Prometheus,
pub sender: CRDPrometheus,
pub application: Arc<dyn Application>,
pub receivers: Vec<Box<dyn AlertReceiver<Prometheus>>>,
pub receivers: Vec<Box<dyn AlertReceiver<CRDPrometheus>>>,
}
impl<T: Topology + Observability<Prometheus> + K8sclient> Score<T> for ApplicationMonitoringScore {
impl<T: Topology + PrometheusMonitoring<CRDPrometheus> + K8sclient + Grafana> Score<T>
for ApplicationMonitoringScore
{
fn create_interpret(&self) -> Box<dyn Interpret<T>> {
debug!("creating alerting interpret");
//TODO will need to use k8sclient to apply service monitors or find a way to pass
//them to the AlertingInterpret potentially via Sender Prometheus
Box::new(AlertingInterpret {
sender: self.sender.clone(),
receivers: self.receivers.clone(),

View File

@@ -9,27 +9,28 @@ use crate::{
inventory::Inventory,
modules::{
application::Application,
monitoring::red_hat_cluster_observability::RedHatClusterObservability,
monitoring::kube_prometheus::crd::{
crd_alertmanager_config::CRDPrometheus, rhob_alertmanager_config::RHOBObservability,
},
prometheus::prometheus::PrometheusMonitoring,
},
score::Score,
topology::{
Topology,
monitoring::{AlertReceiver, AlertingInterpret, Observability},
},
topology::{PreparationOutcome, Topology, oberservability::monitoring::AlertReceiver},
};
use harmony_types::id::Id;
#[derive(Debug, Clone, Serialize)]
pub struct ApplicationRedHatClusterMonitoringScore {
pub sender: RedHatClusterObservability,
pub struct ApplicationRHOBMonitoringScore {
pub sender: RHOBObservability,
pub application: Arc<dyn Application>,
pub receivers: Vec<Box<dyn AlertReceiver<RedHatClusterObservability>>>,
pub receivers: Vec<Box<dyn AlertReceiver<RHOBObservability>>>,
}
impl<T: Topology + Observability<RedHatClusterObservability>> Score<T>
for ApplicationRedHatClusterMonitoringScore
impl<T: Topology + PrometheusMonitoring<RHOBObservability>> Score<T>
for ApplicationRHOBMonitoringScore
{
fn create_interpret(&self) -> Box<dyn Interpret<T>> {
Box::new(ApplicationRedHatClusterMonitoringInterpret {
Box::new(ApplicationRHOBMonitoringInterpret {
score: self.clone(),
})
}
@@ -43,28 +44,38 @@ impl<T: Topology + Observability<RedHatClusterObservability>> Score<T>
}
#[derive(Debug)]
pub struct ApplicationRedHatClusterMonitoringInterpret {
score: ApplicationRedHatClusterMonitoringScore,
pub struct ApplicationRHOBMonitoringInterpret {
score: ApplicationRHOBMonitoringScore,
}
#[async_trait]
impl<T: Topology + Observability<RedHatClusterObservability>> Interpret<T>
for ApplicationRedHatClusterMonitoringInterpret
impl<T: Topology + PrometheusMonitoring<RHOBObservability>> Interpret<T>
for ApplicationRHOBMonitoringInterpret
{
async fn execute(
&self,
inventory: &Inventory,
topology: &T,
) -> Result<Outcome, InterpretError> {
//TODO will need to use k8sclient to apply crd ServiceMonitor or find a way to pass
//them to the AlertingInterpret potentially via Sender RedHatClusterObservability
let alerting_interpret = AlertingInterpret {
sender: self.score.sender.clone(),
receivers: self.score.receivers.clone(),
rules: vec![],
scrape_targets: None,
};
alerting_interpret.execute(inventory, topology).await
let result = topology
.install_prometheus(
&self.score.sender,
inventory,
Some(self.score.receivers.clone()),
)
.await;
match result {
Ok(outcome) => match outcome {
PreparationOutcome::Success { details: _ } => {
Ok(Outcome::success("Prometheus installed".into()))
}
PreparationOutcome::Noop => {
Ok(Outcome::noop("Prometheus installation skipped".into()))
}
},
Err(err) => Err(InterpretError::from(err)),
}
}
fn get_name(&self) -> InterpretName {

View File

@@ -0,0 +1,6 @@
apiVersion: v1
kind: Namespace
metadata:
name: observability
labels:
openshift.io/cluster-monitoring: "true"

View File

@@ -0,0 +1,43 @@
apiVersion: v1
kind: ServiceAccount
metadata:
name: cluster-grafana-sa
namespace: observability
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: grafana-prometheus-api-access
rules:
- apiGroups:
- monitoring.coreos.com
resources:
- prometheuses/api
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: grafana-prometheus-api-access-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: grafana-prometheus-api-access
subjects:
- kind: ServiceAccount
name: cluster-grafana-sa
namespace: observability
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: grafana-cluster-monitoring-view
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-monitoring-view
subjects:
- kind: ServiceAccount
name: cluster-grafana-sa
namespace: observability

View File

@@ -0,0 +1,43 @@
apiVersion: grafana.integreatly.org/v1beta1
kind: Grafana
metadata:
name: cluster-grafana
namespace: observability
labels:
dashboards: "grafana"
spec:
serviceAccountName: cluster-grafana-sa
automountServiceAccountToken: true
config:
log:
mode: console
security:
admin_user: admin
admin_password: paul
users:
viewers_can_edit: "false"
auth:
disable_login_form: "false"
auth.anonymous:
enabled: "true"
org_role: Viewer
deployment:
spec:
replicas: 1
template:
spec:
containers:
- name: grafana
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 1
memory: 2Gi

View File

@@ -0,0 +1,8 @@
apiVersion: v1
kind: Secret
metadata:
name: grafana-prometheus-token
namespace: observability
annotations:
kubernetes.io/service-account.name: cluster-grafana-sa
type: kubernetes.io/service-account-token

View File

@@ -0,0 +1,27 @@
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDatasource
metadata:
name: prometheus-cluster
namespace: observability
spec:
instanceSelector:
matchLabels:
dashboards: "grafana"
valuesFrom:
- targetPath: "secureJsonData.httpHeaderValue1"
valueFrom:
secretKeyRef:
name: grafana-prometheus-token
key: token
datasource:
name: Prometheus-Cluster
type: prometheus
access: proxy
url: https://prometheus-k8s.openshift-monitoring.svc:9091
isDefault: true
jsonData:
httpHeaderName1: "Authorization"
tlsSkipVerify: true
timeInterval: "30s"
secureJsonData:
httpHeaderValue1: "Bearer ${token}"

View File

@@ -0,0 +1,14 @@
apiVersion: route.openshift.io/v1
kind: Route
metadata:
name: grafana
namespace: observability
spec:
to:
kind: Service
name: cluster-grafana-service
port:
targetPort: 3000
tls:
termination: edge
insecureEdgeTerminationPolicy: Redirect

View File

@@ -0,0 +1,97 @@
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
name: cluster-overview
namespace: observability
spec:
instanceSelector:
matchLabels:
dashboards: "grafana"
json: |
{
"title": "Cluster Overview",
"schemaVersion": 36,
"version": 1,
"refresh": "30s",
"time": {
"from": "now-1h",
"to": "now"
},
"panels": [
{
"type": "stat",
"title": "Ready Nodes",
"datasource": {
"type": "prometheus",
"uid": "Prometheus-Cluster"
},
"targets": [
{
"expr": "count(kube_node_status_condition{condition=\"Ready\",status=\"true\"})",
"refId": "A"
}
],
"gridPos": { "h": 6, "w": 6, "x": 0, "y": 0 }
},
{
"type": "stat",
"title": "Running Pods",
"datasource": {
"type": "prometheus",
"uid": "Prometheus-Cluster"
},
"targets": [
{
"expr": "count(kube_pod_status_phase{phase=\"Running\"})",
"refId": "A"
}
],
"gridPos": { "h": 6, "w": 6, "x": 6, "y": 0 }
},
{
"type": "timeseries",
"title": "Cluster CPU Usage (%)",
"datasource": {
"type": "prometheus",
"uid": "Prometheus-Cluster"
},
"targets": [
{
"expr": "100 * (1 - avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])))",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100
}
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 6 }
},
{
"type": "timeseries",
"title": "Cluster Memory Usage (%)",
"datasource": {
"type": "prometheus",
"uid": "Prometheus-Cluster"
},
"targets": [
{
"expr": "100 * (1 - (sum(node_memory_MemAvailable_bytes) / sum(node_memory_MemTotal_bytes)))",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100
}
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 6 }
}
]
}

View File

@@ -0,0 +1,769 @@
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
name: okd-cluster-overview
namespace: observability
spec:
instanceSelector:
matchLabels:
dashboards: "grafana"
json: |
{
"title": "Cluster Overview",
"uid": "okd-cluster-overview",
"schemaVersion": 36,
"version": 2,
"refresh": "30s",
"time": { "from": "now-1h", "to": "now" },
"tags": ["okd", "cluster", "overview"],
"panels": [
{
"id": 1,
"type": "stat",
"title": "Ready Nodes",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "count(kube_node_status_condition{condition=\"Ready\",status=\"true\"} == 1)",
"refId": "A",
"legendFormat": ""
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "red", "value": null },
{ "color": "green", "value": 1 }
]
},
"unit": "short",
"noValue": "0"
}
},
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"gridPos": { "h": 4, "w": 3, "x": 0, "y": 0 }
},
{
"id": 2,
"type": "stat",
"title": "Not Ready Nodes",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "count(kube_node_status_condition{condition=\"Ready\",status=\"false\"} == 1) or vector(0)",
"refId": "A",
"legendFormat": ""
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "red", "value": 1 }
]
},
"unit": "short",
"noValue": "0"
}
},
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"gridPos": { "h": 4, "w": 3, "x": 3, "y": 0 }
},
{
"id": 3,
"type": "stat",
"title": "Running Pods",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "count(kube_pod_status_phase{phase=\"Running\"} == 1)",
"refId": "A",
"legendFormat": ""
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "red", "value": null },
{ "color": "green", "value": 1 }
]
},
"unit": "short",
"noValue": "0"
}
},
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"gridPos": { "h": 4, "w": 3, "x": 6, "y": 0 }
},
{
"id": 4,
"type": "stat",
"title": "Pending Pods",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "count(kube_pod_status_phase{phase=\"Pending\"} == 1) or vector(0)",
"refId": "A",
"legendFormat": ""
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 5 }
]
},
"unit": "short",
"noValue": "0"
}
},
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"gridPos": { "h": 4, "w": 3, "x": 9, "y": 0 }
},
{
"id": 5,
"type": "stat",
"title": "Failed Pods",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "count(kube_pod_status_phase{phase=\"Failed\"} == 1) or vector(0)",
"refId": "A",
"legendFormat": ""
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "red", "value": 1 }
]
},
"unit": "short",
"noValue": "0"
}
},
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"gridPos": { "h": 4, "w": 3, "x": 12, "y": 0 }
},
{
"id": 6,
"type": "stat",
"title": "CrashLoopBackOff",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "count(kube_pod_container_status_waiting_reason{reason=\"CrashLoopBackOff\"} == 1) or vector(0)",
"refId": "A",
"legendFormat": ""
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "red", "value": 1 }
]
},
"unit": "short",
"noValue": "0"
}
},
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"gridPos": { "h": 4, "w": 3, "x": 15, "y": 0 }
},
{
"id": 7,
"type": "stat",
"title": "Critical Alerts",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "count(ALERTS{alertstate=\"firing\",severity=\"critical\"}) or vector(0)",
"refId": "A",
"legendFormat": ""
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "red", "value": 1 }
]
},
"unit": "short",
"noValue": "0"
}
},
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"gridPos": { "h": 4, "w": 3, "x": 18, "y": 0 }
},
{
"id": 8,
"type": "stat",
"title": "Warning Alerts",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "count(ALERTS{alertstate=\"firing\",severity=\"warning\"}) or vector(0)",
"refId": "A",
"legendFormat": ""
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 10 }
]
},
"unit": "short",
"noValue": "0"
}
},
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"gridPos": { "h": 4, "w": 3, "x": 21, "y": 0 }
},
{
"id": 9,
"type": "gauge",
"title": "CPU Usage",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "100 * (1 - avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])))",
"refId": "A",
"legendFormat": "CPU"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 70 },
{ "color": "red", "value": 85 }
]
}
}
},
"options": {
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"showThresholdLabels": false,
"showThresholdMarkers": true,
"orientation": "auto"
},
"gridPos": { "h": 6, "w": 5, "x": 0, "y": 4 }
},
{
"id": 10,
"type": "gauge",
"title": "Memory Usage",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "100 * (1 - (sum(node_memory_MemAvailable_bytes) / sum(node_memory_MemTotal_bytes)))",
"refId": "A",
"legendFormat": "Memory"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 75 },
{ "color": "red", "value": 90 }
]
}
}
},
"options": {
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"showThresholdLabels": false,
"showThresholdMarkers": true,
"orientation": "auto"
},
"gridPos": { "h": 6, "w": 5, "x": 5, "y": 4 }
},
{
"id": 11,
"type": "gauge",
"title": "Root Disk Usage",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "100 * (1 - (sum(node_filesystem_avail_bytes{mountpoint=\"/\",fstype!=\"tmpfs\"}) / sum(node_filesystem_size_bytes{mountpoint=\"/\",fstype!=\"tmpfs\"})))",
"refId": "A",
"legendFormat": "Disk"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 70 },
{ "color": "red", "value": 85 }
]
}
}
},
"options": {
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"showThresholdLabels": false,
"showThresholdMarkers": true,
"orientation": "auto"
},
"gridPos": { "h": 6, "w": 4, "x": 10, "y": 4 }
},
{
"id": 12,
"type": "stat",
"title": "etcd Has Leader",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "min(etcd_server_has_leader)",
"refId": "A",
"legendFormat": ""
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "red", "value": null },
{ "color": "green", "value": 1 }
]
},
"mappings": [
{
"type": "value",
"options": {
"0": { "text": "NO LEADER", "color": "red" },
"1": { "text": "LEADER OK", "color": "green" }
}
}
],
"unit": "short",
"noValue": "?"
}
},
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"gridPos": { "h": 3, "w": 5, "x": 14, "y": 4 }
},
{
"id": 13,
"type": "stat",
"title": "API Servers Up",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum(up{job=\"apiserver\"})",
"refId": "A",
"legendFormat": ""
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "red", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "green", "value": 2 }
]
},
"unit": "short",
"noValue": "0"
}
},
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"gridPos": { "h": 3, "w": 5, "x": 19, "y": 4 }
},
{
"id": 14,
"type": "stat",
"title": "etcd Members Up",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum(up{job=\"etcd\"})",
"refId": "A",
"legendFormat": ""
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "red", "value": null },
{ "color": "yellow", "value": 2 },
{ "color": "green", "value": 3 }
]
},
"unit": "short",
"noValue": "0"
}
},
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"gridPos": { "h": 3, "w": 5, "x": 14, "y": 7 }
},
{
"id": 15,
"type": "stat",
"title": "Operators Degraded",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "count(cluster_operator_conditions{condition=\"Degraded\",status=\"True\"} == 1) or vector(0)",
"refId": "A",
"legendFormat": ""
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "red", "value": 1 }
]
},
"unit": "short",
"noValue": "0"
}
},
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"gridPos": { "h": 3, "w": 5, "x": 19, "y": 7 }
},
{
"id": 16,
"type": "timeseries",
"title": "CPU Usage per Node (%)",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "100 * (1 - avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])))",
"refId": "A",
"legendFormat": "{{instance}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"color": { "mode": "palette-classic" },
"custom": {
"lineWidth": 2,
"fillOpacity": 10,
"spanNulls": false,
"showPoints": "never"
}
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": {
"displayMode": "list",
"placement": "bottom",
"calcs": ["mean", "max"]
}
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 10 }
},
{
"id": 17,
"type": "timeseries",
"title": "Memory Usage per Node (%)",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))",
"refId": "A",
"legendFormat": "{{instance}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"color": { "mode": "palette-classic" },
"custom": {
"lineWidth": 2,
"fillOpacity": 10,
"spanNulls": false,
"showPoints": "never"
}
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": {
"displayMode": "list",
"placement": "bottom",
"calcs": ["mean", "max"]
}
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 10 }
},
{
"id": 18,
"type": "timeseries",
"title": "Network Traffic — Cluster Total",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum(rate(node_network_receive_bytes_total{device!~\"lo|veth.*|tun.*|ovn.*|br-int|br-ex\"}[5m]))",
"refId": "A",
"legendFormat": "Receive"
},
{
"expr": "sum(rate(node_network_transmit_bytes_total{device!~\"lo|veth.*|tun.*|ovn.*|br-int|br-ex\"}[5m]))",
"refId": "B",
"legendFormat": "Transmit"
}
],
"fieldConfig": {
"defaults": {
"unit": "Bps",
"color": { "mode": "palette-classic" },
"custom": {
"lineWidth": 2,
"fillOpacity": 10,
"spanNulls": false,
"showPoints": "never"
}
},
"overrides": [
{
"matcher": { "id": "byName", "options": "Receive" },
"properties": [{ "id": "color", "value": { "fixedColor": "blue", "mode": "fixed" } }]
},
{
"matcher": { "id": "byName", "options": "Transmit" },
"properties": [{ "id": "color", "value": { "fixedColor": "orange", "mode": "fixed" } }]
}
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "none" },
"legend": {
"displayMode": "list",
"placement": "bottom",
"calcs": ["mean", "max"]
}
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 18 }
},
{
"id": 19,
"type": "timeseries",
"title": "Pod Phases Over Time",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "count(kube_pod_status_phase{phase=\"Running\"} == 1)",
"refId": "A",
"legendFormat": "Running"
},
{
"expr": "count(kube_pod_status_phase{phase=\"Pending\"} == 1) or vector(0)",
"refId": "B",
"legendFormat": "Pending"
},
{
"expr": "count(kube_pod_status_phase{phase=\"Failed\"} == 1) or vector(0)",
"refId": "C",
"legendFormat": "Failed"
},
{
"expr": "count(kube_pod_status_phase{phase=\"Unknown\"} == 1) or vector(0)",
"refId": "D",
"legendFormat": "Unknown"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"custom": {
"lineWidth": 2,
"fillOpacity": 15,
"spanNulls": false,
"showPoints": "never"
}
},
"overrides": [
{
"matcher": { "id": "byName", "options": "Running" },
"properties": [{ "id": "color", "value": { "fixedColor": "green", "mode": "fixed" } }]
},
{
"matcher": { "id": "byName", "options": "Pending" },
"properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }]
},
{
"matcher": { "id": "byName", "options": "Failed" },
"properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }]
},
{
"matcher": { "id": "byName", "options": "Unknown" },
"properties": [{ "id": "color", "value": { "fixedColor": "orange", "mode": "fixed" } }]
}
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "none" },
"legend": {
"displayMode": "list",
"placement": "bottom",
"calcs": ["lastNotNull"]
}
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 18 }
}
]
}

View File

@@ -0,0 +1,637 @@
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
name: okd-node-health
namespace: observability
spec:
instanceSelector:
matchLabels:
dashboards: "grafana"
json: |
{
"title": "Node Health",
"uid": "okd-node-health",
"schemaVersion": 36,
"version": 2,
"refresh": "30s",
"time": { "from": "now-1h", "to": "now" },
"tags": ["okd", "node", "health"],
"templating": {
"list": [
{
"name": "node",
"type": "query",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"query": { "query": "label_values(kube_node_info, node)", "refId": "A" },
"refresh": 2,
"includeAll": true,
"multi": true,
"allValue": ".*",
"label": "Node",
"sort": 1,
"current": {},
"options": []
}
]
},
"panels": [
{
"id": 1,
"type": "stat",
"title": "Total Nodes",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_node_info{node=~\"$node\"})", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "blue", "value": null }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 0, "y": 0 }
},
{
"id": 2,
"type": "stat",
"title": "Ready Nodes",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_node_status_condition{condition=\"Ready\",status=\"true\",node=~\"$node\"} == 1)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "red", "value": null }, { "color": "green", "value": 1 }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 3, "y": 0 }
},
{
"id": 3,
"type": "stat",
"title": "Not Ready Nodes",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_node_status_condition{condition=\"Ready\",status=\"false\",node=~\"$node\"} == 1) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "red", "value": 1 }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 6, "y": 0 }
},
{
"id": 4,
"type": "stat",
"title": "Memory Pressure",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_node_status_condition{condition=\"MemoryPressure\",status=\"true\",node=~\"$node\"} == 1) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "red", "value": 1 }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 9, "y": 0 }
},
{
"id": 5,
"type": "stat",
"title": "Disk Pressure",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_node_status_condition{condition=\"DiskPressure\",status=\"true\",node=~\"$node\"} == 1) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "red", "value": 1 }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 12, "y": 0 }
},
{
"id": 6,
"type": "stat",
"title": "PID Pressure",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_node_status_condition{condition=\"PIDPressure\",status=\"true\",node=~\"$node\"} == 1) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "orange", "value": 1 }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 15, "y": 0 }
},
{
"id": 7,
"type": "stat",
"title": "Unschedulable",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_node_spec_unschedulable{node=~\"$node\"} == 1) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "yellow", "value": 1 }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 18, "y": 0 }
},
{
"id": 8,
"type": "stat",
"title": "Kubelet Up",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(up{job=\"kubelet\",metrics_path=\"/metrics\"} == 1) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "red", "value": null }, { "color": "green", "value": 1 }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 21, "y": 0 }
},
{
"id": 9,
"type": "table",
"title": "Node Conditions",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum by(node) (kube_node_status_condition{condition=\"Ready\",status=\"true\",node=~\"$node\"})",
"refId": "A",
"legendFormat": "{{node}}",
"instant": true
},
{
"expr": "sum by(node) (kube_node_status_condition{condition=\"MemoryPressure\",status=\"true\",node=~\"$node\"})",
"refId": "B",
"legendFormat": "{{node}}",
"instant": true
},
{
"expr": "sum by(node) (kube_node_status_condition{condition=\"DiskPressure\",status=\"true\",node=~\"$node\"})",
"refId": "C",
"legendFormat": "{{node}}",
"instant": true
},
{
"expr": "sum by(node) (kube_node_status_condition{condition=\"PIDPressure\",status=\"true\",node=~\"$node\"})",
"refId": "D",
"legendFormat": "{{node}}",
"instant": true
},
{
"expr": "sum by(node) (kube_node_spec_unschedulable{node=~\"$node\"})",
"refId": "E",
"legendFormat": "{{node}}",
"instant": true
}
],
"transformations": [
{
"id": "labelsToFields",
"options": { "mode": "columns" }
},
{
"id": "joinByField",
"options": { "byField": "node", "mode": "outer" }
},
{
"id": "organize",
"options": {
"excludeByName": {
"Time": true,
"Time 1": true,
"Time 2": true,
"Time 3": true,
"Time 4": true,
"Time 5": true
},
"renameByName": {
"node": "Node",
"Value #A": "Ready",
"Value #B": "Mem Pressure",
"Value #C": "Disk Pressure",
"Value #D": "PID Pressure",
"Value #E": "Unschedulable"
},
"indexByName": {
"node": 0,
"Value #A": 1,
"Value #B": 2,
"Value #C": 3,
"Value #D": 4,
"Value #E": 5
}
}
}
],
"fieldConfig": {
"defaults": {
"custom": { "displayMode": "color-background", "align": "center" }
},
"overrides": [
{
"matcher": { "id": "byName", "options": "Node" },
"properties": [
{ "id": "custom.displayMode", "value": "auto" },
{ "id": "custom.align", "value": "left" },
{ "id": "custom.width", "value": 200 }
]
},
{
"matcher": { "id": "byName", "options": "Ready" },
"properties": [
{
"id": "thresholds",
"value": { "mode": "absolute", "steps": [{ "color": "red", "value": null }, { "color": "green", "value": 1 }] }
},
{ "id": "custom.displayMode", "value": "color-background" },
{
"id": "mappings",
"value": [
{
"type": "value",
"options": {
"0": { "text": "✗ Not Ready", "color": "red", "index": 0 },
"1": { "text": "✓ Ready", "color": "green", "index": 1 }
}
}
]
}
]
},
{
"matcher": { "id": "byRegexp", "options": ".*Pressure" },
"properties": [
{
"id": "thresholds",
"value": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "red", "value": 1 }] }
},
{ "id": "custom.displayMode", "value": "color-background" },
{
"id": "mappings",
"value": [
{
"type": "value",
"options": {
"0": { "text": "✓ OK", "color": "green", "index": 0 },
"1": { "text": "⚠ Active", "color": "red", "index": 1 }
}
}
]
}
]
},
{
"matcher": { "id": "byName", "options": "Unschedulable" },
"properties": [
{
"id": "thresholds",
"value": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "yellow", "value": 1 }] }
},
{ "id": "custom.displayMode", "value": "color-background" },
{
"id": "mappings",
"value": [
{
"type": "value",
"options": {
"0": { "text": "✓ Schedulable", "color": "green", "index": 0 },
"1": { "text": "⚠ Cordoned", "color": "yellow", "index": 1 }
}
}
]
}
]
}
]
},
"options": { "sortBy": [{ "displayName": "Node", "desc": false }] },
"gridPos": { "h": 8, "w": 24, "x": 0, "y": 4 }
},
{
"id": 10,
"type": "timeseries",
"title": "CPU Usage per Node (%)",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "100 * (1 - avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])))",
"refId": "A",
"legendFormat": "{{instance}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent", "min": 0, "max": 100,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "yellow", "value": 70 }, { "color": "red", "value": 85 }] }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 16, "x": 0, "y": 12 }
},
{
"id": 11,
"type": "bargauge",
"title": "CPU Usage \u2014 Current",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "100 * (1 - avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])))",
"refId": "A",
"legendFormat": "{{instance}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent", "min": 0, "max": 100,
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "yellow", "value": 70 }, { "color": "red", "value": 85 }] }
}
},
"options": {
"orientation": "horizontal",
"displayMode": "gradient",
"showUnfilled": true,
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }
},
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 12 }
},
{
"id": 12,
"type": "timeseries",
"title": "Memory Usage per Node (%)",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))",
"refId": "A",
"legendFormat": "{{instance}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent", "min": 0, "max": 100,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "yellow", "value": 75 }, { "color": "red", "value": 90 }] }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 16, "x": 0, "y": 20 }
},
{
"id": 13,
"type": "bargauge",
"title": "Memory Usage \u2014 Current",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))",
"refId": "A",
"legendFormat": "{{instance}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent", "min": 0, "max": 100,
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "yellow", "value": 75 }, { "color": "red", "value": 90 }] }
}
},
"options": {
"orientation": "horizontal",
"displayMode": "gradient",
"showUnfilled": true,
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }
},
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 20 }
},
{
"id": 14,
"type": "timeseries",
"title": "Root Disk Usage per Node (%)",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "100 * (1 - (node_filesystem_avail_bytes{mountpoint=\"/\",fstype!=\"tmpfs\"} / node_filesystem_size_bytes{mountpoint=\"/\",fstype!=\"tmpfs\"}))",
"refId": "A",
"legendFormat": "{{instance}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent", "min": 0, "max": 100,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "yellow", "value": 70 }, { "color": "red", "value": 85 }] }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 16, "x": 0, "y": 28 }
},
{
"id": 15,
"type": "bargauge",
"title": "Root Disk Usage \u2014 Current",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "100 * (1 - (node_filesystem_avail_bytes{mountpoint=\"/\",fstype!=\"tmpfs\"} / node_filesystem_size_bytes{mountpoint=\"/\",fstype!=\"tmpfs\"}))",
"refId": "A",
"legendFormat": "{{instance}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent", "min": 0, "max": 100,
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "yellow", "value": 70 }, { "color": "red", "value": 85 }] }
}
},
"options": {
"orientation": "horizontal",
"displayMode": "gradient",
"showUnfilled": true,
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }
},
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 28 }
},
{
"id": 16,
"type": "timeseries",
"title": "Network Traffic per Node",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum by(instance) (rate(node_network_receive_bytes_total{device!~\"lo|veth.*|tun.*|ovn.*|br.*\"}[5m]))",
"refId": "A",
"legendFormat": "rx {{instance}}"
},
{
"expr": "sum by(instance) (rate(node_network_transmit_bytes_total{device!~\"lo|veth.*|tun.*|ovn.*|br.*\"}[5m]))",
"refId": "B",
"legendFormat": "tx {{instance}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "Bps",
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 36 }
},
{
"id": 17,
"type": "bargauge",
"title": "Pods per Node",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "count by(node) (kube_pod_info{node=~\"$node\"})",
"refId": "A",
"legendFormat": "{{node}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"min": 0,
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 100 },
{ "color": "red", "value": 200 }
]
}
}
},
"options": {
"orientation": "horizontal",
"displayMode": "gradient",
"showUnfilled": true,
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 36 }
},
{
"id": 18,
"type": "timeseries",
"title": "System Load Average (1m) per Node",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "node_load1",
"refId": "A",
"legendFormat": "1m \u2014 {{instance}}"
},
{
"expr": "node_load5",
"refId": "B",
"legendFormat": "5m \u2014 {{instance}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 44 }
},
{
"id": 19,
"type": "bargauge",
"title": "Node Uptime",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "time() - node_boot_time_seconds",
"refId": "A",
"legendFormat": "{{instance}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"min": 0,
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "red", "value": null },
{ "color": "yellow", "value": 300 },
{ "color": "green", "value": 3600 }
]
}
}
},
"options": {
"orientation": "horizontal",
"displayMode": "gradient",
"showUnfilled": false,
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 44 }
}
]
}

View File

@@ -0,0 +1,783 @@
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
name: okd-workload-health
namespace: observability
spec:
instanceSelector:
matchLabels:
dashboards: "grafana"
json: |
{
"title": "Workload Health",
"uid": "okd-workload-health",
"schemaVersion": 36,
"version": 3,
"refresh": "30s",
"time": { "from": "now-1h", "to": "now" },
"tags": ["okd", "workload", "health"],
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"query": { "query": "label_values(kube_pod_info, namespace)", "refId": "A" },
"refresh": 2,
"includeAll": true,
"multi": true,
"allValue": ".*",
"label": "Namespace",
"sort": 1,
"current": {},
"options": []
}
]
},
"panels": [
{
"id": 1, "type": "stat", "title": "Total Pods",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_pod_info{namespace=~\"$namespace\"})", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "blue", "value": null }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 0, "y": 0 }
},
{
"id": 2, "type": "stat", "title": "Running Pods",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_pod_status_phase{phase=\"Running\",namespace=~\"$namespace\"} == 1) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "red", "value": null }, { "color": "green", "value": 1 }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 3, "y": 0 }
},
{
"id": 3, "type": "stat", "title": "Pending Pods",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_pod_status_phase{phase=\"Pending\",namespace=~\"$namespace\"} == 1) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "yellow", "value": 1 }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 6, "y": 0 }
},
{
"id": 4, "type": "stat", "title": "Failed Pods",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_pod_status_phase{phase=\"Failed\",namespace=~\"$namespace\"} == 1) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "red", "value": 1 }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 9, "y": 0 }
},
{
"id": 5, "type": "stat", "title": "CrashLoopBackOff",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_pod_container_status_waiting_reason{reason=\"CrashLoopBackOff\",namespace=~\"$namespace\"} == 1) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "red", "value": 1 }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 12, "y": 0 }
},
{
"id": 6, "type": "stat", "title": "OOMKilled",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_pod_container_status_last_terminated_reason{reason=\"OOMKilled\",namespace=~\"$namespace\"} == 1) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "orange", "value": 1 }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 15, "y": 0 }
},
{
"id": 7, "type": "stat", "title": "Deployments Available",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_deployment_status_condition{condition=\"Available\",status=\"true\",namespace=~\"$namespace\"} == 1) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "red", "value": null }, { "color": "green", "value": 1 }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 18, "y": 0 }
},
{
"id": 8, "type": "stat", "title": "Deployments Degraded",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_deployment_status_replicas_unavailable{namespace=~\"$namespace\"} > 0) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "red", "value": 1 }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 21, "y": 0 }
},
{
"id": 9, "type": "row", "title": "Deployments", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 4 }
},
{
"id": 10,
"type": "table",
"title": "Deployment Status",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum by(namespace,deployment)(kube_deployment_spec_replicas{namespace=~\"$namespace\"})",
"refId": "A",
"instant": true,
"format": "table",
"legendFormat": ""
},
{
"expr": "sum by(namespace,deployment)(kube_deployment_status_replicas_ready{namespace=~\"$namespace\"})",
"refId": "B",
"instant": true,
"format": "table",
"legendFormat": ""
},
{
"expr": "sum by(namespace,deployment)(kube_deployment_status_replicas_available{namespace=~\"$namespace\"})",
"refId": "C",
"instant": true,
"format": "table",
"legendFormat": ""
},
{
"expr": "sum by(namespace,deployment)(kube_deployment_status_replicas_unavailable{namespace=~\"$namespace\"})",
"refId": "D",
"instant": true,
"format": "table",
"legendFormat": ""
},
{
"expr": "sum by(namespace,deployment)(kube_deployment_status_replicas_updated{namespace=~\"$namespace\"})",
"refId": "E",
"instant": true,
"format": "table",
"legendFormat": ""
}
],
"transformations": [
{
"id": "filterFieldsByName",
"options": {
"include": {
"names": ["namespace", "deployment", "Value"]
}
}
},
{
"id": "joinByField",
"options": {
"byField": "deployment",
"mode": "outer"
}
},
{
"id": "organize",
"options": {
"excludeByName": {
"namespace 1": true,
"namespace 2": true,
"namespace 3": true,
"namespace 4": true
},
"renameByName": {
"namespace": "Namespace",
"deployment": "Deployment",
"Value": "Desired",
"Value 1": "Ready",
"Value 2": "Available",
"Value 3": "Unavailable",
"Value 4": "Up-to-date"
},
"indexByName": {
"namespace": 0,
"deployment": 1,
"Value": 2,
"Value 1": 3,
"Value 2": 4,
"Value 3": 5,
"Value 4": 6
}
}
},
{
"id": "sortBy",
"options": {
"fields": [{ "displayName": "Namespace", "desc": false }]
}
}
],
"fieldConfig": {
"defaults": { "custom": { "align": "center", "displayMode": "auto" } },
"overrides": [
{
"matcher": { "id": "byName", "options": "Namespace" },
"properties": [{ "id": "custom.align", "value": "left" }, { "id": "custom.width", "value": 200 }]
},
{
"matcher": { "id": "byName", "options": "Deployment" },
"properties": [{ "id": "custom.align", "value": "left" }, { "id": "custom.width", "value": 220 }]
},
{
"matcher": { "id": "byName", "options": "Unavailable" },
"properties": [
{ "id": "custom.displayMode", "value": "color-background" },
{
"id": "thresholds",
"value": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "red", "value": 1 }] }
}
]
},
{
"matcher": { "id": "byName", "options": "Ready" },
"properties": [
{ "id": "custom.displayMode", "value": "color-background" },
{
"id": "thresholds",
"value": { "mode": "absolute", "steps": [{ "color": "red", "value": null }, { "color": "green", "value": 1 }] }
}
]
}
]
},
"options": { "sortBy": [{ "displayName": "Namespace", "desc": false }] },
"gridPos": { "h": 8, "w": 24, "x": 0, "y": 5 }
},
{
"id": 11, "type": "row", "title": "StatefulSets & DaemonSets", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 13 }
},
{
"id": 12,
"type": "table",
"title": "StatefulSet Status",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum by(namespace,statefulset)(kube_statefulset_replicas{namespace=~\"$namespace\"})",
"refId": "A",
"instant": true,
"format": "table",
"legendFormat": ""
},
{
"expr": "sum by(namespace,statefulset)(kube_statefulset_status_replicas_ready{namespace=~\"$namespace\"})",
"refId": "B",
"instant": true,
"format": "table",
"legendFormat": ""
},
{
"expr": "sum by(namespace,statefulset)(kube_statefulset_status_replicas_current{namespace=~\"$namespace\"})",
"refId": "C",
"instant": true,
"format": "table",
"legendFormat": ""
},
{
"expr": "sum by(namespace,statefulset)(kube_statefulset_status_replicas_updated{namespace=~\"$namespace\"})",
"refId": "D",
"instant": true,
"format": "table",
"legendFormat": ""
}
],
"transformations": [
{
"id": "filterFieldsByName",
"options": {
"include": {
"names": ["namespace", "statefulset", "Value"]
}
}
},
{
"id": "joinByField",
"options": {
"byField": "statefulset",
"mode": "outer"
}
},
{
"id": "organize",
"options": {
"excludeByName": {
"namespace 1": true,
"namespace 2": true,
"namespace 3": true
},
"renameByName": {
"namespace": "Namespace",
"statefulset": "StatefulSet",
"Value": "Desired",
"Value 1": "Ready",
"Value 2": "Current",
"Value 3": "Up-to-date"
},
"indexByName": {
"namespace": 0,
"statefulset": 1,
"Value": 2,
"Value 1": 3,
"Value 2": 4,
"Value 3": 5
}
}
},
{
"id": "sortBy",
"options": { "fields": [{ "displayName": "Namespace", "desc": false }] }
}
],
"fieldConfig": {
"defaults": { "custom": { "align": "center", "displayMode": "auto" } },
"overrides": [
{
"matcher": { "id": "byName", "options": "Namespace" },
"properties": [{ "id": "custom.align", "value": "left" }, { "id": "custom.width", "value": 180 }]
},
{
"matcher": { "id": "byName", "options": "StatefulSet" },
"properties": [{ "id": "custom.align", "value": "left" }, { "id": "custom.width", "value": 200 }]
},
{
"matcher": { "id": "byName", "options": "Ready" },
"properties": [
{ "id": "custom.displayMode", "value": "color-background" },
{ "id": "thresholds", "value": { "mode": "absolute", "steps": [{ "color": "red", "value": null }, { "color": "green", "value": 1 }] } }
]
}
]
},
"options": {},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 14 }
},
{
"id": 13,
"type": "table",
"title": "DaemonSet Status",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum by(namespace,daemonset)(kube_daemonset_status_desired_number_scheduled{namespace=~\"$namespace\"})",
"refId": "A",
"instant": true,
"format": "table",
"legendFormat": ""
},
{
"expr": "sum by(namespace,daemonset)(kube_daemonset_status_number_ready{namespace=~\"$namespace\"})",
"refId": "B",
"instant": true,
"format": "table",
"legendFormat": ""
},
{
"expr": "sum by(namespace,daemonset)(kube_daemonset_status_number_unavailable{namespace=~\"$namespace\"})",
"refId": "C",
"instant": true,
"format": "table",
"legendFormat": ""
},
{
"expr": "sum by(namespace,daemonset)(kube_daemonset_status_number_misscheduled{namespace=~\"$namespace\"})",
"refId": "D",
"instant": true,
"format": "table",
"legendFormat": ""
}
],
"transformations": [
{
"id": "filterFieldsByName",
"options": {
"include": {
"names": ["namespace", "daemonset", "Value"]
}
}
},
{
"id": "joinByField",
"options": {
"byField": "daemonset",
"mode": "outer"
}
},
{
"id": "organize",
"options": {
"excludeByName": {
"namespace 1": true,
"namespace 2": true,
"namespace 3": true
},
"renameByName": {
"namespace": "Namespace",
"daemonset": "DaemonSet",
"Value": "Desired",
"Value 1": "Ready",
"Value 2": "Unavailable",
"Value 3": "Misscheduled"
},
"indexByName": {
"namespace": 0,
"daemonset": 1,
"Value": 2,
"Value 1": 3,
"Value 2": 4,
"Value 3": 5
}
}
},
{
"id": "sortBy",
"options": { "fields": [{ "displayName": "Namespace", "desc": false }] }
}
],
"fieldConfig": {
"defaults": { "custom": { "align": "center", "displayMode": "auto" } },
"overrides": [
{
"matcher": { "id": "byName", "options": "Namespace" },
"properties": [{ "id": "custom.align", "value": "left" }, { "id": "custom.width", "value": 180 }]
},
{
"matcher": { "id": "byName", "options": "DaemonSet" },
"properties": [{ "id": "custom.align", "value": "left" }, { "id": "custom.width", "value": 200 }]
},
{
"matcher": { "id": "byName", "options": "Ready" },
"properties": [
{ "id": "custom.displayMode", "value": "color-background" },
{ "id": "thresholds", "value": { "mode": "absolute", "steps": [{ "color": "red", "value": null }, { "color": "green", "value": 1 }] } }
]
},
{
"matcher": { "id": "byName", "options": "Unavailable" },
"properties": [
{ "id": "custom.displayMode", "value": "color-background" },
{ "id": "thresholds", "value": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "red", "value": 1 }] } }
]
},
{
"matcher": { "id": "byName", "options": "Misscheduled" },
"properties": [
{ "id": "custom.displayMode", "value": "color-background" },
{ "id": "thresholds", "value": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "orange", "value": 1 }] } }
]
}
]
},
"options": {},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 14 }
},
{
"id": 14, "type": "row", "title": "Pods", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 22 }
},
{
"id": 15,
"type": "timeseries",
"title": "Pod Phase over Time",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum by(phase)(kube_pod_status_phase{namespace=~\"$namespace\"})",
"refId": "A", "legendFormat": "{{phase}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
},
"overrides": [
{ "matcher": { "id": "byName", "options": "Running" }, "properties": [{ "id": "color", "value": { "fixedColor": "green", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "Pending" }, "properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "Failed" }, "properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "Succeeded" }, "properties": [{ "id": "color", "value": { "fixedColor": "blue", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "Unknown" }, "properties": [{ "id": "color", "value": { "fixedColor": "orange", "mode": "fixed" } }] }
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["lastNotNull"] }
},
"gridPos": { "h": 8, "w": 16, "x": 0, "y": 23 }
},
{
"id": 16,
"type": "piechart",
"title": "Pod Phase — Now",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum by(phase)(kube_pod_status_phase{namespace=~\"$namespace\"})",
"refId": "A", "instant": true, "legendFormat": "{{phase}}"
}
],
"fieldConfig": {
"defaults": { "unit": "short", "color": { "mode": "palette-classic" } },
"overrides": [
{ "matcher": { "id": "byName", "options": "Running" }, "properties": [{ "id": "color", "value": { "fixedColor": "green", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "Pending" }, "properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "Failed" }, "properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "Succeeded" }, "properties": [{ "id": "color", "value": { "fixedColor": "blue", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "Unknown" }, "properties": [{ "id": "color", "value": { "fixedColor": "orange", "mode": "fixed" } }] }
]
},
"options": {
"pieType": "donut",
"tooltip": { "mode": "single" },
"legend": { "displayMode": "table", "placement": "right", "values": ["value", "percent"] }
},
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 23 }
},
{
"id": 17,
"type": "timeseries",
"title": "Container Restarts over Time (total counter, top 10)",
"description": "Absolute restart counter — each vertical step = a restart event. Flat line = healthy.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "topk(10,\n sum by(namespace, pod) (\n kube_pod_container_status_restarts_total{namespace=~\"$namespace\"}\n ) > 0\n)",
"refId": "A",
"legendFormat": "{{namespace}} / {{pod}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "auto", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["max"] }
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 31 }
},
{
"id": 18,
"type": "table",
"title": "Container Total Restarts (non-zero)",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum by(namespace, pod, container) (kube_pod_container_status_restarts_total{namespace=~\"$namespace\"}) > 0",
"refId": "A",
"instant": true,
"format": "table",
"legendFormat": ""
}
],
"transformations": [
{
"id": "filterFieldsByName",
"options": {
"include": { "names": ["namespace", "pod", "container", "Value"] }
}
},
{
"id": "organize",
"options": {
"excludeByName": {},
"renameByName": {
"namespace": "Namespace",
"pod": "Pod",
"container": "Container",
"Value": "Total Restarts"
},
"indexByName": { "namespace": 0, "pod": 1, "container": 2, "Value": 3 }
}
},
{
"id": "sortBy",
"options": { "fields": [{ "displayName": "Total Restarts", "desc": true }] }
}
],
"fieldConfig": {
"defaults": { "custom": { "align": "center", "displayMode": "auto" } },
"overrides": [
{ "matcher": { "id": "byName", "options": "Namespace" }, "properties": [{ "id": "custom.align", "value": "left" }, { "id": "custom.width", "value": 160 }] },
{ "matcher": { "id": "byName", "options": "Pod" }, "properties": [{ "id": "custom.align", "value": "left" }, { "id": "custom.width", "value": 260 }] },
{ "matcher": { "id": "byName", "options": "Container" }, "properties": [{ "id": "custom.align", "value": "left" }, { "id": "custom.width", "value": 160 }] },
{
"matcher": { "id": "byName", "options": "Total Restarts" },
"properties": [
{ "id": "custom.displayMode", "value": "color-background" },
{ "id": "thresholds", "value": { "mode": "absolute", "steps": [{ "color": "yellow", "value": null }, { "color": "orange", "value": 5 }, { "color": "red", "value": 20 }] } }
]
}
]
},
"options": {},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 31 }
},
{
"id": 19, "type": "row", "title": "Resource Usage", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 39 }
},
{
"id": 20,
"type": "timeseries",
"title": "CPU Usage by Namespace",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum by(namespace)(rate(container_cpu_usage_seconds_total{namespace=~\"$namespace\",container!=\"\",container!=\"POD\"}[5m]))",
"refId": "A", "legendFormat": "{{namespace}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "cores", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 40 }
},
{
"id": 21,
"type": "timeseries",
"title": "Memory Usage by Namespace",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum by(namespace)(container_memory_working_set_bytes{namespace=~\"$namespace\",container!=\"\",container!=\"POD\"})",
"refId": "A", "legendFormat": "{{namespace}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "bytes", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 40 }
},
{
"id": 22,
"type": "bargauge",
"title": "CPU — Actual vs Requested (%)",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum by(namespace)(rate(container_cpu_usage_seconds_total{namespace=~\"$namespace\",container!=\"\",container!=\"POD\"}[5m]))\n/\nsum by(namespace)(kube_pod_container_resource_requests{resource=\"cpu\",namespace=~\"$namespace\",container!=\"\"})\n* 100",
"refId": "A", "legendFormat": "{{namespace}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent", "min": 0, "max": 150,
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "yellow", "value": 80 }, { "color": "red", "value": 100 }] }
}
},
"options": {
"orientation": "horizontal", "displayMode": "gradient", "showUnfilled": true,
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 48 }
},
{
"id": 23,
"type": "bargauge",
"title": "Memory — Actual vs Requested (%)",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum by(namespace)(container_memory_working_set_bytes{namespace=~\"$namespace\",container!=\"\",container!=\"POD\"})\n/\nsum by(namespace)(kube_pod_container_resource_requests{resource=\"memory\",namespace=~\"$namespace\",container!=\"\"})\n* 100",
"refId": "A", "legendFormat": "{{namespace}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent", "min": 0, "max": 150,
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "yellow", "value": 80 }, { "color": "red", "value": 100 }] }
}
},
"options": {
"orientation": "horizontal", "displayMode": "gradient", "showUnfilled": true,
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 48 }
}
]
}

View File

@@ -0,0 +1,955 @@
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
name: okd-networking
namespace: observability
spec:
instanceSelector:
matchLabels:
dashboards: "grafana"
json: |
{
"title": "Networking",
"uid": "okd-networking",
"schemaVersion": 36,
"version": 1,
"refresh": "30s",
"time": { "from": "now-1h", "to": "now" },
"tags": ["okd", "networking"],
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"query": { "query": "label_values(kube_pod_info, namespace)", "refId": "A" },
"refresh": 2,
"includeAll": true,
"multi": true,
"allValue": ".*",
"label": "Namespace",
"sort": 1,
"current": {},
"options": []
}
]
},
"panels": [
{
"id": 1, "type": "stat", "title": "Network RX Rate",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum(rate(container_network_receive_bytes_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "A", "legendFormat": ""
}],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "blue", "value": null }] },
"unit": "Bps", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 0, "y": 0 }
},
{
"id": 2, "type": "stat", "title": "Network TX Rate",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum(rate(container_network_transmit_bytes_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "A", "legendFormat": ""
}],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "blue", "value": null }] },
"unit": "Bps", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 3, "y": 0 }
},
{
"id": 3, "type": "stat", "title": "RX Errors/s",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum(rate(container_network_receive_errors_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "A", "legendFormat": ""
}],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "red", "value": 1 }] },
"unit": "pps", "noValue": "0", "decimals": 2
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 6, "y": 0 }
},
{
"id": 4, "type": "stat", "title": "TX Errors/s",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum(rate(container_network_transmit_errors_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "A", "legendFormat": ""
}],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "red", "value": 1 }] },
"unit": "pps", "noValue": "0", "decimals": 2
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 9, "y": 0 }
},
{
"id": 5, "type": "stat", "title": "RX Drops/s",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum(rate(container_network_receive_packets_dropped_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "A", "legendFormat": ""
}],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "orange", "value": 1 }] },
"unit": "pps", "noValue": "0", "decimals": 2
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 12, "y": 0 }
},
{
"id": 6, "type": "stat", "title": "TX Drops/s",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum(rate(container_network_transmit_packets_dropped_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "A", "legendFormat": ""
}],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "orange", "value": 1 }] },
"unit": "pps", "noValue": "0", "decimals": 2
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 15, "y": 0 }
},
{
"id": 7, "type": "stat", "title": "DNS Queries/s",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum(rate(coredns_dns_requests_total[5m]))",
"refId": "A", "legendFormat": ""
}],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "blue", "value": null }] },
"unit": "reqps", "noValue": "0", "decimals": 1
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 18, "y": 0 }
},
{
"id": 8, "type": "stat", "title": "DNS Error %",
"description": "Percentage of DNS responses with non-NOERROR rcode over the last 5 minutes.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum(rate(coredns_dns_responses_total{rcode!=\"NOERROR\"}[5m])) / sum(rate(coredns_dns_responses_total[5m])) * 100",
"refId": "A", "legendFormat": ""
}],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 5 }
]},
"unit": "percent", "noValue": "0", "decimals": 2
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 21, "y": 0 }
},
{
"id": 9, "type": "row", "title": "Network I/O", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 4 }
},
{
"id": 10, "type": "timeseries", "title": "Receive Rate by Namespace",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(namespace)(rate(container_network_receive_bytes_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "A", "legendFormat": "{{namespace}}"
}],
"fieldConfig": {
"defaults": {
"unit": "Bps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 5 }
},
{
"id": 11, "type": "timeseries", "title": "Transmit Rate by Namespace",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(namespace)(rate(container_network_transmit_bytes_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "A", "legendFormat": "{{namespace}}"
}],
"fieldConfig": {
"defaults": {
"unit": "Bps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 5 }
},
{
"id": 12, "type": "row", "title": "Top Pod Consumers", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 13 }
},
{
"id": 13, "type": "timeseries", "title": "Top 10 Pods — RX Rate",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "topk(10, sum by(namespace,pod)(rate(container_network_receive_bytes_total{namespace=~\"$namespace\",pod!=\"\"}[5m])))",
"refId": "A", "legendFormat": "{{namespace}} / {{pod}}"
}],
"fieldConfig": {
"defaults": {
"unit": "Bps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "auto", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["max"] }
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 14 }
},
{
"id": 14, "type": "timeseries", "title": "Top 10 Pods — TX Rate",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "topk(10, sum by(namespace,pod)(rate(container_network_transmit_bytes_total{namespace=~\"$namespace\",pod!=\"\"}[5m])))",
"refId": "A", "legendFormat": "{{namespace}} / {{pod}}"
}],
"fieldConfig": {
"defaults": {
"unit": "Bps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "auto", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["max"] }
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 14 }
},
{
"id": 15,
"type": "table",
"title": "Pod Network I/O Summary",
"description": "Current RX/TX rates, errors and drops per pod. Sorted by RX rate descending.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum by(namespace,pod)(rate(container_network_receive_bytes_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "A", "instant": true, "format": "table", "legendFormat": ""
},
{
"expr": "sum by(namespace,pod)(rate(container_network_transmit_bytes_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "B", "instant": true, "format": "table", "legendFormat": ""
},
{
"expr": "sum by(namespace,pod)(rate(container_network_receive_errors_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "C", "instant": true, "format": "table", "legendFormat": ""
},
{
"expr": "sum by(namespace,pod)(rate(container_network_transmit_errors_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "D", "instant": true, "format": "table", "legendFormat": ""
},
{
"expr": "sum by(namespace,pod)(rate(container_network_receive_packets_dropped_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "E", "instant": true, "format": "table", "legendFormat": ""
},
{
"expr": "sum by(namespace,pod)(rate(container_network_transmit_packets_dropped_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "F", "instant": true, "format": "table", "legendFormat": ""
}
],
"transformations": [
{
"id": "filterFieldsByName",
"options": { "include": { "names": ["namespace", "pod", "Value"] } }
},
{
"id": "joinByField",
"options": { "byField": "pod", "mode": "outer" }
},
{
"id": "organize",
"options": {
"excludeByName": {
"namespace 1": true,
"namespace 2": true,
"namespace 3": true,
"namespace 4": true,
"namespace 5": true
},
"renameByName": {
"namespace": "Namespace",
"pod": "Pod",
"Value": "RX Rate",
"Value 1": "TX Rate",
"Value 2": "RX Errors/s",
"Value 3": "TX Errors/s",
"Value 4": "RX Drops/s",
"Value 5": "TX Drops/s"
},
"indexByName": {
"namespace": 0,
"pod": 1,
"Value": 2,
"Value 1": 3,
"Value 2": 4,
"Value 3": 5,
"Value 4": 6,
"Value 5": 7
}
}
},
{
"id": "sortBy",
"options": { "fields": [{ "displayName": "RX Rate", "desc": true }] }
}
],
"fieldConfig": {
"defaults": { "custom": { "align": "center", "displayMode": "auto" } },
"overrides": [
{
"matcher": { "id": "byName", "options": "Namespace" },
"properties": [{ "id": "custom.align", "value": "left" }, { "id": "custom.width", "value": 160 }]
},
{
"matcher": { "id": "byName", "options": "Pod" },
"properties": [{ "id": "custom.align", "value": "left" }, { "id": "custom.width", "value": 260 }]
},
{
"matcher": { "id": "byRegexp", "options": "^RX Rate$|^TX Rate$" },
"properties": [
{ "id": "unit", "value": "Bps" },
{ "id": "custom.displayMode", "value": "color-background-solid" },
{ "id": "thresholds", "value": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 10000000 },
{ "color": "orange", "value": 100000000 },
{ "color": "red", "value": 500000000 }
]}}
]
},
{
"matcher": { "id": "byRegexp", "options": "^RX Errors/s$|^TX Errors/s$" },
"properties": [
{ "id": "unit", "value": "pps" },
{ "id": "decimals", "value": 3 },
{ "id": "custom.displayMode", "value": "color-background" },
{ "id": "thresholds", "value": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "red", "value": 0.001 }
]}}
]
},
{
"matcher": { "id": "byRegexp", "options": "^RX Drops/s$|^TX Drops/s$" },
"properties": [
{ "id": "unit", "value": "pps" },
{ "id": "decimals", "value": 3 },
{ "id": "custom.displayMode", "value": "color-background" },
{ "id": "thresholds", "value": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "orange", "value": 0.001 }
]}}
]
}
]
},
"options": {},
"gridPos": { "h": 8, "w": 24, "x": 0, "y": 22 }
},
{
"id": 16, "type": "row", "title": "Errors & Packet Loss", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 30 }
},
{
"id": 17, "type": "timeseries", "title": "RX Errors by Namespace",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(namespace)(rate(container_network_receive_errors_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "A", "legendFormat": "{{namespace}}"
}],
"fieldConfig": {
"defaults": {
"unit": "pps", "min": 0, "decimals": 3,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 15, "showPoints": "never", "spanNulls": false },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "red", "value": 1 }] }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 12, "x": 0, "y": 31 }
},
{
"id": 18, "type": "timeseries", "title": "TX Errors by Namespace",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(namespace)(rate(container_network_transmit_errors_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "A", "legendFormat": "{{namespace}}"
}],
"fieldConfig": {
"defaults": {
"unit": "pps", "min": 0, "decimals": 3,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 15, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 12, "x": 12, "y": 31 }
},
{
"id": 19, "type": "timeseries", "title": "RX Packet Drops by Namespace",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(namespace)(rate(container_network_receive_packets_dropped_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "A", "legendFormat": "{{namespace}}"
}],
"fieldConfig": {
"defaults": {
"unit": "pps", "min": 0, "decimals": 3,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 15, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 12, "x": 0, "y": 38 }
},
{
"id": 20, "type": "timeseries", "title": "TX Packet Drops by Namespace",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(namespace)(rate(container_network_transmit_packets_dropped_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "A", "legendFormat": "{{namespace}}"
}],
"fieldConfig": {
"defaults": {
"unit": "pps", "min": 0, "decimals": 3,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 15, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 12, "x": 12, "y": 38 }
},
{
"id": 21, "type": "row", "title": "DNS (CoreDNS)", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 45 }
},
{
"id": 22, "type": "timeseries", "title": "DNS Request Rate by Query Type",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(type)(rate(coredns_dns_requests_total[5m]))",
"refId": "A", "legendFormat": "{{type}}"
}],
"fieldConfig": {
"defaults": {
"unit": "reqps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 0, "y": 46 }
},
{
"id": 23, "type": "timeseries", "title": "DNS Response Rate by Rcode",
"description": "NOERROR = healthy. NXDOMAIN = name not found. SERVFAIL = upstream error.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(rcode)(rate(coredns_dns_responses_total[5m]))",
"refId": "A", "legendFormat": "{{rcode}}"
}],
"fieldConfig": {
"defaults": {
"unit": "reqps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
},
"overrides": [
{ "matcher": { "id": "byName", "options": "NOERROR" }, "properties": [{ "id": "color", "value": { "fixedColor": "green", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "NXDOMAIN" }, "properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "SERVFAIL" }, "properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "REFUSED" }, "properties": [{ "id": "color", "value": { "fixedColor": "orange", "mode": "fixed" } }] }
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 8, "y": 46 }
},
{
"id": 24, "type": "timeseries", "title": "DNS Request Latency (p50 / p95 / p99)",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "histogram_quantile(0.50, sum(rate(coredns_dns_request_duration_seconds_bucket[5m])) by (le))",
"refId": "A", "legendFormat": "p50"
},
{
"expr": "histogram_quantile(0.95, sum(rate(coredns_dns_request_duration_seconds_bucket[5m])) by (le))",
"refId": "B", "legendFormat": "p95"
},
{
"expr": "histogram_quantile(0.99, sum(rate(coredns_dns_request_duration_seconds_bucket[5m])) by (le))",
"refId": "C", "legendFormat": "p99"
}
],
"fieldConfig": {
"defaults": {
"unit": "s", "min": 0, "decimals": 4,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
},
"overrides": [
{ "matcher": { "id": "byName", "options": "p50" }, "properties": [{ "id": "color", "value": { "fixedColor": "green", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "p95" }, "properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "p99" }, "properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }] }
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 46 }
},
{
"id": 25, "type": "timeseries", "title": "DNS Cache Hit Ratio (%)",
"description": "High hit ratio = CoreDNS is serving responses from cache, reducing upstream load.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum(rate(coredns_cache_hits_total[5m])) / (sum(rate(coredns_cache_hits_total[5m])) + sum(rate(coredns_cache_misses_total[5m]))) * 100",
"refId": "A", "legendFormat": "Cache Hit %"
}],
"fieldConfig": {
"defaults": {
"unit": "percent", "min": 0, "max": 100,
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "red", "value": null },
{ "color": "yellow", "value": 50 },
{ "color": "green", "value": 80 }
]},
"custom": { "lineWidth": 2, "fillOpacity": 20, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "single" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "lastNotNull"] }
},
"gridPos": { "h": 7, "w": 12, "x": 0, "y": 54 }
},
{
"id": 26, "type": "timeseries", "title": "DNS Forward Request Rate",
"description": "Queries CoreDNS is forwarding upstream. Spike here with cache miss spike = upstream DNS pressure.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum(rate(coredns_forward_requests_total[5m]))",
"refId": "A", "legendFormat": "Forward Requests/s"
},
{
"expr": "sum(rate(coredns_forward_responses_duration_seconds_count[5m]))",
"refId": "B", "legendFormat": "Forward Responses/s"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 12, "x": 12, "y": 54 }
},
{
"id": 27, "type": "row", "title": "Services & Endpoints", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 61 }
},
{
"id": 28, "type": "stat", "title": "Total Services",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "count(kube_service_info{namespace=~\"$namespace\"})",
"refId": "A", "legendFormat": ""
}],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "blue", "value": null }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 8, "x": 0, "y": 62 }
},
{
"id": 29, "type": "stat", "title": "Endpoint Addresses Available",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum(kube_endpoint_address_available{namespace=~\"$namespace\"})",
"refId": "A", "legendFormat": ""
}],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "red", "value": null }, { "color": "green", "value": 1 }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 8, "x": 8, "y": 62 }
},
{
"id": 30, "type": "stat", "title": "Endpoint Addresses Not Ready",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum(kube_endpoint_address_not_ready{namespace=~\"$namespace\"}) or vector(0)",
"refId": "A", "legendFormat": ""
}],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "red", "value": 1 }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 8, "x": 16, "y": 62 }
},
{
"id": 31,
"type": "table",
"title": "Endpoint Availability",
"description": "Per-endpoint available vs not-ready address counts. Red Not Ready = pods backing this service are unhealthy.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum by(namespace,endpoint)(kube_endpoint_address_available{namespace=~\"$namespace\"})",
"refId": "A", "instant": true, "format": "table", "legendFormat": ""
},
{
"expr": "sum by(namespace,endpoint)(kube_endpoint_address_not_ready{namespace=~\"$namespace\"})",
"refId": "B", "instant": true, "format": "table", "legendFormat": ""
}
],
"transformations": [
{
"id": "filterFieldsByName",
"options": { "include": { "names": ["namespace", "endpoint", "Value"] } }
},
{
"id": "joinByField",
"options": { "byField": "endpoint", "mode": "outer" }
},
{
"id": "organize",
"options": {
"excludeByName": { "namespace 1": true },
"renameByName": {
"namespace": "Namespace",
"endpoint": "Endpoint",
"Value": "Available",
"Value 1": "Not Ready"
},
"indexByName": {
"namespace": 0,
"endpoint": 1,
"Value": 2,
"Value 1": 3
}
}
},
{
"id": "sortBy",
"options": { "fields": [{ "displayName": "Not Ready", "desc": true }] }
}
],
"fieldConfig": {
"defaults": { "custom": { "align": "center", "displayMode": "auto" } },
"overrides": [
{
"matcher": { "id": "byName", "options": "Namespace" },
"properties": [{ "id": "custom.align", "value": "left" }, { "id": "custom.width", "value": 180 }]
},
{
"matcher": { "id": "byName", "options": "Endpoint" },
"properties": [{ "id": "custom.align", "value": "left" }, { "id": "custom.width", "value": 220 }]
},
{
"matcher": { "id": "byName", "options": "Available" },
"properties": [
{ "id": "custom.displayMode", "value": "color-background" },
{ "id": "thresholds", "value": { "mode": "absolute", "steps": [{ "color": "red", "value": null }, { "color": "green", "value": 1 }] } }
]
},
{
"matcher": { "id": "byName", "options": "Not Ready" },
"properties": [
{ "id": "custom.displayMode", "value": "color-background" },
{ "id": "thresholds", "value": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "red", "value": 1 }] } }
]
}
]
},
"options": {},
"gridPos": { "h": 8, "w": 24, "x": 0, "y": 66 }
},
{
"id": 32, "type": "row", "title": "OKD Router / Ingress (HAProxy)", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 74 }
},
{
"id": 33, "type": "timeseries", "title": "Router HTTP Request Rate by Code",
"description": "Requires HAProxy router metrics to be scraped (port 1936). OKD exposes these via the openshift-ingress ServiceMonitor.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum by(code)(rate(haproxy_backend_http_responses_total[5m]))",
"refId": "A", "legendFormat": "HTTP {{code}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
},
"overrides": [
{ "matcher": { "id": "byName", "options": "HTTP 2xx" }, "properties": [{ "id": "color", "value": { "fixedColor": "green", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "HTTP 4xx" }, "properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "HTTP 5xx" }, "properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }] }
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 75 }
},
{
"id": 34, "type": "timeseries", "title": "Router 4xx + 5xx Error Rate (%)",
"description": "Client error (4xx) and server error (5xx) rates as a percentage of all requests.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum(rate(haproxy_backend_http_responses_total{code=\"4xx\"}[5m])) / sum(rate(haproxy_backend_http_responses_total[5m])) * 100",
"refId": "A", "legendFormat": "4xx %"
},
{
"expr": "sum(rate(haproxy_backend_http_responses_total{code=\"5xx\"}[5m])) / sum(rate(haproxy_backend_http_responses_total[5m])) * 100",
"refId": "B", "legendFormat": "5xx %"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 15, "showPoints": "never", "spanNulls": false },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 5 }
]}
},
"overrides": [
{ "matcher": { "id": "byName", "options": "4xx %" }, "properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "5xx %" }, "properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }] }
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 75 }
},
{
"id": 35, "type": "timeseries", "title": "Router Bytes In / Out",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum(rate(haproxy_frontend_bytes_in_total[5m]))",
"refId": "A", "legendFormat": "Bytes In"
},
{
"expr": "sum(rate(haproxy_frontend_bytes_out_total[5m]))",
"refId": "B", "legendFormat": "Bytes Out"
}
],
"fieldConfig": {
"defaults": {
"unit": "Bps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
},
"overrides": [
{ "matcher": { "id": "byName", "options": "Bytes In" }, "properties": [{ "id": "color", "value": { "fixedColor": "blue", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "Bytes Out" }, "properties": [{ "id": "color", "value": { "fixedColor": "green", "mode": "fixed" } }] }
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 83 }
},
{
"id": 36,
"type": "table",
"title": "Router Backend Server Status",
"description": "HAProxy backend servers (routes). Value 0 = DOWN, 1 = UP.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "haproxy_server_up",
"refId": "A", "instant": true, "format": "table", "legendFormat": ""
}
],
"transformations": [
{
"id": "filterFieldsByName",
"options": { "include": { "names": ["proxy", "server", "Value"] } }
},
{
"id": "organize",
"options": {
"excludeByName": {},
"renameByName": {
"proxy": "Backend",
"server": "Server",
"Value": "Status"
},
"indexByName": { "proxy": 0, "server": 1, "Value": 2 }
}
},
{
"id": "sortBy",
"options": { "fields": [{ "displayName": "Status", "desc": false }] }
}
],
"fieldConfig": {
"defaults": { "custom": { "align": "center", "displayMode": "auto" } },
"overrides": [
{
"matcher": { "id": "byName", "options": "Backend" },
"properties": [{ "id": "custom.align", "value": "left" }, { "id": "custom.width", "value": 260 }]
},
{
"matcher": { "id": "byName", "options": "Server" },
"properties": [{ "id": "custom.align", "value": "left" }, { "id": "custom.width", "value": 180 }]
},
{
"matcher": { "id": "byName", "options": "Status" },
"properties": [
{ "id": "custom.displayMode", "value": "color-background" },
{ "id": "mappings", "value": [
{ "type": "value", "options": { "0": { "text": "DOWN", "color": "red" } } },
{ "type": "value", "options": { "1": { "text": "UP", "color": "green" } } }
]},
{ "id": "thresholds", "value": { "mode": "absolute", "steps": [
{ "color": "red", "value": null },
{ "color": "green", "value": 1 }
]}}
]
}
]
},
"options": {},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 83 }
}
]
}

View File

@@ -0,0 +1,607 @@
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
name: storage-health
namespace: observability
spec:
instanceSelector:
matchLabels:
dashboards: "grafana"
json: |
{
"title": "Storage Health",
"uid": "storage-health",
"schemaVersion": 36,
"version": 1,
"refresh": "30s",
"time": { "from": "now-1h", "to": "now" },
"panels": [
{
"type": "row",
"id": 1,
"title": "PVC / PV Status",
"collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 0 }
},
{
"type": "stat",
"id": 2,
"title": "Bound PVCs",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum(kube_persistentvolumeclaim_status_phase{phase=\"Bound\"}) or vector(0)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [{ "color": "green", "value": null }]
}
}
},
"options": {
"reduceOptions": { "calcs": ["lastNotNull"] },
"colorMode": "background",
"graphMode": "none",
"textMode": "auto"
},
"gridPos": { "h": 5, "w": 4, "x": 0, "y": 1 }
},
{
"type": "stat",
"id": 3,
"title": "Pending PVCs",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum(kube_persistentvolumeclaim_status_phase{phase=\"Pending\"}) or vector(0)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 }
]
}
}
},
"options": {
"reduceOptions": { "calcs": ["lastNotNull"] },
"colorMode": "background",
"graphMode": "none",
"textMode": "auto"
},
"gridPos": { "h": 5, "w": 4, "x": 4, "y": 1 }
},
{
"type": "stat",
"id": 4,
"title": "Lost PVCs",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum(kube_persistentvolumeclaim_status_phase{phase=\"Lost\"}) or vector(0)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "red", "value": 1 }
]
}
}
},
"options": {
"reduceOptions": { "calcs": ["lastNotNull"] },
"colorMode": "background",
"graphMode": "none",
"textMode": "auto"
},
"gridPos": { "h": 5, "w": 4, "x": 8, "y": 1 }
},
{
"type": "stat",
"id": 5,
"title": "Bound PVs / Available PVs",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum(kube_persistentvolume_status_phase{phase=\"Bound\"}) or vector(0)",
"refId": "A",
"legendFormat": "Bound"
},
{
"expr": "sum(kube_persistentvolume_status_phase{phase=\"Available\"}) or vector(0)",
"refId": "B",
"legendFormat": "Available"
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [{ "color": "blue", "value": null }]
}
}
},
"options": {
"reduceOptions": { "calcs": ["lastNotNull"] },
"colorMode": "background",
"graphMode": "none",
"textMode": "auto"
},
"gridPos": { "h": 5, "w": 4, "x": 12, "y": 1 }
},
{
"type": "stat",
"id": 6,
"title": "Ceph Cluster Health",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "ceph_health_status",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 2 }
]
},
"mappings": [
{
"type": "value",
"options": {
"0": { "text": "HEALTH_OK", "index": 0 },
"1": { "text": "HEALTH_WARN", "index": 1 },
"2": { "text": "HEALTH_ERR", "index": 2 }
}
}
]
}
},
"options": {
"reduceOptions": { "calcs": ["lastNotNull"] },
"colorMode": "background",
"graphMode": "none",
"textMode": "value"
},
"gridPos": { "h": 5, "w": 4, "x": 16, "y": 1 }
},
{
"type": "stat",
"id": 7,
"title": "OSDs Up / Total",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum(ceph_osd_up) or vector(0)",
"refId": "A",
"legendFormat": "Up"
},
{
"expr": "count(ceph_osd_metadata) or vector(0)",
"refId": "B",
"legendFormat": "Total"
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [{ "color": "green", "value": null }]
}
}
},
"options": {
"reduceOptions": { "calcs": ["lastNotNull"] },
"colorMode": "background",
"graphMode": "none",
"textMode": "auto"
},
"gridPos": { "h": 5, "w": 4, "x": 20, "y": 1 }
},
{
"type": "row",
"id": 8,
"title": "Cluster Capacity",
"collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 6 }
},
{
"type": "gauge",
"id": 9,
"title": "Ceph Cluster Used (%)",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "100 * (ceph_cluster_total_used_raw_bytes or ceph_cluster_total_used_bytes) / ceph_cluster_total_bytes",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 70 },
{ "color": "red", "value": 85 }
]
}
}
},
"options": {
"reduceOptions": { "calcs": ["lastNotNull"] },
"showThresholdLabels": true,
"showThresholdMarkers": true
},
"gridPos": { "h": 8, "w": 5, "x": 0, "y": 7 }
},
{
"type": "stat",
"id": 10,
"title": "Ceph Capacity — Total / Available",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "ceph_cluster_total_bytes",
"refId": "A",
"legendFormat": "Total"
},
{
"expr": "ceph_cluster_total_bytes - (ceph_cluster_total_used_raw_bytes or ceph_cluster_total_used_bytes)",
"refId": "B",
"legendFormat": "Available"
}
],
"fieldConfig": {
"defaults": {
"unit": "bytes",
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [{ "color": "blue", "value": null }]
}
}
},
"options": {
"reduceOptions": { "calcs": ["lastNotNull"] },
"colorMode": "value",
"graphMode": "none",
"textMode": "auto",
"orientation": "vertical"
},
"gridPos": { "h": 8, "w": 4, "x": 5, "y": 7 }
},
{
"type": "bargauge",
"id": 11,
"title": "PV Allocated Capacity by Storage Class (Bound)",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum by (storageclass) (\n kube_persistentvolume_capacity_bytes\n * on(persistentvolume) group_left(storageclass)\n kube_persistentvolume_status_phase{phase=\"Bound\"}\n)",
"refId": "A",
"legendFormat": "{{storageclass}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "bytes",
"color": { "mode": "palette-classic" },
"thresholds": {
"mode": "absolute",
"steps": [{ "color": "blue", "value": null }]
}
}
},
"options": {
"orientation": "horizontal",
"reduceOptions": { "calcs": ["lastNotNull"] },
"displayMode": "gradient",
"showUnfilled": true
},
"gridPos": { "h": 8, "w": 7, "x": 9, "y": 7 }
},
{
"type": "piechart",
"id": 12,
"title": "PVC Phase Distribution",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum(kube_persistentvolumeclaim_status_phase{phase=\"Bound\"}) or vector(0)",
"refId": "A",
"legendFormat": "Bound"
},
{
"expr": "sum(kube_persistentvolumeclaim_status_phase{phase=\"Pending\"}) or vector(0)",
"refId": "B",
"legendFormat": "Pending"
},
{
"expr": "sum(kube_persistentvolumeclaim_status_phase{phase=\"Lost\"}) or vector(0)",
"refId": "C",
"legendFormat": "Lost"
}
],
"fieldConfig": {
"defaults": { "color": { "mode": "palette-classic" } }
},
"options": {
"reduceOptions": { "calcs": ["lastNotNull"] },
"pieType": "pie",
"legend": {
"displayMode": "table",
"placement": "right",
"values": ["value", "percent"]
}
},
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 7 }
},
{
"type": "row",
"id": 13,
"title": "Ceph Performance",
"collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 15 }
},
{
"type": "timeseries",
"id": 14,
"title": "Ceph Pool IOPS (Read / Write)",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "rate(ceph_pool_rd[5m])",
"refId": "A",
"legendFormat": "Read — pool {{pool_id}}"
},
{
"expr": "rate(ceph_pool_wr[5m])",
"refId": "B",
"legendFormat": "Write — pool {{pool_id}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 8 }
}
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 16 }
},
{
"type": "timeseries",
"id": 15,
"title": "Ceph Pool Throughput (Read / Write)",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "rate(ceph_pool_rd_bytes[5m])",
"refId": "A",
"legendFormat": "Read — pool {{pool_id}}"
},
{
"expr": "rate(ceph_pool_wr_bytes[5m])",
"refId": "B",
"legendFormat": "Write — pool {{pool_id}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "Bps",
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 8 }
}
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 16 }
},
{
"type": "row",
"id": 16,
"title": "Ceph OSD & Pool Details",
"collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 24 }
},
{
"type": "timeseries",
"id": 17,
"title": "Ceph Pool Space Used (%)",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "100 * ceph_pool_bytes_used / (ceph_pool_bytes_used + ceph_pool_max_avail)",
"refId": "A",
"legendFormat": "Pool {{pool_id}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"color": { "mode": "palette-classic" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 70 },
{ "color": "red", "value": 85 }
]
},
"custom": { "lineWidth": 2, "fillOpacity": 10 }
}
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 25 }
},
{
"type": "bargauge",
"id": 18,
"title": "OSD Status per Daemon (green = Up, red = Down)",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "ceph_osd_up",
"refId": "A",
"legendFormat": "{{ceph_daemon}}"
}
],
"fieldConfig": {
"defaults": {
"min": 0,
"max": 1,
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "red", "value": null },
{ "color": "green", "value": 1 }
]
},
"mappings": [
{
"type": "value",
"options": {
"0": { "text": "DOWN", "index": 0 },
"1": { "text": "UP", "index": 1 }
}
}
]
}
},
"options": {
"orientation": "horizontal",
"reduceOptions": { "calcs": ["lastNotNull"] },
"displayMode": "basic",
"showUnfilled": true
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 25 }
},
{
"type": "row",
"id": 19,
"title": "Node Disk Usage",
"collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 33 }
},
{
"type": "timeseries",
"id": 20,
"title": "Node Root Disk Usage Over Time (%)",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "100 - (node_filesystem_avail_bytes{mountpoint=\"/\",fstype!=\"tmpfs\"} / node_filesystem_size_bytes{mountpoint=\"/\",fstype!=\"tmpfs\"} * 100)",
"refId": "A",
"legendFormat": "{{instance}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"color": { "mode": "palette-classic" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 70 },
{ "color": "red", "value": 85 }
]
},
"custom": { "lineWidth": 2, "fillOpacity": 10 }
}
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 34 }
},
{
"type": "bargauge",
"id": 21,
"title": "Current Disk Usage — All Nodes & Mountpoints",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "100 - (node_filesystem_avail_bytes{fstype!~\"tmpfs|overlay|squashfs\"} / node_filesystem_size_bytes{fstype!~\"tmpfs|overlay|squashfs\"} * 100)",
"refId": "A",
"legendFormat": "{{instance}} — {{mountpoint}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 70 },
{ "color": "red", "value": 85 }
]
}
}
},
"options": {
"orientation": "horizontal",
"reduceOptions": { "calcs": ["lastNotNull"] },
"displayMode": "gradient",
"showUnfilled": true
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 34 }
}
]
}

View File

@@ -0,0 +1,744 @@
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
name: okd-etcd
namespace: observability
spec:
instanceSelector:
matchLabels:
dashboards: "grafana"
json: |
{
"title": "etcd",
"uid": "okd-etcd",
"schemaVersion": 36,
"version": 1,
"refresh": "30s",
"time": { "from": "now-1h", "to": "now" },
"tags": ["okd", "etcd"],
"templating": {
"list": [
{
"name": "instance",
"type": "query",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"query": { "query": "label_values(etcd_server_has_leader, instance)", "refId": "A" },
"refresh": 2,
"includeAll": true,
"multi": true,
"allValue": ".*",
"label": "Instance",
"sort": 1,
"current": {},
"options": []
}
]
},
"panels": [
{
"id": 1, "type": "stat", "title": "Cluster Members",
"description": "Total number of etcd members currently reporting metrics.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(etcd_server_has_leader)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "red", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "green", "value": 3 }
]},
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 0, "y": 0 }
},
{
"id": 2, "type": "stat", "title": "Has Leader",
"description": "min() across all members. 0 = at least one member has no quorum — cluster is degraded.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "min(etcd_server_has_leader)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "red", "value": null },
{ "color": "green", "value": 1 }
]},
"unit": "short", "noValue": "0",
"mappings": [
{ "type": "value", "options": {
"0": { "text": "NO LEADER", "color": "red" },
"1": { "text": "OK", "color": "green" }
}}
]
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 3, "y": 0 }
},
{
"id": 3, "type": "stat", "title": "Leader Changes (1h)",
"description": "Number of leader elections in the last hour. ≥3 indicates cluster instability.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "sum(changes(etcd_server_leader_changes_seen_total[1h]))", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 3 }
]},
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 6, "y": 0 }
},
{
"id": 4, "type": "stat", "title": "DB Size (Max)",
"description": "Largest boltdb file size across all members. Default etcd quota is 8 GiB.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "max(etcd_mvcc_db_total_size_in_bytes)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 2147483648 },
{ "color": "orange", "value": 5368709120 },
{ "color": "red", "value": 7516192768 }
]},
"unit": "bytes", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 9, "y": 0 }
},
{
"id": 5, "type": "stat", "title": "DB Fragmentation (Max)",
"description": "% of DB space that is allocated but unused. >50% → run etcdctl defrag.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "max((etcd_mvcc_db_total_size_in_bytes - etcd_mvcc_db_total_size_in_use_in_bytes) / etcd_mvcc_db_total_size_in_bytes * 100)",
"refId": "A", "legendFormat": ""
}],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 25 },
{ "color": "orange", "value": 50 },
{ "color": "red", "value": 75 }
]},
"unit": "percent", "noValue": "0", "decimals": 1
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 12, "y": 0 }
},
{
"id": 6, "type": "stat", "title": "Failed Proposals/s",
"description": "Rate of rejected Raft proposals. Any sustained non-zero value = cluster health problem.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "sum(rate(etcd_server_proposals_failed_total[5m]))", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "red", "value": 0.001 }
]},
"unit": "short", "noValue": "0", "decimals": 3
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 15, "y": 0 }
},
{
"id": 7, "type": "stat", "title": "WAL Fsync p99",
"description": "99th percentile WAL flush-to-disk time. >10ms is concerning; >100ms = serious I/O bottleneck.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) by (le))",
"refId": "A", "legendFormat": ""
}],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 0.01 },
{ "color": "orange", "value": 0.1 },
{ "color": "red", "value": 0.5 }
]},
"unit": "s", "noValue": "0", "decimals": 4
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 18, "y": 0 }
},
{
"id": 8, "type": "stat", "title": "Backend Commit p99",
"description": "99th percentile boltdb commit time. >25ms = warning; >100ms = critical backend I/O pressure.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) by (le))",
"refId": "A", "legendFormat": ""
}],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 0.025 },
{ "color": "orange", "value": 0.1 },
{ "color": "red", "value": 0.25 }
]},
"unit": "s", "noValue": "0", "decimals": 4
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 21, "y": 0 }
},
{
"id": 9, "type": "row", "title": "Cluster Health", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 4 }
},
{
"id": 10, "type": "timeseries", "title": "Has Leader per Instance",
"description": "1 = member has a leader; 0 = member lost quorum. A dip to 0 marks the exact moment of a leader election.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "etcd_server_has_leader{instance=~\"$instance\"}",
"refId": "A", "legendFormat": "{{instance}}"
}],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0, "max": 1.1,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 20, "showPoints": "never", "spanNulls": false },
"mappings": [
{ "type": "value", "options": {
"0": { "text": "0 — no leader" },
"1": { "text": "1 — ok" }
}}
]
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "none" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": [] }
},
"gridPos": { "h": 6, "w": 8, "x": 0, "y": 5 }
},
{
"id": 11, "type": "timeseries", "title": "Leader Changes (cumulative)",
"description": "Monotonically increasing counter per member. A step jump = one leader election. Correlated jumps across members = cluster-wide event.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "etcd_server_leader_changes_seen_total{instance=~\"$instance\"}",
"refId": "A", "legendFormat": "{{instance}}"
}],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "auto", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "none" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["lastNotNull"] }
},
"gridPos": { "h": 6, "w": 8, "x": 8, "y": 5 }
},
{
"id": 12, "type": "timeseries", "title": "Slow Operations",
"description": "slow_apply: proposals applied slower than expected. slow_read_index: linearizable reads timing out. heartbeat_failures: Raft heartbeat send errors (network partition indicator).",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "rate(etcd_server_slow_apply_total{instance=~\"$instance\"}[5m])", "refId": "A", "legendFormat": "Slow Apply — {{instance}}" },
{ "expr": "rate(etcd_server_slow_read_indexes_total{instance=~\"$instance\"}[5m])", "refId": "B", "legendFormat": "Slow Read Index — {{instance}}" },
{ "expr": "rate(etcd_server_heartbeat_send_failures_total{instance=~\"$instance\"}[5m])", "refId": "C", "legendFormat": "Heartbeat Failures — {{instance}}" }
],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0, "decimals": 3,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 6, "w": 8, "x": 16, "y": 5 }
},
{
"id": 13, "type": "row", "title": "gRPC Traffic", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 11 }
},
{
"id": 14, "type": "timeseries", "title": "gRPC Request Rate by Method",
"description": "Unary calls/s per RPC method. High Put/Txn = heavy write load. High Range = heavy read load. High Watch = many controller watchers.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(grpc_method)(rate(grpc_server_started_total{job=~\".*etcd.*\",grpc_type=\"unary\"}[5m]))",
"refId": "A", "legendFormat": "{{grpc_method}}"
}],
"fieldConfig": {
"defaults": {
"unit": "reqps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 0, "y": 12 }
},
{
"id": 15, "type": "timeseries", "title": "gRPC Error Rate by Status Code",
"description": "Non-OK responses by gRPC status code. RESOURCE_EXHAUSTED = overloaded. UNAVAILABLE = leader election. DEADLINE_EXCEEDED = latency spike.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(grpc_code)(rate(grpc_server_handled_total{job=~\".*etcd.*\",grpc_code!=\"OK\"}[5m]))",
"refId": "A", "legendFormat": "{{grpc_code}}"
}],
"fieldConfig": {
"defaults": {
"unit": "reqps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 15, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 8, "y": 12 }
},
{
"id": 16, "type": "timeseries", "title": "gRPC Request Latency (p50 / p95 / p99)",
"description": "Unary call handling duration. p99 > 100ms for Put/Txn indicates disk or CPU pressure. p99 > 500ms will cause kube-apiserver timeouts.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "histogram_quantile(0.50, sum(rate(grpc_server_handling_seconds_bucket{job=~\".*etcd.*\",grpc_type=\"unary\"}[5m])) by (le))", "refId": "A", "legendFormat": "p50" },
{ "expr": "histogram_quantile(0.95, sum(rate(grpc_server_handling_seconds_bucket{job=~\".*etcd.*\",grpc_type=\"unary\"}[5m])) by (le))", "refId": "B", "legendFormat": "p95" },
{ "expr": "histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job=~\".*etcd.*\",grpc_type=\"unary\"}[5m])) by (le))", "refId": "C", "legendFormat": "p99" }
],
"fieldConfig": {
"defaults": {
"unit": "s", "min": 0, "decimals": 4,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
},
"overrides": [
{ "matcher": { "id": "byName", "options": "p50" }, "properties": [{ "id": "color", "value": { "fixedColor": "green", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "p95" }, "properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "p99" }, "properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }] }
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 12 }
},
{
"id": 17, "type": "row", "title": "Raft Proposals", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 20 }
},
{
"id": 18, "type": "timeseries", "title": "Proposals Committed vs Applied",
"description": "Committed = agreed by Raft quorum. Applied = persisted to boltdb. A widening gap between the two = backend apply backlog (disk too slow to keep up).",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "rate(etcd_server_proposals_committed_total{instance=~\"$instance\"}[5m])", "refId": "A", "legendFormat": "Committed — {{instance}}" },
{ "expr": "rate(etcd_server_proposals_applied_total{instance=~\"$instance\"}[5m])", "refId": "B", "legendFormat": "Applied — {{instance}}" }
],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 0, "y": 21 }
},
{
"id": 19, "type": "timeseries", "title": "Proposals Pending",
"description": "In-flight Raft proposals not yet committed. Consistently high (>5) = cluster cannot keep up with write throughput.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "etcd_server_proposals_pending{instance=~\"$instance\"}",
"refId": "A", "legendFormat": "{{instance}}"
}],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": {
"lineWidth": 2, "fillOpacity": 15, "showPoints": "never", "spanNulls": false,
"thresholdsStyle": { "mode": "line+area" }
},
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 5 },
{ "color": "red", "value": 10 }
]}
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 8, "y": 21 }
},
{
"id": 20, "type": "timeseries", "title": "Failed Proposals Rate",
"description": "Raft proposals that were rejected. Root causes: quorum loss, leader timeout, network partition between members.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "rate(etcd_server_proposals_failed_total{instance=~\"$instance\"}[5m])",
"refId": "A", "legendFormat": "{{instance}}"
}],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0, "decimals": 3,
"color": { "mode": "palette-classic" },
"custom": {
"lineWidth": 2, "fillOpacity": 20, "showPoints": "never", "spanNulls": false,
"thresholdsStyle": { "mode": "line" }
},
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "red", "value": 0.001 }
]}
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 16, "y": 21 }
},
{
"id": 21, "type": "row", "title": "Disk I/O", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 28 }
},
{
"id": 22, "type": "timeseries", "title": "WAL Fsync Duration (p50 / p95 / p99) per Instance",
"description": "Time to flush the write-ahead log to disk. etcd is extremely sensitive to WAL latency. >10ms p99 = storage is the bottleneck. Correlates directly with Raft commit latency.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "histogram_quantile(0.50, sum by(le,instance)(rate(etcd_disk_wal_fsync_duration_seconds_bucket{instance=~\"$instance\"}[5m])))", "refId": "A", "legendFormat": "p50 — {{instance}}" },
{ "expr": "histogram_quantile(0.95, sum by(le,instance)(rate(etcd_disk_wal_fsync_duration_seconds_bucket{instance=~\"$instance\"}[5m])))", "refId": "B", "legendFormat": "p95 — {{instance}}" },
{ "expr": "histogram_quantile(0.99, sum by(le,instance)(rate(etcd_disk_wal_fsync_duration_seconds_bucket{instance=~\"$instance\"}[5m])))", "refId": "C", "legendFormat": "p99 — {{instance}}" }
],
"fieldConfig": {
"defaults": {
"unit": "s", "min": 0, "decimals": 4,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 29 }
},
{
"id": 23, "type": "timeseries", "title": "Backend Commit Duration (p50 / p95 / p99) per Instance",
"description": "Time for boltdb to commit a batch transaction. A spike here while WAL is healthy = backend I/O saturation or boltdb lock contention. Triggers apply backlog.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "histogram_quantile(0.50, sum by(le,instance)(rate(etcd_disk_backend_commit_duration_seconds_bucket{instance=~\"$instance\"}[5m])))", "refId": "A", "legendFormat": "p50 — {{instance}}" },
{ "expr": "histogram_quantile(0.95, sum by(le,instance)(rate(etcd_disk_backend_commit_duration_seconds_bucket{instance=~\"$instance\"}[5m])))", "refId": "B", "legendFormat": "p95 — {{instance}}" },
{ "expr": "histogram_quantile(0.99, sum by(le,instance)(rate(etcd_disk_backend_commit_duration_seconds_bucket{instance=~\"$instance\"}[5m])))", "refId": "C", "legendFormat": "p99 — {{instance}}" }
],
"fieldConfig": {
"defaults": {
"unit": "s", "min": 0, "decimals": 4,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 29 }
},
{
"id": 24, "type": "row", "title": "Network (Peer & Client)", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 37 }
},
{
"id": 25, "type": "timeseries", "title": "Peer RX Rate",
"description": "Bytes received from Raft peers (log replication + heartbeats). A burst during a quiet period = large snapshot being streamed to a recovering member.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "rate(etcd_network_peer_received_bytes_total{instance=~\"$instance\"}[5m])",
"refId": "A", "legendFormat": "{{instance}}"
}],
"fieldConfig": {
"defaults": {
"unit": "Bps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 6, "x": 0, "y": 38 }
},
{
"id": 26, "type": "timeseries", "title": "Peer TX Rate",
"description": "Bytes sent to Raft peers. Leader will have higher TX than followers (it replicates entries to all members).",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "rate(etcd_network_peer_sent_bytes_total{instance=~\"$instance\"}[5m])",
"refId": "A", "legendFormat": "{{instance}}"
}],
"fieldConfig": {
"defaults": {
"unit": "Bps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 6, "x": 6, "y": 38 }
},
{
"id": 27, "type": "timeseries", "title": "Client gRPC Received",
"description": "Bytes received from API clients (kube-apiserver, operators). Spike = large write burst from controllers or kubectl apply.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "rate(etcd_network_client_grpc_received_bytes_total{instance=~\"$instance\"}[5m])",
"refId": "A", "legendFormat": "{{instance}}"
}],
"fieldConfig": {
"defaults": {
"unit": "Bps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 6, "x": 12, "y": 38 }
},
{
"id": 28, "type": "timeseries", "title": "Client gRPC Sent",
"description": "Bytes sent to API clients (responses + watch events). Persistently high = many active Watch streams or large objects being served.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "rate(etcd_network_client_grpc_sent_bytes_total{instance=~\"$instance\"}[5m])",
"refId": "A", "legendFormat": "{{instance}}"
}],
"fieldConfig": {
"defaults": {
"unit": "Bps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 6, "x": 18, "y": 38 }
},
{
"id": 29, "type": "row", "title": "DB Size & Process Resources", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 45 }
},
{
"id": 30, "type": "timeseries", "title": "DB Total vs In-Use Size per Instance",
"description": "Total = allocated boltdb file size. In Use = live key data. The gap between them = fragmentation. Steady growth of Total = compaction not keeping up with key churn.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "etcd_mvcc_db_total_size_in_bytes{instance=~\"$instance\"}", "refId": "A", "legendFormat": "Total — {{instance}}" },
{ "expr": "etcd_mvcc_db_total_size_in_use_in_bytes{instance=~\"$instance\"}", "refId": "B", "legendFormat": "In Use — {{instance}}" }
],
"fieldConfig": {
"defaults": {
"unit": "bytes", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["lastNotNull", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 0, "y": 46 }
},
{
"id": 31, "type": "timeseries", "title": "Process Resident Memory (RSS)",
"description": "Physical RAM consumed by the etcd process. Monotonically growing RSS = memory leak or oversized watch cache. Typical healthy range: 500 MiB2 GiB depending on cluster size.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "etcd_process_resident_memory_bytes{instance=~\"$instance\"}",
"refId": "A", "legendFormat": "{{instance}}"
}],
"fieldConfig": {
"defaults": {
"unit": "bytes", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["lastNotNull", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 8, "y": 46 }
},
{
"id": 32, "type": "timeseries", "title": "Open File Descriptors vs Limit",
"description": "Open FD count (solid) and process FD limit (dashed). Approaching the limit will cause WAL file creation and new client connections to fail.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "etcd_process_open_fds{instance=~\"$instance\"}", "refId": "A", "legendFormat": "Open — {{instance}}" },
{ "expr": "etcd_process_max_fds{instance=~\"$instance\"}", "refId": "B", "legendFormat": "Limit — {{instance}}" }
],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
},
"overrides": [
{
"matcher": { "id": "byRegexp", "options": "^Limit.*" },
"properties": [
{ "id": "custom.lineWidth", "value": 1 },
{ "id": "custom.lineStyle", "value": { "fill": "dash", "dash": [6, 4] } },
{ "id": "custom.fillOpacity","value": 0 }
]
}
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["lastNotNull", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 46 }
},
{
"id": 33, "type": "row", "title": "Snapshots", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 54 }
},
{
"id": 34, "type": "timeseries", "title": "Snapshot Save Duration (p50 / p95 / p99)",
"description": "Time to write a full snapshot of the boltdb to disk. Slow saves delay Raft log compaction, causing the WAL to grow unboundedly and members to fall further behind.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "histogram_quantile(0.50, sum by(le)(rate(etcd_debugging_snap_save_total_duration_seconds_bucket{instance=~\"$instance\"}[5m])))", "refId": "A", "legendFormat": "p50" },
{ "expr": "histogram_quantile(0.95, sum by(le)(rate(etcd_debugging_snap_save_total_duration_seconds_bucket{instance=~\"$instance\"}[5m])))", "refId": "B", "legendFormat": "p95" },
{ "expr": "histogram_quantile(0.99, sum by(le)(rate(etcd_debugging_snap_save_total_duration_seconds_bucket{instance=~\"$instance\"}[5m])))", "refId": "C", "legendFormat": "p99" }
],
"fieldConfig": {
"defaults": {
"unit": "s", "min": 0, "decimals": 3,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
},
"overrides": [
{ "matcher": { "id": "byName", "options": "p50" }, "properties": [{ "id": "color", "value": { "fixedColor": "green", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "p95" }, "properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "p99" }, "properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }] }
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 12, "x": 0, "y": 55 }
},
{
"id": 35, "type": "timeseries", "title": "Snapshot DB Fsync Duration (p50 / p95 / p99)",
"description": "Time to fsync the snapshot file itself. Distinct from WAL fsync: this is flushing the entire boltdb copy to disk after a snapshot is taken.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "histogram_quantile(0.50, sum by(le)(rate(etcd_snap_db_fsync_duration_seconds_bucket{instance=~\"$instance\"}[5m])))", "refId": "A", "legendFormat": "p50" },
{ "expr": "histogram_quantile(0.95, sum by(le)(rate(etcd_snap_db_fsync_duration_seconds_bucket{instance=~\"$instance\"}[5m])))", "refId": "B", "legendFormat": "p95" },
{ "expr": "histogram_quantile(0.99, sum by(le)(rate(etcd_snap_db_fsync_duration_seconds_bucket{instance=~\"$instance\"}[5m])))", "refId": "C", "legendFormat": "p99" }
],
"fieldConfig": {
"defaults": {
"unit": "s", "min": 0, "decimals": 3,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
},
"overrides": [
{ "matcher": { "id": "byName", "options": "p50" }, "properties": [{ "id": "color", "value": { "fixedColor": "green", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "p95" }, "properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "p99" }, "properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }] }
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 12, "x": 12, "y": 55 }
}
]
}

View File

@@ -0,0 +1,752 @@
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
name: okd-control-plane-health
namespace: observability
spec:
instanceSelector:
matchLabels:
dashboards: "grafana"
json: |
{
"title": "Control Plane Health",
"uid": "okd-control-plane",
"schemaVersion": 36,
"version": 1,
"refresh": "30s",
"time": { "from": "now-1h", "to": "now" },
"tags": ["okd", "control-plane"],
"templating": {
"list": [
{
"name": "instance",
"type": "query",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"query": { "query": "label_values(apiserver_request_total, instance)", "refId": "A" },
"refresh": 2,
"includeAll": true,
"multi": true,
"allValue": ".*",
"label": "API Server Instance",
"sort": 1,
"current": {},
"options": []
}
]
},
"panels": [
{
"id": 1, "type": "stat", "title": "API Servers Up",
"description": "Number of kube-apiserver instances currently scraped and up. Healthy HA cluster = 3.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(up{job=~\".*apiserver.*\"} == 1)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "red", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "green", "value": 3 }
]},
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 0, "y": 0 }
},
{
"id": 2, "type": "stat", "title": "Controller Managers Up",
"description": "kube-controller-manager instances up. In OKD only one holds the leader lease at a time; others are hot standbys.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(up{job=~\".*controller-manager.*\"} == 1)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "red", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "green", "value": 3 }
]},
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 3, "y": 0 }
},
{
"id": 3, "type": "stat", "title": "Schedulers Up",
"description": "kube-scheduler instances up. One holds the leader lease; rest are standbys. 0 = no scheduling of new pods.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(up{job=~\".*scheduler.*\"} == 1)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "red", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "green", "value": 3 }
]},
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 6, "y": 0 }
},
{
"id": 4, "type": "stat", "title": "API 5xx Rate",
"description": "Server-side errors (5xx) across all apiserver instances per second. Any sustained non-zero value = apiserver internal fault.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "sum(rate(apiserver_request_total{code=~\"5..\"}[5m]))", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 0.01 },
{ "color": "red", "value": 1 }
]},
"unit": "reqps", "noValue": "0", "decimals": 3
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 9, "y": 0 }
},
{
"id": 5, "type": "stat", "title": "Inflight — Mutating",
"description": "Current in-flight mutating requests (POST/PUT/PATCH/DELETE). Default OKD limit is ~1000. Hitting the limit = 429 errors for writes.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "sum(apiserver_current_inflight_requests{request_kind=\"mutating\"})", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 500 },
{ "color": "orange", "value": 750 },
{ "color": "red", "value": 900 }
]},
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 12, "y": 0 }
},
{
"id": 6, "type": "stat", "title": "Inflight — Read-Only",
"description": "Current in-flight non-mutating requests (GET/LIST/WATCH). Default OKD limit is ~3000. Hitting it = 429 for reads, impacting controllers and kubectl.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "sum(apiserver_current_inflight_requests{request_kind=\"readOnly\"})", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1500 },
{ "color": "orange", "value": 2200 },
{ "color": "red", "value": 2700 }
]},
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 15, "y": 0 }
},
{
"id": 7, "type": "stat", "title": "API Request p99 (non-WATCH)",
"description": "Overall p99 latency for all non-streaming verbs. >1s = noticeable kubectl sluggishness. >10s = controllers timing out on LIST/GET.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{verb!~\"WATCH|CONNECT\"}[5m])) by (le))",
"refId": "A", "legendFormat": ""
}],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 0.5 },
{ "color": "orange", "value": 1 },
{ "color": "red", "value": 5 }
]},
"unit": "s", "noValue": "0", "decimals": 3
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 18, "y": 0 }
},
{
"id": 8, "type": "stat", "title": "APIServer → etcd p99",
"description": "p99 time apiserver spends waiting on etcd calls. Spike here while WAL fsync is healthy = serialization or large object overhead.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "histogram_quantile(0.99, sum(rate(apiserver_storage_request_duration_seconds_bucket[5m])) by (le))",
"refId": "A", "legendFormat": ""
}],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 0.05 },
{ "color": "orange", "value": 0.2 },
{ "color": "red", "value": 0.5 }
]},
"unit": "s", "noValue": "0", "decimals": 4
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 21, "y": 0 }
},
{
"id": 9, "type": "row", "title": "API Server — Request Rates & Errors", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 4 }
},
{
"id": 10, "type": "timeseries", "title": "Request Rate by Verb",
"description": "Non-streaming calls per second broken down by verb. GET/LIST = read load from controllers. POST/PUT/PATCH/DELETE = write throughput. A sudden LIST spike = controller cache resync storm.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(verb)(rate(apiserver_request_total{instance=~\"$instance\",verb!~\"WATCH|CONNECT\"}[5m]))",
"refId": "A", "legendFormat": "{{verb}}"
}],
"fieldConfig": {
"defaults": {
"unit": "reqps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 0, "y": 5 }
},
{
"id": 11, "type": "timeseries", "title": "Error Rate by HTTP Status Code",
"description": "4xx/5xx responses per second by code. 429 = inflight limit hit (throttling). 422 = admission rejection or invalid object. 500/503 = internal apiserver fault or etcd unavailability.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(code)(rate(apiserver_request_total{instance=~\"$instance\",code=~\"[45]..\"}[5m]))",
"refId": "A", "legendFormat": "HTTP {{code}}"
}],
"fieldConfig": {
"defaults": {
"unit": "reqps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 15, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 8, "y": 5 }
},
{
"id": 12, "type": "timeseries", "title": "In-Flight Requests — Mutating vs Read-Only",
"description": "Instantaneous count of requests being actively handled. The two series correspond to the two inflight limit buckets enforced by the apiserver's Priority and Fairness (APF) or legacy inflight settings.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "sum by(request_kind)(apiserver_current_inflight_requests{instance=~\"$instance\"})", "refId": "A", "legendFormat": "{{request_kind}}" }
],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 20, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 5 }
},
{
"id": 13, "type": "row", "title": "API Server — Latency", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 13 }
},
{
"id": 14, "type": "timeseries", "title": "Request Latency — p50 / p95 / p99 (non-WATCH)",
"description": "Aggregated end-to-end request duration across all verbs except WATCH/CONNECT (which are unbounded streaming). A rising p99 without a matching rise in etcd latency = CPU saturation, admission webhook slowness, or serialization overhead.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "histogram_quantile(0.50, sum(rate(apiserver_request_duration_seconds_bucket{instance=~\"$instance\",verb!~\"WATCH|CONNECT\"}[5m])) by (le))", "refId": "A", "legendFormat": "p50" },
{ "expr": "histogram_quantile(0.95, sum(rate(apiserver_request_duration_seconds_bucket{instance=~\"$instance\",verb!~\"WATCH|CONNECT\"}[5m])) by (le))", "refId": "B", "legendFormat": "p95" },
{ "expr": "histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{instance=~\"$instance\",verb!~\"WATCH|CONNECT\"}[5m])) by (le))", "refId": "C", "legendFormat": "p99" }
],
"fieldConfig": {
"defaults": {
"unit": "s", "min": 0, "decimals": 4,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
},
"overrides": [
{ "matcher": { "id": "byName", "options": "p50" }, "properties": [{ "id": "color", "value": { "fixedColor": "green", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "p95" }, "properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "p99" }, "properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }] }
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 0, "y": 14 }
},
{
"id": 15, "type": "timeseries", "title": "Request p99 Latency by Verb",
"description": "p99 latency broken out per verb. LIST is inherently slower than GET due to serializing full collections. A POST/PUT spike = heavy admission webhook chain or large object writes. DELETE spikes are usually caused by cascading GC finalizer storms.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "histogram_quantile(0.99, sum by(verb,le)(rate(apiserver_request_duration_seconds_bucket{instance=~\"$instance\",verb!~\"WATCH|CONNECT\"}[5m])))",
"refId": "A", "legendFormat": "{{verb}}"
}],
"fieldConfig": {
"defaults": {
"unit": "s", "min": 0, "decimals": 4,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 8, "y": 14 }
},
{
"id": 16, "type": "timeseries", "title": "APIServer → etcd Latency by Operation",
"description": "Time apiserver spends waiting on etcd, split by operation type (get, list, create, update, delete, watch). Elevated get/list = etcd read pressure. Elevated create/update = write bottleneck, likely correlated with WAL fsync latency.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "histogram_quantile(0.50, sum by(operation,le)(rate(apiserver_storage_request_duration_seconds_bucket[5m])))", "refId": "A", "legendFormat": "p50 — {{operation}}" },
{ "expr": "histogram_quantile(0.99, sum by(operation,le)(rate(apiserver_storage_request_duration_seconds_bucket[5m])))", "refId": "B", "legendFormat": "p99 — {{operation}}" }
],
"fieldConfig": {
"defaults": {
"unit": "s", "min": 0, "decimals": 4,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 14 }
},
{
"id": 17, "type": "row", "title": "API Server — Watches & Long-Running Requests", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 22 }
},
{
"id": 18, "type": "timeseries", "title": "Active Long-Running Requests (Watches) by Resource",
"description": "Instantaneous count of open WATCH streams grouped by resource. Each controller typically holds one WATCH per resource type per apiserver instance. A sudden drop = controller restart; a runaway climb = operator creating watches without cleanup.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(resource)(apiserver_longrunning_requests{instance=~\"$instance\",verb=\"WATCH\"})",
"refId": "A", "legendFormat": "{{resource}}"
}],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["lastNotNull", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 0, "y": 23 }
},
{
"id": 19, "type": "timeseries", "title": "Watch Events Dispatched Rate by Kind",
"description": "Watch events sent to all active watchers per second, by object kind. Persistent high rate for a specific kind = that resource type is churning heavily, increasing etcd load and controller reconcile frequency.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(kind)(rate(apiserver_watch_events_total{instance=~\"$instance\"}[5m]))",
"refId": "A", "legendFormat": "{{kind}}"
}],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 8, "y": 23 }
},
{
"id": 20, "type": "timeseries", "title": "Watch Event Size — p50 / p95 / p99 by Kind",
"description": "Size of individual watch events dispatched to clients. Large events (MiB-scale) for Secrets or ConfigMaps = objects being stored with oversized data. Contributes to apiserver memory pressure and network saturation.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "histogram_quantile(0.50, sum by(kind,le)(rate(apiserver_watch_events_sizes_bucket{instance=~\"$instance\"}[5m])))", "refId": "A", "legendFormat": "p50 — {{kind}}" },
{ "expr": "histogram_quantile(0.99, sum by(kind,le)(rate(apiserver_watch_events_sizes_bucket{instance=~\"$instance\"}[5m])))", "refId": "B", "legendFormat": "p99 — {{kind}}" }
],
"fieldConfig": {
"defaults": {
"unit": "bytes", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 16, "y": 23 }
},
{
"id": 21, "type": "row", "title": "Admission Webhooks", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 30 }
},
{
"id": 22, "type": "timeseries", "title": "Webhook Call Rate by Name",
"description": "Mutating and validating admission webhook invocations per second by webhook name. A webhook invoked on every write (e.g., a mutating webhook with no object selector) can be a major source of write latency amplification.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(name,type)(rate(apiserver_admission_webhook_request_total{instance=~\"$instance\"}[5m]))",
"refId": "A", "legendFormat": "{{type}} — {{name}}"
}],
"fieldConfig": {
"defaults": {
"unit": "reqps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 0, "y": 31 }
},
{
"id": 23, "type": "timeseries", "title": "Webhook Latency p99 by Name",
"description": "p99 round-trip time per webhook call (network + webhook server processing). Default apiserver timeout is 10s; a webhook consistently near that limit causes cascading write latency for all resources it intercepts.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "histogram_quantile(0.99, sum by(name,le)(rate(apiserver_admission_webhook_admission_duration_seconds_bucket{instance=~\"$instance\"}[5m])))",
"refId": "A", "legendFormat": "{{name}}"
}],
"fieldConfig": {
"defaults": {
"unit": "s", "min": 0, "decimals": 4,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 0.5 },
{ "color": "red", "value": 2.0 }
]}
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 8, "y": 31 }
},
{
"id": 24, "type": "timeseries", "title": "Webhook Rejection Rate by Name",
"description": "Rate of admission denials per webhook. A validating webhook rejecting requests is expected behaviour; a sudden surge indicates either a newly enforced policy or a misbehaving webhook rejecting valid objects.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(name,error_type)(rate(apiserver_admission_webhook_rejection_count{instance=~\"$instance\"}[5m]))",
"refId": "A", "legendFormat": "{{name}} ({{error_type}})"
}],
"fieldConfig": {
"defaults": {
"unit": "reqps", "min": 0, "decimals": 3,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 15, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 16, "y": 31 }
},
{
"id": 25, "type": "row", "title": "kube-controller-manager", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 38 }
},
{
"id": 26, "type": "timeseries", "title": "Work Queue Depth by Controller",
"description": "Items waiting to be reconciled in each controller's work queue. Persistent non-zero depth = controller cannot keep up with the event rate. Identifies which specific controller is the bottleneck during overload incidents.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "topk(15, sum by(name)(workqueue_depth{job=~\".*controller-manager.*\"}))",
"refId": "A", "legendFormat": "{{name}}"
}],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": {
"lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false,
"thresholdsStyle": { "mode": "line" }
},
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 10 },
{ "color": "red", "value": 50 }
]}
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 0, "y": 39 }
},
{
"id": 27, "type": "timeseries", "title": "Work Queue Item Processing Duration p99 by Controller",
"description": "p99 time a work item spends being actively reconciled (inside the reconcile loop, excludes queue wait time). A slow reconcile = either the controller is doing expensive API calls or the etcd write path is slow.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "histogram_quantile(0.99, sum by(name,le)(rate(workqueue_work_duration_seconds_bucket{job=~\".*controller-manager.*\"}[5m])))",
"refId": "A", "legendFormat": "{{name}}"
}],
"fieldConfig": {
"defaults": {
"unit": "s", "min": 0, "decimals": 4,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 8, "y": 39 }
},
{
"id": 28, "type": "timeseries", "title": "Work Queue Retry Rate by Controller",
"description": "Rate of items being re-queued after a failed reconciliation. A persistently high retry rate for a controller = it is encountering recurring errors on the same objects (e.g., API permission errors, webhook rejections, or resource conflicts).",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "topk(15, sum by(name)(rate(workqueue_retries_total{job=~\".*controller-manager.*\"}[5m])))",
"refId": "A", "legendFormat": "{{name}}"
}],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0, "decimals": 3,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 39 }
},
{
"id": 29, "type": "row", "title": "kube-scheduler", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 47 }
},
{
"id": 30, "type": "timeseries", "title": "Scheduling Attempt Rate by Result",
"description": "Outcomes of scheduling cycles per second. scheduled = pod successfully bound to a node. unschedulable = no node met the pod's constraints. error = scheduler internal failure (API error, timeout). Persistent unschedulable = cluster capacity or taints/affinity misconfiguration.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(result)(rate(scheduler_schedule_attempts_total[5m]))",
"refId": "A", "legendFormat": "{{result}}"
}],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
},
"overrides": [
{ "matcher": { "id": "byName", "options": "scheduled" }, "properties": [{ "id": "color", "value": { "fixedColor": "green", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "unschedulable" }, "properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "error" }, "properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }] }
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 0, "y": 48 }
},
{
"id": 31, "type": "timeseries", "title": "Scheduling Latency — p50 / p95 / p99",
"description": "Time from when a pod enters the active queue to when a binding decision is made (does not include bind API call time). Includes filter, score, and reserve plugin execution time. Spike = expensive affinity rules, large number of nodes, or slow extender webhooks.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "histogram_quantile(0.50, sum(rate(scheduler_scheduling_attempt_duration_seconds_bucket[5m])) by (le))", "refId": "A", "legendFormat": "p50" },
{ "expr": "histogram_quantile(0.95, sum(rate(scheduler_scheduling_attempt_duration_seconds_bucket[5m])) by (le))", "refId": "B", "legendFormat": "p95" },
{ "expr": "histogram_quantile(0.99, sum(rate(scheduler_scheduling_attempt_duration_seconds_bucket[5m])) by (le))", "refId": "C", "legendFormat": "p99" }
],
"fieldConfig": {
"defaults": {
"unit": "s", "min": 0, "decimals": 4,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
},
"overrides": [
{ "matcher": { "id": "byName", "options": "p50" }, "properties": [{ "id": "color", "value": { "fixedColor": "green", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "p95" }, "properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "p99" }, "properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }] }
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 8, "y": 48 }
},
{
"id": 32, "type": "timeseries", "title": "Pending Pods by Queue",
"description": "Pods waiting to be scheduled, split by internal queue. active = ready to be attempted now. backoff = recently failed, in exponential back-off. unschedulable = parked until cluster state changes. A growing unschedulable queue = systemic capacity or constraint problem.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(queue)(scheduler_pending_pods)",
"refId": "A", "legendFormat": "{{queue}}"
}],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": {
"lineWidth": 2, "fillOpacity": 15, "showPoints": "never", "spanNulls": false,
"thresholdsStyle": { "mode": "line" }
},
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 10 },
{ "color": "red", "value": 50 }
]}
},
"overrides": [
{ "matcher": { "id": "byName", "options": "unschedulable" }, "properties": [{ "id": "color", "value": { "fixedColor": "orange", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "backoff" }, "properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "active" }, "properties": [{ "id": "color", "value": { "fixedColor": "blue", "mode": "fixed" } }] }
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["lastNotNull", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 16, "y": 48 }
},
{
"id": 33, "type": "row", "title": "Process Resources — All Control Plane Components", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 55 }
},
{
"id": 34, "type": "timeseries", "title": "CPU Usage by Component",
"description": "Rate of CPU seconds consumed by each control plane process. apiserver CPU spike = surge in request volume or list serialization. controller-manager CPU spike = reconcile storm. scheduler CPU spike = large node count with complex affinity.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "sum by(job)(rate(process_cpu_seconds_total{job=~\".*apiserver.*\"}[5m]))", "refId": "A", "legendFormat": "apiserver — {{job}}" },
{ "expr": "sum by(job)(rate(process_cpu_seconds_total{job=~\".*controller-manager.*\"}[5m]))", "refId": "B", "legendFormat": "controller-manager — {{job}}" },
{ "expr": "sum by(job)(rate(process_cpu_seconds_total{job=~\".*scheduler.*\"}[5m]))", "refId": "C", "legendFormat": "scheduler — {{job}}" }
],
"fieldConfig": {
"defaults": {
"unit": "percentunit", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 0, "y": 56 }
},
{
"id": 35, "type": "timeseries", "title": "RSS Memory by Component",
"description": "Resident set size of each control plane process. apiserver memory is dominated by the watch cache size and serialisation buffers. controller-manager memory = informer caches. Monotonically growing RSS without restarts = memory leak or unbounded cache growth.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "sum by(job)(process_resident_memory_bytes{job=~\".*apiserver.*\"})", "refId": "A", "legendFormat": "apiserver — {{job}}" },
{ "expr": "sum by(job)(process_resident_memory_bytes{job=~\".*controller-manager.*\"})", "refId": "B", "legendFormat": "controller-manager — {{job}}" },
{ "expr": "sum by(job)(process_resident_memory_bytes{job=~\".*scheduler.*\"})", "refId": "C", "legendFormat": "scheduler — {{job}}" }
],
"fieldConfig": {
"defaults": {
"unit": "bytes", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["lastNotNull", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 8, "y": 56 }
},
{
"id": 36, "type": "timeseries", "title": "Goroutines by Component",
"description": "Number of live goroutines in each control plane process. Gradual upward drift = goroutine leak (often tied to unclosed watch streams or context leaks). A step-down = process restart. apiserver typically runs 200600 goroutines; spikes above 1000 warrant investigation.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "sum by(job)(go_goroutines{job=~\".*apiserver.*\"})", "refId": "A", "legendFormat": "apiserver — {{job}}" },
{ "expr": "sum by(job)(go_goroutines{job=~\".*controller-manager.*\"})", "refId": "B", "legendFormat": "controller-manager — {{job}}" },
{ "expr": "sum by(job)(go_goroutines{job=~\".*scheduler.*\"})", "refId": "C", "legendFormat": "scheduler — {{job}}" }
],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["lastNotNull", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 16, "y": 56 }
}
]
}

View File

@@ -0,0 +1,741 @@
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
name: okd-alerts-events
namespace: observability
spec:
instanceSelector:
matchLabels:
dashboards: "grafana"
json: |
{
"title": "Alerts & Events — Active Problems",
"uid": "okd-alerts-events",
"schemaVersion": 36,
"version": 1,
"refresh": "30s",
"time": { "from": "now-3h", "to": "now" },
"tags": ["okd", "alerts", "events"],
"templating": {
"list": [
{
"name": "severity",
"type": "custom",
"label": "Severity Filter",
"query": "critical,warning,info",
"current": { "selected": true, "text": "All", "value": "$__all" },
"includeAll": true,
"allValue": "critical|warning|info",
"multi": false,
"options": [
{ "selected": true, "text": "All", "value": "$__all" },
{ "selected": false, "text": "Critical", "value": "critical" },
{ "selected": false, "text": "Warning", "value": "warning" },
{ "selected": false, "text": "Info", "value": "info" }
]
},
{
"name": "namespace",
"type": "query",
"label": "Namespace",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"query": { "query": "label_values(ALERTS{alertstate=\"firing\"}, namespace)", "refId": "A" },
"refresh": 2,
"includeAll": true,
"allValue": ".*",
"multi": true,
"sort": 1,
"current": {},
"options": []
}
]
},
"panels": [
{
"id": 1, "type": "stat", "title": "Critical Alerts Firing",
"description": "Alerting rule instances currently in the firing state with severity=\"critical\". Any non-zero value represents a breached SLO or infrastructure condition requiring immediate on-call response. The ALERTS metric is generated by Prometheus directly from your alerting rules — it reflects what Prometheus knows, before Alertmanager routing or silencing.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(ALERTS{alertstate=\"firing\",severity=\"critical\"}) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "red", "value": 1 }
]},
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 0, "y": 0 }
},
{
"id": 2, "type": "stat", "title": "Warning Alerts Firing",
"description": "Firing alerts at severity=\"warning\". Warnings indicate a degraded or elevated-risk condition that has not yet crossed the critical threshold. A sustained or growing warning count often precedes a critical fire — treat them as early-warning signals, not background noise.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(ALERTS{alertstate=\"firing\",severity=\"warning\"}) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "orange", "value": 5 }
]},
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 3, "y": 0 }
},
{
"id": 3, "type": "stat", "title": "Info / Unclassified Alerts Firing",
"description": "Firing alerts with severity=\"info\" or no severity label. These are informational and do not normally require immediate action. A sudden large jump may reveal noisy alerting rules generating alert fatigue — rules worth reviewing for threshold tuning or adding inhibition rules.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(ALERTS{alertstate=\"firing\",severity!~\"critical|warning\"}) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "blue", "value": 1 },
{ "color": "blue", "value": 25 }
]},
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 6, "y": 0 }
},
{
"id": 4, "type": "stat", "title": "Alerts Silenced (Suppressed)",
"description": "Alerts currently matched by an active Alertmanager silence rule and therefore not routed to receivers. Silences are intentional during maintenance windows, but a large suppressed count outside of planned maintenance = an overly broad silence masking real problems. Zero silences when a maintenance window is active = the silence has expired or was misconfigured.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "sum(alertmanager_alerts{state=\"suppressed\"}) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 20 }
]},
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 9, "y": 0 }
},
{
"id": 5, "type": "stat", "title": "CrashLoopBackOff Pods",
"description": "Container instances currently waiting in the CrashLoopBackOff state — the container crashed and Kubernetes is retrying with exponential back-off. Each instance is a pod that cannot stay running. Common root causes: OOM kill, bad entrypoint, missing Secret or ConfigMap, an unavailable init dependency, or a broken image layer.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_pod_container_status_waiting_reason{reason=\"CrashLoopBackOff\"} == 1) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 3 }
]},
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 12, "y": 0 }
},
{
"id": 6, "type": "stat", "title": "OOMKilled Containers",
"description": "Containers whose most recent termination reason was OOMKilled. This is a current-state snapshot: a container that was OOMKilled, restarted, and is now Running will still appear here until its next termination occurs for a different reason. Non-zero and stable = recurring OOM, likely a workload memory leak or under-provisioned memory limit.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_pod_container_status_last_terminated_reason{reason=\"OOMKilled\"} == 1) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "orange", "value": 1 },
{ "color": "red", "value": 5 }
]},
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 15, "y": 0 }
},
{
"id": 7, "type": "stat", "title": "NotReady Nodes",
"description": "Nodes where the Ready condition is currently not True (False or Unknown). A NotReady node stops receiving new pod scheduling and, after the node eviction timeout (~5 min default), pods on it will be evicted. Control plane nodes going NotReady simultaneously = potential quorum loss. Any non-zero value is a tier-1 incident signal.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_node_status_condition{condition=\"Ready\",status=\"true\"} == 0) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "red", "value": 1 }
]},
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 18, "y": 0 }
},
{
"id": 8, "type": "stat", "title": "Degraded Cluster Operators (OKD)",
"description": "OKD ClusterOperators currently reporting Degraded=True. Each ClusterOperator owns a core platform component — authentication, networking, image-registry, monitoring, ingress, storage, etc. A degraded operator means its managed component is impaired or unavailable. Zero is the only acceptable steady-state value outside of an active upgrade.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(cluster_operator_conditions{condition=\"Degraded\"} == 1) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "red", "value": 1 }
]},
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 21, "y": 0 }
},
{
"id": 9, "type": "row", "title": "Alert Overview", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 4 }
},
{
"id": 10, "type": "timeseries", "title": "Firing Alert Count by Severity Over Time",
"description": "Instantaneous count of firing ALERTS series grouped by severity over the selected window. A vertical rise = new alerting condition emerged. A horizontal plateau = a persistent, unresolved problem. A step-down = alert resolved or Prometheus rule evaluation stopped matching. Use the Severity Filter variable to narrow scope during triage.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "count by(severity)(ALERTS{alertstate=\"firing\",severity=~\"$severity\",namespace=~\"$namespace\"})",
"refId": "A",
"legendFormat": "{{severity}}"
}],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 15, "showPoints": "never", "spanNulls": false }
},
"overrides": [
{ "matcher": { "id": "byName", "options": "critical" }, "properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "warning" }, "properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "info" }, "properties": [{ "id": "color", "value": { "fixedColor": "blue", "mode": "fixed" } }] }
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["max", "lastNotNull"] }
},
"gridPos": { "h": 8, "w": 8, "x": 0, "y": 5 }
},
{
"id": 11, "type": "timeseries", "title": "Alertmanager Notification Rate by Integration",
"description": "Rate of notification delivery attempts from Alertmanager per second, split by integration type (slack, pagerduty, email, webhook, etc.). Solid lines = successful deliveries; dashed red lines = failed deliveries. A drop to zero on all integrations = Alertmanager is not processing or the cluster is completely quiet. Persistent failures on one integration = check that receiver's credentials or endpoint availability.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "sum by(integration)(rate(alertmanager_notifications_total[5m]))", "refId": "A", "legendFormat": "✓ {{integration}}" },
{ "expr": "sum by(integration)(rate(alertmanager_notifications_failed_total[5m]))", "refId": "B", "legendFormat": "✗ {{integration}}" }
],
"fieldConfig": {
"defaults": {
"unit": "reqps", "min": 0, "decimals": 3,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
},
"overrides": [
{
"matcher": { "id": "byFrameRefID", "options": "B" },
"properties": [
{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } },
{ "id": "custom.lineStyle", "value": { "dash": [6, 4], "fill": "dash" } },
{ "id": "custom.lineWidth", "value": 1 }
]
}
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 8, "y": 5 }
},
{
"id": 12, "type": "bargauge", "title": "Longest-Firing Active Alerts",
"description": "Duration (now - ALERTS_FOR_STATE timestamp) for each currently firing alert, sorted descending. Alerts at the top have been firing longest and are the most likely candidates for known-but-unresolved issues, stale firing conditions, or alerts that should have a silence applied. Red bars (> 2 hours) strongly suggest a problem that has been acknowledged but not resolved.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sort_desc(time() - ALERTS_FOR_STATE{alertstate=\"firing\",severity=~\"$severity\",namespace=~\"$namespace\"})",
"refId": "A",
"legendFormat": "{{alertname}} · {{severity}} · {{namespace}}"
}],
"fieldConfig": {
"defaults": {
"unit": "s", "min": 0,
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 300 },
{ "color": "orange", "value": 1800 },
{ "color": "red", "value": 7200 }
]}
}
},
"options": {
"orientation": "horizontal",
"reduceOptions": { "calcs": ["lastNotNull"] },
"displayMode": "gradient",
"showUnfilled": true,
"valueMode": "color"
},
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 5 }
},
{
"id": 13, "type": "row", "title": "Active Firing Alerts — Full Detail", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 13 }
},
{
"id": 14, "type": "table", "title": "All Firing Alerts",
"description": "Instant-query table of every currently firing alert visible to Prometheus, filtered by the Namespace and Severity variables above. Each row is one alert instance (unique label combination). The value column is omitted — by definition every row here is firing. Use the built-in column filter (funnel icon) to further narrow to a specific alertname, pod, or node. Columns are sparse: labels not defined in a given alert rule will show '—'.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "ALERTS{alertstate=\"firing\",severity=~\"$severity\",namespace=~\"$namespace\"}",
"refId": "A",
"instant": true,
"legendFormat": ""
}],
"transformations": [
{ "id": "labelsToFields", "options": { "mode": "columns" } },
{
"id": "organize",
"options": {
"excludeByName": {
"alertstate": true,
"__name__": true,
"Value": true,
"Time": true
},
"renameByName": {
"alertname": "Alert Name",
"severity": "Severity",
"namespace": "Namespace",
"pod": "Pod",
"node": "Node",
"container": "Container",
"job": "Job",
"service": "Service",
"reason": "Reason",
"instance": "Instance"
},
"indexByName": {
"severity": 0,
"alertname": 1,
"namespace": 2,
"pod": 3,
"node": 4,
"container": 5,
"job": 6,
"service": 7,
"reason": 8,
"instance": 9
}
}
}
],
"fieldConfig": {
"defaults": {
"custom": { "align": "left", "filterable": true },
"noValue": "—"
},
"overrides": [
{
"matcher": { "id": "byName", "options": "Severity" },
"properties": [
{ "id": "custom.displayMode", "value": "color-background" },
{ "id": "custom.width", "value": 110 },
{
"id": "mappings",
"value": [{
"type": "value",
"options": {
"critical": { "text": "CRITICAL", "color": "dark-red", "index": 0 },
"warning": { "text": "WARNING", "color": "dark-yellow", "index": 1 },
"info": { "text": "INFO", "color": "dark-blue", "index": 2 }
}
}]
}
]
},
{ "matcher": { "id": "byName", "options": "Alert Name" }, "properties": [{ "id": "custom.width", "value": 300 }] },
{ "matcher": { "id": "byName", "options": "Namespace" }, "properties": [{ "id": "custom.width", "value": 180 }] },
{ "matcher": { "id": "byName", "options": "Pod" }, "properties": [{ "id": "custom.width", "value": 200 }] },
{ "matcher": { "id": "byName", "options": "Node" }, "properties": [{ "id": "custom.width", "value": 200 }] }
]
},
"options": {
"sortBy": [{ "desc": false, "displayName": "Severity" }],
"footer": { "show": false }
},
"gridPos": { "h": 12, "w": 24, "x": 0, "y": 14 }
},
{
"id": 15, "type": "row", "title": "Kubernetes Warning Events", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 26 }
},
{
"id": 16, "type": "timeseries", "title": "Warning Event Rate by Reason",
"description": "Rate of Kubernetes Warning-type events per second grouped by reason code. BackOff = container is CrashLooping. FailedScheduling = no node satisfies pod constraints. FailedMount = volume attachment or CSI failure. Evicted = kubelet evicted a pod due to memory or disk pressure. NodeNotReady = node lost contact. A spike in a single reason narrows the incident root-cause immediately without needing to read raw event logs. Requires kube-state-metrics with --resources=events.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "topk(10, sum by(reason)(rate(kube_event_count{type=\"Warning\",namespace=~\"$namespace\"}[5m])))",
"refId": "A",
"legendFormat": "{{reason}}"
}],
"fieldConfig": {
"defaults": {
"unit": "reqps", "min": 0, "decimals": 4,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 0, "y": 27 }
},
{
"id": 17, "type": "bargauge", "title": "Warning Events — Top Namespaces (Accumulated Count)",
"description": "Total accumulated Warning event count (the count field on the Kubernetes Event object) per namespace, showing the top 15 most active. A namespace dominating this chart is generating significantly more abnormal conditions than its peers, useful for identifying noisy tenants, misconfigured deployments, or namespaces experiencing a persistent infrastructure problem. Note this is the raw Event.count field — it resets if the event object is deleted and recreated.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "topk(15, sum by(namespace)(kube_event_count{type=\"Warning\"}))",
"refId": "A",
"legendFormat": "{{namespace}}"
}],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 10 },
{ "color": "orange", "value": 50 },
{ "color": "red", "value": 200 }
]}
}
},
"options": {
"orientation": "horizontal",
"reduceOptions": { "calcs": ["lastNotNull"] },
"displayMode": "gradient",
"showUnfilled": true
},
"gridPos": { "h": 8, "w": 8, "x": 8, "y": 27 }
},
{
"id": 18, "type": "timeseries", "title": "Warning Events — Accumulated Count by Reason Over Time",
"description": "Raw accumulated event count gauge over time, split by reason. Unlike the rate panel this shows total volume and slope simultaneously. A line that climbs steeply = events are occurring frequently right now. A line that plateaus = the condition causing that reason has stopped. A line that drops to zero = the event object was deleted and recreated or the condition fully resolved.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "topk(10, sum by(reason)(kube_event_count{type=\"Warning\",namespace=~\"$namespace\"}))",
"refId": "A",
"legendFormat": "{{reason}}"
}],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 8, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["lastNotNull", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 27 }
},
{
"id": 19, "type": "row", "title": "Pod Problems", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 35 }
},
{
"id": 20, "type": "timeseries", "title": "CrashLoopBackOff Pods by Namespace",
"description": "Count of container instances in CrashLoopBackOff waiting state over time, broken down by namespace. A sudden rise in one namespace = a workload deployment is failing. A persistent baseline across many namespaces = a shared dependency (Secret, ConfigMap, network policy, or an upstream service) has become unavailable. Unlike restart rate, this panel shows the steady-state count of pods currently stuck — not flapping.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(namespace)(kube_pod_container_status_waiting_reason{reason=\"CrashLoopBackOff\",namespace=~\"$namespace\"} == 1)",
"refId": "A",
"legendFormat": "{{namespace}}"
}],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": {
"lineWidth": 2, "fillOpacity": 15, "showPoints": "never", "spanNulls": false,
"thresholdsStyle": { "mode": "line" }
},
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 5 }
]}
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["lastNotNull", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 0, "y": 36 }
},
{
"id": 21, "type": "timeseries", "title": "Container Restart Rate by Namespace",
"description": "Rate of container restarts per second across all reasons (OOMKill, liveness probe failure, process exit) grouped by namespace. A namespace with a rising restart rate that has not yet entered CrashLoopBackOff is in the early failure window before the exponential back-off penalty kicks in. Cross-reference with the OOMKilled stat tile and the last-terminated-reason to separate crash types.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "topk(10, sum by(namespace)(rate(kube_pod_container_status_restarts_total{namespace=~\"$namespace\"}[5m])))",
"refId": "A",
"legendFormat": "{{namespace}}"
}],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0, "decimals": 4,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 8, "y": 36 }
},
{
"id": 22, "type": "timeseries", "title": "Pods by Problem Phase (Failed / Pending / Unknown)",
"description": "Count of pods in Failed, Pending, or Unknown phase over time. Failed = container terminated with a non-zero exit code or was evicted and not rescheduled. Pending for more than a few minutes = scheduler unable to bind the pod (check FailedScheduling events, node capacity, and taint/toleration mismatches). Unknown = kubelet is not reporting to the apiserver, typically indicating a node network partition or kubelet crash.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "sum by(phase)(kube_pod_status_phase{phase=~\"Failed|Unknown\",namespace=~\"$namespace\"} == 1)", "refId": "A", "legendFormat": "{{phase}}" },
{ "expr": "sum(kube_pod_status_phase{phase=\"Pending\",namespace=~\"$namespace\"} == 1)", "refId": "B", "legendFormat": "Pending" }
],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": {
"lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false,
"thresholdsStyle": { "mode": "line" }
},
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "red", "value": 1 }
]}
},
"overrides": [
{ "matcher": { "id": "byName", "options": "Failed" }, "properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "Pending" }, "properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "Unknown" }, "properties": [{ "id": "color", "value": { "fixedColor": "orange", "mode": "fixed" } }] }
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["lastNotNull", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 16, "y": 36 }
},
{
"id": 23, "type": "row", "title": "Node & Cluster Operator Conditions", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 43 }
},
{
"id": 24, "type": "table", "title": "Node Condition Status Matrix",
"description": "Instant snapshot of every active node condition across all nodes. Each row is one (node, condition, status) triple where value=1, meaning that combination is currently true. Ready=true is the normal healthy state; MemoryPressure=true, DiskPressure=true, PIDPressure=true, and NetworkUnavailable=true all indicate problem states that will affect pod scheduling on that node. Use the column filter to show only conditions where status=\"true\" and condition != \"Ready\" to isolate problems quickly.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "kube_node_status_condition == 1",
"refId": "A",
"instant": true,
"legendFormat": ""
}],
"transformations": [
{ "id": "labelsToFields", "options": { "mode": "columns" } },
{
"id": "organize",
"options": {
"excludeByName": {
"Time": true,
"Value": true,
"__name__": true,
"endpoint": true,
"job": true,
"service": true,
"instance": true
},
"renameByName": {
"node": "Node",
"condition": "Condition",
"status": "Status"
},
"indexByName": { "node": 0, "condition": 1, "status": 2 }
}
}
],
"fieldConfig": {
"defaults": {
"custom": { "align": "left", "filterable": true },
"noValue": "—"
},
"overrides": [
{
"matcher": { "id": "byName", "options": "Status" },
"properties": [
{ "id": "custom.displayMode", "value": "color-background" },
{ "id": "custom.width", "value": 90 },
{
"id": "mappings",
"value": [{
"type": "value",
"options": {
"true": { "text": "true", "color": "green", "index": 0 },
"false": { "text": "false", "color": "dark-red", "index": 1 },
"unknown": { "text": "unknown", "color": "dark-orange", "index": 2 }
}
}]
}
]
},
{
"matcher": { "id": "byName", "options": "Condition" },
"properties": [
{ "id": "custom.width", "value": 190 },
{ "id": "custom.displayMode", "value": "color-text" },
{
"id": "mappings",
"value": [{
"type": "value",
"options": {
"Ready": { "color": "green", "index": 0 },
"MemoryPressure": { "color": "red", "index": 1 },
"DiskPressure": { "color": "red", "index": 2 },
"PIDPressure": { "color": "red", "index": 3 },
"NetworkUnavailable": { "color": "red", "index": 4 }
}
}]
}
]
},
{ "matcher": { "id": "byName", "options": "Node" }, "properties": [{ "id": "custom.width", "value": 230 }] }
]
},
"options": {
"sortBy": [{ "desc": false, "displayName": "Node" }],
"footer": { "show": false }
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 44 }
},
{
"id": 25, "type": "table", "title": "Cluster Operator Conditions — Degraded & Progressing (OKD)",
"description": "Shows only ClusterOperator conditions that indicate a problem state: Degraded=True (operator has failed to achieve its desired state) or Progressing=True (operator is actively reconciling — normal during upgrades but alarming in steady state). Operators not appearing in this table are healthy. The reason column gives the operator's own explanation for the condition, which maps directly to the relevant operator log stream and OpenShift runbook.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "cluster_operator_conditions{condition=\"Degraded\"} == 1",
"refId": "A",
"instant": true,
"legendFormat": ""
},
{
"expr": "cluster_operator_conditions{condition=\"Progressing\"} == 1",
"refId": "B",
"instant": true,
"legendFormat": ""
}
],
"transformations": [
{ "id": "labelsToFields", "options": { "mode": "columns" } },
{
"id": "organize",
"options": {
"excludeByName": {
"Time": true,
"Value": true,
"__name__": true,
"endpoint": true,
"job": true,
"service": true,
"instance": true,
"namespace": true
},
"renameByName": {
"name": "Operator",
"condition": "Condition",
"reason": "Reason"
},
"indexByName": { "name": 0, "condition": 1, "reason": 2 }
}
}
],
"fieldConfig": {
"defaults": {
"custom": { "align": "left", "filterable": true },
"noValue": "—"
},
"overrides": [
{
"matcher": { "id": "byName", "options": "Condition" },
"properties": [
{ "id": "custom.displayMode", "value": "color-background" },
{ "id": "custom.width", "value": 140 },
{
"id": "mappings",
"value": [{
"type": "value",
"options": {
"Degraded": { "text": "Degraded", "color": "dark-red", "index": 0 },
"Progressing": { "text": "Progressing", "color": "dark-yellow", "index": 1 }
}
}]
}
]
},
{ "matcher": { "id": "byName", "options": "Operator" }, "properties": [{ "id": "custom.width", "value": 240 }] },
{ "matcher": { "id": "byName", "options": "Reason" }, "properties": [{ "id": "custom.width", "value": 220 }] }
]
},
"options": {
"sortBy": [{ "desc": false, "displayName": "Condition" }],
"footer": { "show": false }
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 44 }
}
]
}

View File

@@ -0,0 +1,49 @@
# These are probably already created by rook-ceph operator, not sure, needs to validate.
# in fact, 100% sure for the second one (rook-ceph-exporter)
# i over-wrote the first one (rook-ceph-mgr) with what is here, it was probably already working
# all what was missing was a label on the rook-ceph namespace to tell prometheus to look for monitors in this namespace
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: rook-ceph-mgr
namespace: rook-ceph
labels:
# This specific label is what tells OKD's Prometheus to pick this up
openshift.io/cluster-monitoring: "true"
spec:
namespaceSelector:
matchNames:
- rook-ceph
selector:
matchLabels:
# This matches your 'rook-ceph-mgr' service
app: rook-ceph-mgr
endpoints:
- port: ""
# The port name in your service is empty/integers, so we use targetPort
targetPort: 9283
path: /metrics
interval: 30s
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: rook-ceph-exporter
namespace: rook-ceph
labels:
# This label is required for OKD cluster-wide monitoring to pick it up
openshift.io/cluster-monitoring: "true"
team: rook
spec:
endpoints:
- honorLabels: true
interval: 10s
path: /metrics
port: ceph-exporter-http-metrics
namespaceSelector:
matchNames:
- rook-ceph
selector:
matchLabels:
app: rook-ceph-exporter
rook_cluster: rook-ceph

View File

@@ -0,0 +1,23 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: rook-ceph-metrics-viewer
namespace: rook-ceph
rules:
- apiGroups: [""]
resources: ["services", "endpoints", "pods"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: rook-ceph-metrics-viewer
namespace: rook-ceph
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: rook-ceph-metrics-viewer
subjects:
- kind: ServiceAccount
name: prometheus-k8s
namespace: openshift-monitoring

View File

@@ -0,0 +1,7 @@
apiVersion: v1
kind: Namespace
metadata:
name: rook-ceph
labels:
# This is the critical label that allows OKD Prometheus to see the namespace
openshift.io/cluster-monitoring: "true"

View File

@@ -0,0 +1,731 @@
{
"title": "Alerts & Events — Active Problems",
"uid": "okd-alerts-events",
"schemaVersion": 36,
"version": 1,
"refresh": "30s",
"time": { "from": "now-3h", "to": "now" },
"tags": ["okd", "alerts", "events"],
"templating": {
"list": [
{
"name": "severity",
"type": "custom",
"label": "Severity Filter",
"query": "critical,warning,info",
"current": { "selected": true, "text": "All", "value": "$__all" },
"includeAll": true,
"allValue": "critical|warning|info",
"multi": false,
"options": [
{ "selected": true, "text": "All", "value": "$__all" },
{ "selected": false, "text": "Critical", "value": "critical" },
{ "selected": false, "text": "Warning", "value": "warning" },
{ "selected": false, "text": "Info", "value": "info" }
]
},
{
"name": "namespace",
"type": "query",
"label": "Namespace",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"query": { "query": "label_values(ALERTS{alertstate=\"firing\"}, namespace)", "refId": "A" },
"refresh": 2,
"includeAll": true,
"allValue": ".*",
"multi": true,
"sort": 1,
"current": {},
"options": []
}
]
},
"panels": [
{
"id": 1, "type": "stat", "title": "Critical Alerts Firing",
"description": "Alerting rule instances currently in the firing state with severity=\"critical\". Any non-zero value represents a breached SLO or infrastructure condition requiring immediate on-call response. The ALERTS metric is generated by Prometheus directly from your alerting rules — it reflects what Prometheus knows, before Alertmanager routing or silencing.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(ALERTS{alertstate=\"firing\",severity=\"critical\"}) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "red", "value": 1 }
]},
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 0, "y": 0 }
},
{
"id": 2, "type": "stat", "title": "Warning Alerts Firing",
"description": "Firing alerts at severity=\"warning\". Warnings indicate a degraded or elevated-risk condition that has not yet crossed the critical threshold. A sustained or growing warning count often precedes a critical fire — treat them as early-warning signals, not background noise.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(ALERTS{alertstate=\"firing\",severity=\"warning\"}) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "orange", "value": 5 }
]},
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 3, "y": 0 }
},
{
"id": 3, "type": "stat", "title": "Info / Unclassified Alerts Firing",
"description": "Firing alerts with severity=\"info\" or no severity label. These are informational and do not normally require immediate action. A sudden large jump may reveal noisy alerting rules generating alert fatigue — rules worth reviewing for threshold tuning or adding inhibition rules.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(ALERTS{alertstate=\"firing\",severity!~\"critical|warning\"}) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "blue", "value": 1 },
{ "color": "blue", "value": 25 }
]},
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 6, "y": 0 }
},
{
"id": 4, "type": "stat", "title": "Alerts Silenced (Suppressed)",
"description": "Alerts currently matched by an active Alertmanager silence rule and therefore not routed to receivers. Silences are intentional during maintenance windows, but a large suppressed count outside of planned maintenance = an overly broad silence masking real problems. Zero silences when a maintenance window is active = the silence has expired or was misconfigured.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "sum(alertmanager_alerts{state=\"suppressed\"}) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 20 }
]},
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 9, "y": 0 }
},
{
"id": 5, "type": "stat", "title": "CrashLoopBackOff Pods",
"description": "Container instances currently waiting in the CrashLoopBackOff state — the container crashed and Kubernetes is retrying with exponential back-off. Each instance is a pod that cannot stay running. Common root causes: OOM kill, bad entrypoint, missing Secret or ConfigMap, an unavailable init dependency, or a broken image layer.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_pod_container_status_waiting_reason{reason=\"CrashLoopBackOff\"} == 1) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 3 }
]},
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 12, "y": 0 }
},
{
"id": 6, "type": "stat", "title": "OOMKilled Containers",
"description": "Containers whose most recent termination reason was OOMKilled. This is a current-state snapshot: a container that was OOMKilled, restarted, and is now Running will still appear here until its next termination occurs for a different reason. Non-zero and stable = recurring OOM, likely a workload memory leak or under-provisioned memory limit.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_pod_container_status_last_terminated_reason{reason=\"OOMKilled\"} == 1) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "orange", "value": 1 },
{ "color": "red", "value": 5 }
]},
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 15, "y": 0 }
},
{
"id": 7, "type": "stat", "title": "NotReady Nodes",
"description": "Nodes where the Ready condition is currently not True (False or Unknown). A NotReady node stops receiving new pod scheduling and, after the node eviction timeout (~5 min default), pods on it will be evicted. Control plane nodes going NotReady simultaneously = potential quorum loss. Any non-zero value is a tier-1 incident signal.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_node_status_condition{condition=\"Ready\",status=\"true\"} == 0) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "red", "value": 1 }
]},
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 18, "y": 0 }
},
{
"id": 8, "type": "stat", "title": "Degraded Cluster Operators (OKD)",
"description": "OKD ClusterOperators currently reporting Degraded=True. Each ClusterOperator owns a core platform component — authentication, networking, image-registry, monitoring, ingress, storage, etc. A degraded operator means its managed component is impaired or unavailable. Zero is the only acceptable steady-state value outside of an active upgrade.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(cluster_operator_conditions{condition=\"Degraded\"} == 1) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "red", "value": 1 }
]},
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 21, "y": 0 }
},
{
"id": 9, "type": "row", "title": "Alert Overview", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 4 }
},
{
"id": 10, "type": "timeseries", "title": "Firing Alert Count by Severity Over Time",
"description": "Instantaneous count of firing ALERTS series grouped by severity over the selected window. A vertical rise = new alerting condition emerged. A horizontal plateau = a persistent, unresolved problem. A step-down = alert resolved or Prometheus rule evaluation stopped matching. Use the Severity Filter variable to narrow scope during triage.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "count by(severity)(ALERTS{alertstate=\"firing\",severity=~\"$severity\",namespace=~\"$namespace\"})",
"refId": "A",
"legendFormat": "{{severity}}"
}],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 15, "showPoints": "never", "spanNulls": false }
},
"overrides": [
{ "matcher": { "id": "byName", "options": "critical" }, "properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "warning" }, "properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "info" }, "properties": [{ "id": "color", "value": { "fixedColor": "blue", "mode": "fixed" } }] }
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["max", "lastNotNull"] }
},
"gridPos": { "h": 8, "w": 8, "x": 0, "y": 5 }
},
{
"id": 11, "type": "timeseries", "title": "Alertmanager Notification Rate by Integration",
"description": "Rate of notification delivery attempts from Alertmanager per second, split by integration type (slack, pagerduty, email, webhook, etc.). Solid lines = successful deliveries; dashed red lines = failed deliveries. A drop to zero on all integrations = Alertmanager is not processing or the cluster is completely quiet. Persistent failures on one integration = check that receiver's credentials or endpoint availability.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "sum by(integration)(rate(alertmanager_notifications_total[5m]))", "refId": "A", "legendFormat": "✓ {{integration}}" },
{ "expr": "sum by(integration)(rate(alertmanager_notifications_failed_total[5m]))", "refId": "B", "legendFormat": "✗ {{integration}}" }
],
"fieldConfig": {
"defaults": {
"unit": "reqps", "min": 0, "decimals": 3,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
},
"overrides": [
{
"matcher": { "id": "byFrameRefID", "options": "B" },
"properties": [
{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } },
{ "id": "custom.lineStyle", "value": { "dash": [6, 4], "fill": "dash" } },
{ "id": "custom.lineWidth", "value": 1 }
]
}
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 8, "y": 5 }
},
{
"id": 12, "type": "bargauge", "title": "Longest-Firing Active Alerts",
"description": "Duration (now - ALERTS_FOR_STATE timestamp) for each currently firing alert, sorted descending. Alerts at the top have been firing longest and are the most likely candidates for known-but-unresolved issues, stale firing conditions, or alerts that should have a silence applied. Red bars (> 2 hours) strongly suggest a problem that has been acknowledged but not resolved.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sort_desc(time() - ALERTS_FOR_STATE{alertstate=\"firing\",severity=~\"$severity\",namespace=~\"$namespace\"})",
"refId": "A",
"legendFormat": "{{alertname}} · {{severity}} · {{namespace}}"
}],
"fieldConfig": {
"defaults": {
"unit": "s", "min": 0,
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 300 },
{ "color": "orange", "value": 1800 },
{ "color": "red", "value": 7200 }
]}
}
},
"options": {
"orientation": "horizontal",
"reduceOptions": { "calcs": ["lastNotNull"] },
"displayMode": "gradient",
"showUnfilled": true,
"valueMode": "color"
},
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 5 }
},
{
"id": 13, "type": "row", "title": "Active Firing Alerts — Full Detail", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 13 }
},
{
"id": 14, "type": "table", "title": "All Firing Alerts",
"description": "Instant-query table of every currently firing alert visible to Prometheus, filtered by the Namespace and Severity variables above. Each row is one alert instance (unique label combination). The value column is omitted — by definition every row here is firing. Use the built-in column filter (funnel icon) to further narrow to a specific alertname, pod, or node. Columns are sparse: labels not defined in a given alert rule will show '—'.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "ALERTS{alertstate=\"firing\",severity=~\"$severity\",namespace=~\"$namespace\"}",
"refId": "A",
"instant": true,
"legendFormat": ""
}],
"transformations": [
{ "id": "labelsToFields", "options": { "mode": "columns" } },
{
"id": "organize",
"options": {
"excludeByName": {
"alertstate": true,
"__name__": true,
"Value": true,
"Time": true
},
"renameByName": {
"alertname": "Alert Name",
"severity": "Severity",
"namespace": "Namespace",
"pod": "Pod",
"node": "Node",
"container": "Container",
"job": "Job",
"service": "Service",
"reason": "Reason",
"instance": "Instance"
},
"indexByName": {
"severity": 0,
"alertname": 1,
"namespace": 2,
"pod": 3,
"node": 4,
"container": 5,
"job": 6,
"service": 7,
"reason": 8,
"instance": 9
}
}
}
],
"fieldConfig": {
"defaults": {
"custom": { "align": "left", "filterable": true },
"noValue": "—"
},
"overrides": [
{
"matcher": { "id": "byName", "options": "Severity" },
"properties": [
{ "id": "custom.displayMode", "value": "color-background" },
{ "id": "custom.width", "value": 110 },
{
"id": "mappings",
"value": [{
"type": "value",
"options": {
"critical": { "text": "CRITICAL", "color": "dark-red", "index": 0 },
"warning": { "text": "WARNING", "color": "dark-yellow", "index": 1 },
"info": { "text": "INFO", "color": "dark-blue", "index": 2 }
}
}]
}
]
},
{ "matcher": { "id": "byName", "options": "Alert Name" }, "properties": [{ "id": "custom.width", "value": 300 }] },
{ "matcher": { "id": "byName", "options": "Namespace" }, "properties": [{ "id": "custom.width", "value": 180 }] },
{ "matcher": { "id": "byName", "options": "Pod" }, "properties": [{ "id": "custom.width", "value": 200 }] },
{ "matcher": { "id": "byName", "options": "Node" }, "properties": [{ "id": "custom.width", "value": 200 }] }
]
},
"options": {
"sortBy": [{ "desc": false, "displayName": "Severity" }],
"footer": { "show": false }
},
"gridPos": { "h": 12, "w": 24, "x": 0, "y": 14 }
},
{
"id": 15, "type": "row", "title": "Kubernetes Warning Events", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 26 }
},
{
"id": 16, "type": "timeseries", "title": "Warning Event Rate by Reason",
"description": "Rate of Kubernetes Warning-type events per second grouped by reason code. BackOff = container is CrashLooping. FailedScheduling = no node satisfies pod constraints. FailedMount = volume attachment or CSI failure. Evicted = kubelet evicted a pod due to memory or disk pressure. NodeNotReady = node lost contact. A spike in a single reason narrows the incident root-cause immediately without needing to read raw event logs. Requires kube-state-metrics with --resources=events.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "topk(10, sum by(reason)(rate(kube_event_count{type=\"Warning\",namespace=~\"$namespace\"}[5m])))",
"refId": "A",
"legendFormat": "{{reason}}"
}],
"fieldConfig": {
"defaults": {
"unit": "reqps", "min": 0, "decimals": 4,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 0, "y": 27 }
},
{
"id": 17, "type": "bargauge", "title": "Warning Events — Top Namespaces (Accumulated Count)",
"description": "Total accumulated Warning event count (the count field on the Kubernetes Event object) per namespace, showing the top 15 most active. A namespace dominating this chart is generating significantly more abnormal conditions than its peers, useful for identifying noisy tenants, misconfigured deployments, or namespaces experiencing a persistent infrastructure problem. Note this is the raw Event.count field — it resets if the event object is deleted and recreated.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "topk(15, sum by(namespace)(kube_event_count{type=\"Warning\"}))",
"refId": "A",
"legendFormat": "{{namespace}}"
}],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 10 },
{ "color": "orange", "value": 50 },
{ "color": "red", "value": 200 }
]}
}
},
"options": {
"orientation": "horizontal",
"reduceOptions": { "calcs": ["lastNotNull"] },
"displayMode": "gradient",
"showUnfilled": true
},
"gridPos": { "h": 8, "w": 8, "x": 8, "y": 27 }
},
{
"id": 18, "type": "timeseries", "title": "Warning Events — Accumulated Count by Reason Over Time",
"description": "Raw accumulated event count gauge over time, split by reason. Unlike the rate panel this shows total volume and slope simultaneously. A line that climbs steeply = events are occurring frequently right now. A line that plateaus = the condition causing that reason has stopped. A line that drops to zero = the event object was deleted and recreated or the condition fully resolved.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "topk(10, sum by(reason)(kube_event_count{type=\"Warning\",namespace=~\"$namespace\"}))",
"refId": "A",
"legendFormat": "{{reason}}"
}],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 8, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["lastNotNull", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 27 }
},
{
"id": 19, "type": "row", "title": "Pod Problems", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 35 }
},
{
"id": 20, "type": "timeseries", "title": "CrashLoopBackOff Pods by Namespace",
"description": "Count of container instances in CrashLoopBackOff waiting state over time, broken down by namespace. A sudden rise in one namespace = a workload deployment is failing. A persistent baseline across many namespaces = a shared dependency (Secret, ConfigMap, network policy, or an upstream service) has become unavailable. Unlike restart rate, this panel shows the steady-state count of pods currently stuck — not flapping.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(namespace)(kube_pod_container_status_waiting_reason{reason=\"CrashLoopBackOff\",namespace=~\"$namespace\"} == 1)",
"refId": "A",
"legendFormat": "{{namespace}}"
}],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": {
"lineWidth": 2, "fillOpacity": 15, "showPoints": "never", "spanNulls": false,
"thresholdsStyle": { "mode": "line" }
},
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 5 }
]}
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["lastNotNull", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 0, "y": 36 }
},
{
"id": 21, "type": "timeseries", "title": "Container Restart Rate by Namespace",
"description": "Rate of container restarts per second across all reasons (OOMKill, liveness probe failure, process exit) grouped by namespace. A namespace with a rising restart rate that has not yet entered CrashLoopBackOff is in the early failure window before the exponential back-off penalty kicks in. Cross-reference with the OOMKilled stat tile and the last-terminated-reason to separate crash types.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "topk(10, sum by(namespace)(rate(kube_pod_container_status_restarts_total{namespace=~\"$namespace\"}[5m])))",
"refId": "A",
"legendFormat": "{{namespace}}"
}],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0, "decimals": 4,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 8, "y": 36 }
},
{
"id": 22, "type": "timeseries", "title": "Pods by Problem Phase (Failed / Pending / Unknown)",
"description": "Count of pods in Failed, Pending, or Unknown phase over time. Failed = container terminated with a non-zero exit code or was evicted and not rescheduled. Pending for more than a few minutes = scheduler unable to bind the pod (check FailedScheduling events, node capacity, and taint/toleration mismatches). Unknown = kubelet is not reporting to the apiserver, typically indicating a node network partition or kubelet crash.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "sum by(phase)(kube_pod_status_phase{phase=~\"Failed|Unknown\",namespace=~\"$namespace\"} == 1)", "refId": "A", "legendFormat": "{{phase}}" },
{ "expr": "sum(kube_pod_status_phase{phase=\"Pending\",namespace=~\"$namespace\"} == 1)", "refId": "B", "legendFormat": "Pending" }
],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": {
"lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false,
"thresholdsStyle": { "mode": "line" }
},
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "red", "value": 1 }
]}
},
"overrides": [
{ "matcher": { "id": "byName", "options": "Failed" }, "properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "Pending" }, "properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "Unknown" }, "properties": [{ "id": "color", "value": { "fixedColor": "orange", "mode": "fixed" } }] }
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["lastNotNull", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 16, "y": 36 }
},
{
"id": 23, "type": "row", "title": "Node & Cluster Operator Conditions", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 43 }
},
{
"id": 24, "type": "table", "title": "Node Condition Status Matrix",
"description": "Instant snapshot of every active node condition across all nodes. Each row is one (node, condition, status) triple where value=1, meaning that combination is currently true. Ready=true is the normal healthy state; MemoryPressure=true, DiskPressure=true, PIDPressure=true, and NetworkUnavailable=true all indicate problem states that will affect pod scheduling on that node. Use the column filter to show only conditions where status=\"true\" and condition != \"Ready\" to isolate problems quickly.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "kube_node_status_condition == 1",
"refId": "A",
"instant": true,
"legendFormat": ""
}],
"transformations": [
{ "id": "labelsToFields", "options": { "mode": "columns" } },
{
"id": "organize",
"options": {
"excludeByName": {
"Time": true,
"Value": true,
"__name__": true,
"endpoint": true,
"job": true,
"service": true,
"instance": true
},
"renameByName": {
"node": "Node",
"condition": "Condition",
"status": "Status"
},
"indexByName": { "node": 0, "condition": 1, "status": 2 }
}
}
],
"fieldConfig": {
"defaults": {
"custom": { "align": "left", "filterable": true },
"noValue": "—"
},
"overrides": [
{
"matcher": { "id": "byName", "options": "Status" },
"properties": [
{ "id": "custom.displayMode", "value": "color-background" },
{ "id": "custom.width", "value": 90 },
{
"id": "mappings",
"value": [{
"type": "value",
"options": {
"true": { "text": "true", "color": "green", "index": 0 },
"false": { "text": "false", "color": "dark-red", "index": 1 },
"unknown": { "text": "unknown", "color": "dark-orange", "index": 2 }
}
}]
}
]
},
{
"matcher": { "id": "byName", "options": "Condition" },
"properties": [
{ "id": "custom.width", "value": 190 },
{ "id": "custom.displayMode", "value": "color-text" },
{
"id": "mappings",
"value": [{
"type": "value",
"options": {
"Ready": { "color": "green", "index": 0 },
"MemoryPressure": { "color": "red", "index": 1 },
"DiskPressure": { "color": "red", "index": 2 },
"PIDPressure": { "color": "red", "index": 3 },
"NetworkUnavailable": { "color": "red", "index": 4 }
}
}]
}
]
},
{ "matcher": { "id": "byName", "options": "Node" }, "properties": [{ "id": "custom.width", "value": 230 }] }
]
},
"options": {
"sortBy": [{ "desc": false, "displayName": "Node" }],
"footer": { "show": false }
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 44 }
},
{
"id": 25, "type": "table", "title": "Cluster Operator Conditions — Degraded & Progressing (OKD)",
"description": "Shows only ClusterOperator conditions that indicate a problem state: Degraded=True (operator has failed to achieve its desired state) or Progressing=True (operator is actively reconciling — normal during upgrades but alarming in steady state). Operators not appearing in this table are healthy. The reason column gives the operator's own explanation for the condition, which maps directly to the relevant operator log stream and OpenShift runbook.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "cluster_operator_conditions{condition=\"Degraded\"} == 1",
"refId": "A",
"instant": true,
"legendFormat": ""
},
{
"expr": "cluster_operator_conditions{condition=\"Progressing\"} == 1",
"refId": "B",
"instant": true,
"legendFormat": ""
}
],
"transformations": [
{ "id": "labelsToFields", "options": { "mode": "columns" } },
{
"id": "organize",
"options": {
"excludeByName": {
"Time": true,
"Value": true,
"__name__": true,
"endpoint": true,
"job": true,
"service": true,
"instance": true,
"namespace": true
},
"renameByName": {
"name": "Operator",
"condition": "Condition",
"reason": "Reason"
},
"indexByName": { "name": 0, "condition": 1, "reason": 2 }
}
}
],
"fieldConfig": {
"defaults": {
"custom": { "align": "left", "filterable": true },
"noValue": "—"
},
"overrides": [
{
"matcher": { "id": "byName", "options": "Condition" },
"properties": [
{ "id": "custom.displayMode", "value": "color-background" },
{ "id": "custom.width", "value": 140 },
{
"id": "mappings",
"value": [{
"type": "value",
"options": {
"Degraded": { "text": "Degraded", "color": "dark-red", "index": 0 },
"Progressing": { "text": "Progressing", "color": "dark-yellow", "index": 1 }
}
}]
}
]
},
{ "matcher": { "id": "byName", "options": "Operator" }, "properties": [{ "id": "custom.width", "value": 240 }] },
{ "matcher": { "id": "byName", "options": "Reason" }, "properties": [{ "id": "custom.width", "value": 220 }] }
]
},
"options": {
"sortBy": [{ "desc": false, "displayName": "Condition" }],
"footer": { "show": false }
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 44 }
}
]
}

View File

@@ -0,0 +1,739 @@
{
"title": "Cluster Overview",
"uid": "okd-cluster-overview",
"schemaVersion": 36,
"version": 2,
"refresh": "30s",
"time": { "from": "now-1h", "to": "now" },
"tags": ["okd", "cluster", "overview"],
"panels": [
{
"id": 1,
"type": "stat",
"title": "Ready Nodes",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "count(kube_node_status_condition{condition=\"Ready\",status=\"true\"} == 1)",
"refId": "A",
"legendFormat": ""
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "red", "value": null },
{ "color": "green", "value": 1 }
]
},
"unit": "short",
"noValue": "0"
}
},
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"gridPos": { "h": 4, "w": 3, "x": 0, "y": 0 }
},
{
"id": 2,
"type": "stat",
"title": "Not Ready Nodes",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "count(kube_node_status_condition{condition=\"Ready\",status=\"false\"} == 1) or vector(0)",
"refId": "A",
"legendFormat": ""
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "red", "value": 1 }
]
},
"unit": "short",
"noValue": "0"
}
},
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"gridPos": { "h": 4, "w": 3, "x": 3, "y": 0 }
},
{
"id": 3,
"type": "stat",
"title": "Running Pods",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "count(kube_pod_status_phase{phase=\"Running\"} == 1)",
"refId": "A",
"legendFormat": ""
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "red", "value": null },
{ "color": "green", "value": 1 }
]
},
"unit": "short",
"noValue": "0"
}
},
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"gridPos": { "h": 4, "w": 3, "x": 6, "y": 0 }
},
{
"id": 4,
"type": "stat",
"title": "Pending Pods",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "count(kube_pod_status_phase{phase=\"Pending\"} == 1) or vector(0)",
"refId": "A",
"legendFormat": ""
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 5 }
]
},
"unit": "short",
"noValue": "0"
}
},
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"gridPos": { "h": 4, "w": 3, "x": 9, "y": 0 }
},
{
"id": 5,
"type": "stat",
"title": "Failed Pods",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "count(kube_pod_status_phase{phase=\"Failed\"} == 1) or vector(0)",
"refId": "A",
"legendFormat": ""
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "red", "value": 1 }
]
},
"unit": "short",
"noValue": "0"
}
},
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"gridPos": { "h": 4, "w": 3, "x": 12, "y": 0 }
},
{
"id": 6,
"type": "stat",
"title": "CrashLoopBackOff",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "count(kube_pod_container_status_waiting_reason{reason=\"CrashLoopBackOff\"} == 1) or vector(0)",
"refId": "A",
"legendFormat": ""
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "red", "value": 1 }
]
},
"unit": "short",
"noValue": "0"
}
},
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"gridPos": { "h": 4, "w": 3, "x": 15, "y": 0 }
},
{
"id": 7,
"type": "stat",
"title": "Critical Alerts",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "count(ALERTS{alertstate=\"firing\",severity=\"critical\"}) or vector(0)",
"refId": "A",
"legendFormat": ""
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "red", "value": 1 }
]
},
"unit": "short",
"noValue": "0"
}
},
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"gridPos": { "h": 4, "w": 3, "x": 18, "y": 0 }
},
{
"id": 8,
"type": "stat",
"title": "Warning Alerts",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "count(ALERTS{alertstate=\"firing\",severity=\"warning\"}) or vector(0)",
"refId": "A",
"legendFormat": ""
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 10 }
]
},
"unit": "short",
"noValue": "0"
}
},
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"gridPos": { "h": 4, "w": 3, "x": 21, "y": 0 }
},
{
"id": 9,
"type": "gauge",
"title": "CPU Usage",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "100 * (1 - avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])))",
"refId": "A",
"legendFormat": "CPU"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 70 },
{ "color": "red", "value": 85 }
]
}
}
},
"options": {
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"showThresholdLabels": false,
"showThresholdMarkers": true,
"orientation": "auto"
},
"gridPos": { "h": 6, "w": 5, "x": 0, "y": 4 }
},
{
"id": 10,
"type": "gauge",
"title": "Memory Usage",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "100 * (1 - (sum(node_memory_MemAvailable_bytes) / sum(node_memory_MemTotal_bytes)))",
"refId": "A",
"legendFormat": "Memory"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 75 },
{ "color": "red", "value": 90 }
]
}
}
},
"options": {
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"showThresholdLabels": false,
"showThresholdMarkers": true,
"orientation": "auto"
},
"gridPos": { "h": 6, "w": 5, "x": 5, "y": 4 }
},
{
"id": 11,
"type": "gauge",
"title": "Root Disk Usage",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "100 * (1 - (sum(node_filesystem_avail_bytes{mountpoint=\"/\",fstype!=\"tmpfs\"}) / sum(node_filesystem_size_bytes{mountpoint=\"/\",fstype!=\"tmpfs\"})))",
"refId": "A",
"legendFormat": "Disk"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 70 },
{ "color": "red", "value": 85 }
]
}
}
},
"options": {
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"showThresholdLabels": false,
"showThresholdMarkers": true,
"orientation": "auto"
},
"gridPos": { "h": 6, "w": 4, "x": 10, "y": 4 }
},
{
"id": 12,
"type": "stat",
"title": "etcd Has Leader",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "min(etcd_server_has_leader)",
"refId": "A",
"legendFormat": ""
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "red", "value": null },
{ "color": "green", "value": 1 }
]
},
"mappings": [
{
"type": "value",
"options": {
"0": { "text": "NO LEADER", "color": "red" },
"1": { "text": "LEADER OK", "color": "green" }
}
}
],
"unit": "short",
"noValue": "?"
}
},
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"gridPos": { "h": 3, "w": 5, "x": 14, "y": 4 }
},
{
"id": 13,
"type": "stat",
"title": "API Servers Up",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum(up{job=\"apiserver\"})",
"refId": "A",
"legendFormat": ""
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "red", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "green", "value": 2 }
]
},
"unit": "short",
"noValue": "0"
}
},
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"gridPos": { "h": 3, "w": 5, "x": 19, "y": 4 }
},
{
"id": 14,
"type": "stat",
"title": "etcd Members Up",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum(up{job=\"etcd\"})",
"refId": "A",
"legendFormat": ""
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "red", "value": null },
{ "color": "yellow", "value": 2 },
{ "color": "green", "value": 3 }
]
},
"unit": "short",
"noValue": "0"
}
},
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"gridPos": { "h": 3, "w": 5, "x": 14, "y": 7 }
},
{
"id": 15,
"type": "stat",
"title": "Operators Degraded",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "count(cluster_operator_conditions{condition=\"Degraded\",status=\"True\"} == 1) or vector(0)",
"refId": "A",
"legendFormat": ""
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "red", "value": 1 }
]
},
"unit": "short",
"noValue": "0"
}
},
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"gridPos": { "h": 3, "w": 5, "x": 19, "y": 7 }
},
{
"id": 16,
"type": "timeseries",
"title": "CPU Usage per Node (%)",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "100 * (1 - avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])))",
"refId": "A",
"legendFormat": "{{instance}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"color": { "mode": "palette-classic" },
"custom": {
"lineWidth": 2,
"fillOpacity": 10,
"spanNulls": false,
"showPoints": "never"
}
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": {
"displayMode": "list",
"placement": "bottom",
"calcs": ["mean", "max"]
}
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 10 }
},
{
"id": 17,
"type": "timeseries",
"title": "Memory Usage per Node (%)",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))",
"refId": "A",
"legendFormat": "{{instance}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"color": { "mode": "palette-classic" },
"custom": {
"lineWidth": 2,
"fillOpacity": 10,
"spanNulls": false,
"showPoints": "never"
}
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": {
"displayMode": "list",
"placement": "bottom",
"calcs": ["mean", "max"]
}
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 10 }
},
{
"id": 18,
"type": "timeseries",
"title": "Network Traffic — Cluster Total",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum(rate(node_network_receive_bytes_total{device!~\"lo|veth.*|tun.*|ovn.*|br-int|br-ex\"}[5m]))",
"refId": "A",
"legendFormat": "Receive"
},
{
"expr": "sum(rate(node_network_transmit_bytes_total{device!~\"lo|veth.*|tun.*|ovn.*|br-int|br-ex\"}[5m]))",
"refId": "B",
"legendFormat": "Transmit"
}
],
"fieldConfig": {
"defaults": {
"unit": "Bps",
"color": { "mode": "palette-classic" },
"custom": {
"lineWidth": 2,
"fillOpacity": 10,
"spanNulls": false,
"showPoints": "never"
}
},
"overrides": [
{
"matcher": { "id": "byName", "options": "Receive" },
"properties": [{ "id": "color", "value": { "fixedColor": "blue", "mode": "fixed" } }]
},
{
"matcher": { "id": "byName", "options": "Transmit" },
"properties": [{ "id": "color", "value": { "fixedColor": "orange", "mode": "fixed" } }]
}
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "none" },
"legend": {
"displayMode": "list",
"placement": "bottom",
"calcs": ["mean", "max"]
}
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 18 }
},
{
"id": 19,
"type": "timeseries",
"title": "Pod Phases Over Time",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "count(kube_pod_status_phase{phase=\"Running\"} == 1)",
"refId": "A",
"legendFormat": "Running"
},
{
"expr": "count(kube_pod_status_phase{phase=\"Pending\"} == 1) or vector(0)",
"refId": "B",
"legendFormat": "Pending"
},
{
"expr": "count(kube_pod_status_phase{phase=\"Failed\"} == 1) or vector(0)",
"refId": "C",
"legendFormat": "Failed"
},
{
"expr": "count(kube_pod_status_phase{phase=\"Unknown\"} == 1) or vector(0)",
"refId": "D",
"legendFormat": "Unknown"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"custom": {
"lineWidth": 2,
"fillOpacity": 15,
"spanNulls": false,
"showPoints": "never"
}
},
"overrides": [
{
"matcher": { "id": "byName", "options": "Running" },
"properties": [{ "id": "color", "value": { "fixedColor": "green", "mode": "fixed" } }]
},
{
"matcher": { "id": "byName", "options": "Pending" },
"properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }]
},
{
"matcher": { "id": "byName", "options": "Failed" },
"properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }]
},
{
"matcher": { "id": "byName", "options": "Unknown" },
"properties": [{ "id": "color", "value": { "fixedColor": "orange", "mode": "fixed" } }]
}
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "none" },
"legend": {
"displayMode": "list",
"placement": "bottom",
"calcs": ["lastNotNull"]
}
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 18 }
}
]
}

View File

@@ -0,0 +1,742 @@
{
"title": "Control Plane Health",
"uid": "okd-control-plane",
"schemaVersion": 36,
"version": 1,
"refresh": "30s",
"time": { "from": "now-1h", "to": "now" },
"tags": ["okd", "control-plane"],
"templating": {
"list": [
{
"name": "instance",
"type": "query",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"query": { "query": "label_values(apiserver_request_total, instance)", "refId": "A" },
"refresh": 2,
"includeAll": true,
"multi": true,
"allValue": ".*",
"label": "API Server Instance",
"sort": 1,
"current": {},
"options": []
}
]
},
"panels": [
{
"id": 1, "type": "stat", "title": "API Servers Up",
"description": "Number of kube-apiserver instances currently scraped and up. Healthy HA cluster = 3.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(up{job=~\".*apiserver.*\"} == 1)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "red", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "green", "value": 3 }
]},
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 0, "y": 0 }
},
{
"id": 2, "type": "stat", "title": "Controller Managers Up",
"description": "kube-controller-manager instances up. In OKD only one holds the leader lease at a time; others are hot standbys.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(up{job=~\".*controller-manager.*\"} == 1)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "red", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "green", "value": 3 }
]},
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 3, "y": 0 }
},
{
"id": 3, "type": "stat", "title": "Schedulers Up",
"description": "kube-scheduler instances up. One holds the leader lease; rest are standbys. 0 = no scheduling of new pods.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(up{job=~\".*scheduler.*\"} == 1)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "red", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "green", "value": 3 }
]},
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 6, "y": 0 }
},
{
"id": 4, "type": "stat", "title": "API 5xx Rate",
"description": "Server-side errors (5xx) across all apiserver instances per second. Any sustained non-zero value = apiserver internal fault.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "sum(rate(apiserver_request_total{code=~\"5..\"}[5m]))", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 0.01 },
{ "color": "red", "value": 1 }
]},
"unit": "reqps", "noValue": "0", "decimals": 3
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 9, "y": 0 }
},
{
"id": 5, "type": "stat", "title": "Inflight — Mutating",
"description": "Current in-flight mutating requests (POST/PUT/PATCH/DELETE). Default OKD limit is ~1000. Hitting the limit = 429 errors for writes.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "sum(apiserver_current_inflight_requests{request_kind=\"mutating\"})", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 500 },
{ "color": "orange", "value": 750 },
{ "color": "red", "value": 900 }
]},
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 12, "y": 0 }
},
{
"id": 6, "type": "stat", "title": "Inflight — Read-Only",
"description": "Current in-flight non-mutating requests (GET/LIST/WATCH). Default OKD limit is ~3000. Hitting it = 429 for reads, impacting controllers and kubectl.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "sum(apiserver_current_inflight_requests{request_kind=\"readOnly\"})", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1500 },
{ "color": "orange", "value": 2200 },
{ "color": "red", "value": 2700 }
]},
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 15, "y": 0 }
},
{
"id": 7, "type": "stat", "title": "API Request p99 (non-WATCH)",
"description": "Overall p99 latency for all non-streaming verbs. >1s = noticeable kubectl sluggishness. >10s = controllers timing out on LIST/GET.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{verb!~\"WATCH|CONNECT\"}[5m])) by (le))",
"refId": "A", "legendFormat": ""
}],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 0.5 },
{ "color": "orange", "value": 1 },
{ "color": "red", "value": 5 }
]},
"unit": "s", "noValue": "0", "decimals": 3
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 18, "y": 0 }
},
{
"id": 8, "type": "stat", "title": "APIServer → etcd p99",
"description": "p99 time apiserver spends waiting on etcd calls. Spike here while WAL fsync is healthy = serialization or large object overhead.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "histogram_quantile(0.99, sum(rate(apiserver_storage_request_duration_seconds_bucket[5m])) by (le))",
"refId": "A", "legendFormat": ""
}],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 0.05 },
{ "color": "orange", "value": 0.2 },
{ "color": "red", "value": 0.5 }
]},
"unit": "s", "noValue": "0", "decimals": 4
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 21, "y": 0 }
},
{
"id": 9, "type": "row", "title": "API Server — Request Rates & Errors", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 4 }
},
{
"id": 10, "type": "timeseries", "title": "Request Rate by Verb",
"description": "Non-streaming calls per second broken down by verb. GET/LIST = read load from controllers. POST/PUT/PATCH/DELETE = write throughput. A sudden LIST spike = controller cache resync storm.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(verb)(rate(apiserver_request_total{instance=~\"$instance\",verb!~\"WATCH|CONNECT\"}[5m]))",
"refId": "A", "legendFormat": "{{verb}}"
}],
"fieldConfig": {
"defaults": {
"unit": "reqps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 0, "y": 5 }
},
{
"id": 11, "type": "timeseries", "title": "Error Rate by HTTP Status Code",
"description": "4xx/5xx responses per second by code. 429 = inflight limit hit (throttling). 422 = admission rejection or invalid object. 500/503 = internal apiserver fault or etcd unavailability.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(code)(rate(apiserver_request_total{instance=~\"$instance\",code=~\"[45]..\"}[5m]))",
"refId": "A", "legendFormat": "HTTP {{code}}"
}],
"fieldConfig": {
"defaults": {
"unit": "reqps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 15, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 8, "y": 5 }
},
{
"id": 12, "type": "timeseries", "title": "In-Flight Requests — Mutating vs Read-Only",
"description": "Instantaneous count of requests being actively handled. The two series correspond to the two inflight limit buckets enforced by the apiserver's Priority and Fairness (APF) or legacy inflight settings.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "sum by(request_kind)(apiserver_current_inflight_requests{instance=~\"$instance\"})", "refId": "A", "legendFormat": "{{request_kind}}" }
],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 20, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 5 }
},
{
"id": 13, "type": "row", "title": "API Server — Latency", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 13 }
},
{
"id": 14, "type": "timeseries", "title": "Request Latency — p50 / p95 / p99 (non-WATCH)",
"description": "Aggregated end-to-end request duration across all verbs except WATCH/CONNECT (which are unbounded streaming). A rising p99 without a matching rise in etcd latency = CPU saturation, admission webhook slowness, or serialization overhead.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "histogram_quantile(0.50, sum(rate(apiserver_request_duration_seconds_bucket{instance=~\"$instance\",verb!~\"WATCH|CONNECT\"}[5m])) by (le))", "refId": "A", "legendFormat": "p50" },
{ "expr": "histogram_quantile(0.95, sum(rate(apiserver_request_duration_seconds_bucket{instance=~\"$instance\",verb!~\"WATCH|CONNECT\"}[5m])) by (le))", "refId": "B", "legendFormat": "p95" },
{ "expr": "histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{instance=~\"$instance\",verb!~\"WATCH|CONNECT\"}[5m])) by (le))", "refId": "C", "legendFormat": "p99" }
],
"fieldConfig": {
"defaults": {
"unit": "s", "min": 0, "decimals": 4,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
},
"overrides": [
{ "matcher": { "id": "byName", "options": "p50" }, "properties": [{ "id": "color", "value": { "fixedColor": "green", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "p95" }, "properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "p99" }, "properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }] }
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 0, "y": 14 }
},
{
"id": 15, "type": "timeseries", "title": "Request p99 Latency by Verb",
"description": "p99 latency broken out per verb. LIST is inherently slower than GET due to serializing full collections. A POST/PUT spike = heavy admission webhook chain or large object writes. DELETE spikes are usually caused by cascading GC finalizer storms.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "histogram_quantile(0.99, sum by(verb,le)(rate(apiserver_request_duration_seconds_bucket{instance=~\"$instance\",verb!~\"WATCH|CONNECT\"}[5m])))",
"refId": "A", "legendFormat": "{{verb}}"
}],
"fieldConfig": {
"defaults": {
"unit": "s", "min": 0, "decimals": 4,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 8, "y": 14 }
},
{
"id": 16, "type": "timeseries", "title": "APIServer → etcd Latency by Operation",
"description": "Time apiserver spends waiting on etcd, split by operation type (get, list, create, update, delete, watch). Elevated get/list = etcd read pressure. Elevated create/update = write bottleneck, likely correlated with WAL fsync latency.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "histogram_quantile(0.50, sum by(operation,le)(rate(apiserver_storage_request_duration_seconds_bucket[5m])))", "refId": "A", "legendFormat": "p50 — {{operation}}" },
{ "expr": "histogram_quantile(0.99, sum by(operation,le)(rate(apiserver_storage_request_duration_seconds_bucket[5m])))", "refId": "B", "legendFormat": "p99 — {{operation}}" }
],
"fieldConfig": {
"defaults": {
"unit": "s", "min": 0, "decimals": 4,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 14 }
},
{
"id": 17, "type": "row", "title": "API Server — Watches & Long-Running Requests", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 22 }
},
{
"id": 18, "type": "timeseries", "title": "Active Long-Running Requests (Watches) by Resource",
"description": "Instantaneous count of open WATCH streams grouped by resource. Each controller typically holds one WATCH per resource type per apiserver instance. A sudden drop = controller restart; a runaway climb = operator creating watches without cleanup.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(resource)(apiserver_longrunning_requests{instance=~\"$instance\",verb=\"WATCH\"})",
"refId": "A", "legendFormat": "{{resource}}"
}],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["lastNotNull", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 0, "y": 23 }
},
{
"id": 19, "type": "timeseries", "title": "Watch Events Dispatched Rate by Kind",
"description": "Watch events sent to all active watchers per second, by object kind. Persistent high rate for a specific kind = that resource type is churning heavily, increasing etcd load and controller reconcile frequency.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(kind)(rate(apiserver_watch_events_total{instance=~\"$instance\"}[5m]))",
"refId": "A", "legendFormat": "{{kind}}"
}],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 8, "y": 23 }
},
{
"id": 20, "type": "timeseries", "title": "Watch Event Size — p50 / p95 / p99 by Kind",
"description": "Size of individual watch events dispatched to clients. Large events (MiB-scale) for Secrets or ConfigMaps = objects being stored with oversized data. Contributes to apiserver memory pressure and network saturation.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "histogram_quantile(0.50, sum by(kind,le)(rate(apiserver_watch_events_sizes_bucket{instance=~\"$instance\"}[5m])))", "refId": "A", "legendFormat": "p50 — {{kind}}" },
{ "expr": "histogram_quantile(0.99, sum by(kind,le)(rate(apiserver_watch_events_sizes_bucket{instance=~\"$instance\"}[5m])))", "refId": "B", "legendFormat": "p99 — {{kind}}" }
],
"fieldConfig": {
"defaults": {
"unit": "bytes", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 16, "y": 23 }
},
{
"id": 21, "type": "row", "title": "Admission Webhooks", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 30 }
},
{
"id": 22, "type": "timeseries", "title": "Webhook Call Rate by Name",
"description": "Mutating and validating admission webhook invocations per second by webhook name. A webhook invoked on every write (e.g., a mutating webhook with no object selector) can be a major source of write latency amplification.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(name,type)(rate(apiserver_admission_webhook_request_total{instance=~\"$instance\"}[5m]))",
"refId": "A", "legendFormat": "{{type}} — {{name}}"
}],
"fieldConfig": {
"defaults": {
"unit": "reqps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 0, "y": 31 }
},
{
"id": 23, "type": "timeseries", "title": "Webhook Latency p99 by Name",
"description": "p99 round-trip time per webhook call (network + webhook server processing). Default apiserver timeout is 10s; a webhook consistently near that limit causes cascading write latency for all resources it intercepts.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "histogram_quantile(0.99, sum by(name,le)(rate(apiserver_admission_webhook_admission_duration_seconds_bucket{instance=~\"$instance\"}[5m])))",
"refId": "A", "legendFormat": "{{name}}"
}],
"fieldConfig": {
"defaults": {
"unit": "s", "min": 0, "decimals": 4,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 0.5 },
{ "color": "red", "value": 2.0 }
]}
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 8, "y": 31 }
},
{
"id": 24, "type": "timeseries", "title": "Webhook Rejection Rate by Name",
"description": "Rate of admission denials per webhook. A validating webhook rejecting requests is expected behaviour; a sudden surge indicates either a newly enforced policy or a misbehaving webhook rejecting valid objects.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(name,error_type)(rate(apiserver_admission_webhook_rejection_count{instance=~\"$instance\"}[5m]))",
"refId": "A", "legendFormat": "{{name}} ({{error_type}})"
}],
"fieldConfig": {
"defaults": {
"unit": "reqps", "min": 0, "decimals": 3,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 15, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 16, "y": 31 }
},
{
"id": 25, "type": "row", "title": "kube-controller-manager", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 38 }
},
{
"id": 26, "type": "timeseries", "title": "Work Queue Depth by Controller",
"description": "Items waiting to be reconciled in each controller's work queue. Persistent non-zero depth = controller cannot keep up with the event rate. Identifies which specific controller is the bottleneck during overload incidents.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "topk(15, sum by(name)(workqueue_depth{job=~\".*controller-manager.*\"}))",
"refId": "A", "legendFormat": "{{name}}"
}],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": {
"lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false,
"thresholdsStyle": { "mode": "line" }
},
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 10 },
{ "color": "red", "value": 50 }
]}
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 0, "y": 39 }
},
{
"id": 27, "type": "timeseries", "title": "Work Queue Item Processing Duration p99 by Controller",
"description": "p99 time a work item spends being actively reconciled (inside the reconcile loop, excludes queue wait time). A slow reconcile = either the controller is doing expensive API calls or the etcd write path is slow.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "histogram_quantile(0.99, sum by(name,le)(rate(workqueue_work_duration_seconds_bucket{job=~\".*controller-manager.*\"}[5m])))",
"refId": "A", "legendFormat": "{{name}}"
}],
"fieldConfig": {
"defaults": {
"unit": "s", "min": 0, "decimals": 4,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 8, "y": 39 }
},
{
"id": 28, "type": "timeseries", "title": "Work Queue Retry Rate by Controller",
"description": "Rate of items being re-queued after a failed reconciliation. A persistently high retry rate for a controller = it is encountering recurring errors on the same objects (e.g., API permission errors, webhook rejections, or resource conflicts).",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "topk(15, sum by(name)(rate(workqueue_retries_total{job=~\".*controller-manager.*\"}[5m])))",
"refId": "A", "legendFormat": "{{name}}"
}],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0, "decimals": 3,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 39 }
},
{
"id": 29, "type": "row", "title": "kube-scheduler", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 47 }
},
{
"id": 30, "type": "timeseries", "title": "Scheduling Attempt Rate by Result",
"description": "Outcomes of scheduling cycles per second. scheduled = pod successfully bound to a node. unschedulable = no node met the pod's constraints. error = scheduler internal failure (API error, timeout). Persistent unschedulable = cluster capacity or taints/affinity misconfiguration.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(result)(rate(scheduler_schedule_attempts_total[5m]))",
"refId": "A", "legendFormat": "{{result}}"
}],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
},
"overrides": [
{ "matcher": { "id": "byName", "options": "scheduled" }, "properties": [{ "id": "color", "value": { "fixedColor": "green", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "unschedulable" }, "properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "error" }, "properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }] }
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 0, "y": 48 }
},
{
"id": 31, "type": "timeseries", "title": "Scheduling Latency — p50 / p95 / p99",
"description": "Time from when a pod enters the active queue to when a binding decision is made (does not include bind API call time). Includes filter, score, and reserve plugin execution time. Spike = expensive affinity rules, large number of nodes, or slow extender webhooks.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "histogram_quantile(0.50, sum(rate(scheduler_scheduling_attempt_duration_seconds_bucket[5m])) by (le))", "refId": "A", "legendFormat": "p50" },
{ "expr": "histogram_quantile(0.95, sum(rate(scheduler_scheduling_attempt_duration_seconds_bucket[5m])) by (le))", "refId": "B", "legendFormat": "p95" },
{ "expr": "histogram_quantile(0.99, sum(rate(scheduler_scheduling_attempt_duration_seconds_bucket[5m])) by (le))", "refId": "C", "legendFormat": "p99" }
],
"fieldConfig": {
"defaults": {
"unit": "s", "min": 0, "decimals": 4,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
},
"overrides": [
{ "matcher": { "id": "byName", "options": "p50" }, "properties": [{ "id": "color", "value": { "fixedColor": "green", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "p95" }, "properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "p99" }, "properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }] }
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 8, "y": 48 }
},
{
"id": 32, "type": "timeseries", "title": "Pending Pods by Queue",
"description": "Pods waiting to be scheduled, split by internal queue. active = ready to be attempted now. backoff = recently failed, in exponential back-off. unschedulable = parked until cluster state changes. A growing unschedulable queue = systemic capacity or constraint problem.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(queue)(scheduler_pending_pods)",
"refId": "A", "legendFormat": "{{queue}}"
}],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": {
"lineWidth": 2, "fillOpacity": 15, "showPoints": "never", "spanNulls": false,
"thresholdsStyle": { "mode": "line" }
},
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 10 },
{ "color": "red", "value": 50 }
]}
},
"overrides": [
{ "matcher": { "id": "byName", "options": "unschedulable" }, "properties": [{ "id": "color", "value": { "fixedColor": "orange", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "backoff" }, "properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "active" }, "properties": [{ "id": "color", "value": { "fixedColor": "blue", "mode": "fixed" } }] }
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["lastNotNull", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 16, "y": 48 }
},
{
"id": 33, "type": "row", "title": "Process Resources — All Control Plane Components", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 55 }
},
{
"id": 34, "type": "timeseries", "title": "CPU Usage by Component",
"description": "Rate of CPU seconds consumed by each control plane process. apiserver CPU spike = surge in request volume or list serialization. controller-manager CPU spike = reconcile storm. scheduler CPU spike = large node count with complex affinity.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "sum by(job)(rate(process_cpu_seconds_total{job=~\".*apiserver.*\"}[5m]))", "refId": "A", "legendFormat": "apiserver — {{job}}" },
{ "expr": "sum by(job)(rate(process_cpu_seconds_total{job=~\".*controller-manager.*\"}[5m]))", "refId": "B", "legendFormat": "controller-manager — {{job}}" },
{ "expr": "sum by(job)(rate(process_cpu_seconds_total{job=~\".*scheduler.*\"}[5m]))", "refId": "C", "legendFormat": "scheduler — {{job}}" }
],
"fieldConfig": {
"defaults": {
"unit": "percentunit", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 0, "y": 56 }
},
{
"id": 35, "type": "timeseries", "title": "RSS Memory by Component",
"description": "Resident set size of each control plane process. apiserver memory is dominated by the watch cache size and serialisation buffers. controller-manager memory = informer caches. Monotonically growing RSS without restarts = memory leak or unbounded cache growth.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "sum by(job)(process_resident_memory_bytes{job=~\".*apiserver.*\"})", "refId": "A", "legendFormat": "apiserver — {{job}}" },
{ "expr": "sum by(job)(process_resident_memory_bytes{job=~\".*controller-manager.*\"})", "refId": "B", "legendFormat": "controller-manager — {{job}}" },
{ "expr": "sum by(job)(process_resident_memory_bytes{job=~\".*scheduler.*\"})", "refId": "C", "legendFormat": "scheduler — {{job}}" }
],
"fieldConfig": {
"defaults": {
"unit": "bytes", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["lastNotNull", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 8, "y": 56 }
},
{
"id": 36, "type": "timeseries", "title": "Goroutines by Component",
"description": "Number of live goroutines in each control plane process. Gradual upward drift = goroutine leak (often tied to unclosed watch streams or context leaks). A step-down = process restart. apiserver typically runs 200600 goroutines; spikes above 1000 warrant investigation.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "sum by(job)(go_goroutines{job=~\".*apiserver.*\"})", "refId": "A", "legendFormat": "apiserver — {{job}}" },
{ "expr": "sum by(job)(go_goroutines{job=~\".*controller-manager.*\"})", "refId": "B", "legendFormat": "controller-manager — {{job}}" },
{ "expr": "sum by(job)(go_goroutines{job=~\".*scheduler.*\"})", "refId": "C", "legendFormat": "scheduler — {{job}}" }
],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["lastNotNull", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 16, "y": 56 }
}
]
}

View File

@@ -0,0 +1,734 @@
{
"title": "etcd",
"uid": "okd-etcd",
"schemaVersion": 36,
"version": 1,
"refresh": "30s",
"time": { "from": "now-1h", "to": "now" },
"tags": ["okd", "etcd"],
"templating": {
"list": [
{
"name": "instance",
"type": "query",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"query": { "query": "label_values(etcd_server_has_leader, instance)", "refId": "A" },
"refresh": 2,
"includeAll": true,
"multi": true,
"allValue": ".*",
"label": "Instance",
"sort": 1,
"current": {},
"options": []
}
]
},
"panels": [
{
"id": 1, "type": "stat", "title": "Cluster Members",
"description": "Total number of etcd members currently reporting metrics.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(etcd_server_has_leader)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "red", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "green", "value": 3 }
]},
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 0, "y": 0 }
},
{
"id": 2, "type": "stat", "title": "Has Leader",
"description": "min() across all members. 0 = at least one member has no quorum — cluster is degraded.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "min(etcd_server_has_leader)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "red", "value": null },
{ "color": "green", "value": 1 }
]},
"unit": "short", "noValue": "0",
"mappings": [
{ "type": "value", "options": {
"0": { "text": "NO LEADER", "color": "red" },
"1": { "text": "OK", "color": "green" }
}}
]
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 3, "y": 0 }
},
{
"id": 3, "type": "stat", "title": "Leader Changes (1h)",
"description": "Number of leader elections in the last hour. ≥3 indicates cluster instability.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "sum(changes(etcd_server_leader_changes_seen_total[1h]))", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 3 }
]},
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 6, "y": 0 }
},
{
"id": 4, "type": "stat", "title": "DB Size (Max)",
"description": "Largest boltdb file size across all members. Default etcd quota is 8 GiB.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "max(etcd_mvcc_db_total_size_in_bytes)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 2147483648 },
{ "color": "orange", "value": 5368709120 },
{ "color": "red", "value": 7516192768 }
]},
"unit": "bytes", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 9, "y": 0 }
},
{
"id": 5, "type": "stat", "title": "DB Fragmentation (Max)",
"description": "% of DB space that is allocated but unused. >50% → run etcdctl defrag.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "max((etcd_mvcc_db_total_size_in_bytes - etcd_mvcc_db_total_size_in_use_in_bytes) / etcd_mvcc_db_total_size_in_bytes * 100)",
"refId": "A", "legendFormat": ""
}],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 25 },
{ "color": "orange", "value": 50 },
{ "color": "red", "value": 75 }
]},
"unit": "percent", "noValue": "0", "decimals": 1
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 12, "y": 0 }
},
{
"id": 6, "type": "stat", "title": "Failed Proposals/s",
"description": "Rate of rejected Raft proposals. Any sustained non-zero value = cluster health problem.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "sum(rate(etcd_server_proposals_failed_total[5m]))", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "red", "value": 0.001 }
]},
"unit": "short", "noValue": "0", "decimals": 3
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 15, "y": 0 }
},
{
"id": 7, "type": "stat", "title": "WAL Fsync p99",
"description": "99th percentile WAL flush-to-disk time. >10ms is concerning; >100ms = serious I/O bottleneck.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) by (le))",
"refId": "A", "legendFormat": ""
}],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 0.01 },
{ "color": "orange", "value": 0.1 },
{ "color": "red", "value": 0.5 }
]},
"unit": "s", "noValue": "0", "decimals": 4
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 18, "y": 0 }
},
{
"id": 8, "type": "stat", "title": "Backend Commit p99",
"description": "99th percentile boltdb commit time. >25ms = warning; >100ms = critical backend I/O pressure.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) by (le))",
"refId": "A", "legendFormat": ""
}],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 0.025 },
{ "color": "orange", "value": 0.1 },
{ "color": "red", "value": 0.25 }
]},
"unit": "s", "noValue": "0", "decimals": 4
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 21, "y": 0 }
},
{
"id": 9, "type": "row", "title": "Cluster Health", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 4 }
},
{
"id": 10, "type": "timeseries", "title": "Has Leader per Instance",
"description": "1 = member has a leader; 0 = member lost quorum. A dip to 0 marks the exact moment of a leader election.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "etcd_server_has_leader{instance=~\"$instance\"}",
"refId": "A", "legendFormat": "{{instance}}"
}],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0, "max": 1.1,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 20, "showPoints": "never", "spanNulls": false },
"mappings": [
{ "type": "value", "options": {
"0": { "text": "0 — no leader" },
"1": { "text": "1 — ok" }
}}
]
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "none" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": [] }
},
"gridPos": { "h": 6, "w": 8, "x": 0, "y": 5 }
},
{
"id": 11, "type": "timeseries", "title": "Leader Changes (cumulative)",
"description": "Monotonically increasing counter per member. A step jump = one leader election. Correlated jumps across members = cluster-wide event.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "etcd_server_leader_changes_seen_total{instance=~\"$instance\"}",
"refId": "A", "legendFormat": "{{instance}}"
}],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "auto", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "none" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["lastNotNull"] }
},
"gridPos": { "h": 6, "w": 8, "x": 8, "y": 5 }
},
{
"id": 12, "type": "timeseries", "title": "Slow Operations",
"description": "slow_apply: proposals applied slower than expected. slow_read_index: linearizable reads timing out. heartbeat_failures: Raft heartbeat send errors (network partition indicator).",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "rate(etcd_server_slow_apply_total{instance=~\"$instance\"}[5m])", "refId": "A", "legendFormat": "Slow Apply — {{instance}}" },
{ "expr": "rate(etcd_server_slow_read_indexes_total{instance=~\"$instance\"}[5m])", "refId": "B", "legendFormat": "Slow Read Index — {{instance}}" },
{ "expr": "rate(etcd_server_heartbeat_send_failures_total{instance=~\"$instance\"}[5m])", "refId": "C", "legendFormat": "Heartbeat Failures — {{instance}}" }
],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0, "decimals": 3,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 6, "w": 8, "x": 16, "y": 5 }
},
{
"id": 13, "type": "row", "title": "gRPC Traffic", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 11 }
},
{
"id": 14, "type": "timeseries", "title": "gRPC Request Rate by Method",
"description": "Unary calls/s per RPC method. High Put/Txn = heavy write load. High Range = heavy read load. High Watch = many controller watchers.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(grpc_method)(rate(grpc_server_started_total{job=~\".*etcd.*\",grpc_type=\"unary\"}[5m]))",
"refId": "A", "legendFormat": "{{grpc_method}}"
}],
"fieldConfig": {
"defaults": {
"unit": "reqps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 0, "y": 12 }
},
{
"id": 15, "type": "timeseries", "title": "gRPC Error Rate by Status Code",
"description": "Non-OK responses by gRPC status code. RESOURCE_EXHAUSTED = overloaded. UNAVAILABLE = leader election. DEADLINE_EXCEEDED = latency spike.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(grpc_code)(rate(grpc_server_handled_total{job=~\".*etcd.*\",grpc_code!=\"OK\"}[5m]))",
"refId": "A", "legendFormat": "{{grpc_code}}"
}],
"fieldConfig": {
"defaults": {
"unit": "reqps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 15, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 8, "y": 12 }
},
{
"id": 16, "type": "timeseries", "title": "gRPC Request Latency (p50 / p95 / p99)",
"description": "Unary call handling duration. p99 > 100ms for Put/Txn indicates disk or CPU pressure. p99 > 500ms will cause kube-apiserver timeouts.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "histogram_quantile(0.50, sum(rate(grpc_server_handling_seconds_bucket{job=~\".*etcd.*\",grpc_type=\"unary\"}[5m])) by (le))", "refId": "A", "legendFormat": "p50" },
{ "expr": "histogram_quantile(0.95, sum(rate(grpc_server_handling_seconds_bucket{job=~\".*etcd.*\",grpc_type=\"unary\"}[5m])) by (le))", "refId": "B", "legendFormat": "p95" },
{ "expr": "histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job=~\".*etcd.*\",grpc_type=\"unary\"}[5m])) by (le))", "refId": "C", "legendFormat": "p99" }
],
"fieldConfig": {
"defaults": {
"unit": "s", "min": 0, "decimals": 4,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
},
"overrides": [
{ "matcher": { "id": "byName", "options": "p50" }, "properties": [{ "id": "color", "value": { "fixedColor": "green", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "p95" }, "properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "p99" }, "properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }] }
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 12 }
},
{
"id": 17, "type": "row", "title": "Raft Proposals", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 20 }
},
{
"id": 18, "type": "timeseries", "title": "Proposals Committed vs Applied",
"description": "Committed = agreed by Raft quorum. Applied = persisted to boltdb. A widening gap between the two = backend apply backlog (disk too slow to keep up).",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "rate(etcd_server_proposals_committed_total{instance=~\"$instance\"}[5m])", "refId": "A", "legendFormat": "Committed — {{instance}}" },
{ "expr": "rate(etcd_server_proposals_applied_total{instance=~\"$instance\"}[5m])", "refId": "B", "legendFormat": "Applied — {{instance}}" }
],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 0, "y": 21 }
},
{
"id": 19, "type": "timeseries", "title": "Proposals Pending",
"description": "In-flight Raft proposals not yet committed. Consistently high (>5) = cluster cannot keep up with write throughput.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "etcd_server_proposals_pending{instance=~\"$instance\"}",
"refId": "A", "legendFormat": "{{instance}}"
}],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": {
"lineWidth": 2, "fillOpacity": 15, "showPoints": "never", "spanNulls": false,
"thresholdsStyle": { "mode": "line+area" }
},
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 5 },
{ "color": "red", "value": 10 }
]}
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 8, "y": 21 }
},
{
"id": 20, "type": "timeseries", "title": "Failed Proposals Rate",
"description": "Raft proposals that were rejected. Root causes: quorum loss, leader timeout, network partition between members.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "rate(etcd_server_proposals_failed_total{instance=~\"$instance\"}[5m])",
"refId": "A", "legendFormat": "{{instance}}"
}],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0, "decimals": 3,
"color": { "mode": "palette-classic" },
"custom": {
"lineWidth": 2, "fillOpacity": 20, "showPoints": "never", "spanNulls": false,
"thresholdsStyle": { "mode": "line" }
},
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "red", "value": 0.001 }
]}
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 8, "x": 16, "y": 21 }
},
{
"id": 21, "type": "row", "title": "Disk I/O", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 28 }
},
{
"id": 22, "type": "timeseries", "title": "WAL Fsync Duration (p50 / p95 / p99) per Instance",
"description": "Time to flush the write-ahead log to disk. etcd is extremely sensitive to WAL latency. >10ms p99 = storage is the bottleneck. Correlates directly with Raft commit latency.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "histogram_quantile(0.50, sum by(le,instance)(rate(etcd_disk_wal_fsync_duration_seconds_bucket{instance=~\"$instance\"}[5m])))", "refId": "A", "legendFormat": "p50 — {{instance}}" },
{ "expr": "histogram_quantile(0.95, sum by(le,instance)(rate(etcd_disk_wal_fsync_duration_seconds_bucket{instance=~\"$instance\"}[5m])))", "refId": "B", "legendFormat": "p95 — {{instance}}" },
{ "expr": "histogram_quantile(0.99, sum by(le,instance)(rate(etcd_disk_wal_fsync_duration_seconds_bucket{instance=~\"$instance\"}[5m])))", "refId": "C", "legendFormat": "p99 — {{instance}}" }
],
"fieldConfig": {
"defaults": {
"unit": "s", "min": 0, "decimals": 4,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 29 }
},
{
"id": 23, "type": "timeseries", "title": "Backend Commit Duration (p50 / p95 / p99) per Instance",
"description": "Time for boltdb to commit a batch transaction. A spike here while WAL is healthy = backend I/O saturation or boltdb lock contention. Triggers apply backlog.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "histogram_quantile(0.50, sum by(le,instance)(rate(etcd_disk_backend_commit_duration_seconds_bucket{instance=~\"$instance\"}[5m])))", "refId": "A", "legendFormat": "p50 — {{instance}}" },
{ "expr": "histogram_quantile(0.95, sum by(le,instance)(rate(etcd_disk_backend_commit_duration_seconds_bucket{instance=~\"$instance\"}[5m])))", "refId": "B", "legendFormat": "p95 — {{instance}}" },
{ "expr": "histogram_quantile(0.99, sum by(le,instance)(rate(etcd_disk_backend_commit_duration_seconds_bucket{instance=~\"$instance\"}[5m])))", "refId": "C", "legendFormat": "p99 — {{instance}}" }
],
"fieldConfig": {
"defaults": {
"unit": "s", "min": 0, "decimals": 4,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 29 }
},
{
"id": 24, "type": "row", "title": "Network (Peer & Client)", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 37 }
},
{
"id": 25, "type": "timeseries", "title": "Peer RX Rate",
"description": "Bytes received from Raft peers (log replication + heartbeats). A burst during a quiet period = large snapshot being streamed to a recovering member.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "rate(etcd_network_peer_received_bytes_total{instance=~\"$instance\"}[5m])",
"refId": "A", "legendFormat": "{{instance}}"
}],
"fieldConfig": {
"defaults": {
"unit": "Bps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 6, "x": 0, "y": 38 }
},
{
"id": 26, "type": "timeseries", "title": "Peer TX Rate",
"description": "Bytes sent to Raft peers. Leader will have higher TX than followers (it replicates entries to all members).",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "rate(etcd_network_peer_sent_bytes_total{instance=~\"$instance\"}[5m])",
"refId": "A", "legendFormat": "{{instance}}"
}],
"fieldConfig": {
"defaults": {
"unit": "Bps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 6, "x": 6, "y": 38 }
},
{
"id": 27, "type": "timeseries", "title": "Client gRPC Received",
"description": "Bytes received from API clients (kube-apiserver, operators). Spike = large write burst from controllers or kubectl apply.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "rate(etcd_network_client_grpc_received_bytes_total{instance=~\"$instance\"}[5m])",
"refId": "A", "legendFormat": "{{instance}}"
}],
"fieldConfig": {
"defaults": {
"unit": "Bps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 6, "x": 12, "y": 38 }
},
{
"id": 28, "type": "timeseries", "title": "Client gRPC Sent",
"description": "Bytes sent to API clients (responses + watch events). Persistently high = many active Watch streams or large objects being served.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "rate(etcd_network_client_grpc_sent_bytes_total{instance=~\"$instance\"}[5m])",
"refId": "A", "legendFormat": "{{instance}}"
}],
"fieldConfig": {
"defaults": {
"unit": "Bps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 6, "x": 18, "y": 38 }
},
{
"id": 29, "type": "row", "title": "DB Size & Process Resources", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 45 }
},
{
"id": 30, "type": "timeseries", "title": "DB Total vs In-Use Size per Instance",
"description": "Total = allocated boltdb file size. In Use = live key data. The gap between them = fragmentation. Steady growth of Total = compaction not keeping up with key churn.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "etcd_mvcc_db_total_size_in_bytes{instance=~\"$instance\"}", "refId": "A", "legendFormat": "Total — {{instance}}" },
{ "expr": "etcd_mvcc_db_total_size_in_use_in_bytes{instance=~\"$instance\"}", "refId": "B", "legendFormat": "In Use — {{instance}}" }
],
"fieldConfig": {
"defaults": {
"unit": "bytes", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["lastNotNull", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 0, "y": 46 }
},
{
"id": 31, "type": "timeseries", "title": "Process Resident Memory (RSS)",
"description": "Physical RAM consumed by the etcd process. Monotonically growing RSS = memory leak or oversized watch cache. Typical healthy range: 500 MiB2 GiB depending on cluster size.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "etcd_process_resident_memory_bytes{instance=~\"$instance\"}",
"refId": "A", "legendFormat": "{{instance}}"
}],
"fieldConfig": {
"defaults": {
"unit": "bytes", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["lastNotNull", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 8, "y": 46 }
},
{
"id": 32, "type": "timeseries", "title": "Open File Descriptors vs Limit",
"description": "Open FD count (solid) and process FD limit (dashed). Approaching the limit will cause WAL file creation and new client connections to fail.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "etcd_process_open_fds{instance=~\"$instance\"}", "refId": "A", "legendFormat": "Open — {{instance}}" },
{ "expr": "etcd_process_max_fds{instance=~\"$instance\"}", "refId": "B", "legendFormat": "Limit — {{instance}}" }
],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
},
"overrides": [
{
"matcher": { "id": "byRegexp", "options": "^Limit.*" },
"properties": [
{ "id": "custom.lineWidth", "value": 1 },
{ "id": "custom.lineStyle", "value": { "fill": "dash", "dash": [6, 4] } },
{ "id": "custom.fillOpacity","value": 0 }
]
}
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["lastNotNull", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 46 }
},
{
"id": 33, "type": "row", "title": "Snapshots", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 54 }
},
{
"id": 34, "type": "timeseries", "title": "Snapshot Save Duration (p50 / p95 / p99)",
"description": "Time to write a full snapshot of the boltdb to disk. Slow saves delay Raft log compaction, causing the WAL to grow unboundedly and members to fall further behind.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "histogram_quantile(0.50, sum by(le)(rate(etcd_debugging_snap_save_total_duration_seconds_bucket{instance=~\"$instance\"}[5m])))", "refId": "A", "legendFormat": "p50" },
{ "expr": "histogram_quantile(0.95, sum by(le)(rate(etcd_debugging_snap_save_total_duration_seconds_bucket{instance=~\"$instance\"}[5m])))", "refId": "B", "legendFormat": "p95" },
{ "expr": "histogram_quantile(0.99, sum by(le)(rate(etcd_debugging_snap_save_total_duration_seconds_bucket{instance=~\"$instance\"}[5m])))", "refId": "C", "legendFormat": "p99" }
],
"fieldConfig": {
"defaults": {
"unit": "s", "min": 0, "decimals": 3,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
},
"overrides": [
{ "matcher": { "id": "byName", "options": "p50" }, "properties": [{ "id": "color", "value": { "fixedColor": "green", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "p95" }, "properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "p99" }, "properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }] }
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 12, "x": 0, "y": 55 }
},
{
"id": 35, "type": "timeseries", "title": "Snapshot DB Fsync Duration (p50 / p95 / p99)",
"description": "Time to fsync the snapshot file itself. Distinct from WAL fsync: this is flushing the entire boltdb copy to disk after a snapshot is taken.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{ "expr": "histogram_quantile(0.50, sum by(le)(rate(etcd_snap_db_fsync_duration_seconds_bucket{instance=~\"$instance\"}[5m])))", "refId": "A", "legendFormat": "p50" },
{ "expr": "histogram_quantile(0.95, sum by(le)(rate(etcd_snap_db_fsync_duration_seconds_bucket{instance=~\"$instance\"}[5m])))", "refId": "B", "legendFormat": "p95" },
{ "expr": "histogram_quantile(0.99, sum by(le)(rate(etcd_snap_db_fsync_duration_seconds_bucket{instance=~\"$instance\"}[5m])))", "refId": "C", "legendFormat": "p99" }
],
"fieldConfig": {
"defaults": {
"unit": "s", "min": 0, "decimals": 3,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
},
"overrides": [
{ "matcher": { "id": "byName", "options": "p50" }, "properties": [{ "id": "color", "value": { "fixedColor": "green", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "p95" }, "properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "p99" }, "properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }] }
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 12, "x": 12, "y": 55 }
}
]
}

View File

@@ -0,0 +1,945 @@
{
"title": "Networking",
"uid": "okd-networking",
"schemaVersion": 36,
"version": 1,
"refresh": "30s",
"time": { "from": "now-1h", "to": "now" },
"tags": ["okd", "networking"],
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"query": { "query": "label_values(kube_pod_info, namespace)", "refId": "A" },
"refresh": 2,
"includeAll": true,
"multi": true,
"allValue": ".*",
"label": "Namespace",
"sort": 1,
"current": {},
"options": []
}
]
},
"panels": [
{
"id": 1, "type": "stat", "title": "Network RX Rate",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum(rate(container_network_receive_bytes_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "A", "legendFormat": ""
}],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "blue", "value": null }] },
"unit": "Bps", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 0, "y": 0 }
},
{
"id": 2, "type": "stat", "title": "Network TX Rate",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum(rate(container_network_transmit_bytes_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "A", "legendFormat": ""
}],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "blue", "value": null }] },
"unit": "Bps", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 3, "y": 0 }
},
{
"id": 3, "type": "stat", "title": "RX Errors/s",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum(rate(container_network_receive_errors_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "A", "legendFormat": ""
}],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "red", "value": 1 }] },
"unit": "pps", "noValue": "0", "decimals": 2
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 6, "y": 0 }
},
{
"id": 4, "type": "stat", "title": "TX Errors/s",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum(rate(container_network_transmit_errors_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "A", "legendFormat": ""
}],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "red", "value": 1 }] },
"unit": "pps", "noValue": "0", "decimals": 2
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 9, "y": 0 }
},
{
"id": 5, "type": "stat", "title": "RX Drops/s",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum(rate(container_network_receive_packets_dropped_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "A", "legendFormat": ""
}],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "orange", "value": 1 }] },
"unit": "pps", "noValue": "0", "decimals": 2
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 12, "y": 0 }
},
{
"id": 6, "type": "stat", "title": "TX Drops/s",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum(rate(container_network_transmit_packets_dropped_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "A", "legendFormat": ""
}],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "orange", "value": 1 }] },
"unit": "pps", "noValue": "0", "decimals": 2
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 15, "y": 0 }
},
{
"id": 7, "type": "stat", "title": "DNS Queries/s",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum(rate(coredns_dns_requests_total[5m]))",
"refId": "A", "legendFormat": ""
}],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "blue", "value": null }] },
"unit": "reqps", "noValue": "0", "decimals": 1
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 18, "y": 0 }
},
{
"id": 8, "type": "stat", "title": "DNS Error %",
"description": "Percentage of DNS responses with non-NOERROR rcode over the last 5 minutes.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum(rate(coredns_dns_responses_total{rcode!=\"NOERROR\"}[5m])) / sum(rate(coredns_dns_responses_total[5m])) * 100",
"refId": "A", "legendFormat": ""
}],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 5 }
]},
"unit": "percent", "noValue": "0", "decimals": 2
}
},
"options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 21, "y": 0 }
},
{
"id": 9, "type": "row", "title": "Network I/O", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 4 }
},
{
"id": 10, "type": "timeseries", "title": "Receive Rate by Namespace",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(namespace)(rate(container_network_receive_bytes_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "A", "legendFormat": "{{namespace}}"
}],
"fieldConfig": {
"defaults": {
"unit": "Bps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 5 }
},
{
"id": 11, "type": "timeseries", "title": "Transmit Rate by Namespace",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(namespace)(rate(container_network_transmit_bytes_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "A", "legendFormat": "{{namespace}}"
}],
"fieldConfig": {
"defaults": {
"unit": "Bps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 5 }
},
{
"id": 12, "type": "row", "title": "Top Pod Consumers", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 13 }
},
{
"id": 13, "type": "timeseries", "title": "Top 10 Pods — RX Rate",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "topk(10, sum by(namespace,pod)(rate(container_network_receive_bytes_total{namespace=~\"$namespace\",pod!=\"\"}[5m])))",
"refId": "A", "legendFormat": "{{namespace}} / {{pod}}"
}],
"fieldConfig": {
"defaults": {
"unit": "Bps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "auto", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["max"] }
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 14 }
},
{
"id": 14, "type": "timeseries", "title": "Top 10 Pods — TX Rate",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "topk(10, sum by(namespace,pod)(rate(container_network_transmit_bytes_total{namespace=~\"$namespace\",pod!=\"\"}[5m])))",
"refId": "A", "legendFormat": "{{namespace}} / {{pod}}"
}],
"fieldConfig": {
"defaults": {
"unit": "Bps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "auto", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["max"] }
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 14 }
},
{
"id": 15,
"type": "table",
"title": "Pod Network I/O Summary",
"description": "Current RX/TX rates, errors and drops per pod. Sorted by RX rate descending.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum by(namespace,pod)(rate(container_network_receive_bytes_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "A", "instant": true, "format": "table", "legendFormat": ""
},
{
"expr": "sum by(namespace,pod)(rate(container_network_transmit_bytes_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "B", "instant": true, "format": "table", "legendFormat": ""
},
{
"expr": "sum by(namespace,pod)(rate(container_network_receive_errors_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "C", "instant": true, "format": "table", "legendFormat": ""
},
{
"expr": "sum by(namespace,pod)(rate(container_network_transmit_errors_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "D", "instant": true, "format": "table", "legendFormat": ""
},
{
"expr": "sum by(namespace,pod)(rate(container_network_receive_packets_dropped_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "E", "instant": true, "format": "table", "legendFormat": ""
},
{
"expr": "sum by(namespace,pod)(rate(container_network_transmit_packets_dropped_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "F", "instant": true, "format": "table", "legendFormat": ""
}
],
"transformations": [
{
"id": "filterFieldsByName",
"options": { "include": { "names": ["namespace", "pod", "Value"] } }
},
{
"id": "joinByField",
"options": { "byField": "pod", "mode": "outer" }
},
{
"id": "organize",
"options": {
"excludeByName": {
"namespace 1": true,
"namespace 2": true,
"namespace 3": true,
"namespace 4": true,
"namespace 5": true
},
"renameByName": {
"namespace": "Namespace",
"pod": "Pod",
"Value": "RX Rate",
"Value 1": "TX Rate",
"Value 2": "RX Errors/s",
"Value 3": "TX Errors/s",
"Value 4": "RX Drops/s",
"Value 5": "TX Drops/s"
},
"indexByName": {
"namespace": 0,
"pod": 1,
"Value": 2,
"Value 1": 3,
"Value 2": 4,
"Value 3": 5,
"Value 4": 6,
"Value 5": 7
}
}
},
{
"id": "sortBy",
"options": { "fields": [{ "displayName": "RX Rate", "desc": true }] }
}
],
"fieldConfig": {
"defaults": { "custom": { "align": "center", "displayMode": "auto" } },
"overrides": [
{
"matcher": { "id": "byName", "options": "Namespace" },
"properties": [{ "id": "custom.align", "value": "left" }, { "id": "custom.width", "value": 160 }]
},
{
"matcher": { "id": "byName", "options": "Pod" },
"properties": [{ "id": "custom.align", "value": "left" }, { "id": "custom.width", "value": 260 }]
},
{
"matcher": { "id": "byRegexp", "options": "^RX Rate$|^TX Rate$" },
"properties": [
{ "id": "unit", "value": "Bps" },
{ "id": "custom.displayMode", "value": "color-background-solid" },
{ "id": "thresholds", "value": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 10000000 },
{ "color": "orange", "value": 100000000 },
{ "color": "red", "value": 500000000 }
]}}
]
},
{
"matcher": { "id": "byRegexp", "options": "^RX Errors/s$|^TX Errors/s$" },
"properties": [
{ "id": "unit", "value": "pps" },
{ "id": "decimals", "value": 3 },
{ "id": "custom.displayMode", "value": "color-background" },
{ "id": "thresholds", "value": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "red", "value": 0.001 }
]}}
]
},
{
"matcher": { "id": "byRegexp", "options": "^RX Drops/s$|^TX Drops/s$" },
"properties": [
{ "id": "unit", "value": "pps" },
{ "id": "decimals", "value": 3 },
{ "id": "custom.displayMode", "value": "color-background" },
{ "id": "thresholds", "value": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "orange", "value": 0.001 }
]}}
]
}
]
},
"options": {},
"gridPos": { "h": 8, "w": 24, "x": 0, "y": 22 }
},
{
"id": 16, "type": "row", "title": "Errors & Packet Loss", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 30 }
},
{
"id": 17, "type": "timeseries", "title": "RX Errors by Namespace",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(namespace)(rate(container_network_receive_errors_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "A", "legendFormat": "{{namespace}}"
}],
"fieldConfig": {
"defaults": {
"unit": "pps", "min": 0, "decimals": 3,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 15, "showPoints": "never", "spanNulls": false },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "red", "value": 1 }] }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 12, "x": 0, "y": 31 }
},
{
"id": 18, "type": "timeseries", "title": "TX Errors by Namespace",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(namespace)(rate(container_network_transmit_errors_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "A", "legendFormat": "{{namespace}}"
}],
"fieldConfig": {
"defaults": {
"unit": "pps", "min": 0, "decimals": 3,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 15, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 12, "x": 12, "y": 31 }
},
{
"id": 19, "type": "timeseries", "title": "RX Packet Drops by Namespace",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(namespace)(rate(container_network_receive_packets_dropped_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "A", "legendFormat": "{{namespace}}"
}],
"fieldConfig": {
"defaults": {
"unit": "pps", "min": 0, "decimals": 3,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 15, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 12, "x": 0, "y": 38 }
},
{
"id": 20, "type": "timeseries", "title": "TX Packet Drops by Namespace",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(namespace)(rate(container_network_transmit_packets_dropped_total{namespace=~\"$namespace\",pod!=\"\"}[5m]))",
"refId": "A", "legendFormat": "{{namespace}}"
}],
"fieldConfig": {
"defaults": {
"unit": "pps", "min": 0, "decimals": 3,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 15, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 12, "x": 12, "y": 38 }
},
{
"id": 21, "type": "row", "title": "DNS (CoreDNS)", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 45 }
},
{
"id": 22, "type": "timeseries", "title": "DNS Request Rate by Query Type",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(type)(rate(coredns_dns_requests_total[5m]))",
"refId": "A", "legendFormat": "{{type}}"
}],
"fieldConfig": {
"defaults": {
"unit": "reqps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 0, "y": 46 }
},
{
"id": 23, "type": "timeseries", "title": "DNS Response Rate by Rcode",
"description": "NOERROR = healthy. NXDOMAIN = name not found. SERVFAIL = upstream error.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum by(rcode)(rate(coredns_dns_responses_total[5m]))",
"refId": "A", "legendFormat": "{{rcode}}"
}],
"fieldConfig": {
"defaults": {
"unit": "reqps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
},
"overrides": [
{ "matcher": { "id": "byName", "options": "NOERROR" }, "properties": [{ "id": "color", "value": { "fixedColor": "green", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "NXDOMAIN" }, "properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "SERVFAIL" }, "properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "REFUSED" }, "properties": [{ "id": "color", "value": { "fixedColor": "orange", "mode": "fixed" } }] }
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 8, "y": 46 }
},
{
"id": 24, "type": "timeseries", "title": "DNS Request Latency (p50 / p95 / p99)",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "histogram_quantile(0.50, sum(rate(coredns_dns_request_duration_seconds_bucket[5m])) by (le))",
"refId": "A", "legendFormat": "p50"
},
{
"expr": "histogram_quantile(0.95, sum(rate(coredns_dns_request_duration_seconds_bucket[5m])) by (le))",
"refId": "B", "legendFormat": "p95"
},
{
"expr": "histogram_quantile(0.99, sum(rate(coredns_dns_request_duration_seconds_bucket[5m])) by (le))",
"refId": "C", "legendFormat": "p99"
}
],
"fieldConfig": {
"defaults": {
"unit": "s", "min": 0, "decimals": 4,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
},
"overrides": [
{ "matcher": { "id": "byName", "options": "p50" }, "properties": [{ "id": "color", "value": { "fixedColor": "green", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "p95" }, "properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "p99" }, "properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }] }
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 46 }
},
{
"id": 25, "type": "timeseries", "title": "DNS Cache Hit Ratio (%)",
"description": "High hit ratio = CoreDNS is serving responses from cache, reducing upstream load.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum(rate(coredns_cache_hits_total[5m])) / (sum(rate(coredns_cache_hits_total[5m])) + sum(rate(coredns_cache_misses_total[5m]))) * 100",
"refId": "A", "legendFormat": "Cache Hit %"
}],
"fieldConfig": {
"defaults": {
"unit": "percent", "min": 0, "max": 100,
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "red", "value": null },
{ "color": "yellow", "value": 50 },
{ "color": "green", "value": 80 }
]},
"custom": { "lineWidth": 2, "fillOpacity": 20, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "single" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "lastNotNull"] }
},
"gridPos": { "h": 7, "w": 12, "x": 0, "y": 54 }
},
{
"id": 26, "type": "timeseries", "title": "DNS Forward Request Rate",
"description": "Queries CoreDNS is forwarding upstream. Spike here with cache miss spike = upstream DNS pressure.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum(rate(coredns_forward_requests_total[5m]))",
"refId": "A", "legendFormat": "Forward Requests/s"
},
{
"expr": "sum(rate(coredns_forward_responses_duration_seconds_count[5m]))",
"refId": "B", "legendFormat": "Forward Responses/s"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 7, "w": 12, "x": 12, "y": 54 }
},
{
"id": 27, "type": "row", "title": "Services & Endpoints", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 61 }
},
{
"id": 28, "type": "stat", "title": "Total Services",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "count(kube_service_info{namespace=~\"$namespace\"})",
"refId": "A", "legendFormat": ""
}],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "blue", "value": null }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 8, "x": 0, "y": 62 }
},
{
"id": 29, "type": "stat", "title": "Endpoint Addresses Available",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum(kube_endpoint_address_available{namespace=~\"$namespace\"})",
"refId": "A", "legendFormat": ""
}],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "red", "value": null }, { "color": "green", "value": 1 }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 8, "x": 8, "y": 62 }
},
{
"id": 30, "type": "stat", "title": "Endpoint Addresses Not Ready",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{
"expr": "sum(kube_endpoint_address_not_ready{namespace=~\"$namespace\"}) or vector(0)",
"refId": "A", "legendFormat": ""
}],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "red", "value": 1 }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 8, "x": 16, "y": 62 }
},
{
"id": 31,
"type": "table",
"title": "Endpoint Availability",
"description": "Per-endpoint available vs not-ready address counts. Red Not Ready = pods backing this service are unhealthy.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum by(namespace,endpoint)(kube_endpoint_address_available{namespace=~\"$namespace\"})",
"refId": "A", "instant": true, "format": "table", "legendFormat": ""
},
{
"expr": "sum by(namespace,endpoint)(kube_endpoint_address_not_ready{namespace=~\"$namespace\"})",
"refId": "B", "instant": true, "format": "table", "legendFormat": ""
}
],
"transformations": [
{
"id": "filterFieldsByName",
"options": { "include": { "names": ["namespace", "endpoint", "Value"] } }
},
{
"id": "joinByField",
"options": { "byField": "endpoint", "mode": "outer" }
},
{
"id": "organize",
"options": {
"excludeByName": { "namespace 1": true },
"renameByName": {
"namespace": "Namespace",
"endpoint": "Endpoint",
"Value": "Available",
"Value 1": "Not Ready"
},
"indexByName": {
"namespace": 0,
"endpoint": 1,
"Value": 2,
"Value 1": 3
}
}
},
{
"id": "sortBy",
"options": { "fields": [{ "displayName": "Not Ready", "desc": true }] }
}
],
"fieldConfig": {
"defaults": { "custom": { "align": "center", "displayMode": "auto" } },
"overrides": [
{
"matcher": { "id": "byName", "options": "Namespace" },
"properties": [{ "id": "custom.align", "value": "left" }, { "id": "custom.width", "value": 180 }]
},
{
"matcher": { "id": "byName", "options": "Endpoint" },
"properties": [{ "id": "custom.align", "value": "left" }, { "id": "custom.width", "value": 220 }]
},
{
"matcher": { "id": "byName", "options": "Available" },
"properties": [
{ "id": "custom.displayMode", "value": "color-background" },
{ "id": "thresholds", "value": { "mode": "absolute", "steps": [{ "color": "red", "value": null }, { "color": "green", "value": 1 }] } }
]
},
{
"matcher": { "id": "byName", "options": "Not Ready" },
"properties": [
{ "id": "custom.displayMode", "value": "color-background" },
{ "id": "thresholds", "value": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "red", "value": 1 }] } }
]
}
]
},
"options": {},
"gridPos": { "h": 8, "w": 24, "x": 0, "y": 66 }
},
{
"id": 32, "type": "row", "title": "OKD Router / Ingress (HAProxy)", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 74 }
},
{
"id": 33, "type": "timeseries", "title": "Router HTTP Request Rate by Code",
"description": "Requires HAProxy router metrics to be scraped (port 1936). OKD exposes these via the openshift-ingress ServiceMonitor.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum by(code)(rate(haproxy_backend_http_responses_total[5m]))",
"refId": "A", "legendFormat": "HTTP {{code}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
},
"overrides": [
{ "matcher": { "id": "byName", "options": "HTTP 2xx" }, "properties": [{ "id": "color", "value": { "fixedColor": "green", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "HTTP 4xx" }, "properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "HTTP 5xx" }, "properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }] }
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 75 }
},
{
"id": 34, "type": "timeseries", "title": "Router 4xx + 5xx Error Rate (%)",
"description": "Client error (4xx) and server error (5xx) rates as a percentage of all requests.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum(rate(haproxy_backend_http_responses_total{code=\"4xx\"}[5m])) / sum(rate(haproxy_backend_http_responses_total[5m])) * 100",
"refId": "A", "legendFormat": "4xx %"
},
{
"expr": "sum(rate(haproxy_backend_http_responses_total{code=\"5xx\"}[5m])) / sum(rate(haproxy_backend_http_responses_total[5m])) * 100",
"refId": "B", "legendFormat": "5xx %"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 15, "showPoints": "never", "spanNulls": false },
"thresholds": { "mode": "absolute", "steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 5 }
]}
},
"overrides": [
{ "matcher": { "id": "byName", "options": "4xx %" }, "properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "5xx %" }, "properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }] }
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 75 }
},
{
"id": 35, "type": "timeseries", "title": "Router Bytes In / Out",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum(rate(haproxy_frontend_bytes_in_total[5m]))",
"refId": "A", "legendFormat": "Bytes In"
},
{
"expr": "sum(rate(haproxy_frontend_bytes_out_total[5m]))",
"refId": "B", "legendFormat": "Bytes Out"
}
],
"fieldConfig": {
"defaults": {
"unit": "Bps", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
},
"overrides": [
{ "matcher": { "id": "byName", "options": "Bytes In" }, "properties": [{ "id": "color", "value": { "fixedColor": "blue", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "Bytes Out" }, "properties": [{ "id": "color", "value": { "fixedColor": "green", "mode": "fixed" } }] }
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 83 }
},
{
"id": 36,
"type": "table",
"title": "Router Backend Server Status",
"description": "HAProxy backend servers (routes). Value 0 = DOWN, 1 = UP.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "haproxy_server_up",
"refId": "A", "instant": true, "format": "table", "legendFormat": ""
}
],
"transformations": [
{
"id": "filterFieldsByName",
"options": { "include": { "names": ["proxy", "server", "Value"] } }
},
{
"id": "organize",
"options": {
"excludeByName": {},
"renameByName": {
"proxy": "Backend",
"server": "Server",
"Value": "Status"
},
"indexByName": { "proxy": 0, "server": 1, "Value": 2 }
}
},
{
"id": "sortBy",
"options": { "fields": [{ "displayName": "Status", "desc": false }] }
}
],
"fieldConfig": {
"defaults": { "custom": { "align": "center", "displayMode": "auto" } },
"overrides": [
{
"matcher": { "id": "byName", "options": "Backend" },
"properties": [{ "id": "custom.align", "value": "left" }, { "id": "custom.width", "value": 260 }]
},
{
"matcher": { "id": "byName", "options": "Server" },
"properties": [{ "id": "custom.align", "value": "left" }, { "id": "custom.width", "value": 180 }]
},
{
"matcher": { "id": "byName", "options": "Status" },
"properties": [
{ "id": "custom.displayMode", "value": "color-background" },
{ "id": "mappings", "value": [
{ "type": "value", "options": { "0": { "text": "DOWN", "color": "red" } } },
{ "type": "value", "options": { "1": { "text": "UP", "color": "green" } } }
]},
{ "id": "thresholds", "value": { "mode": "absolute", "steps": [
{ "color": "red", "value": null },
{ "color": "green", "value": 1 }
]}}
]
}
]
},
"options": {},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 83 }
}
]
}

View File

@@ -0,0 +1,627 @@
{
"title": "Node Health",
"uid": "okd-node-health",
"schemaVersion": 36,
"version": 2,
"refresh": "30s",
"time": { "from": "now-1h", "to": "now" },
"tags": ["okd", "node", "health"],
"templating": {
"list": [
{
"name": "node",
"type": "query",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"query": { "query": "label_values(kube_node_info, node)", "refId": "A" },
"refresh": 2,
"includeAll": true,
"multi": true,
"allValue": ".*",
"label": "Node",
"sort": 1,
"current": {},
"options": []
}
]
},
"panels": [
{
"id": 1,
"type": "stat",
"title": "Total Nodes",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_node_info{node=~\"$node\"})", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "blue", "value": null }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 0, "y": 0 }
},
{
"id": 2,
"type": "stat",
"title": "Ready Nodes",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_node_status_condition{condition=\"Ready\",status=\"true\",node=~\"$node\"} == 1)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "red", "value": null }, { "color": "green", "value": 1 }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 3, "y": 0 }
},
{
"id": 3,
"type": "stat",
"title": "Not Ready Nodes",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_node_status_condition{condition=\"Ready\",status=\"false\",node=~\"$node\"} == 1) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "red", "value": 1 }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 6, "y": 0 }
},
{
"id": 4,
"type": "stat",
"title": "Memory Pressure",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_node_status_condition{condition=\"MemoryPressure\",status=\"true\",node=~\"$node\"} == 1) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "red", "value": 1 }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 9, "y": 0 }
},
{
"id": 5,
"type": "stat",
"title": "Disk Pressure",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_node_status_condition{condition=\"DiskPressure\",status=\"true\",node=~\"$node\"} == 1) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "red", "value": 1 }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 12, "y": 0 }
},
{
"id": 6,
"type": "stat",
"title": "PID Pressure",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_node_status_condition{condition=\"PIDPressure\",status=\"true\",node=~\"$node\"} == 1) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "orange", "value": 1 }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 15, "y": 0 }
},
{
"id": 7,
"type": "stat",
"title": "Unschedulable",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_node_spec_unschedulable{node=~\"$node\"} == 1) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "yellow", "value": 1 }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 18, "y": 0 }
},
{
"id": 8,
"type": "stat",
"title": "Kubelet Up",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(up{job=\"kubelet\",metrics_path=\"/metrics\"} == 1) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "red", "value": null }, { "color": "green", "value": 1 }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 21, "y": 0 }
},
{
"id": 9,
"type": "table",
"title": "Node Conditions",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum by(node) (kube_node_status_condition{condition=\"Ready\",status=\"true\",node=~\"$node\"})",
"refId": "A",
"legendFormat": "{{node}}",
"instant": true
},
{
"expr": "sum by(node) (kube_node_status_condition{condition=\"MemoryPressure\",status=\"true\",node=~\"$node\"})",
"refId": "B",
"legendFormat": "{{node}}",
"instant": true
},
{
"expr": "sum by(node) (kube_node_status_condition{condition=\"DiskPressure\",status=\"true\",node=~\"$node\"})",
"refId": "C",
"legendFormat": "{{node}}",
"instant": true
},
{
"expr": "sum by(node) (kube_node_status_condition{condition=\"PIDPressure\",status=\"true\",node=~\"$node\"})",
"refId": "D",
"legendFormat": "{{node}}",
"instant": true
},
{
"expr": "sum by(node) (kube_node_spec_unschedulable{node=~\"$node\"})",
"refId": "E",
"legendFormat": "{{node}}",
"instant": true
}
],
"transformations": [
{
"id": "labelsToFields",
"options": { "mode": "columns" }
},
{
"id": "joinByField",
"options": { "byField": "node", "mode": "outer" }
},
{
"id": "organize",
"options": {
"excludeByName": {
"Time": true,
"Time 1": true,
"Time 2": true,
"Time 3": true,
"Time 4": true,
"Time 5": true
},
"renameByName": {
"node": "Node",
"Value #A": "Ready",
"Value #B": "Mem Pressure",
"Value #C": "Disk Pressure",
"Value #D": "PID Pressure",
"Value #E": "Unschedulable"
},
"indexByName": {
"node": 0,
"Value #A": 1,
"Value #B": 2,
"Value #C": 3,
"Value #D": 4,
"Value #E": 5
}
}
}
],
"fieldConfig": {
"defaults": {
"custom": { "displayMode": "color-background", "align": "center" }
},
"overrides": [
{
"matcher": { "id": "byName", "options": "Node" },
"properties": [
{ "id": "custom.displayMode", "value": "auto" },
{ "id": "custom.align", "value": "left" },
{ "id": "custom.width", "value": 200 }
]
},
{
"matcher": { "id": "byName", "options": "Ready" },
"properties": [
{
"id": "thresholds",
"value": { "mode": "absolute", "steps": [{ "color": "red", "value": null }, { "color": "green", "value": 1 }] }
},
{ "id": "custom.displayMode", "value": "color-background" },
{
"id": "mappings",
"value": [
{
"type": "value",
"options": {
"0": { "text": "✗ Not Ready", "color": "red", "index": 0 },
"1": { "text": "✓ Ready", "color": "green", "index": 1 }
}
}
]
}
]
},
{
"matcher": { "id": "byRegexp", "options": ".*Pressure" },
"properties": [
{
"id": "thresholds",
"value": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "red", "value": 1 }] }
},
{ "id": "custom.displayMode", "value": "color-background" },
{
"id": "mappings",
"value": [
{
"type": "value",
"options": {
"0": { "text": "✓ OK", "color": "green", "index": 0 },
"1": { "text": "⚠ Active", "color": "red", "index": 1 }
}
}
]
}
]
},
{
"matcher": { "id": "byName", "options": "Unschedulable" },
"properties": [
{
"id": "thresholds",
"value": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "yellow", "value": 1 }] }
},
{ "id": "custom.displayMode", "value": "color-background" },
{
"id": "mappings",
"value": [
{
"type": "value",
"options": {
"0": { "text": "✓ Schedulable", "color": "green", "index": 0 },
"1": { "text": "⚠ Cordoned", "color": "yellow", "index": 1 }
}
}
]
}
]
}
]
},
"options": { "sortBy": [{ "displayName": "Node", "desc": false }] },
"gridPos": { "h": 8, "w": 24, "x": 0, "y": 4 }
},
{
"id": 10,
"type": "timeseries",
"title": "CPU Usage per Node (%)",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "100 * (1 - avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])))",
"refId": "A",
"legendFormat": "{{instance}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent", "min": 0, "max": 100,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "yellow", "value": 70 }, { "color": "red", "value": 85 }] }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 16, "x": 0, "y": 12 }
},
{
"id": 11,
"type": "bargauge",
"title": "CPU Usage \u2014 Current",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "100 * (1 - avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])))",
"refId": "A",
"legendFormat": "{{instance}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent", "min": 0, "max": 100,
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "yellow", "value": 70 }, { "color": "red", "value": 85 }] }
}
},
"options": {
"orientation": "horizontal",
"displayMode": "gradient",
"showUnfilled": true,
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }
},
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 12 }
},
{
"id": 12,
"type": "timeseries",
"title": "Memory Usage per Node (%)",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))",
"refId": "A",
"legendFormat": "{{instance}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent", "min": 0, "max": 100,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "yellow", "value": 75 }, { "color": "red", "value": 90 }] }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 16, "x": 0, "y": 20 }
},
{
"id": 13,
"type": "bargauge",
"title": "Memory Usage \u2014 Current",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))",
"refId": "A",
"legendFormat": "{{instance}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent", "min": 0, "max": 100,
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "yellow", "value": 75 }, { "color": "red", "value": 90 }] }
}
},
"options": {
"orientation": "horizontal",
"displayMode": "gradient",
"showUnfilled": true,
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }
},
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 20 }
},
{
"id": 14,
"type": "timeseries",
"title": "Root Disk Usage per Node (%)",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "100 * (1 - (node_filesystem_avail_bytes{mountpoint=\"/\",fstype!=\"tmpfs\"} / node_filesystem_size_bytes{mountpoint=\"/\",fstype!=\"tmpfs\"}))",
"refId": "A",
"legendFormat": "{{instance}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent", "min": 0, "max": 100,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "yellow", "value": 70 }, { "color": "red", "value": 85 }] }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 16, "x": 0, "y": 28 }
},
{
"id": 15,
"type": "bargauge",
"title": "Root Disk Usage \u2014 Current",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "100 * (1 - (node_filesystem_avail_bytes{mountpoint=\"/\",fstype!=\"tmpfs\"} / node_filesystem_size_bytes{mountpoint=\"/\",fstype!=\"tmpfs\"}))",
"refId": "A",
"legendFormat": "{{instance}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent", "min": 0, "max": 100,
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "yellow", "value": 70 }, { "color": "red", "value": 85 }] }
}
},
"options": {
"orientation": "horizontal",
"displayMode": "gradient",
"showUnfilled": true,
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }
},
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 28 }
},
{
"id": 16,
"type": "timeseries",
"title": "Network Traffic per Node",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum by(instance) (rate(node_network_receive_bytes_total{device!~\"lo|veth.*|tun.*|ovn.*|br.*\"}[5m]))",
"refId": "A",
"legendFormat": "rx {{instance}}"
},
{
"expr": "sum by(instance) (rate(node_network_transmit_bytes_total{device!~\"lo|veth.*|tun.*|ovn.*|br.*\"}[5m]))",
"refId": "B",
"legendFormat": "tx {{instance}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "Bps",
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 36 }
},
{
"id": 17,
"type": "bargauge",
"title": "Pods per Node",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "count by(node) (kube_pod_info{node=~\"$node\"})",
"refId": "A",
"legendFormat": "{{node}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"min": 0,
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 100 },
{ "color": "red", "value": 200 }
]
}
}
},
"options": {
"orientation": "horizontal",
"displayMode": "gradient",
"showUnfilled": true,
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 36 }
},
{
"id": 18,
"type": "timeseries",
"title": "System Load Average (1m) per Node",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "node_load1",
"refId": "A",
"legendFormat": "1m \u2014 {{instance}}"
},
{
"expr": "node_load5",
"refId": "B",
"legendFormat": "5m \u2014 {{instance}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 44 }
},
{
"id": 19,
"type": "bargauge",
"title": "Node Uptime",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "time() - node_boot_time_seconds",
"refId": "A",
"legendFormat": "{{instance}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"min": 0,
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "red", "value": null },
{ "color": "yellow", "value": 300 },
{ "color": "green", "value": 3600 }
]
}
}
},
"options": {
"orientation": "horizontal",
"displayMode": "gradient",
"showUnfilled": false,
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 44 }
}
]
}

View File

@@ -0,0 +1,596 @@
{
"title": "Storage Health",
"uid": "storage-health",
"schemaVersion": 36,
"version": 1,
"refresh": "30s",
"time": { "from": "now-1h", "to": "now" },
"panels": [
{
"type": "row",
"id": 1,
"title": "PVC / PV Status",
"collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 0 }
},
{
"type": "stat",
"id": 2,
"title": "Bound PVCs",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum(kube_persistentvolumeclaim_status_phase{phase=\"Bound\"}) or vector(0)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [{ "color": "green", "value": null }]
}
}
},
"options": {
"reduceOptions": { "calcs": ["lastNotNull"] },
"colorMode": "background",
"graphMode": "none",
"textMode": "auto"
},
"gridPos": { "h": 5, "w": 4, "x": 0, "y": 1 }
},
{
"type": "stat",
"id": 3,
"title": "Pending PVCs",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum(kube_persistentvolumeclaim_status_phase{phase=\"Pending\"}) or vector(0)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 }
]
}
}
},
"options": {
"reduceOptions": { "calcs": ["lastNotNull"] },
"colorMode": "background",
"graphMode": "none",
"textMode": "auto"
},
"gridPos": { "h": 5, "w": 4, "x": 4, "y": 1 }
},
{
"type": "stat",
"id": 4,
"title": "Lost PVCs",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum(kube_persistentvolumeclaim_status_phase{phase=\"Lost\"}) or vector(0)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "red", "value": 1 }
]
}
}
},
"options": {
"reduceOptions": { "calcs": ["lastNotNull"] },
"colorMode": "background",
"graphMode": "none",
"textMode": "auto"
},
"gridPos": { "h": 5, "w": 4, "x": 8, "y": 1 }
},
{
"type": "stat",
"id": 5,
"title": "Bound PVs / Available PVs",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum(kube_persistentvolume_status_phase{phase=\"Bound\"}) or vector(0)",
"refId": "A",
"legendFormat": "Bound"
},
{
"expr": "sum(kube_persistentvolume_status_phase{phase=\"Available\"}) or vector(0)",
"refId": "B",
"legendFormat": "Available"
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [{ "color": "blue", "value": null }]
}
}
},
"options": {
"reduceOptions": { "calcs": ["lastNotNull"] },
"colorMode": "background",
"graphMode": "none",
"textMode": "auto"
},
"gridPos": { "h": 5, "w": 4, "x": 12, "y": 1 }
},
{
"type": "stat",
"id": 6,
"title": "Ceph Cluster Health",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "ceph_health_status",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 2 }
]
},
"mappings": [
{
"type": "value",
"options": {
"0": { "text": "HEALTH_OK", "index": 0 },
"1": { "text": "HEALTH_WARN", "index": 1 },
"2": { "text": "HEALTH_ERR", "index": 2 }
}
}
]
}
},
"options": {
"reduceOptions": { "calcs": ["lastNotNull"] },
"colorMode": "background",
"graphMode": "none",
"textMode": "value"
},
"gridPos": { "h": 5, "w": 4, "x": 16, "y": 1 }
},
{
"type": "stat",
"id": 7,
"title": "OSDs Up / Total",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum(ceph_osd_up) or vector(0)",
"refId": "A",
"legendFormat": "Up"
},
{
"expr": "count(ceph_osd_metadata) or vector(0)",
"refId": "B",
"legendFormat": "Total"
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [{ "color": "green", "value": null }]
}
}
},
"options": {
"reduceOptions": { "calcs": ["lastNotNull"] },
"colorMode": "background",
"graphMode": "none",
"textMode": "auto"
},
"gridPos": { "h": 5, "w": 4, "x": 20, "y": 1 }
},
{
"type": "row",
"id": 8,
"title": "Cluster Capacity",
"collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 6 }
},
{
"type": "gauge",
"id": 9,
"title": "Ceph Cluster Used (%)",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "100 * (ceph_cluster_total_used_raw_bytes or ceph_cluster_total_used_bytes) / ceph_cluster_total_bytes",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 70 },
{ "color": "red", "value": 85 }
]
}
}
},
"options": {
"reduceOptions": { "calcs": ["lastNotNull"] },
"showThresholdLabels": true,
"showThresholdMarkers": true
},
"gridPos": { "h": 8, "w": 5, "x": 0, "y": 7 }
},
{
"type": "stat",
"id": 10,
"title": "Ceph Capacity — Total / Available",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "ceph_cluster_total_bytes",
"refId": "A",
"legendFormat": "Total"
},
{
"expr": "ceph_cluster_total_bytes - (ceph_cluster_total_used_raw_bytes or ceph_cluster_total_used_bytes)",
"refId": "B",
"legendFormat": "Available"
}
],
"fieldConfig": {
"defaults": {
"unit": "bytes",
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [{ "color": "blue", "value": null }]
}
}
},
"options": {
"reduceOptions": { "calcs": ["lastNotNull"] },
"colorMode": "value",
"graphMode": "none",
"textMode": "auto",
"orientation": "vertical"
},
"gridPos": { "h": 8, "w": 4, "x": 5, "y": 7 }
},
{
"type": "bargauge",
"id": 11,
"title": "PV Allocated Capacity by Storage Class (Bound)",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum by (storageclass) (\n kube_persistentvolume_capacity_bytes\n * on(persistentvolume) group_left(storageclass)\n kube_persistentvolume_status_phase{phase=\"Bound\"}\n)",
"refId": "A",
"legendFormat": "{{storageclass}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "bytes",
"color": { "mode": "palette-classic" },
"thresholds": {
"mode": "absolute",
"steps": [{ "color": "blue", "value": null }]
}
}
},
"options": {
"orientation": "horizontal",
"reduceOptions": { "calcs": ["lastNotNull"] },
"displayMode": "gradient",
"showUnfilled": true
},
"gridPos": { "h": 8, "w": 7, "x": 9, "y": 7 }
},
{
"type": "piechart",
"id": 12,
"title": "PVC Phase Distribution",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum(kube_persistentvolumeclaim_status_phase{phase=\"Bound\"}) or vector(0)",
"refId": "A",
"legendFormat": "Bound"
},
{
"expr": "sum(kube_persistentvolumeclaim_status_phase{phase=\"Pending\"}) or vector(0)",
"refId": "B",
"legendFormat": "Pending"
},
{
"expr": "sum(kube_persistentvolumeclaim_status_phase{phase=\"Lost\"}) or vector(0)",
"refId": "C",
"legendFormat": "Lost"
}
],
"fieldConfig": {
"defaults": { "color": { "mode": "palette-classic" } }
},
"options": {
"reduceOptions": { "calcs": ["lastNotNull"] },
"pieType": "pie",
"legend": {
"displayMode": "table",
"placement": "right",
"values": ["value", "percent"]
}
},
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 7 }
},
{
"type": "row",
"id": 13,
"title": "Ceph Performance",
"collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 15 }
},
{
"type": "timeseries",
"id": 14,
"title": "Ceph Pool IOPS (Read / Write)",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "rate(ceph_pool_rd[5m])",
"refId": "A",
"legendFormat": "Read — pool {{pool_id}}"
},
{
"expr": "rate(ceph_pool_wr[5m])",
"refId": "B",
"legendFormat": "Write — pool {{pool_id}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 8 }
}
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 16 }
},
{
"type": "timeseries",
"id": 15,
"title": "Ceph Pool Throughput (Read / Write)",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "rate(ceph_pool_rd_bytes[5m])",
"refId": "A",
"legendFormat": "Read — pool {{pool_id}}"
},
{
"expr": "rate(ceph_pool_wr_bytes[5m])",
"refId": "B",
"legendFormat": "Write — pool {{pool_id}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "Bps",
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 8 }
}
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 16 }
},
{
"type": "row",
"id": 16,
"title": "Ceph OSD & Pool Details",
"collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 24 }
},
{
"type": "timeseries",
"id": 17,
"title": "Ceph Pool Space Used (%)",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "100 * ceph_pool_bytes_used / (ceph_pool_bytes_used + ceph_pool_max_avail)",
"refId": "A",
"legendFormat": "Pool {{pool_id}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"color": { "mode": "palette-classic" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 70 },
{ "color": "red", "value": 85 }
]
},
"custom": { "lineWidth": 2, "fillOpacity": 10 }
}
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 25 }
},
{
"type": "bargauge",
"id": 18,
"title": "OSD Status per Daemon (green = Up, red = Down)",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "ceph_osd_up",
"refId": "A",
"legendFormat": "{{ceph_daemon}}"
}
],
"fieldConfig": {
"defaults": {
"min": 0,
"max": 1,
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "red", "value": null },
{ "color": "green", "value": 1 }
]
},
"mappings": [
{
"type": "value",
"options": {
"0": { "text": "DOWN", "index": 0 },
"1": { "text": "UP", "index": 1 }
}
}
]
}
},
"options": {
"orientation": "horizontal",
"reduceOptions": { "calcs": ["lastNotNull"] },
"displayMode": "basic",
"showUnfilled": true
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 25 }
},
{
"type": "row",
"id": 19,
"title": "Node Disk Usage",
"collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 33 }
},
{
"type": "timeseries",
"id": 20,
"title": "Node Root Disk Usage Over Time (%)",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "100 - (node_filesystem_avail_bytes{mountpoint=\"/\",fstype!=\"tmpfs\"} / node_filesystem_size_bytes{mountpoint=\"/\",fstype!=\"tmpfs\"} * 100)",
"refId": "A",
"legendFormat": "{{instance}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"color": { "mode": "palette-classic" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 70 },
{ "color": "red", "value": 85 }
]
},
"custom": { "lineWidth": 2, "fillOpacity": 10 }
}
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 34 }
},
{
"type": "bargauge",
"id": 21,
"title": "Current Disk Usage — All Nodes & Mountpoints",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "100 - (node_filesystem_avail_bytes{fstype!~\"tmpfs|overlay|squashfs\"} / node_filesystem_size_bytes{fstype!~\"tmpfs|overlay|squashfs\"} * 100)",
"refId": "A",
"legendFormat": "{{instance}} — {{mountpoint}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 70 },
{ "color": "red", "value": 85 }
]
}
}
},
"options": {
"orientation": "horizontal",
"reduceOptions": { "calcs": ["lastNotNull"] },
"displayMode": "gradient",
"showUnfilled": true
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 34 }
}
]
}

View File

@@ -0,0 +1,773 @@
{
"title": "Workload Health",
"uid": "okd-workload-health",
"schemaVersion": 36,
"version": 3,
"refresh": "30s",
"time": { "from": "now-1h", "to": "now" },
"tags": ["okd", "workload", "health"],
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"query": { "query": "label_values(kube_pod_info, namespace)", "refId": "A" },
"refresh": 2,
"includeAll": true,
"multi": true,
"allValue": ".*",
"label": "Namespace",
"sort": 1,
"current": {},
"options": []
}
]
},
"panels": [
{
"id": 1, "type": "stat", "title": "Total Pods",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_pod_info{namespace=~\"$namespace\"})", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "blue", "value": null }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 0, "y": 0 }
},
{
"id": 2, "type": "stat", "title": "Running Pods",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_pod_status_phase{phase=\"Running\",namespace=~\"$namespace\"} == 1) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "red", "value": null }, { "color": "green", "value": 1 }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 3, "y": 0 }
},
{
"id": 3, "type": "stat", "title": "Pending Pods",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_pod_status_phase{phase=\"Pending\",namespace=~\"$namespace\"} == 1) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "yellow", "value": 1 }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 6, "y": 0 }
},
{
"id": 4, "type": "stat", "title": "Failed Pods",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_pod_status_phase{phase=\"Failed\",namespace=~\"$namespace\"} == 1) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "red", "value": 1 }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 9, "y": 0 }
},
{
"id": 5, "type": "stat", "title": "CrashLoopBackOff",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_pod_container_status_waiting_reason{reason=\"CrashLoopBackOff\",namespace=~\"$namespace\"} == 1) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "red", "value": 1 }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 12, "y": 0 }
},
{
"id": 6, "type": "stat", "title": "OOMKilled",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_pod_container_status_last_terminated_reason{reason=\"OOMKilled\",namespace=~\"$namespace\"} == 1) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "orange", "value": 1 }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 15, "y": 0 }
},
{
"id": 7, "type": "stat", "title": "Deployments Available",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_deployment_status_condition{condition=\"Available\",status=\"true\",namespace=~\"$namespace\"} == 1) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "red", "value": null }, { "color": "green", "value": 1 }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 18, "y": 0 }
},
{
"id": 8, "type": "stat", "title": "Deployments Degraded",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [{ "expr": "count(kube_deployment_status_replicas_unavailable{namespace=~\"$namespace\"} > 0) or vector(0)", "refId": "A", "legendFormat": "" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "red", "value": 1 }] },
"unit": "short", "noValue": "0"
}
},
"options": { "colorMode": "background", "graphMode": "none", "justifyMode": "center", "reduceOptions": { "calcs": ["lastNotNull"] }, "textMode": "auto" },
"gridPos": { "h": 4, "w": 3, "x": 21, "y": 0 }
},
{
"id": 9, "type": "row", "title": "Deployments", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 4 }
},
{
"id": 10,
"type": "table",
"title": "Deployment Status",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum by(namespace,deployment)(kube_deployment_spec_replicas{namespace=~\"$namespace\"})",
"refId": "A",
"instant": true,
"format": "table",
"legendFormat": ""
},
{
"expr": "sum by(namespace,deployment)(kube_deployment_status_replicas_ready{namespace=~\"$namespace\"})",
"refId": "B",
"instant": true,
"format": "table",
"legendFormat": ""
},
{
"expr": "sum by(namespace,deployment)(kube_deployment_status_replicas_available{namespace=~\"$namespace\"})",
"refId": "C",
"instant": true,
"format": "table",
"legendFormat": ""
},
{
"expr": "sum by(namespace,deployment)(kube_deployment_status_replicas_unavailable{namespace=~\"$namespace\"})",
"refId": "D",
"instant": true,
"format": "table",
"legendFormat": ""
},
{
"expr": "sum by(namespace,deployment)(kube_deployment_status_replicas_updated{namespace=~\"$namespace\"})",
"refId": "E",
"instant": true,
"format": "table",
"legendFormat": ""
}
],
"transformations": [
{
"id": "filterFieldsByName",
"options": {
"include": {
"names": ["namespace", "deployment", "Value"]
}
}
},
{
"id": "joinByField",
"options": {
"byField": "deployment",
"mode": "outer"
}
},
{
"id": "organize",
"options": {
"excludeByName": {
"namespace 1": true,
"namespace 2": true,
"namespace 3": true,
"namespace 4": true
},
"renameByName": {
"namespace": "Namespace",
"deployment": "Deployment",
"Value": "Desired",
"Value 1": "Ready",
"Value 2": "Available",
"Value 3": "Unavailable",
"Value 4": "Up-to-date"
},
"indexByName": {
"namespace": 0,
"deployment": 1,
"Value": 2,
"Value 1": 3,
"Value 2": 4,
"Value 3": 5,
"Value 4": 6
}
}
},
{
"id": "sortBy",
"options": {
"fields": [{ "displayName": "Namespace", "desc": false }]
}
}
],
"fieldConfig": {
"defaults": { "custom": { "align": "center", "displayMode": "auto" } },
"overrides": [
{
"matcher": { "id": "byName", "options": "Namespace" },
"properties": [{ "id": "custom.align", "value": "left" }, { "id": "custom.width", "value": 200 }]
},
{
"matcher": { "id": "byName", "options": "Deployment" },
"properties": [{ "id": "custom.align", "value": "left" }, { "id": "custom.width", "value": 220 }]
},
{
"matcher": { "id": "byName", "options": "Unavailable" },
"properties": [
{ "id": "custom.displayMode", "value": "color-background" },
{
"id": "thresholds",
"value": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "red", "value": 1 }] }
}
]
},
{
"matcher": { "id": "byName", "options": "Ready" },
"properties": [
{ "id": "custom.displayMode", "value": "color-background" },
{
"id": "thresholds",
"value": { "mode": "absolute", "steps": [{ "color": "red", "value": null }, { "color": "green", "value": 1 }] }
}
]
}
]
},
"options": { "sortBy": [{ "displayName": "Namespace", "desc": false }] },
"gridPos": { "h": 8, "w": 24, "x": 0, "y": 5 }
},
{
"id": 11, "type": "row", "title": "StatefulSets & DaemonSets", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 13 }
},
{
"id": 12,
"type": "table",
"title": "StatefulSet Status",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum by(namespace,statefulset)(kube_statefulset_replicas{namespace=~\"$namespace\"})",
"refId": "A",
"instant": true,
"format": "table",
"legendFormat": ""
},
{
"expr": "sum by(namespace,statefulset)(kube_statefulset_status_replicas_ready{namespace=~\"$namespace\"})",
"refId": "B",
"instant": true,
"format": "table",
"legendFormat": ""
},
{
"expr": "sum by(namespace,statefulset)(kube_statefulset_status_replicas_current{namespace=~\"$namespace\"})",
"refId": "C",
"instant": true,
"format": "table",
"legendFormat": ""
},
{
"expr": "sum by(namespace,statefulset)(kube_statefulset_status_replicas_updated{namespace=~\"$namespace\"})",
"refId": "D",
"instant": true,
"format": "table",
"legendFormat": ""
}
],
"transformations": [
{
"id": "filterFieldsByName",
"options": {
"include": {
"names": ["namespace", "statefulset", "Value"]
}
}
},
{
"id": "joinByField",
"options": {
"byField": "statefulset",
"mode": "outer"
}
},
{
"id": "organize",
"options": {
"excludeByName": {
"namespace 1": true,
"namespace 2": true,
"namespace 3": true
},
"renameByName": {
"namespace": "Namespace",
"statefulset": "StatefulSet",
"Value": "Desired",
"Value 1": "Ready",
"Value 2": "Current",
"Value 3": "Up-to-date"
},
"indexByName": {
"namespace": 0,
"statefulset": 1,
"Value": 2,
"Value 1": 3,
"Value 2": 4,
"Value 3": 5
}
}
},
{
"id": "sortBy",
"options": { "fields": [{ "displayName": "Namespace", "desc": false }] }
}
],
"fieldConfig": {
"defaults": { "custom": { "align": "center", "displayMode": "auto" } },
"overrides": [
{
"matcher": { "id": "byName", "options": "Namespace" },
"properties": [{ "id": "custom.align", "value": "left" }, { "id": "custom.width", "value": 180 }]
},
{
"matcher": { "id": "byName", "options": "StatefulSet" },
"properties": [{ "id": "custom.align", "value": "left" }, { "id": "custom.width", "value": 200 }]
},
{
"matcher": { "id": "byName", "options": "Ready" },
"properties": [
{ "id": "custom.displayMode", "value": "color-background" },
{ "id": "thresholds", "value": { "mode": "absolute", "steps": [{ "color": "red", "value": null }, { "color": "green", "value": 1 }] } }
]
}
]
},
"options": {},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 14 }
},
{
"id": 13,
"type": "table",
"title": "DaemonSet Status",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum by(namespace,daemonset)(kube_daemonset_status_desired_number_scheduled{namespace=~\"$namespace\"})",
"refId": "A",
"instant": true,
"format": "table",
"legendFormat": ""
},
{
"expr": "sum by(namespace,daemonset)(kube_daemonset_status_number_ready{namespace=~\"$namespace\"})",
"refId": "B",
"instant": true,
"format": "table",
"legendFormat": ""
},
{
"expr": "sum by(namespace,daemonset)(kube_daemonset_status_number_unavailable{namespace=~\"$namespace\"})",
"refId": "C",
"instant": true,
"format": "table",
"legendFormat": ""
},
{
"expr": "sum by(namespace,daemonset)(kube_daemonset_status_number_misscheduled{namespace=~\"$namespace\"})",
"refId": "D",
"instant": true,
"format": "table",
"legendFormat": ""
}
],
"transformations": [
{
"id": "filterFieldsByName",
"options": {
"include": {
"names": ["namespace", "daemonset", "Value"]
}
}
},
{
"id": "joinByField",
"options": {
"byField": "daemonset",
"mode": "outer"
}
},
{
"id": "organize",
"options": {
"excludeByName": {
"namespace 1": true,
"namespace 2": true,
"namespace 3": true
},
"renameByName": {
"namespace": "Namespace",
"daemonset": "DaemonSet",
"Value": "Desired",
"Value 1": "Ready",
"Value 2": "Unavailable",
"Value 3": "Misscheduled"
},
"indexByName": {
"namespace": 0,
"daemonset": 1,
"Value": 2,
"Value 1": 3,
"Value 2": 4,
"Value 3": 5
}
}
},
{
"id": "sortBy",
"options": { "fields": [{ "displayName": "Namespace", "desc": false }] }
}
],
"fieldConfig": {
"defaults": { "custom": { "align": "center", "displayMode": "auto" } },
"overrides": [
{
"matcher": { "id": "byName", "options": "Namespace" },
"properties": [{ "id": "custom.align", "value": "left" }, { "id": "custom.width", "value": 180 }]
},
{
"matcher": { "id": "byName", "options": "DaemonSet" },
"properties": [{ "id": "custom.align", "value": "left" }, { "id": "custom.width", "value": 200 }]
},
{
"matcher": { "id": "byName", "options": "Ready" },
"properties": [
{ "id": "custom.displayMode", "value": "color-background" },
{ "id": "thresholds", "value": { "mode": "absolute", "steps": [{ "color": "red", "value": null }, { "color": "green", "value": 1 }] } }
]
},
{
"matcher": { "id": "byName", "options": "Unavailable" },
"properties": [
{ "id": "custom.displayMode", "value": "color-background" },
{ "id": "thresholds", "value": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "red", "value": 1 }] } }
]
},
{
"matcher": { "id": "byName", "options": "Misscheduled" },
"properties": [
{ "id": "custom.displayMode", "value": "color-background" },
{ "id": "thresholds", "value": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "orange", "value": 1 }] } }
]
}
]
},
"options": {},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 14 }
},
{
"id": 14, "type": "row", "title": "Pods", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 22 }
},
{
"id": 15,
"type": "timeseries",
"title": "Pod Phase over Time",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum by(phase)(kube_pod_status_phase{namespace=~\"$namespace\"})",
"refId": "A", "legendFormat": "{{phase}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
},
"overrides": [
{ "matcher": { "id": "byName", "options": "Running" }, "properties": [{ "id": "color", "value": { "fixedColor": "green", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "Pending" }, "properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "Failed" }, "properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "Succeeded" }, "properties": [{ "id": "color", "value": { "fixedColor": "blue", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "Unknown" }, "properties": [{ "id": "color", "value": { "fixedColor": "orange", "mode": "fixed" } }] }
]
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["lastNotNull"] }
},
"gridPos": { "h": 8, "w": 16, "x": 0, "y": 23 }
},
{
"id": 16,
"type": "piechart",
"title": "Pod Phase — Now",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum by(phase)(kube_pod_status_phase{namespace=~\"$namespace\"})",
"refId": "A", "instant": true, "legendFormat": "{{phase}}"
}
],
"fieldConfig": {
"defaults": { "unit": "short", "color": { "mode": "palette-classic" } },
"overrides": [
{ "matcher": { "id": "byName", "options": "Running" }, "properties": [{ "id": "color", "value": { "fixedColor": "green", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "Pending" }, "properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "Failed" }, "properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "Succeeded" }, "properties": [{ "id": "color", "value": { "fixedColor": "blue", "mode": "fixed" } }] },
{ "matcher": { "id": "byName", "options": "Unknown" }, "properties": [{ "id": "color", "value": { "fixedColor": "orange", "mode": "fixed" } }] }
]
},
"options": {
"pieType": "donut",
"tooltip": { "mode": "single" },
"legend": { "displayMode": "table", "placement": "right", "values": ["value", "percent"] }
},
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 23 }
},
{
"id": 17,
"type": "timeseries",
"title": "Container Restarts over Time (total counter, top 10)",
"description": "Absolute restart counter — each vertical step = a restart event. Flat line = healthy.",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "topk(10,\n sum by(namespace, pod) (\n kube_pod_container_status_restarts_total{namespace=~\"$namespace\"}\n ) > 0\n)",
"refId": "A",
"legendFormat": "{{namespace}} / {{pod}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "short", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 5, "showPoints": "auto", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["max"] }
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 31 }
},
{
"id": 18,
"type": "table",
"title": "Container Total Restarts (non-zero)",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum by(namespace, pod, container) (kube_pod_container_status_restarts_total{namespace=~\"$namespace\"}) > 0",
"refId": "A",
"instant": true,
"format": "table",
"legendFormat": ""
}
],
"transformations": [
{
"id": "filterFieldsByName",
"options": {
"include": { "names": ["namespace", "pod", "container", "Value"] }
}
},
{
"id": "organize",
"options": {
"excludeByName": {},
"renameByName": {
"namespace": "Namespace",
"pod": "Pod",
"container": "Container",
"Value": "Total Restarts"
},
"indexByName": { "namespace": 0, "pod": 1, "container": 2, "Value": 3 }
}
},
{
"id": "sortBy",
"options": { "fields": [{ "displayName": "Total Restarts", "desc": true }] }
}
],
"fieldConfig": {
"defaults": { "custom": { "align": "center", "displayMode": "auto" } },
"overrides": [
{ "matcher": { "id": "byName", "options": "Namespace" }, "properties": [{ "id": "custom.align", "value": "left" }, { "id": "custom.width", "value": 160 }] },
{ "matcher": { "id": "byName", "options": "Pod" }, "properties": [{ "id": "custom.align", "value": "left" }, { "id": "custom.width", "value": 260 }] },
{ "matcher": { "id": "byName", "options": "Container" }, "properties": [{ "id": "custom.align", "value": "left" }, { "id": "custom.width", "value": 160 }] },
{
"matcher": { "id": "byName", "options": "Total Restarts" },
"properties": [
{ "id": "custom.displayMode", "value": "color-background" },
{ "id": "thresholds", "value": { "mode": "absolute", "steps": [{ "color": "yellow", "value": null }, { "color": "orange", "value": 5 }, { "color": "red", "value": 20 }] } }
]
}
]
},
"options": {},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 31 }
},
{
"id": 19, "type": "row", "title": "Resource Usage", "collapsed": false,
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 39 }
},
{
"id": 20,
"type": "timeseries",
"title": "CPU Usage by Namespace",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum by(namespace)(rate(container_cpu_usage_seconds_total{namespace=~\"$namespace\",container!=\"\",container!=\"POD\"}[5m]))",
"refId": "A", "legendFormat": "{{namespace}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "cores", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 40 }
},
{
"id": 21,
"type": "timeseries",
"title": "Memory Usage by Namespace",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum by(namespace)(container_memory_working_set_bytes{namespace=~\"$namespace\",container!=\"\",container!=\"POD\"})",
"refId": "A", "legendFormat": "{{namespace}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "bytes", "min": 0,
"color": { "mode": "palette-classic" },
"custom": { "lineWidth": 2, "fillOpacity": 10, "showPoints": "never", "spanNulls": false }
}
},
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"legend": { "displayMode": "list", "placement": "bottom", "calcs": ["mean", "max"] }
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 40 }
},
{
"id": 22,
"type": "bargauge",
"title": "CPU — Actual vs Requested (%)",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum by(namespace)(rate(container_cpu_usage_seconds_total{namespace=~\"$namespace\",container!=\"\",container!=\"POD\"}[5m]))\n/\nsum by(namespace)(kube_pod_container_resource_requests{resource=\"cpu\",namespace=~\"$namespace\",container!=\"\"})\n* 100",
"refId": "A", "legendFormat": "{{namespace}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent", "min": 0, "max": 150,
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "yellow", "value": 80 }, { "color": "red", "value": 100 }] }
}
},
"options": {
"orientation": "horizontal", "displayMode": "gradient", "showUnfilled": true,
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 48 }
},
{
"id": 23,
"type": "bargauge",
"title": "Memory — Actual vs Requested (%)",
"datasource": { "type": "prometheus", "uid": "Prometheus-Cluster" },
"targets": [
{
"expr": "sum by(namespace)(container_memory_working_set_bytes{namespace=~\"$namespace\",container!=\"\",container!=\"POD\"})\n/\nsum by(namespace)(kube_pod_container_resource_requests{resource=\"memory\",namespace=~\"$namespace\",container!=\"\"})\n* 100",
"refId": "A", "legendFormat": "{{namespace}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent", "min": 0, "max": 150,
"color": { "mode": "thresholds" },
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }, { "color": "yellow", "value": 80 }, { "color": "red", "value": 100 }] }
}
},
"options": {
"orientation": "horizontal", "displayMode": "gradient", "showUnfilled": true,
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 48 }
}
]
}

View File

@@ -0,0 +1,2 @@
mod score;
pub use score::ClusterDashboardsScore;

View File

@@ -0,0 +1,507 @@
use async_trait::async_trait;
use harmony_types::id::Id;
use k8s_openapi::api::core::v1::{Namespace, Secret};
use kube::{api::ObjectMeta, api::DynamicObject};
use serde::{Deserialize, Serialize};
use serde_yaml;
use std::collections::BTreeMap;
use crate::{
data::Version,
interpret::{Interpret, InterpretError, InterpretName, InterpretStatus, Outcome},
inventory::Inventory,
modules::k8s::resource::K8sResourceScore,
modules::okd::crd::route::Route,
score::Score,
topology::{K8sclient, Topology},
};
#[derive(Clone, Debug, Serialize)]
pub struct ClusterDashboardsScore {
pub namespace: String,
pub grafana_admin_user: String,
pub grafana_admin_password: String,
}
impl Default for ClusterDashboardsScore {
fn default() -> Self {
Self {
namespace: "harmony-observability".to_string(),
grafana_admin_user: "admin".to_string(),
grafana_admin_password: "password".to_string(),
}
}
}
impl ClusterDashboardsScore {
pub fn new(namespace: &str) -> Self {
Self {
namespace: namespace.to_string(),
grafana_admin_user: "admin".to_string(),
grafana_admin_password: "password".to_string(),
}
}
pub fn with_credentials(namespace: &str, admin_user: &str, admin_password: &str) -> Self {
Self {
namespace: namespace.to_string(),
grafana_admin_user: admin_user.to_string(),
grafana_admin_password: admin_password.to_string(),
}
}
}
impl<T: Topology + K8sclient> Score<T> for ClusterDashboardsScore {
fn name(&self) -> String {
format!("ClusterDashboardsScore({})", self.namespace)
}
#[doc(hidden)]
fn create_interpret(&self) -> Box<dyn Interpret<T>> {
Box::new(ClusterDashboardsInterpret {
namespace: self.namespace.clone(),
grafana_admin_user: self.grafana_admin_user.clone(),
grafana_admin_password: self.grafana_admin_password.clone(),
})
}
}
#[derive(Debug, Clone)]
pub struct ClusterDashboardsInterpret {
namespace: String,
grafana_admin_user: String,
grafana_admin_password: String,
}
#[async_trait]
impl<T: Topology + K8sclient> Interpret<T> for ClusterDashboardsInterpret {
async fn execute(
&self,
inventory: &Inventory,
topology: &T,
) -> Result<Outcome, InterpretError> {
self.create_namespace(inventory, topology).await?;
self.create_rbac_resources(inventory, topology).await?;
self.create_secret(inventory, topology).await?;
self.create_grafana(inventory, topology).await?;
self.create_datasource(inventory, topology).await?;
self.create_dashboards(inventory, topology).await?;
self.create_route(inventory, topology).await?;
Ok(Outcome::success(format!(
"Cluster dashboards resources in namespace '{}' with {} dashboards successfully created",
self.namespace,
8
)))
}
fn get_name(&self) -> InterpretName {
InterpretName::Custom("ClusterDashboards")
}
fn get_version(&self) -> Version {
todo!()
}
fn get_status(&self) -> InterpretStatus {
todo!()
}
fn get_children(&self) -> Vec<Id> {
todo!()
}
}
impl ClusterDashboardsInterpret {
async fn create_namespace(
&self,
inventory: &Inventory,
topology: &(impl Topology + K8sclient),
) -> Result<(), InterpretError> {
let mut labels = BTreeMap::new();
labels.insert(
"openshift.io/cluster-monitoring".to_string(),
"true".to_string(),
);
let namespace = Namespace {
metadata: ObjectMeta {
name: Some(self.namespace.clone()),
labels: Some(labels),
..ObjectMeta::default()
},
..Namespace::default()
};
K8sResourceScore::single(namespace, None)
.interpret(inventory, topology)
.await?;
Ok(())
}
async fn create_rbac_resources(
&self,
inventory: &Inventory,
topology: &(impl Topology + K8sclient),
) -> Result<(), InterpretError> {
let service_account_name = "cluster-grafana-sa".to_string();
let rbac_namespace = self.namespace.clone();
let service_account = {
use k8s_openapi::api::core::v1::ServiceAccount;
ServiceAccount {
metadata: ObjectMeta {
name: Some(service_account_name.clone()),
namespace: Some(rbac_namespace.clone()),
..ObjectMeta::default()
},
..ServiceAccount::default()
}
};
let cluster_role = {
use k8s_openapi::api::rbac::v1::{ClusterRole, PolicyRule};
ClusterRole {
metadata: ObjectMeta {
name: Some("grafana-prometheus-api-access".to_string()),
..ObjectMeta::default()
},
rules: Some(vec![PolicyRule {
api_groups: Some(vec!["monitoring.coreos.com".to_string()]),
resources: Some(vec!["prometheuses/api".to_string()]),
verbs: vec!["get".to_string()],
..PolicyRule::default()
}]),
..ClusterRole::default()
}
};
let cluster_role_binding = {
use k8s_openapi::api::rbac::v1::{ClusterRoleBinding, RoleRef, Subject};
ClusterRoleBinding {
metadata: ObjectMeta {
name: Some("grafana-prometheus-api-access-binding".to_string()),
..ObjectMeta::default()
},
subjects: Some(vec![Subject {
kind: "ServiceAccount".to_string(),
name: service_account_name.clone(),
namespace: Some(rbac_namespace.clone()),
..Subject::default()
}]),
role_ref: RoleRef {
api_group: "rbac.authorization.k8s.io".to_string(),
kind: "ClusterRole".to_string(),
name: "grafana-prometheus-api-access".to_string(),
},
}
};
let cluster_role_binding_cluster_monitoring = {
use k8s_openapi::api::rbac::v1::{ClusterRoleBinding, RoleRef, Subject};
ClusterRoleBinding {
metadata: ObjectMeta {
name: Some("grafana-cluster-monitoring-view".to_string()),
..ObjectMeta::default()
},
subjects: Some(vec![Subject {
kind: "ServiceAccount".to_string(),
name: service_account_name.clone(),
namespace: Some(rbac_namespace.clone()),
..Subject::default()
}]),
role_ref: RoleRef {
api_group: "rbac.authorization.k8s.io".to_string(),
kind: "ClusterRole".to_string(),
name: "cluster-monitoring-view".to_string(),
},
}
};
K8sResourceScore::single(service_account, Some(rbac_namespace.clone()))
.interpret(inventory, topology)
.await?;
K8sResourceScore::single(cluster_role, None)
.interpret(inventory, topology)
.await?;
K8sResourceScore::single(cluster_role_binding, None)
.interpret(inventory, topology)
.await?;
K8sResourceScore::single(cluster_role_binding_cluster_monitoring, None)
.interpret(inventory, topology)
.await?;
Ok(())
}
async fn create_secret(
&self,
inventory: &Inventory,
topology: &(impl Topology + K8sclient),
) -> Result<(), InterpretError> {
let service_account_name = "cluster-grafana-sa".to_string();
let secret_name = "grafana-prometheus-token".to_string();
let secret_namespace = self.namespace.clone();
let secret = Secret {
metadata: ObjectMeta {
name: Some(secret_name),
namespace: Some(secret_namespace),
annotations: Some({
let mut ann = BTreeMap::new();
ann.insert(
"kubernetes.io/service-account.name".to_string(),
service_account_name,
);
ann
}),
..ObjectMeta::default()
},
type_: Some("kubernetes.io/service-account-token".to_string()),
..Secret::default()
};
K8sResourceScore::single(secret, Some(self.namespace.clone()))
.interpret(inventory, topology)
.await?;
Ok(())
}
async fn create_grafana(
&self,
inventory: &Inventory,
topology: &(impl Topology + K8sclient),
) -> Result<(), InterpretError> {
let labels: BTreeMap<String, String> = vec![
("dashboards".to_string(), "grafana".to_string()),
]
.into_iter()
.collect();
let client = topology
.k8s_client()
.await
.map_err(|e| InterpretError::new(format!("Failed to get k8s client: {e}")))?;
let mut annotations = BTreeMap::new();
annotations.insert(
"kubectl.kubernetes.io/last-applied-configuration".to_string(),
"".to_string(),
);
let grafana_yaml = format!(r#"
apiVersion: grafana.integreatly.org/v1beta1
kind: Grafana
metadata:
name: cluster-grafana
namespace: {}
labels:
dashboards: "grafana"
spec:
config:
log:
mode: console
security:
admin_user: {}
admin_password: {}
users:
viewers_can_edit: "false"
auth:
disable_login_form: "false"
"auth.anonymous":
enabled: "true"
org_role: "Viewer"
deployment:
spec:
replicas: 1
template:
spec:
containers:
- name: grafana
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: "1"
memory: 2Gi
"#, self.namespace, self.grafana_admin_user, self.grafana_admin_password);
let grafana_value: serde_json::Value = serde_yaml::from_str(grafana_yaml.as_str())
.map_err(|e| InterpretError::new(format!("Failed to parse Grafana YAML: {e}")))?;
let grafana: DynamicObject = serde_json::from_value(grafana_value)
.map_err(|e| InterpretError::new(format!("Failed to create DynamicObject: {e}")))?;
client.apply_dynamic(&grafana, Some(&self.namespace), false).await
.map_err(|e| InterpretError::new(format!("Failed to apply Grafana: {e}")))?;
Ok(())
}
async fn create_datasource(
&self,
inventory: &Inventory,
topology: &(impl Topology + K8sclient),
) -> Result<(), InterpretError> {
let labels: BTreeMap<String, String> = vec![
("datasource".to_string(), "prometheus".to_string()),
]
.into_iter()
.collect();
let client = topology
.k8s_client()
.await
.map_err(|e| InterpretError::new(format!("Failed to get k8s client: {e}")))?;
let secure_json_data_value = "Bearer ${token}";
let datasource_yaml = format!(r#"
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDatasource
metadata:
name: prometheus-cluster
namespace: {}
labels:
datasource: "prometheus"
spec:
instanceSelector:
matchLabels:
dashboards: "grafana"
valuesFrom:
- targetPath: "secureJsonData.httpHeaderValue1"
valueFrom:
secretKeyRef:
name: grafana-prometheus-token
key: token
datasource:
name: Prometheus-Cluster
type: prometheus
access: proxy
url: https://prometheus-k8s.openshift-monitoring.svc:9091
isDefault: true
jsonData:
httpHeaderName1: "Authorization"
tlsSkipVerify: true
timeInterval: "30s"
secureJsonData:
httpHeaderValue1: "{}"
"#, self.namespace, secure_json_data_value);
let datasource_value: serde_json::Value = serde_yaml::from_str(datasource_yaml.as_str())
.map_err(|e| InterpretError::new(format!("Failed to parse Datasource YAML: {e}")))?;
let datasource: DynamicObject = serde_json::from_value(datasource_value)
.map_err(|e| InterpretError::new(format!("Failed to create DynamicObject: {e}")))?;
client.apply_dynamic(&datasource, Some(&self.namespace), false).await
.map_err(|e| InterpretError::new(format!("Failed to apply Datasource: {e}")))?;
Ok(())
}
async fn create_dashboards(
&self,
inventory: &Inventory,
topology: &(impl Topology + K8sclient),
) -> Result<(), InterpretError> {
let client = topology
.k8s_client()
.await
.map_err(|e| InterpretError::new(format!("Failed to get k8s client: {e}")))?;
let dashboards: &[(&str, &str)] = &[
("okd-cluster-overview", include_str!("dashboards/cluster-overview.json")),
("okd-node-health", include_str!("dashboards/nodes-health.json")),
("okd-workload-health", include_str!("dashboards/workloads-health.json")),
("okd-networking", include_str!("dashboards/networking.json")),
("storage-health", include_str!("dashboards/storage.json")),
("okd-etcd", include_str!("dashboards/etcd.json")),
("okd-control-plane", include_str!("dashboards/control-plane.json")),
("okd-alerts-events", include_str!("dashboards/alerts-events-problems.json")),
];
for (dashboard_name, json_content) in dashboards {
let dashboard: DynamicObject = serde_json::from_value(serde_json::json!({
"apiVersion": "grafana.integreatly.org/v1beta1",
"kind": "GrafanaDashboard",
"metadata": {
"name": dashboard_name,
"namespace": self.namespace,
"labels": {
"dashboard": dashboard_name
}
},
"spec": {
"instanceSelector": {
"matchLabels": {
"dashboards": "grafana"
}
},
"json": json_content
}
})).map_err(|e| InterpretError::new(format!("Failed to create Dashboard {} DynamicObject: {e}", dashboard_name)))?;
client.apply_dynamic(&dashboard, Some(&self.namespace), false).await
.map_err(|e| InterpretError::new(format!("Failed to apply Dashboard {}: {e}", dashboard_name)))?;
}
Ok(())
}
async fn create_route(
&self,
inventory: &Inventory,
topology: &(impl Topology + K8sclient),
) -> Result<(), InterpretError> {
let route = Route {
metadata: ObjectMeta {
name: Some("grafana".to_string()),
namespace: Some(self.namespace.clone()),
..ObjectMeta::default()
},
spec: crate::modules::okd::crd::route::RouteSpec {
to: crate::modules::okd::crd::route::RouteTargetReference {
kind: "Service".to_string(),
name: "cluster-grafana-service".to_string(),
weight: None,
},
port: Some(crate::modules::okd::crd::route::RoutePort {
target_port: 3000,
}),
tls: Some(crate::modules::okd::crd::route::TLSConfig {
termination: "edge".to_string(),
insecure_edge_termination_policy: Some("Redirect".to_string()),
..crate::modules::okd::crd::route::TLSConfig::default()
}),
..crate::modules::okd::crd::route::RouteSpec::default()
},
..crate::modules::okd::crd::route::Route::default()
};
K8sResourceScore::single(route, Some(self.namespace.clone()))
.interpret(inventory, topology)
.await?;
Ok(())
}
fn get_name(&self) -> InterpretName {
InterpretName::Custom("ClusterDashboards")
}
fn get_version(&self) -> Version {
todo!()
}
fn get_status(&self) -> InterpretStatus {
todo!()
}
fn get_children(&self) -> Vec<Id> {
todo!()
}
}

View File

@@ -1,41 +1,17 @@
use serde::Serialize;
use async_trait::async_trait;
use k8s_openapi::Resource;
use crate::topology::monitoring::{AlertReceiver, AlertRule, AlertSender, ScrapeTarget};
use crate::{
inventory::Inventory,
topology::{PreparationError, PreparationOutcome},
};
#[derive(Debug, Clone, Serialize)]
pub struct Grafana {
pub namespace: String,
}
impl AlertSender for Grafana {
fn name(&self) -> String {
"grafana".to_string()
}
}
impl Serialize for Box<dyn AlertReceiver<Grafana>> {
fn serialize<S>(&self, _serializer: S) -> Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
todo!()
}
}
impl Serialize for Box<dyn AlertRule<Grafana>> {
fn serialize<S>(&self, _serializer: S) -> Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
todo!()
}
}
impl Serialize for Box<dyn ScrapeTarget<Grafana>> {
fn serialize<S>(&self, _serializer: S) -> Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
todo!()
}
#[async_trait]
pub trait Grafana {
async fn ensure_grafana_operator(
&self,
inventory: &Inventory,
) -> Result<PreparationOutcome, PreparationError>;
async fn install_grafana(&self) -> Result<PreparationOutcome, PreparationError>;
}

View File

@@ -1,32 +0,0 @@
use serde::Serialize;
use crate::{
modules::monitoring::grafana::grafana::Grafana,
score::Score,
topology::{
Topology,
monitoring::{AlertReceiver, AlertRule, AlertingInterpret, Observability, ScrapeTarget},
},
};
#[derive(Clone, Debug, Serialize)]
pub struct GrafanaAlertingScore {
pub receivers: Vec<Box<dyn AlertReceiver<Grafana>>>,
pub rules: Vec<Box<dyn AlertRule<Grafana>>>,
pub scrape_targets: Option<Vec<Box<dyn ScrapeTarget<Grafana>>>>,
pub sender: Grafana,
}
impl<T: Topology + Observability<Grafana>> Score<T> for GrafanaAlertingScore {
fn create_interpret(&self) -> Box<dyn crate::interpret::Interpret<T>> {
Box::new(AlertingInterpret {
sender: self.sender.clone(),
receivers: self.receivers.clone(),
rules: self.rules.clone(),
scrape_targets: self.scrape_targets.clone(),
})
}
fn name(&self) -> String {
"HelmPrometheusAlertingScore".to_string()
}
}

View File

@@ -0,0 +1,28 @@
use harmony_macros::hurl;
use non_blank_string_rs::NonBlankString;
use std::{collections::HashMap, str::FromStr};
use crate::modules::helm::chart::{HelmChartScore, HelmRepository};
pub fn grafana_helm_chart_score(ns: &str, namespace_scope: bool) -> HelmChartScore {
let mut values_overrides = HashMap::new();
values_overrides.insert(
NonBlankString::from_str("namespaceScope").unwrap(),
namespace_scope.to_string(),
);
HelmChartScore {
namespace: Some(NonBlankString::from_str(ns).unwrap()),
release_name: NonBlankString::from_str("grafana-operator").unwrap(),
chart_name: NonBlankString::from_str("grafana/grafana-operator").unwrap(),
chart_version: None,
values_overrides: Some(values_overrides),
values_yaml: None,
create_namespace: true,
install_only: true,
repository: Some(HelmRepository::new(
"grafana".to_string(),
hurl!("https://grafana.github.io/helm-charts"),
true,
)),
}
}

View File

@@ -0,0 +1 @@
pub mod helm_grafana;

View File

@@ -1,3 +0,0 @@
pub mod crd_grafana;
pub mod grafana_default_dashboard;
pub mod rhob_grafana;

View File

@@ -1 +0,0 @@
pub mod grafana_operator;

View File

@@ -1,7 +0,0 @@
pub mod crd;
pub mod helm;
pub mod score_ensure_grafana_ready;
pub mod score_grafana_alert_receiver;
pub mod score_grafana_datasource;
pub mod score_grafana_rule;
pub mod score_install_grafana;

View File

@@ -1,54 +0,0 @@
use serde::Serialize;
use crate::{
interpret::Interpret,
modules::monitoring::grafana::grafana::Grafana,
score::Score,
topology::{K8sclient, Topology},
};
#[derive(Debug, Clone, Serialize)]
pub struct GrafanaK8sEnsureReadyScore {
pub sender: Grafana,
}
impl<T: Topology + K8sclient> Score<T> for GrafanaK8sEnsureReadyScore {
fn name(&self) -> String {
"GrafanaK8sEnsureReadyScore".to_string()
}
fn create_interpret(&self) -> Box<dyn Interpret<T>> {
todo!()
}
}
// async fn ensure_ready(
// &self,
// inventory: &Inventory,
// ) -> Result<PreparationOutcome, PreparationError> {
// debug!("ensure grafana operator");
// let client = self.k8s_client().await.unwrap();
// let grafana_gvk = GroupVersionKind {
// group: "grafana.integreatly.org".to_string(),
// version: "v1beta1".to_string(),
// kind: "Grafana".to_string(),
// };
// let name = "grafanas.grafana.integreatly.org";
// let ns = "grafana";
//
// let grafana_crd = client
// .get_resource_json_value(name, Some(ns), &grafana_gvk)
// .await;
// match grafana_crd {
// Ok(_) => {
// return Ok(PreparationOutcome::Success {
// details: "Found grafana CRDs in cluster".to_string(),
// });
// }
//
// Err(_) => {
// return self
// .install_grafana_operator(inventory, Some("grafana"))
// .await;
// }
// };
// }

View File

@@ -1,24 +0,0 @@
use serde::Serialize;
use crate::{
interpret::Interpret,
modules::monitoring::grafana::grafana::Grafana,
score::Score,
topology::{K8sclient, Topology, monitoring::AlertReceiver},
};
#[derive(Debug, Clone, Serialize)]
pub struct GrafanaK8sReceiverScore {
pub sender: Grafana,
pub receiver: Box<dyn AlertReceiver<Grafana>>,
}
impl<T: Topology + K8sclient> Score<T> for GrafanaK8sReceiverScore {
fn name(&self) -> String {
"GrafanaK8sReceiverScore".to_string()
}
fn create_interpret(&self) -> Box<dyn Interpret<T>> {
todo!()
}
}

View File

@@ -1,83 +0,0 @@
use serde::Serialize;
use crate::{
interpret::Interpret,
modules::monitoring::grafana::grafana::Grafana,
score::Score,
topology::{K8sclient, Topology, monitoring::ScrapeTarget},
};
#[derive(Debug, Clone, Serialize)]
pub struct GrafanaK8sDatasourceScore {
pub sender: Grafana,
pub scrape_target: Box<dyn ScrapeTarget<Grafana>>,
}
impl<T: Topology + K8sclient> Score<T> for GrafanaK8sDatasourceScore {
fn name(&self) -> String {
"GrafanaK8sDatasourceScore".to_string()
}
fn create_interpret(&self) -> Box<dyn Interpret<T>> {
todo!()
}
}
// fn extract_and_normalize_token(&self, secret: &DynamicObject) -> Option<String> {
// let token_b64 = secret
// .data
// .get("token")
// .or_else(|| secret.data.get("data").and_then(|d| d.get("token")))
// .and_then(|v| v.as_str())?;
//
// let bytes = general_purpose::STANDARD.decode(token_b64).ok()?;
//
// let s = String::from_utf8(bytes).ok()?;
//
// let cleaned = s
// .trim_matches(|c: char| c.is_whitespace() || c == '\0')
// .to_string();
// Some(cleaned)
// }
// fn build_grafana_datasource(
// &self,
// name: &str,
// ns: &str,
// label_selector: &LabelSelector,
// url: &str,
// token: &str,
// ) -> GrafanaDatasource {
// let mut json_data = BTreeMap::new();
// json_data.insert("timeInterval".to_string(), "5s".to_string());
//
// GrafanaDatasource {
// metadata: ObjectMeta {
// name: Some(name.to_string()),
// namespace: Some(ns.to_string()),
// ..Default::default()
// },
// spec: GrafanaDatasourceSpec {
// instance_selector: label_selector.clone(),
// allow_cross_namespace_import: Some(true),
// values_from: None,
// datasource: GrafanaDatasourceConfig {
// access: "proxy".to_string(),
// name: name.to_string(),
// rype: "prometheus".to_string(),
// url: url.to_string(),
// database: None,
// json_data: Some(GrafanaDatasourceJsonData {
// time_interval: Some("60s".to_string()),
// http_header_name1: Some("Authorization".to_string()),
// tls_skip_verify: Some(true),
// oauth_pass_thru: Some(true),
// }),
// secure_json_data: Some(GrafanaDatasourceSecureJsonData {
// http_header_value1: Some(format!("Bearer {token}")),
// }),
// is_default: Some(false),
// editable: Some(true),
// },
// },
// }
// }

View File

@@ -1,67 +0,0 @@
use serde::Serialize;
use crate::{
interpret::Interpret,
modules::monitoring::grafana::grafana::Grafana,
score::Score,
topology::{K8sclient, Topology, monitoring::AlertRule},
};
#[derive(Debug, Clone, Serialize)]
pub struct GrafanaK8sRuleScore {
pub sender: Grafana,
pub rule: Box<dyn AlertRule<Grafana>>,
}
impl<T: Topology + K8sclient> Score<T> for GrafanaK8sRuleScore {
fn name(&self) -> String {
"GrafanaK8sRuleScore".to_string()
}
fn create_interpret(&self) -> Box<dyn Interpret<T>> {
todo!()
}
}
// kind: Secret
// apiVersion: v1
// metadata:
// name: credentials
// namespace: grafana
// stringData:
// PROMETHEUS_USERNAME: root
// PROMETHEUS_PASSWORD: secret
// type: Opaque
// ---
// apiVersion: grafana.integreatly.org/v1beta1
// kind: GrafanaDatasource
// metadata:
// name: grafanadatasource-sample
// spec:
// valuesFrom:
// - targetPath: "basicAuthUser"
// valueFrom:
// secretKeyRef:
// name: "credentials"
// key: "PROMETHEUS_USERNAME"
// - targetPath: "secureJsonData.basicAuthPassword"
// valueFrom:
// secretKeyRef:
// name: "credentials"
// key: "PROMETHEUS_PASSWORD"
// instanceSelector:
// matchLabels:
// dashboards: "grafana"
// datasource:
// name: prometheus
// type: prometheus
// access: proxy
// basicAuth: true
// url: http://prometheus-service:9090
// isDefault: true
// basicAuthUser: ${PROMETHEUS_USERNAME}
// jsonData:
// "tlsSkipVerify": true
// "timeInterval": "5s"
// secureJsonData:
// "basicAuthPassword": ${PROMETHEUS_PASSWORD} #

View File

@@ -1,223 +0,0 @@
use async_trait::async_trait;
use harmony_types::id::Id;
use serde::Serialize;
use crate::{
data::Version,
interpret::{Interpret, InterpretError, InterpretName, InterpretStatus, Outcome},
inventory::Inventory,
modules::monitoring::grafana::grafana::Grafana,
score::Score,
topology::{HelmCommand, K8sclient, Topology},
};
#[derive(Debug, Clone, Serialize)]
pub struct GrafanaK8sInstallScore {
pub sender: Grafana,
}
impl<T: Topology + K8sclient + HelmCommand> Score<T> for GrafanaK8sInstallScore {
fn name(&self) -> String {
"GrafanaK8sEnsureReadyScore".to_string()
}
fn create_interpret(&self) -> Box<dyn Interpret<T>> {
Box::new(GrafanaK8sInstallInterpret {})
}
}
#[derive(Debug, Clone, Serialize)]
pub struct GrafanaK8sInstallInterpret {}
#[async_trait]
impl<T: Topology + K8sclient + HelmCommand> Interpret<T> for GrafanaK8sInstallInterpret {
async fn execute(
&self,
inventory: &Inventory,
topology: &T,
) -> Result<Outcome, InterpretError> {
todo!()
}
fn get_name(&self) -> InterpretName {
InterpretName::Custom("GrafanaK8sInstallInterpret")
}
fn get_version(&self) -> Version {
todo!()
}
fn get_status(&self) -> InterpretStatus {
todo!()
}
fn get_children(&self) -> Vec<Id> {
todo!()
}
}
// let score = grafana_operator_helm_chart_score(sender.namespace.clone());
//
// score
// .create_interpret()
// .execute(inventory, self)
// .await
// .map_err(|e| PreparationError::new(e.to_string()))?;
//
//
// fn build_grafana_dashboard(
// &self,
// ns: &str,
// label_selector: &LabelSelector,
// ) -> GrafanaDashboard {
// let graf_dashboard = GrafanaDashboard {
// metadata: ObjectMeta {
// name: Some(format!("grafana-dashboard-{}", ns)),
// namespace: Some(ns.to_string()),
// ..Default::default()
// },
// spec: GrafanaDashboardSpec {
// resync_period: Some("30s".to_string()),
// instance_selector: label_selector.clone(),
// datasources: Some(vec![GrafanaDashboardDatasource {
// input_name: "DS_PROMETHEUS".to_string(),
// datasource_name: "thanos-openshift-monitoring".to_string(),
// }]),
// json: None,
// grafana_com: Some(GrafanaCom {
// id: 17406,
// revision: None,
// }),
// },
// };
// graf_dashboard
// }
//
// fn build_grafana(&self, ns: &str, labels: &BTreeMap<String, String>) -> GrafanaCRD {
// let grafana = GrafanaCRD {
// metadata: ObjectMeta {
// name: Some(format!("grafana-{}", ns)),
// namespace: Some(ns.to_string()),
// labels: Some(labels.clone()),
// ..Default::default()
// },
// spec: GrafanaSpec {
// config: None,
// admin_user: None,
// admin_password: None,
// ingress: None,
// persistence: None,
// resources: None,
// },
// };
// grafana
// }
//
// async fn build_grafana_ingress(&self, ns: &str) -> K8sIngressScore {
// let domain = self.get_domain(&format!("grafana-{}", ns)).await.unwrap();
// let name = format!("{}-grafana", ns);
// let backend_service = format!("grafana-{}-service", ns);
//
// K8sIngressScore {
// name: fqdn::fqdn!(&name),
// host: fqdn::fqdn!(&domain),
// backend_service: fqdn::fqdn!(&backend_service),
// port: 3000,
// path: Some("/".to_string()),
// path_type: Some(PathType::Prefix),
// namespace: Some(fqdn::fqdn!(&ns)),
// ingress_class_name: Some("openshift-default".to_string()),
// }
// }
// #[async_trait]
// impl Grafana for K8sAnywhereTopology {
// async fn install_grafana(&self) -> Result<PreparationOutcome, PreparationError> {
// let ns = "grafana";
//
// let mut label = BTreeMap::new();
//
// label.insert("dashboards".to_string(), "grafana".to_string());
//
// let label_selector = LabelSelector {
// match_labels: label.clone(),
// match_expressions: vec![],
// };
//
// let client = self.k8s_client().await?;
//
// let grafana = self.build_grafana(ns, &label);
//
// client.apply(&grafana, Some(ns)).await?;
// //TODO change this to a ensure ready or something better than just a timeout
// client
// .wait_until_deployment_ready(
// "grafana-grafana-deployment",
// Some("grafana"),
// Some(Duration::from_secs(30)),
// )
// .await?;
//
// let sa_name = "grafana-grafana-sa";
// let token_secret_name = "grafana-sa-token-secret";
//
// let sa_token_secret = self.build_sa_token_secret(token_secret_name, sa_name, ns);
//
// client.apply(&sa_token_secret, Some(ns)).await?;
// let secret_gvk = GroupVersionKind {
// group: "".to_string(),
// version: "v1".to_string(),
// kind: "Secret".to_string(),
// };
//
// let secret = client
// .get_resource_json_value(token_secret_name, Some(ns), &secret_gvk)
// .await?;
//
// let token = format!(
// "Bearer {}",
// self.extract_and_normalize_token(&secret).unwrap()
// );
//
// debug!("creating grafana clusterrole binding");
//
// let clusterrolebinding =
// self.build_cluster_rolebinding(sa_name, "cluster-monitoring-view", ns);
//
// client.apply(&clusterrolebinding, Some(ns)).await?;
//
// debug!("creating grafana datasource crd");
//
// let thanos_url = format!(
// "https://{}",
// self.get_domain("thanos-querier-openshift-monitoring")
// .await
// .unwrap()
// );
//
// let thanos_openshift_datasource = self.build_grafana_datasource(
// "thanos-openshift-monitoring",
// ns,
// &label_selector,
// &thanos_url,
// &token,
// );
//
// client.apply(&thanos_openshift_datasource, Some(ns)).await?;
//
// debug!("creating grafana dashboard crd");
// let dashboard = self.build_grafana_dashboard(ns, &label_selector);
//
// client.apply(&dashboard, Some(ns)).await?;
// debug!("creating grafana ingress");
// let grafana_ingress = self.build_grafana_ingress(ns).await;
//
// grafana_ingress
// .interpret(&Inventory::empty(), self)
// .await
// .map_err(|e| PreparationError::new(e.to_string()))?;
//
// Ok(PreparationOutcome::Success {
// details: "Installed grafana composants".to_string(),
// })
// }
// }

View File

@@ -1,3 +1,2 @@
pub mod grafana;
pub mod grafana_alerting_score;
pub mod k8s;
pub mod helm;

View File

@@ -1,17 +1,91 @@
use std::sync::Arc;
use async_trait::async_trait;
use kube::CustomResource;
use schemars::JsonSchema;
use serde::{Deserialize, Serialize};
#[derive(CustomResource, Serialize, Deserialize, Default, Debug, Clone, JsonSchema)]
use crate::{
interpret::InterpretError,
inventory::Inventory,
modules::{
monitoring::{
grafana::grafana::Grafana, kube_prometheus::crd::service_monitor::ServiceMonitor,
},
prometheus::prometheus::PrometheusMonitoring,
},
topology::{
K8sclient, Topology,
installable::Installable,
oberservability::monitoring::{AlertReceiver, AlertSender, ScrapeTarget},
},
};
use harmony_k8s::K8sClient;
#[derive(CustomResource, Serialize, Deserialize, Debug, Clone, JsonSchema)]
#[kube(
group = "monitoring.coreos.com",
version = "v1",
version = "v1alpha1",
kind = "AlertmanagerConfig",
plural = "alertmanagerconfigs",
namespaced,
derive = "Default"
namespaced
)]
pub struct AlertmanagerConfigSpec {
#[serde(flatten)]
pub data: serde_json::Value,
}
#[derive(Debug, Clone, Serialize)]
pub struct CRDPrometheus {
pub namespace: String,
pub client: Arc<K8sClient>,
pub service_monitor: Vec<ServiceMonitor>,
}
impl AlertSender for CRDPrometheus {
fn name(&self) -> String {
"CRDAlertManager".to_string()
}
}
impl Clone for Box<dyn AlertReceiver<CRDPrometheus>> {
fn clone(&self) -> Self {
self.clone_box()
}
}
impl Clone for Box<dyn ScrapeTarget<CRDPrometheus>> {
fn clone(&self) -> Self {
self.clone_box()
}
}
impl Serialize for Box<dyn AlertReceiver<CRDPrometheus>> {
fn serialize<S>(&self, _serializer: S) -> Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
todo!()
}
}
#[async_trait]
impl<T: Topology + K8sclient + PrometheusMonitoring<CRDPrometheus> + Grafana> Installable<T>
for CRDPrometheus
{
async fn configure(&self, inventory: &Inventory, topology: &T) -> Result<(), InterpretError> {
topology.ensure_grafana_operator(inventory).await?;
topology.ensure_prometheus_operator(self, inventory).await?;
Ok(())
}
async fn ensure_installed(
&self,
inventory: &Inventory,
topology: &T,
) -> Result<(), InterpretError> {
topology.install_grafana().await?;
topology.install_prometheus(&self, inventory, None).await?;
Ok(())
}
}

View File

@@ -1,4 +1,4 @@
use crate::modules::monitoring::alert_rule::alerts::k8s::{
use crate::modules::prometheus::alerts::k8s::{
deployment::alert_deployment_unavailable,
pod::{alert_container_restarting, alert_pod_not_ready, pod_failed},
pvc::high_pvc_fill_rate_over_two_days,

View File

@@ -4,7 +4,7 @@ use kube::CustomResource;
use schemars::JsonSchema;
use serde::{Deserialize, Serialize};
use crate::modules::monitoring::kube_prometheus::crd::crd_prometheuses::LabelSelector;
use super::crd_prometheuses::LabelSelector;
#[derive(CustomResource, Serialize, Deserialize, Debug, Clone, JsonSchema)]
#[kube(

View File

@@ -6,14 +6,13 @@ use serde::{Deserialize, Serialize};
use crate::modules::monitoring::alert_rule::prometheus_alert_rule::PrometheusAlertRule;
#[derive(CustomResource, Default, Debug, Serialize, Deserialize, Clone, JsonSchema)]
#[derive(CustomResource, Debug, Serialize, Deserialize, Clone, JsonSchema)]
#[kube(
group = "monitoring.coreos.com",
version = "v1",
kind = "PrometheusRule",
plural = "prometheusrules",
namespaced,
derive = "Default"
namespaced
)]
#[serde(rename_all = "camelCase")]
pub struct PrometheusRuleSpec {

View File

@@ -1,18 +1,23 @@
use std::collections::BTreeMap;
use std::net::IpAddr;
use async_trait::async_trait;
use kube::CustomResource;
use schemars::JsonSchema;
use serde::{Deserialize, Serialize};
use crate::modules::monitoring::kube_prometheus::crd::crd_prometheuses::LabelSelector;
use crate::{
modules::monitoring::kube_prometheus::crd::{
crd_alertmanager_config::CRDPrometheus, crd_prometheuses::LabelSelector,
},
topology::oberservability::monitoring::ScrapeTarget,
};
#[derive(CustomResource, Default, Serialize, Deserialize, Debug, Clone, JsonSchema)]
#[derive(CustomResource, Serialize, Deserialize, Debug, Clone, JsonSchema)]
#[kube(
group = "monitoring.coreos.com",
version = "v1alpha1",
kind = "ScrapeConfig",
plural = "scrapeconfigs",
derive = "Default",
namespaced
)]
#[serde(rename_all = "camelCase")]
@@ -65,8 +70,8 @@ pub struct ScrapeConfigSpec {
#[serde(rename_all = "camelCase")]
pub struct StaticConfig {
pub targets: Vec<String>,
#[serde(skip_serializing_if = "Option::is_none")]
pub labels: Option<BTreeMap<String, String>>,
pub labels: Option<LabelSelector>,
}
/// Relabeling configuration for target or metric relabeling.

View File

@@ -1,9 +1,22 @@
pub mod crd_alertmanager_config;
pub mod crd_alertmanagers;
pub mod crd_default_rules;
pub mod crd_grafana;
pub mod crd_prometheus_rules;
pub mod crd_prometheuses;
pub mod crd_scrape_config;
pub mod grafana_default_dashboard;
pub mod grafana_operator;
pub mod prometheus_operator;
pub mod rhob_alertmanager_config;
pub mod rhob_alertmanagers;
pub mod rhob_cluster_observability_operator;
pub mod rhob_default_rules;
pub mod rhob_grafana;
pub mod rhob_monitoring_stack;
pub mod rhob_prometheus_rules;
pub mod rhob_prometheuses;
pub mod rhob_role;
pub mod rhob_service_monitor;
pub mod role;
pub mod service_monitor;

View File

@@ -0,0 +1,48 @@
use std::sync::Arc;
use kube::CustomResource;
use schemars::JsonSchema;
use serde::{Deserialize, Serialize};
use crate::topology::oberservability::monitoring::{AlertReceiver, AlertSender};
use harmony_k8s::K8sClient;
#[derive(CustomResource, Serialize, Deserialize, Debug, Clone, JsonSchema)]
#[kube(
group = "monitoring.rhobs",
version = "v1alpha1",
kind = "AlertmanagerConfig",
plural = "alertmanagerconfigs",
namespaced
)]
pub struct AlertmanagerConfigSpec {
#[serde(flatten)]
pub data: serde_json::Value,
}
#[derive(Debug, Clone, Serialize)]
pub struct RHOBObservability {
pub namespace: String,
pub client: Arc<K8sClient>,
}
impl AlertSender for RHOBObservability {
fn name(&self) -> String {
"RHOBAlertManager".to_string()
}
}
impl Clone for Box<dyn AlertReceiver<RHOBObservability>> {
fn clone(&self) -> Self {
self.clone_box()
}
}
impl Serialize for Box<dyn AlertReceiver<RHOBObservability>> {
fn serialize<S>(&self, _serializer: S) -> Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
todo!()
}
}

View File

@@ -2,7 +2,7 @@ use kube::CustomResource;
use schemars::JsonSchema;
use serde::{Deserialize, Serialize};
use crate::modules::monitoring::red_hat_cluster_observability::crd::rhob_prometheuses::LabelSelector;
use super::crd_prometheuses::LabelSelector;
/// Rust CRD for `Alertmanager` from Prometheus Operator
#[derive(CustomResource, Serialize, Deserialize, Debug, Clone, JsonSchema)]

View File

@@ -0,0 +1,22 @@
use std::str::FromStr;
use non_blank_string_rs::NonBlankString;
use crate::modules::helm::chart::HelmChartScore;
//TODO package chart or something for COO okd
pub fn rhob_cluster_observability_operator() -> HelmChartScore {
HelmChartScore {
namespace: None,
release_name: NonBlankString::from_str("").unwrap(),
chart_name: NonBlankString::from_str(
"oci://hub.nationtech.io/harmony/nt-prometheus-operator",
)
.unwrap(),
chart_version: None,
values_overrides: None,
values_yaml: None,
create_namespace: true,
install_only: true,
repository: None,
}
}

View File

@@ -1,11 +1,11 @@
use crate::modules::monitoring::{
alert_rule::alerts::k8s::{
use crate::modules::{
monitoring::kube_prometheus::crd::rhob_prometheus_rules::Rule,
prometheus::alerts::k8s::{
deployment::alert_deployment_unavailable,
pod::{alert_container_restarting, alert_pod_not_ready, pod_failed},
pvc::high_pvc_fill_rate_over_two_days,
service::alert_service_down,
},
red_hat_cluster_observability::crd::rhob_prometheus_rules::Rule,
};
pub fn build_default_application_rules() -> Vec<Rule> {

View File

@@ -4,7 +4,7 @@ use kube::CustomResource;
use schemars::JsonSchema;
use serde::{Deserialize, Serialize};
use crate::modules::monitoring::red_hat_cluster_observability::crd::rhob_prometheuses::LabelSelector;
use crate::modules::monitoring::kube_prometheus::crd::rhob_prometheuses::LabelSelector;
#[derive(CustomResource, Serialize, Deserialize, Debug, Clone, JsonSchema)]
#[kube(

View File

@@ -2,7 +2,7 @@ use kube::CustomResource;
use schemars::JsonSchema;
use serde::{Deserialize, Serialize};
use crate::modules::monitoring::red_hat_cluster_observability::crd::rhob_prometheuses::LabelSelector;
use crate::modules::monitoring::kube_prometheus::crd::rhob_prometheuses::LabelSelector;
/// MonitoringStack CRD for monitoring.rhobs/v1alpha1
#[derive(CustomResource, Serialize, Deserialize, Debug, Clone, JsonSchema)]
@@ -11,8 +11,7 @@ use crate::modules::monitoring::red_hat_cluster_observability::crd::rhob_prometh
version = "v1alpha1",
kind = "MonitoringStack",
plural = "monitoringstacks",
namespaced,
derive = "Default"
namespaced
)]
#[serde(rename_all = "camelCase")]
pub struct MonitoringStackSpec {

View File

@@ -4,6 +4,8 @@ use kube::CustomResource;
use schemars::JsonSchema;
use serde::{Deserialize, Serialize};
use crate::modules::monitoring::kube_prometheus::types::Operator;
#[derive(CustomResource, Serialize, Deserialize, Debug, Clone, JsonSchema)]
#[kube(
group = "monitoring.rhobs",
@@ -91,14 +93,6 @@ pub struct LabelSelectorRequirement {
pub values: Vec<String>,
}
#[derive(Debug, Clone, Serialize, Deserialize, JsonSchema)]
pub enum Operator {
In,
NotIn,
Exists,
DoesNotExist,
}
impl Default for PrometheusSpec {
fn default() -> Self {
PrometheusSpec {

Some files were not shown because too many files have changed in this diff Show More