All checks were successful
Run Check Script / check (pull_request) Successful in 1m23s
444 lines
15 KiB
Markdown
444 lines
15 KiB
Markdown
# Monitoring and Alerting in Harmony
|
|
|
|
Harmony provides a unified, type-safe approach to monitoring and alerting across Kubernetes, OpenShift, and bare-metal infrastructure. This guide explains the architecture and how to use it at different levels of abstraction.
|
|
|
|
## Overview
|
|
|
|
Harmony's monitoring module supports three distinct use cases:
|
|
|
|
| Level | Who Uses It | What It Provides |
|
|
|-------|-------------|------------------|
|
|
| **Cluster** | Cluster administrators | Full control over monitoring stack, cluster-wide alerts, external scrape targets |
|
|
| **Tenant** | Platform teams | Namespace-scoped monitoring in multi-tenant environments |
|
|
| **Application** | Application developers | Zero-config monitoring that "just works" |
|
|
|
|
Each level builds on the same underlying abstractions, ensuring consistency while providing appropriate complexity for each audience.
|
|
|
|
## Core Concepts
|
|
|
|
### AlertSender
|
|
|
|
An `AlertSender` represents the system that evaluates alert rules and sends notifications. Harmony supports multiple monitoring stacks:
|
|
|
|
| Sender | Description | Use When |
|
|
|--------|-------------|----------|
|
|
| `OpenshiftClusterAlertSender` | OKD/OpenShift built-in monitoring | Running on OKD/OpenShift |
|
|
| `KubePrometheus` | kube-prometheus-stack via Helm | Standard Kubernetes, need full stack |
|
|
| `Prometheus` | Standalone Prometheus | Custom Prometheus deployment |
|
|
| `RedHatClusterObservability` | RHOB operator | Red Hat managed clusters |
|
|
| `Grafana` | Grafana-managed alerting | Grafana as primary alerting layer |
|
|
|
|
### AlertReceiver
|
|
|
|
An `AlertReceiver` defines where alerts are sent (Discord, Slack, email, webhook, etc.). Receivers are parameterized by sender type because each monitoring stack has different configuration formats.
|
|
|
|
```rust
|
|
pub trait AlertReceiver<S: AlertSender> {
|
|
fn build(&self) -> Result<ReceiverInstallPlan, InterpretError>;
|
|
fn name(&self) -> String;
|
|
}
|
|
```
|
|
|
|
Built-in receivers:
|
|
- `DiscordReceiver` - Discord webhooks
|
|
- `WebhookReceiver` - Generic HTTP webhooks
|
|
|
|
### AlertRule
|
|
|
|
An `AlertRule` defines a Prometheus alert expression. Rules are also parameterized by sender to handle different CRD formats.
|
|
|
|
```rust
|
|
pub trait AlertRule<S: AlertSender> {
|
|
fn build_rule(&self) -> Result<serde_json::Value, InterpretError>;
|
|
fn name(&self) -> String;
|
|
}
|
|
```
|
|
|
|
### Observability Capability
|
|
|
|
Topologies implement `Observability<S>` to indicate they support a specific alert sender:
|
|
|
|
```rust
|
|
impl Observability<OpenshiftClusterAlertSender> for K8sAnywhereTopology {
|
|
async fn install_receivers(&self, sender, inventory, receivers) { ... }
|
|
async fn install_rules(&self, sender, inventory, rules) { ... }
|
|
// ...
|
|
}
|
|
```
|
|
|
|
This provides **compile-time verification**: if you try to use `OpenshiftClusterAlertScore` with a topology that doesn't implement `Observability<OpenshiftClusterAlertSender>`, the code won't compile.
|
|
|
|
---
|
|
|
|
## Level 1: Cluster Monitoring
|
|
|
|
Cluster monitoring is for administrators who need full control over the monitoring infrastructure. This includes:
|
|
- Installing/managing the monitoring stack
|
|
- Configuring cluster-wide alert receivers
|
|
- Defining cluster-level alert rules
|
|
- Adding external scrape targets (e.g., bare-metal servers, firewalls)
|
|
|
|
### Example: OKD Cluster Alerts
|
|
|
|
```rust
|
|
use harmony::{
|
|
modules::monitoring::{
|
|
alert_channel::discord_alert_channel::DiscordReceiver,
|
|
alert_rule::{alerts::k8s::pvc::high_pvc_fill_rate_over_two_days, prometheus_alert_rule::AlertManagerRuleGroup},
|
|
okd::openshift_cluster_alerting_score::OpenshiftClusterAlertScore,
|
|
scrape_target::prometheus_node_exporter::PrometheusNodeExporter,
|
|
},
|
|
topology::{K8sAnywhereTopology, monitoring::{AlertMatcher, AlertRoute, MatchOp}},
|
|
};
|
|
|
|
let severity_matcher = AlertMatcher {
|
|
label: "severity".to_string(),
|
|
operator: MatchOp::Eq,
|
|
value: "critical".to_string(),
|
|
};
|
|
|
|
let rule_group = AlertManagerRuleGroup::new(
|
|
"cluster-rules",
|
|
vec![high_pvc_fill_rate_over_two_days()],
|
|
);
|
|
|
|
let external_exporter = PrometheusNodeExporter {
|
|
job_name: "firewall".to_string(),
|
|
metrics_path: "/metrics".to_string(),
|
|
listen_address: ip!("192.168.1.1"),
|
|
port: 9100,
|
|
..Default::default()
|
|
};
|
|
|
|
harmony_cli::run(
|
|
Inventory::autoload(),
|
|
K8sAnywhereTopology::from_env(),
|
|
vec![Box::new(OpenshiftClusterAlertScore {
|
|
sender: OpenshiftClusterAlertSender,
|
|
receivers: vec![Box::new(DiscordReceiver {
|
|
name: "critical-alerts".to_string(),
|
|
url: hurl!("https://discord.com/api/webhooks/..."),
|
|
route: AlertRoute {
|
|
matchers: vec![severity_matcher],
|
|
..AlertRoute::default("critical-alerts".to_string())
|
|
},
|
|
})],
|
|
rules: vec![Box::new(rule_group)],
|
|
scrape_targets: Some(vec![Box::new(external_exporter)]),
|
|
})],
|
|
None,
|
|
).await?;
|
|
```
|
|
|
|
### What This Does
|
|
|
|
1. **Enables cluster monitoring** - Activates OKD's built-in Prometheus
|
|
2. **Enables user workload monitoring** - Allows namespace-scoped rules
|
|
3. **Configures Alertmanager** - Adds Discord receiver with route matching
|
|
4. **Deploys alert rules** - Creates `AlertingRule` CRD with PVC fill rate alert
|
|
5. **Adds external scrape target** - Configures Prometheus to scrape the firewall
|
|
|
|
### Compile-Time Safety
|
|
|
|
The `OpenshiftClusterAlertScore` requires:
|
|
|
|
```rust
|
|
impl<T: Topology + Observability<OpenshiftClusterAlertSender>> Score<T>
|
|
for OpenshiftClusterAlertScore
|
|
```
|
|
|
|
If `K8sAnywhereTopology` didn't implement `Observability<OpenshiftClusterAlertSender>`, this code would fail to compile. You cannot accidentally deploy OKD alerts to a cluster that doesn't support them.
|
|
|
|
---
|
|
|
|
## Level 2: Tenant Monitoring
|
|
|
|
In multi-tenant clusters, teams are often confined to specific namespaces. Tenant monitoring adapts to this constraint:
|
|
|
|
- Resources are deployed in the tenant's namespace
|
|
- Cannot modify cluster-level monitoring configuration
|
|
- The topology determines namespace context at runtime
|
|
|
|
### How It Works
|
|
|
|
The topology's `Observability` implementation handles tenant scoping:
|
|
|
|
```rust
|
|
impl Observability<KubePrometheus> for K8sAnywhereTopology {
|
|
async fn install_rules(&self, sender, inventory, rules) {
|
|
// Topology knows if it's tenant-scoped
|
|
let namespace = self.get_tenant_config().await
|
|
.map(|t| t.name)
|
|
.unwrap_or_else(|| "monitoring".to_string());
|
|
|
|
// Rules are installed in the appropriate namespace
|
|
for rule in rules.unwrap_or_default() {
|
|
let score = KubePrometheusRuleScore {
|
|
sender: sender.clone(),
|
|
rule,
|
|
namespace: namespace.clone(), // Tenant namespace
|
|
};
|
|
score.create_interpret().execute(inventory, self).await?;
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Tenant vs Cluster Resources
|
|
|
|
| Resource | Cluster-Level | Tenant-Level |
|
|
|----------|---------------|--------------|
|
|
| Alertmanager config | Global receivers | Namespaced receivers (where supported) |
|
|
| PrometheusRules | Cluster-wide alerts | Namespace alerts only |
|
|
| ServiceMonitors | Any namespace | Own namespace only |
|
|
| External scrape targets | Can add | Cannot add (cluster config) |
|
|
|
|
### Runtime Validation
|
|
|
|
Tenant constraints are validated at runtime via Kubernetes RBAC. If a tenant-scoped deployment attempts cluster-level operations, it fails with a clear permission error from the Kubernetes API.
|
|
|
|
This cannot be fully compile-time because tenant context is determined by who's running the code and what permissions they have—information only available at runtime.
|
|
|
|
---
|
|
|
|
## Level 3: Application Monitoring
|
|
|
|
Application monitoring provides zero-config, opinionated monitoring for developers. Just add the `Monitoring` feature to your application and it works.
|
|
|
|
### Example
|
|
|
|
```rust
|
|
use harmony::modules::{
|
|
application::{Application, ApplicationFeature},
|
|
monitoring::alert_channel::webhook_receiver::WebhookReceiver,
|
|
};
|
|
|
|
// Define your application
|
|
let my_app = MyApplication::new();
|
|
|
|
// Add monitoring as a feature
|
|
let monitoring = Monitoring {
|
|
application: Arc::new(my_app),
|
|
alert_receiver: vec![], // Uses defaults
|
|
};
|
|
|
|
// Install with the application
|
|
my_app.add_feature(monitoring);
|
|
```
|
|
|
|
### What Application Monitoring Provides
|
|
|
|
1. **Automatic ServiceMonitor** - Creates a ServiceMonitor for your application's pods
|
|
2. **Ntfy Notification Channel** - Auto-installs and configures Ntfy for push notifications
|
|
3. **Tenant Awareness** - Automatically scopes to the correct namespace
|
|
4. **Sensible Defaults** - Pre-configured alert routes and receivers
|
|
|
|
### Under the Hood
|
|
|
|
```rust
|
|
impl<T: Topology + Observability<Prometheus> + TenantManager>
|
|
ApplicationFeature<T> for Monitoring
|
|
{
|
|
async fn ensure_installed(&self, topology: &T) -> Result<...> {
|
|
// 1. Get tenant namespace (or use app name)
|
|
let namespace = topology.get_tenant_config().await
|
|
.map(|ns| ns.name.clone())
|
|
.unwrap_or_else(|| self.application.name());
|
|
|
|
// 2. Create ServiceMonitor for the app
|
|
let app_service_monitor = ServiceMonitor {
|
|
metadata: ObjectMeta {
|
|
name: Some(self.application.name()),
|
|
namespace: Some(namespace.clone()),
|
|
..Default::default()
|
|
},
|
|
spec: ServiceMonitorSpec::default(),
|
|
};
|
|
|
|
// 3. Install Ntfy for notifications
|
|
let ntfy = NtfyScore { namespace, host };
|
|
ntfy.interpret(&Inventory::empty(), topology).await?;
|
|
|
|
// 4. Wire up webhook receiver to Ntfy
|
|
let ntfy_receiver = WebhookReceiver { ... };
|
|
|
|
// 5. Execute monitoring score
|
|
alerting_score.interpret(&Inventory::empty(), topology).await?;
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Pre-Built Alert Rules
|
|
|
|
Harmony provides a library of common alert rules in `modules/monitoring/alert_rule/alerts/`:
|
|
|
|
### Kubernetes Alerts (`alerts/k8s/`)
|
|
|
|
```rust
|
|
use harmony::modules::monitoring::alert_rule::alerts::k8s::{
|
|
pod::pod_failed,
|
|
pvc::high_pvc_fill_rate_over_two_days,
|
|
memory_usage::alert_high_memory_usage,
|
|
};
|
|
|
|
let rules = AlertManagerRuleGroup::new("k8s-rules", vec![
|
|
pod_failed(),
|
|
high_pvc_fill_rate_over_two_days(),
|
|
alert_high_memory_usage(),
|
|
]);
|
|
```
|
|
|
|
Available rules:
|
|
- `pod_failed()` - Pod in failed state
|
|
- `alert_container_restarting()` - Container restart loop
|
|
- `alert_pod_not_ready()` - Pod not ready for extended period
|
|
- `high_pvc_fill_rate_over_two_days()` - PVC will fill within 2 days
|
|
- `alert_high_memory_usage()` - Memory usage above threshold
|
|
- `alert_high_cpu_usage()` - CPU usage above threshold
|
|
|
|
### Infrastructure Alerts (`alerts/infra/`)
|
|
|
|
```rust
|
|
use harmony::modules::monitoring::alert_rule::alerts::infra::opnsense::high_http_error_rate;
|
|
|
|
let rules = AlertManagerRuleGroup::new("infra-rules", vec![
|
|
high_http_error_rate(),
|
|
]);
|
|
```
|
|
|
|
### Creating Custom Rules
|
|
|
|
```rust
|
|
use harmony::modules::monitoring::alert_rule::prometheus_alert_rule::PrometheusAlertRule;
|
|
|
|
pub fn my_custom_alert() -> PrometheusAlertRule {
|
|
PrometheusAlertRule::new("MyServiceDown", "up{job=\"my-service\"} == 0")
|
|
.for_duration("5m")
|
|
.label("severity", "critical")
|
|
.annotation("summary", "My service is down")
|
|
.annotation("description", "The my-service job has been down for more than 5 minutes")
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Alert Receivers
|
|
|
|
### Discord Webhook
|
|
|
|
```rust
|
|
use harmony::modules::monitoring::alert_channel::discord_alert_channel::DiscordReceiver;
|
|
use harmony::topology::monitoring::{AlertRoute, AlertMatcher, MatchOp};
|
|
|
|
let discord = DiscordReceiver {
|
|
name: "ops-alerts".to_string(),
|
|
url: hurl!("https://discord.com/api/webhooks/123456/abcdef"),
|
|
route: AlertRoute {
|
|
receiver: "ops-alerts".to_string(),
|
|
matchers: vec![AlertMatcher {
|
|
label: "severity".to_string(),
|
|
operator: MatchOp::Eq,
|
|
value: "critical".to_string(),
|
|
}],
|
|
group_by: vec!["alertname".to_string()],
|
|
repeat_interval: Some("30m".to_string()),
|
|
continue_matching: false,
|
|
children: vec![],
|
|
},
|
|
};
|
|
```
|
|
|
|
### Generic Webhook
|
|
|
|
```rust
|
|
use harmony::modules::monitoring::alert_channel::webhook_receiver::WebhookReceiver;
|
|
|
|
let webhook = WebhookReceiver {
|
|
name: "custom-webhook".to_string(),
|
|
url: hurl!("https://api.example.com/alerts"),
|
|
route: AlertRoute::default("custom-webhook".to_string()),
|
|
};
|
|
```
|
|
|
|
---
|
|
|
|
## Adding a New Monitoring Stack
|
|
|
|
To add support for a new monitoring stack:
|
|
|
|
1. **Create the sender type** in `modules/monitoring/my_sender/mod.rs`:
|
|
```rust
|
|
#[derive(Debug, Clone)]
|
|
pub struct MySender;
|
|
|
|
impl AlertSender for MySender {
|
|
fn name(&self) -> String { "MySender".to_string() }
|
|
}
|
|
```
|
|
|
|
2. **Define CRD types** in `modules/monitoring/my_sender/crd/`:
|
|
```rust
|
|
#[derive(CustomResource, Debug, Serialize, Deserialize, Clone)]
|
|
#[kube(group = "monitoring.example.com", version = "v1", kind = "MyAlertRule")]
|
|
pub struct MyAlertRuleSpec { ... }
|
|
```
|
|
|
|
3. **Implement Observability** in `domain/topology/k8s_anywhere/observability/my_sender.rs`:
|
|
```rust
|
|
impl Observability<MySender> for K8sAnywhereTopology {
|
|
async fn install_receivers(&self, sender, inventory, receivers) { ... }
|
|
async fn install_rules(&self, sender, inventory, rules) { ... }
|
|
// ...
|
|
}
|
|
```
|
|
|
|
4. **Implement receiver conversions** for existing receivers:
|
|
```rust
|
|
impl AlertReceiver<MySender> for DiscordReceiver {
|
|
fn build(&self) -> Result<ReceiverInstallPlan, InterpretError> {
|
|
// Convert DiscordReceiver to MySender's format
|
|
}
|
|
}
|
|
```
|
|
|
|
5. **Create score types**:
|
|
```rust
|
|
pub struct MySenderAlertScore {
|
|
pub sender: MySender,
|
|
pub receivers: Vec<Box<dyn AlertReceiver<MySender>>>,
|
|
pub rules: Vec<Box<dyn AlertRule<MySender>>>,
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Architecture Principles
|
|
|
|
### Type Safety Over Flexibility
|
|
|
|
Each monitoring stack has distinct CRDs and configuration formats. Rather than a unified "MonitoringStack" type that loses stack-specific features, we use generic traits that provide type safety while allowing each stack to express its unique configuration.
|
|
|
|
### Compile-Time Capability Verification
|
|
|
|
The `Observability<S>` bound ensures you can't deploy OKD alerts to a KubePrometheus cluster. The compiler catches platform mismatches before deployment.
|
|
|
|
### Explicit Over Implicit
|
|
|
|
Monitoring stacks are chosen explicitly (`OpenshiftClusterAlertSender` vs `KubePrometheus`). There's no "auto-detection" that could lead to surprising behavior.
|
|
|
|
### Three Levels, One Foundation
|
|
|
|
Cluster, tenant, and application monitoring all use the same traits (`AlertSender`, `AlertReceiver`, `AlertRule`). The difference is in how scores are constructed and how topologies interpret them.
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
- [ADR-020: Monitoring and Alerting Architecture](../adr/020-monitoring-alerting-architecture.md)
|
|
- [ADR-013: Monitoring Notifications (ntfy)](../adr/013-monitoring-notifications.md)
|
|
- [ADR-011: Multi-Tenant Cluster Architecture](../adr/011-multi-tenant-cluster.md)
|
|
- [Coding Guide](coding-guide.md)
|
|
- [Core Concepts](concepts.md)
|