15 KiB
Monitoring and Alerting in Harmony
Harmony provides a unified, type-safe approach to monitoring and alerting across Kubernetes, OpenShift, and bare-metal infrastructure. This guide explains the architecture and how to use it at different levels of abstraction.
Overview
Harmony's monitoring module supports three distinct use cases:
| Level | Who Uses It | What It Provides |
|---|---|---|
| Cluster | Cluster administrators | Full control over monitoring stack, cluster-wide alerts, external scrape targets |
| Tenant | Platform teams | Namespace-scoped monitoring in multi-tenant environments |
| Application | Application developers | Zero-config monitoring that "just works" |
Each level builds on the same underlying abstractions, ensuring consistency while providing appropriate complexity for each audience.
Core Concepts
AlertSender
An AlertSender represents the system that evaluates alert rules and sends notifications. Harmony supports multiple monitoring stacks:
| Sender | Description | Use When |
|---|---|---|
OpenshiftClusterAlertSender |
OKD/OpenShift built-in monitoring | Running on OKD/OpenShift |
KubePrometheus |
kube-prometheus-stack via Helm | Standard Kubernetes, need full stack |
Prometheus |
Standalone Prometheus | Custom Prometheus deployment |
RedHatClusterObservability |
RHOB operator | Red Hat managed clusters |
Grafana |
Grafana-managed alerting | Grafana as primary alerting layer |
AlertReceiver
An AlertReceiver defines where alerts are sent (Discord, Slack, email, webhook, etc.). Receivers are parameterized by sender type because each monitoring stack has different configuration formats.
pub trait AlertReceiver<S: AlertSender> {
fn build(&self) -> Result<ReceiverInstallPlan, InterpretError>;
fn name(&self) -> String;
}
Built-in receivers:
DiscordReceiver- Discord webhooksWebhookReceiver- Generic HTTP webhooks
AlertRule
An AlertRule defines a Prometheus alert expression. Rules are also parameterized by sender to handle different CRD formats.
pub trait AlertRule<S: AlertSender> {
fn build_rule(&self) -> Result<serde_json::Value, InterpretError>;
fn name(&self) -> String;
}
Observability Capability
Topologies implement Observability<S> to indicate they support a specific alert sender:
impl Observability<OpenshiftClusterAlertSender> for K8sAnywhereTopology {
async fn install_receivers(&self, sender, inventory, receivers) { ... }
async fn install_rules(&self, sender, inventory, rules) { ... }
// ...
}
This provides compile-time verification: if you try to use OpenshiftClusterAlertScore with a topology that doesn't implement Observability<OpenshiftClusterAlertSender>, the code won't compile.
Level 1: Cluster Monitoring
Cluster monitoring is for administrators who need full control over the monitoring infrastructure. This includes:
- Installing/managing the monitoring stack
- Configuring cluster-wide alert receivers
- Defining cluster-level alert rules
- Adding external scrape targets (e.g., bare-metal servers, firewalls)
Example: OKD Cluster Alerts
use harmony::{
modules::monitoring::{
alert_channel::discord_alert_channel::DiscordReceiver,
alert_rule::{alerts::k8s::pvc::high_pvc_fill_rate_over_two_days, prometheus_alert_rule::AlertManagerRuleGroup},
okd::openshift_cluster_alerting_score::OpenshiftClusterAlertScore,
scrape_target::prometheus_node_exporter::PrometheusNodeExporter,
},
topology::{K8sAnywhereTopology, monitoring::{AlertMatcher, AlertRoute, MatchOp}},
};
let severity_matcher = AlertMatcher {
label: "severity".to_string(),
operator: MatchOp::Eq,
value: "critical".to_string(),
};
let rule_group = AlertManagerRuleGroup::new(
"cluster-rules",
vec![high_pvc_fill_rate_over_two_days()],
);
let external_exporter = PrometheusNodeExporter {
job_name: "firewall".to_string(),
metrics_path: "/metrics".to_string(),
listen_address: ip!("192.168.1.1"),
port: 9100,
..Default::default()
};
harmony_cli::run(
Inventory::autoload(),
K8sAnywhereTopology::from_env(),
vec![Box::new(OpenshiftClusterAlertScore {
sender: OpenshiftClusterAlertSender,
receivers: vec![Box::new(DiscordReceiver {
name: "critical-alerts".to_string(),
url: hurl!("https://discord.com/api/webhooks/..."),
route: AlertRoute {
matchers: vec![severity_matcher],
..AlertRoute::default("critical-alerts".to_string())
},
})],
rules: vec![Box::new(rule_group)],
scrape_targets: Some(vec![Box::new(external_exporter)]),
})],
None,
).await?;
What This Does
- Enables cluster monitoring - Activates OKD's built-in Prometheus
- Enables user workload monitoring - Allows namespace-scoped rules
- Configures Alertmanager - Adds Discord receiver with route matching
- Deploys alert rules - Creates
AlertingRuleCRD with PVC fill rate alert - Adds external scrape target - Configures Prometheus to scrape the firewall
Compile-Time Safety
The OpenshiftClusterAlertScore requires:
impl<T: Topology + Observability<OpenshiftClusterAlertSender>> Score<T>
for OpenshiftClusterAlertScore
If K8sAnywhereTopology didn't implement Observability<OpenshiftClusterAlertSender>, this code would fail to compile. You cannot accidentally deploy OKD alerts to a cluster that doesn't support them.
Level 2: Tenant Monitoring
In multi-tenant clusters, teams are often confined to specific namespaces. Tenant monitoring adapts to this constraint:
- Resources are deployed in the tenant's namespace
- Cannot modify cluster-level monitoring configuration
- The topology determines namespace context at runtime
How It Works
The topology's Observability implementation handles tenant scoping:
impl Observability<KubePrometheus> for K8sAnywhereTopology {
async fn install_rules(&self, sender, inventory, rules) {
// Topology knows if it's tenant-scoped
let namespace = self.get_tenant_config().await
.map(|t| t.name)
.unwrap_or_else(|| "monitoring".to_string());
// Rules are installed in the appropriate namespace
for rule in rules.unwrap_or_default() {
let score = KubePrometheusRuleScore {
sender: sender.clone(),
rule,
namespace: namespace.clone(), // Tenant namespace
};
score.create_interpret().execute(inventory, self).await?;
}
}
}
Tenant vs Cluster Resources
| Resource | Cluster-Level | Tenant-Level |
|---|---|---|
| Alertmanager config | Global receivers | Namespaced receivers (where supported) |
| PrometheusRules | Cluster-wide alerts | Namespace alerts only |
| ServiceMonitors | Any namespace | Own namespace only |
| External scrape targets | Can add | Cannot add (cluster config) |
Runtime Validation
Tenant constraints are validated at runtime via Kubernetes RBAC. If a tenant-scoped deployment attempts cluster-level operations, it fails with a clear permission error from the Kubernetes API.
This cannot be fully compile-time because tenant context is determined by who's running the code and what permissions they have—information only available at runtime.
Level 3: Application Monitoring
Application monitoring provides zero-config, opinionated monitoring for developers. Just add the Monitoring feature to your application and it works.
Example
use harmony::modules::{
application::{Application, ApplicationFeature},
monitoring::alert_channel::webhook_receiver::WebhookReceiver,
};
// Define your application
let my_app = MyApplication::new();
// Add monitoring as a feature
let monitoring = Monitoring {
application: Arc::new(my_app),
alert_receiver: vec![], // Uses defaults
};
// Install with the application
my_app.add_feature(monitoring);
What Application Monitoring Provides
- Automatic ServiceMonitor - Creates a ServiceMonitor for your application's pods
- Ntfy Notification Channel - Auto-installs and configures Ntfy for push notifications
- Tenant Awareness - Automatically scopes to the correct namespace
- Sensible Defaults - Pre-configured alert routes and receivers
Under the Hood
impl<T: Topology + Observability<Prometheus> + TenantManager>
ApplicationFeature<T> for Monitoring
{
async fn ensure_installed(&self, topology: &T) -> Result<...> {
// 1. Get tenant namespace (or use app name)
let namespace = topology.get_tenant_config().await
.map(|ns| ns.name.clone())
.unwrap_or_else(|| self.application.name());
// 2. Create ServiceMonitor for the app
let app_service_monitor = ServiceMonitor {
metadata: ObjectMeta {
name: Some(self.application.name()),
namespace: Some(namespace.clone()),
..Default::default()
},
spec: ServiceMonitorSpec::default(),
};
// 3. Install Ntfy for notifications
let ntfy = NtfyScore { namespace, host };
ntfy.interpret(&Inventory::empty(), topology).await?;
// 4. Wire up webhook receiver to Ntfy
let ntfy_receiver = WebhookReceiver { ... };
// 5. Execute monitoring score
alerting_score.interpret(&Inventory::empty(), topology).await?;
}
}
Pre-Built Alert Rules
Harmony provides a library of common alert rules in modules/monitoring/alert_rule/alerts/:
Kubernetes Alerts (alerts/k8s/)
use harmony::modules::monitoring::alert_rule::alerts::k8s::{
pod::pod_failed,
pvc::high_pvc_fill_rate_over_two_days,
memory_usage::alert_high_memory_usage,
};
let rules = AlertManagerRuleGroup::new("k8s-rules", vec![
pod_failed(),
high_pvc_fill_rate_over_two_days(),
alert_high_memory_usage(),
]);
Available rules:
pod_failed()- Pod in failed statealert_container_restarting()- Container restart loopalert_pod_not_ready()- Pod not ready for extended periodhigh_pvc_fill_rate_over_two_days()- PVC will fill within 2 daysalert_high_memory_usage()- Memory usage above thresholdalert_high_cpu_usage()- CPU usage above threshold
Infrastructure Alerts (alerts/infra/)
use harmony::modules::monitoring::alert_rule::alerts::infra::opnsense::high_http_error_rate;
let rules = AlertManagerRuleGroup::new("infra-rules", vec![
high_http_error_rate(),
]);
Creating Custom Rules
use harmony::modules::monitoring::alert_rule::prometheus_alert_rule::PrometheusAlertRule;
pub fn my_custom_alert() -> PrometheusAlertRule {
PrometheusAlertRule::new("MyServiceDown", "up{job=\"my-service\"} == 0")
.for_duration("5m")
.label("severity", "critical")
.annotation("summary", "My service is down")
.annotation("description", "The my-service job has been down for more than 5 minutes")
}
Alert Receivers
Discord Webhook
use harmony::modules::monitoring::alert_channel::discord_alert_channel::DiscordReceiver;
use harmony::topology::monitoring::{AlertRoute, AlertMatcher, MatchOp};
let discord = DiscordReceiver {
name: "ops-alerts".to_string(),
url: hurl!("https://discord.com/api/webhooks/123456/abcdef"),
route: AlertRoute {
receiver: "ops-alerts".to_string(),
matchers: vec![AlertMatcher {
label: "severity".to_string(),
operator: MatchOp::Eq,
value: "critical".to_string(),
}],
group_by: vec!["alertname".to_string()],
repeat_interval: Some("30m".to_string()),
continue_matching: false,
children: vec![],
},
};
Generic Webhook
use harmony::modules::monitoring::alert_channel::webhook_receiver::WebhookReceiver;
let webhook = WebhookReceiver {
name: "custom-webhook".to_string(),
url: hurl!("https://api.example.com/alerts"),
route: AlertRoute::default("custom-webhook".to_string()),
};
Adding a New Monitoring Stack
To add support for a new monitoring stack:
-
Create the sender type in
modules/monitoring/my_sender/mod.rs:#[derive(Debug, Clone)] pub struct MySender; impl AlertSender for MySender { fn name(&self) -> String { "MySender".to_string() } } -
Define CRD types in
modules/monitoring/my_sender/crd/:#[derive(CustomResource, Debug, Serialize, Deserialize, Clone)] #[kube(group = "monitoring.example.com", version = "v1", kind = "MyAlertRule")] pub struct MyAlertRuleSpec { ... } -
Implement Observability in
domain/topology/k8s_anywhere/observability/my_sender.rs:impl Observability<MySender> for K8sAnywhereTopology { async fn install_receivers(&self, sender, inventory, receivers) { ... } async fn install_rules(&self, sender, inventory, rules) { ... } // ... } -
Implement receiver conversions for existing receivers:
impl AlertReceiver<MySender> for DiscordReceiver { fn build(&self) -> Result<ReceiverInstallPlan, InterpretError> { // Convert DiscordReceiver to MySender's format } } -
Create score types:
pub struct MySenderAlertScore { pub sender: MySender, pub receivers: Vec<Box<dyn AlertReceiver<MySender>>>, pub rules: Vec<Box<dyn AlertRule<MySender>>>, }
Architecture Principles
Type Safety Over Flexibility
Each monitoring stack has distinct CRDs and configuration formats. Rather than a unified "MonitoringStack" type that loses stack-specific features, we use generic traits that provide type safety while allowing each stack to express its unique configuration.
Compile-Time Capability Verification
The Observability<S> bound ensures you can't deploy OKD alerts to a KubePrometheus cluster. The compiler catches platform mismatches before deployment.
Explicit Over Implicit
Monitoring stacks are chosen explicitly (OpenshiftClusterAlertSender vs KubePrometheus). There's no "auto-detection" that could lead to surprising behavior.
Three Levels, One Foundation
Cluster, tenant, and application monitoring all use the same traits (AlertSender, AlertReceiver, AlertRule). The difference is in how scores are constructed and how topologies interpret them.