NationTech/harmony

Fork 2

Files

Jean-Gabriel Gill-Couture af6145afe3

Run Check Script / check (pull_request) Successful in 1m23s

Details

doc: monitoring module documentation

2026-03-09 18:33:35 -04:00

12 KiB

Raw Permalink Blame History

Architecture Decision Record: Monitoring and Alerting Architecture

Initial Author: Willem Rolleman, Jean-Gabriel Carrier

Initial Date: March 9, 2026

Last Updated Date: March 9, 2026

Status

Accepted

Supersedes: ADR-010

Context

Harmony needs a unified approach to monitoring and alerting across different infrastructure targets:

Cluster-level monitoring: Administrators managing entire Kubernetes/OKD clusters need to define cluster-wide alerts, receivers, and scrape targets.
Tenant-level monitoring: Multi-tenant clusters where teams are confined to namespaces need monitoring scoped to their resources.
Application-level monitoring: Developers deploying applications want zero-config monitoring that "just works" for their services.

The monitoring landscape is fragmented:

OKD/OpenShift: Built-in Prometheus with AlertmanagerConfig CRDs
KubePrometheus: Helm-based stack with PrometheusRule CRDs
RHOB (Red Hat Observability): Operator-based with MonitoringStack CRDs
Standalone Prometheus: Raw Prometheus deployments

Each system has different CRDs, different installation methods, and different configuration APIs.

Decision

We implement a trait-based architecture with compile-time capability verification that provides:

Type-safe abstractions via parameterized traits: AlertReceiver<S>, AlertRule<S>, ScrapeTarget<S>
Compile-time topology compatibility via the Observability<S> capability bound
Three levels of abstraction: Cluster, Tenant, and Application monitoring
Pre-built alert rules as functions that return typed structs

Core Traits

// domain/topology/monitoring.rs

/// Marker trait for systems that send alerts (Prometheus, etc.)
pub trait AlertSender: Send + Sync + std::fmt::Debug {
    fn name(&self) -> String;
}

/// Defines how a receiver (Discord, Slack, etc.) builds its configuration
/// for a specific sender type
pub trait AlertReceiver<S: AlertSender>: std::fmt::Debug + Send + Sync {
    fn build(&self) -> Result<ReceiverInstallPlan, InterpretError>;
    fn name(&self) -> String;
    fn clone_box(&self) -> Box<dyn AlertReceiver<S>>;
}

/// Defines how an alert rule builds its PrometheusRule configuration
pub trait AlertRule<S: AlertSender>: std::fmt::Debug + Send + Sync {
    fn build_rule(&self) -> Result<serde_json::Value, InterpretError>;
    fn name(&self) -> String;
    fn clone_box(&self) -> Box<dyn AlertRule<S>>;
}

/// Capability that topologies implement to support monitoring
pub trait Observability<S: AlertSender> {
    async fn install_alert_sender(&self, sender: &S, inventory: &Inventory) 
        -> Result<PreparationOutcome, PreparationError>;
    async fn install_receivers(&self, sender: &S, inventory: &Inventory, 
        receivers: Option<Vec<Box<dyn AlertReceiver<S>>>>) -> Result<...>;
    async fn install_rules(&self, sender: &S, inventory: &Inventory,
        rules: Option<Vec<Box<dyn AlertRule<S>>>>) -> Result<...>;
    async fn add_scrape_targets(&self, sender: &S, inventory: &Inventory,
        scrape_targets: Option<Vec<Box<dyn ScrapeTarget<S>>>>) -> Result<...>;
    async fn ensure_monitoring_installed(&self, sender: &S, inventory: &Inventory) 
        -> Result<...>;
}

Alert Sender Types

Each monitoring stack is a distinct AlertSender:

Sender	Module	Use Case
`OpenshiftClusterAlertSender`	`monitoring/okd/`	OKD/OpenShift built-in monitoring
`KubePrometheus`	`monitoring/kube_prometheus/`	Helm-deployed kube-prometheus-stack
`Prometheus`	`monitoring/prometheus/`	Standalone Prometheus via Helm
`RedHatClusterObservability`	`monitoring/red_hat_cluster_observability/`	RHOB operator
`Grafana`	`monitoring/grafana/`	Grafana-managed alerting

Three Levels of Monitoring

1. Cluster-Level Monitoring

For cluster administrators. Full control over monitoring infrastructure.

// examples/okd_cluster_alerts/src/main.rs
OpenshiftClusterAlertScore {
    sender: OpenshiftClusterAlertSender,
    receivers: vec![Box::new(DiscordReceiver { ... })],
    rules: vec![Box::new(alert_rules)],
    scrape_targets: Some(vec![Box::new(external_exporters)]),
}

Characteristics:

Cluster-scoped CRDs and resources
Can add external scrape targets (outside cluster)
Manages Alertmanager configuration
Requires cluster-admin privileges

2. Tenant-Level Monitoring

For teams confined to namespaces. The topology determines tenant context.

// The topology's Observability impl handles namespace scoping
impl Observability<KubePrometheus> for K8sAnywhereTopology {
    async fn install_rules(&self, sender: &KubePrometheus, ...) {
        // Topology knows if it's tenant-scoped
        let namespace = self.get_tenant_config().await
            .map(|t| t.name)
            .unwrap_or("default");
        // Install rules in tenant namespace
    }
}

Characteristics:

Namespace-scoped resources
Cannot modify cluster-level monitoring config
May have restricted receiver types
Runtime validation of permissions (cannot be fully compile-time)

3. Application-Level Monitoring

For developers. Zero-config, opinionated monitoring.

// modules/application/features/monitoring.rs
pub struct Monitoring {
    pub application: Arc<dyn Application>,
    pub alert_receiver: Vec<Box<dyn AlertReceiver<Prometheus>>>,
}

impl<T: Topology + Observability<Prometheus> + TenantManager + ...> 
    ApplicationFeature<T> for Monitoring 
{
    async fn ensure_installed(&self, topology: &T) -> Result<...> {
        // Auto-creates ServiceMonitor
        // Auto-installs Ntfy for notifications
        // Handles tenant namespace automatically
        // Wires up sensible defaults
    }
}

Characteristics:

Automatic ServiceMonitor creation
Opinionated notification channel (Ntfy)
Tenant-aware via topology
Minimal configuration required

Rationale

Why Generic Traits Instead of Unified Types?

Each monitoring stack (OKD, KubePrometheus, RHOB) has fundamentally different CRDs:

// OKD uses AlertmanagerConfig with different structure
AlertmanagerConfig { spec: { receivers: [...] } }

// RHOB uses secret references for webhook URLs
MonitoringStack { spec: { alertmanagerConfig: { discordConfigs: [{ apiURL: { key: "..." } }] } } }

// KubePrometheus uses Alertmanager CRD with different field names
Alertmanager { spec: { config: { receivers: [...] } } }

A unified type would either:

Be a lowest-common-denominator (loses stack-specific features)
Be a complex union type (hard to use, easy to misconfigure)

Generic traits let each stack express its configuration naturally while providing a consistent interface.

Why Compile-Time Capability Bounds?

impl<T: Topology + Observability<OpenshiftClusterAlertSender>> Score<T> 
    for OpenshiftClusterAlertScore { ... }

This fails at compile time if you try to use OpenshiftClusterAlertScore with a topology that doesn't support OKD monitoring. This prevents the "config-is-valid-but-platform-is-wrong" errors that Harmony was designed to eliminate.

Why Not a MonitoringStack Abstraction (V2 Approach)?

The V2 approach proposed a unified MonitoringStack that hides sender selection:

// V2 approach - rejected
MonitoringStack::new(MonitoringApiVersion::V2CRD)
    .add_alert_channel(discord)

Problems:

Hides which sender you're using, losing compile-time guarantees
"Version selection" actually chooses between fundamentally different systems
Would need to handle all stack-specific features through a generic interface

The current approach is explicit: you choose OpenshiftClusterAlertSender and the compiler verifies compatibility.

Why Runtime Validation for Tenants?

Tenant confinement is determined at runtime by the topology and K8s RBAC. We cannot know at compile time whether a user has cluster-admin or namespace-only access.

Options considered:

Compile-time tenant markers - Would require modeling entire RBAC hierarchy in types. Over-engineering.
Runtime validation - Current approach. Fails with clear K8s permission errors if insufficient access.
No tenant support - Would exclude a major use case.

Runtime validation is the pragmatic choice. The failure mode is clear (K8s API error) and occurs early in execution.

Note : we will eventually have compile time validation for such things. Rust macros are powerful and we could discover the actual capabilities we're dealing with, similar to sqlx approach in query! macros.

Consequences

Pros

Type Safety: Invalid configurations are caught at compile time
Extensibility: Adding a new monitoring stack requires implementing traits, not modifying core code
Clear Separation: Cluster/Tenant/Application levels have distinct entry points
Reusable Rules: Pre-built alert rules as functions (high_pvc_fill_rate_over_two_days())
CRD Accuracy: Type definitions match actual Kubernetes CRDs exactly

Cons

Implementation Explosion: DiscordReceiver implements AlertReceiver<S> for each sender type (3+ implementations)
Learning Curve: Understanding the trait hierarchy takes time
clone_box Boilerplate: Required for trait object cloning (3 lines per impl)

Mitigations

Implementation explosion is contained: each receiver type has O(senders) implementations, but receivers are rare compared to rules
Learning curve is documented with examples at each level
clone_box boilerplate is minimal and copy-paste

Alternatives Considered

Unified MonitoringStack Type

See "Why Not a MonitoringStack Abstraction" above. Rejected for losing compile-time safety.

Helm-Only Approach

Use HelmScore directly for each monitoring deployment. Rejected because:

No type safety for alert rules
Cannot compose with application features
No tenant awareness

Separate Modules Per Use Case

Have cluster_monitoring/, tenant_monitoring/, app_monitoring/ as separate modules. Rejected because:

Massive code duplication
No shared abstraction for receivers/rules
Adding a feature requires three implementations

Implementation Notes

Module Structure

modules/monitoring/
├── mod.rs                     # Public exports
├── alert_channel/             # Receivers (Discord, Webhook)
├── alert_rule/                # Rules and pre-built alerts
│   ├── prometheus_alert_rule.rs
│   └── alerts/                # Library of pre-built rules
│       ├── k8s/               # K8s-specific (pvc, pod, memory)
│       └── infra/             # Infrastructure (opnsense, dell)
├── okd/                       # OpenshiftClusterAlertSender
├── kube_prometheus/           # KubePrometheus
├── prometheus/                # Prometheus
├── red_hat_cluster_observability/  # RHOB
├── grafana/                   # Grafana
├── application_monitoring/    # Application-level scores
└── scrape_target/             # External scrape targets

Adding a New Alert Sender

Create sender type: pub struct MySender; impl AlertSender for MySender { ... }
Implement Observability<MySender> for topologies that support it
Create CRD types in crd/ subdirectory
Implement AlertReceiver<MySender> for existing receivers
Implement AlertRule<MySender> for AlertManagerRuleGroup

Adding a New Alert Rule

pub fn my_custom_alert() -> PrometheusAlertRule {
    PrometheusAlertRule::new("MyAlert", "up == 0")
        .for_duration("5m")
        .label("severity", "critical")
        .annotation("summary", "Service is down")
}

No trait implementation needed - AlertManagerRuleGroup already handles conversion.

ADR-013: Notification channel selection (ntfy)
ADR-011: Multi-tenant cluster architecture

12 KiB Raw Permalink Blame History

Architecture Decision Record: Monitoring and Alerting Architecture

Status

Context

Decision

Core Traits

Alert Sender Types

Three Levels of Monitoring

1. Cluster-Level Monitoring

2. Tenant-Level Monitoring

3. Application-Level Monitoring

Rationale

Why Generic Traits Instead of Unified Types?

Why Compile-Time Capability Bounds?

Why Not a MonitoringStack Abstraction (V2 Approach)?

Why Runtime Validation for Tenants?

Consequences

Pros

Cons

Mitigations

Alternatives Considered

Unified MonitoringStack Type

Helm-Only Approach

Separate Modules Per Use Case

Implementation Notes

Module Structure

Adding a New Alert Sender

Adding a New Alert Rule

Related ADRs

12 KiB

Raw Permalink Blame History