feat: added alert rule and impl for prometheus as well as a few preconfigured bmc alerts for dell server that are used in the monitoring example #67

Merged
wjro merged 2 commits from feat/alert_rules into master 2025-06-26 13:16:41 +00:00
Owner
No description provided.
wjro added 1 commit 2025-06-25 19:11:37 +00:00
johnride requested changes 2025-06-25 19:31:58 +00:00
johnride left a comment
Owner

Looks pretty good overall. Some minor refactoring comments but the rest is great!

Looks pretty good overall. Some minor refactoring comments but the rest is great!
@ -0,0 +1,40 @@
use crate::modules::monitoring::alert_rule::prometheus_alert_rule::PrometheusAlertRule;
pub fn global_storage_status_degraded_non_critical() -> PrometheusAlertRule {
Owner

Is this really Dell specific?

Is this really Dell specific?
Author
Owner

these are from the Dell snmp walk:
dell:
walk:
- 1.3.6.1.4.1.674.10892.5.2
- 1.3.6.1.4.1.674.10892.5.4
- 1.3.6.1.4.1.674.10892.5.5
metrics:
- name: globalSystemStatus
oid: 1.3.6.1.4.1.674.10892.5.2.1
type: gauge
help: This attribute defines the overall rollup status of all components in
the system being monitored by the remote access card - 1.3.6.1.4.1.674.10892.5.2.1
enum_values:
1: other
2: unknown
3: ok
4: nonCritical
5: critical
6: nonRecoverable

there are a bunch of other ones as well but i only included a few for the example. Each server type has a different snmp walk that translates to a different name from the appropriate MIB file. I believe the Dell Mib that this is from is DELL-MM-MIB.
https://github.com/librenms/librenms/tree/master/mibs/dell

these are from the Dell snmp walk: dell: walk: - 1.3.6.1.4.1.674.10892.5.2 - 1.3.6.1.4.1.674.10892.5.4 - 1.3.6.1.4.1.674.10892.5.5 metrics: - name: globalSystemStatus oid: 1.3.6.1.4.1.674.10892.5.2.1 type: gauge help: This attribute defines the overall rollup status of all components in the system being monitored by the remote access card - 1.3.6.1.4.1.674.10892.5.2.1 enum_values: 1: other 2: unknown 3: ok 4: nonCritical 5: critical 6: nonRecoverable there are a bunch of other ones as well but i only included a few for the example. Each server type has a different snmp walk that translates to a different name from the appropriate MIB file. I believe the Dell Mib that this is from is DELL-MM-MIB. https://github.com/librenms/librenms/tree/master/mibs/dell
@ -0,0 +37,4 @@
"description",
"- **System**: {{ $labels.instance }}\n- **Status**: nonRecoverable\n- **Value**: {{ $value }}\n- **Job**: {{ $labels.job }}",
)
}
Owner

Nothing in this file seems Dell specific?

Nothing in this file seems Dell specific?
@ -0,0 +1,11 @@
use crate::modules::monitoring::alert_rule::prometheus_alert_rule::PrometheusAlertRule;
Owner

For clarity, this is fine-ish to be in the kube_prometheus mod because it is specifically for pvc alerts but I think alert definitions should be in another module called just prometheus. kube_prometheus is for stuff specific to deploying prometheus on k8s. It is very possible to have a prometheus deployed somewhere else (AWS managed or grafana cloud maybe) which scrapes k8s targets and will want this alert.

So the logic here would ask for modules/prometheus/alerts/k8s as this is a k8s specific alert for any prometheus deployment.

For clarity, this is fine-ish to be in the kube_prometheus mod because it is specifically for pvc alerts but I think alert definitions should be in another module called just prometheus. kube_prometheus is for stuff specific to *deploying* prometheus on k8s. It is very possible to have a prometheus deployed somewhere else (AWS managed or grafana cloud maybe) which scrapes k8s targets and will want this alert. So the logic here would ask for `modules/prometheus/alerts/k8s` as this is a k8s specific alert for any prometheus deployment.
Author
Owner

that makes sense

that makes sense
wjro added 1 commit 2025-06-25 20:10:55 +00:00
fix: modified directory names to be in line with alert functions and deployment environments
All checks were successful
Run Check Script / check (pull_request) Successful in 1m43s
e16f8fa82e
wjro merged commit 29e74a2712 into master 2025-06-26 13:16:41 +00:00
wjro deleted branch feat/alert_rules 2025-06-26 13:16:43 +00:00
Sign in to join this conversation.
No reviewers
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: NationTech/harmony#67
No description provided.