Prometheus Alerting Rules That Actually Matter in Production
Most Prometheus alerting setups I've inherited look the same: 200+ rules, 90% of which are either permanently firing, permanently silenced, or so vague that on-call engineers ignore them. The problem isn't Prometheus—it's that teams copy-paste community rules without understanding what actually matters for their systems.
After years of managing alerting for production systems handling millions of requests, I've converged on a specific set of rules that consistently catch real problems while keeping noise under control. Here's what actually works.
The SLO-First Approach: Stop Alerting on Symptoms
Before writing a single alert, answer this question: "What promises have we made to users?" If you can't answer that, you're not ready to alert on anything.
The fundamental shift is moving from symptom-based alerting (CPU is high!) to SLO-based alerting (we're burning through our error budget too fast). Here's a practical error budget burn rate alert:
groups:
- name: slo-alerts
rules:
# Fast burn - 2% of monthly budget in 1 hour
- alert: HighErrorBudgetBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > (14.4 * 0.001)
for: 2m
labels:
severity: critical
annotations:
summary: "Burning error budget 14.4x faster than sustainable"
description: "Current error rate: {{ $value | humanizePercentage }}"
runbook_url: "https://runbooks.internal/slo-breach"
# Slow burn - 5% of monthly budget in 6 hours
- alert: ErrorBudgetBurnSlow
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[6h]))
/
sum(rate(http_requests_total[6h]))
) > (6 * 0.001)
for: 15m
labels:
severity: warning
annotations:
summary: "Sustained error budget burn at 6x rate"
The magic number 14.4 comes from: if your monthly error budget is 0.1% (99.9% SLO), burning at 14.4x that rate means you'll exhaust 2% of your monthly budget in just 1 hour. That's a page-worthy event. The slow burn at 6x catches sustained degradation that might fly under the radar.
Saturation Alerts That Predict Outages
The USE method (Utilization, Saturation, Errors) is well-known, but most teams get saturation wrong. They alert when something hits 80% and wonder why they still get surprised by outages.
The trick is predicting exhaustion, not reacting to thresholds:
groups:
- name: saturation-predictive
rules:
# Disk will fill in 4 hours at current rate
- alert: DiskWillFillIn4Hours
expr: |
(
node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}
/
(
node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}
-
(predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}[1h], 4*3600))
)
) < 0
and
node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} < 100e9
for: 10m
labels:
severity: warning
annotations:
summary: "Disk {{ $labels.mountpoint }} will fill within 4 hours"
current_free: "{{ $value | humanize1024 }}B"
# Memory pressure - actual swapping, not just low free memory
- alert: MemoryPressureSwapping
expr: rate(node_vmstat_pgmajfault[5m]) > 100
for: 5m
labels:
severity: warning
annotations:
summary: "High major page fault rate indicates memory pressure"
description: "{{ $value }} major faults/sec on {{ $labels.instance }}"
The disk alert combines predict_linear with an absolute threshold (< 100GB). Without the absolute check, you'll get alerts for petabyte volumes where 0.1% change triggers the prediction math.
For memory, forget "free memory < 10%"—Linux aggressively uses memory for caching, and that's fine. Alert on pgmajfault (major page faults), which means the system is actually paging to disk and users are feeling it.
Kubernetes-Specific Rules Worth Keeping
If you're running Kubernetes, you've probably encountered the kube-prometheus stack's default rules. Here's what to keep, what to tune, and what to delete:
Keep and tune:
groups:
- name: kubernetes-apps
rules:
# Pod stuck in non-running state
- alert: KubePodNotReady
expr: |
sum by (namespace, pod) (
max by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown"}) *
on(namespace, pod) group_left(owner_kind)
topk by (namespace, pod) (1, max by (namespace, pod, owner_kind) (kube_pod_owner{owner_kind!="Job"}))
) > 0
for: 15m # Not 5m - deployments need time
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} not ready for 15m"
# Container restarts - but not during deployments
- alert: KubePodCrashLooping
expr: |
max_over_time(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}[5m]) >= 1
unless on(namespace, pod)
(time() - kube_pod_created < 300)
for: 10m
labels:
severity: critical
annotations:
summary: "{{ $labels.namespace }}/{{ $labels.pod }} crash looping"
The key modifications: exclude Jobs from pod readiness checks (they're supposed to terminate), add a grace period for newly created pods, and extend the for duration because 5 minutes of Pending during a cluster scale-up is normal.
Delete these default rules:
KubeMemoryOvercommit- Overcommit is a feature, not a bugKubeCPUOvercommit- Same reasoningNodeClockSkew- Unless you're running bare metal without NTP
The Alerts You're Missing
After auditing dozens of production setups, these are consistently absent and consistently useful:
groups:
- name: often-missing
rules:
# Certificate expiry - 14 days warning, 7 days critical
- alert: CertificateExpiringCritical
expr: |
(probe_ssl_earliest_cert_expiry - time()) / 86400 < 7
labels:
severity: critical
annotations:
summary: "Certificate for {{ $labels.instance }} expires in {{ $value | humanize }} days"
# Prometheus itself falling behind
- alert: PrometheusTargetScrapesSlow
expr: |
prometheus_target_interval_length_seconds{quantile="0.99"}
/ on() group_left
prometheus_target_interval_length_seconds{quantile="0.5"} > 1.5
for: 10m
labels:
severity: warning
annotations:
summary: "Prometheus scrape intervals degraded - p99 is 50% higher than median"
# Actual blackbox connectivity, not just port open
- alert: ServiceEndpointDown
expr: probe_success{job="blackbox-http"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Blackbox probe failing for {{ $labels.instance }}"
The Prometheus self-monitoring rule catches a common failure mode: Prometheus getting overloaded and falling behind on scrapes, which means your dashboards show stale data and alerts fire late.
Practical Implementation: The Alert Audit
Run this query to find your noisiest alerts over the past week:
# Top 10 most-firing alerts
curl -s "http://prometheus:9090/api/v1/query?query=topk(10,sum%20by(alertname)(ALERTS{alertstate=\"firing\"}))" | jq '.data.result[] | {alert: .metric.alertname, count: .value[1]}'
# Alerts that fired but never resolved (stuck)
curl -s "http://prometheus:9090/api/v1/query?query=ALERTS{alertstate=\"firing\"}" | jq '.data.result | length'
If any alert appears more than 10 times per week, either the threshold is wrong, the underlying issue needs fixing, or the alert should be deleted.
For every alert you have, ask: "What action does on-call take when this fires?" If the answer is "acknowledge and wait" or "check the dashboard," delete it. Alerts are for immediate human action, not monitoring.
Start by deleting 50% of your current alerts. I'm serious. Then add back only the ones where someone says "I really needed that."
Written by GeekOnCloud
DevOps & Infrastructure engineer at geekoncloud.com