Prometheus 监控告警：基于 SLO 的告警设计

大, 虾 4 月 27, 2026 2 0

Prometheus 监控告警：基于 SLO 的告警设计

引言

在传统的监控告警中，我们往往为每个指标单独设置阈值：CPU 使用率>80%、内存>90%、延迟>500ms。然而，这种指标驱动的告警方式存在明显问题：

❌ 传统监控的痛点：

告警风暴：大量指标触发告警，难以区分优先级
业务不相关：技术指标正常但用户仍受影响
缺乏上下文：不知道这些告警对业务的影响
难以决策：收到告警后不知该如何响应


✅ SLO 告警的优势：

业务导向：直接关联用户体验
优先级清晰：基于错误预算燃烧速率
告警数量少：只关注关键问题
可预测：提前发现长期趋势问题

SLO、SLI、SLA 概念：

术语	全称	说明
SLI	Service Level Indicator	服务等级指标（如请求成功率）
SLO	Service Level Objective	服务等级目标（如成功率>99.9%）
SLA	Service Level Agreement	服务等级协议（承诺给用户的目标）

适用场景：

🌐 Web API 可用性监控
📊 数据库性能监控
💾 缓存服务监控
🔔 告警风暴治理
📈 业务稳定性评估

适用读者： 运维工程师、SRE、后端开发工程师

—

SLO 基础

1. SLI 定义

SLI 是衡量服务健康度的指标。常见 SLI 类型：

“`promql

可用性 SLI（请求成功率）

requests_success / requests_total

延迟 SLI（请求延迟）

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

吞吐量 SLI（每秒请求数）

rate(http_requests_total[5m])

完整性 SLI（数据完整性）

valid_records / total_records


2. SLO 设计原则

错误预算（Error Budget）：

错误预算 = 100% – SLO

示例：99.9% 可用性的 SLO

错误预算 = 0.1%
每月允许错误时间 = 30 天 × 24 小时 × 60 分钟 × 0.1%
= 43.2 分钟/月


错误预算燃烧速率：

promql

错误预算消耗速率（当前错误率 / 允许的错误率）

(rate(http_requests_total{status=~”5..”}[5m])
/
rate(http_requests_total[5m]))
/
0.001 # 0.1% 错误率


3. SLO 监控目标

yaml

示例：Web API 的 SLO

api_slo:
name: “API 可用性”
objective: 0.999 # 99.9% 可用性
period: 30d # 月度评估
slis:

name: “请求成功率”

metric: http_requests_total
success_condition: status!~”5..”


---

Prometheus 配置

1. 录制规则（Recording Rules）

录制规则用于预先计算复杂查询，提升查询性能。

yaml

prometheus-recording-rules.yml

groups:

name: slo-recording-rules

interval: 30s
rules:
# 计算每秒请求率

record: http_requests:rate5m

expr: sum(rate(http_requests_total[5m])) by (service, method, path)

# 计算错误率

record: http_errors:rate5m

expr: sum(rate(http_requests_total{status=~”5..”}[5m])) by (service, method, path)

# 计算成功率

record: http_success_rate:rate5m

expr: |
1 – (
sum(rate(http_requests_total{status=~”5..”}[5m])) by (service, method, path)
/
sum(rate(http_requests_total[5m])) by (service, method, path)
)

# 计算错误预算消耗速率

record: http_error_budget_burn_rate:5m

expr: |
(
sum(rate(http_requests_total{status=~”5..”}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
)
/
(1 – 0.999) # 0.1% 错误率


2. 告警规则（Alerting Rules）

yaml

prometheus-alerts-slo.yml

groups:

name: slo-alerts

interval: 1m
rules:
# 错误预算快速燃烧告警

alert: ErrorBudgetBurnRateHigh

expr: |
http_error_budget_burn_rate:5m{service=~”.+”} > 2
for: 5m
labels:
severity: critical
team: ‘{{ $labels.team }}’
annotations:
summary: |
[{{ $labels.service }}] 错误预算燃烧过快
description: |
服务 {{ $labels.service }} 的错误预算正在快速消耗。
当前燃烧速率：{{ $value | humanizePercentage }}/秒
预计 {{ $value | humanizeDuration }} 后耗尽当前错误预算。
runbook_url: https://wiki.internal/runbooks/slo-burn-rate

# 长期趋势告警

alert: ErrorBudgetDepleting

expr: |
rate(http_requests_total{status=~”5..”}[1h]) /
rate(http_requests_total[1h]) > 0.001
for: 4h
labels:
severity: warning
team: ‘{{ $labels.team }}’
annotations:
summary: |
[{{ $labels.service }}] 错误预算持续消耗
description: |
服务 {{ $labels.service }} 在 1 小时内持续消耗错误预算。
当前错误率：{{ $value | humanizePercentage }}
预计 30 天内将耗尽错误预算。
runbook_url: https://wiki.internal/runbooks/slo-depleting

# 瞬时错误率告警

alert: HighErrorRate

expr: |
rate(http_requests_total{status=~”5..”}[5m]) >
rate(http_requests_total{status=~”5..”}[1h]) * 3
for: 5m
labels:
severity: critical
team: ‘{{ $labels.team }}’
annotations:
summary: |
[{{ $labels.service }}] 错误率突增
description: |
服务 {{ $labels.service }} 的错误率突然增加。
当前 5 分钟错误率：{{ $value | humanizePercentage }}
相比 1 小时平均值增长 3 倍以上。
runbook_url: https://wiki.internal/runbooks/high-error-rate

# 错误预算即将耗尽告警

alert: ErrorBudgetNearExhaustion

expr: |
(
sum(increase(http_requests_total{status=~”5..”}[30d])) by (service)
/
sum(increase(http_requests_total[30d])) by (service)
) > 0.0005
for: 1d
labels:
severity: warning
team: ‘{{ $labels.team }}’
annotations:
summary: |
[{{ $labels.service }}] 错误预算即将耗尽
description: |
服务 {{ $labels.service }} 在过去 30 天内已使用超过 50% 的错误预算。
当前错误率：{{ $value | humanizePercentage }}
剩余错误预算不足。
runbook_url: https://wiki.internal/runbooks/slo-exhaustion


---

实战案例

1. Web API 服务 SLO 监控

yaml

定义 Web API 的 SLO

groups:

name: api-slo

rules:
# 计算 API 请求成功率

record: api:success_rate:5m

expr: |
1 – (
sum(rate(http_requests_total{job=”api”,status=~”5..”}[5m]))
/
sum(rate(http_requests_total{job=”api”}[5m]))
)

# 计算 API 错误预算

record: api:error_budget_remaining:30d

expr: |
(
1 – 0.999 # SLO 目标
) – (
(
sum(increase(http_requests_total{job=”api”,status=~”5..”}[30d]))
/
sum(increase(http_requests_total{job=”api”}[30d]))
)
)

# API 错误预算燃烧告警

alert: APISLOErrorBudgetBurn

expr: |
(
sum(rate(http_requests_total{job=”api”,status=~”5..”}[5m]))
/
sum(rate(http_requests_total{job=”api”}[5m]))
)
/
0.001 > 2
for: 5m
annotations:
summary: “API 错误预算燃烧过快”
description: “API 服务错误预算燃烧速率超过阈值”


2. 数据库服务 SLO 监控

yaml

定义数据库的 SLO

groups:

name: database-slo

rules:
# 计算数据库连接成功率

record: db:connection_success_rate:5m

expr: |
sum(increase(mysql_global_status_connections{status=”accepted”}[5m]))
/
sum(increase(mysql_global_status_connections[5m]))

# 计算数据库慢查询率

record: db:slow_query_rate:5m

expr: |
sum(rate(mysql_global_status_slow_queries[5m]))
/
sum(rate(mysql_global_status_questions[5m]))

# 计算数据库错误预算

record: db:error_budget_remaining:30d

expr: |
(
1 – 0.9999 # 99.99% 可用性 SLO
) – (
sum(increase(mysql_global_status_errors_total[30d]))
/
sum(increase(mysql_global_status_commands_total[30d]))
)

# 数据库连接错误率告警

alert: DatabaseConnectionErrorRate

expr: |
(
sum(rate(mysql_global_status_errors_total[5m]))
/
sum(rate(mysql_global_status_connections[5m]))
)
/
0.0001 > 3
for: 5m
annotations:
summary: “数据库连接错误率异常”
description: “数据库连接错误率超出 SLO 阈值”


3. 缓存服务 SLO 监控

yaml

定义缓存服务的 SLO

groups:

name: cache-slo

rules:
# 计算缓存命中率

record: cache:hit_rate:5m

expr: |
sum(rate(redis_hits_total[5m]))
/
(
sum(rate(redis_hits_total[5m]))
+
sum(rate(redis_misses_total[5m]))
)

# 计算缓存延迟

record: cache:latency:p99:5m

expr: |
histogram_quantile(0.99,
rate(redis_command_duration_seconds_bucket[5m])
)

# 计算缓存错误预算

record: cache:error_budget_remaining:30d

expr: |
(
1 – 0.995 # 99.5% 命中率 SLO
) – (
1 – (
sum(rate(redis_hits_total[30d]))
/
(
sum(rate(redis_hits_total[30d]))
+
sum(rate(redis_misses_total[30d]))
)
)
)

# 缓存命中率下降告警

alert: CacheHitRateLow

expr: |
(
sum(rate(redis_hits_total[5m]))
/
(
sum(rate(redis_hits_total[5m]))
+
sum(rate(redis_misses_total[5m]))
)
)
< 0.99 for: 10m annotations: summary: "缓存命中率低于 SLO" description: "缓存命中率低于 99%，可能影响性能"


---

告警策略

1. 错误预算燃烧告警

yaml
groups:

name: error-budget-burn

rules:
# 快速燃烧（2 倍速率，持续 5 分钟）

alert: ErrorBudgetBurnRateHigh

expr: |
rate(http_requests_total{status=~”5..”}[5m])
/ rate(http_requests_total[5m])
/ (1 – 0.999) > 2
for: 5m
labels:
severity: critical
annotations:
summary: “[{{ $labels.service }}] 错误预算快速燃烧”
description: “错误预算以 {{ $value | humanize }} 倍速度消耗”

# 中速燃烧（1.5 倍速率，持续 30 分钟）

alert: ErrorBudgetBurnRateMedium

expr: |
rate(http_requests_total{status=~”5..”}[5m])
/ rate(http_requests_total[5m])
/ (1 – 0.999) > 1.5
for: 30m
labels:
severity: warning
annotations:
summary: “[{{ $labels.service }}] 错误预算中等燃烧”
description: “错误预算以 {{ $value | humanize }} 倍速度消耗”


2. 长期趋势告警

yaml
groups:

name: error-budget-trend

rules:
# 30 天内错误率超出阈值

alert: ErrorBudgetExhaustionWarning

expr: |
(
sum(increase(http_requests_total{status=~”5..”}[30d]))
/
sum(increase(http_requests_total[30d]))
) > 0.0005 # 50% 错误预算已用
for: 1d
labels:
severity: warning
annotations:
summary: “[{{ $labels.service }}] 错误预算使用过半”
description: “30 天内已使用 50% 错误预算，需要关注”

# 剩余 1 周错误预算

alert: ErrorBudgetCritical

expr: |
(
sum(increase(http_requests_total{status=~”5..”}[7d]))
/
sum(increase(http_requests_total[7d]))
) > 0.00014 # 1/7 * 0.001
for: 3d
labels:
severity: critical
annotations:
summary: “[{{ $labels.service }}] 错误预算即将耗尽”
description: “按当前趋势，一周内将耗尽错误预算”


3. 瞬时错误告警

yaml
groups:

name: spike-alerts

rules:
# 错误率突增

alert: ErrorRateSpike

expr: |
increase(http_requests_total{status=~”5..”}[1m]) / 60
/
(rate(http_requests_total[5m]) / 60) > 5
for: 2m
labels:
severity: critical
annotations:
summary: “[{{ $labels.service }}] 错误率突增”
description: “错误率在过去 1 分钟增加了 5 倍”


---

Alertmanager 配置

1. 告警分级和路由

yaml

alertmanager.yml

global:
resolve_timeout: 5m
slack_api_url: ‘https://hooks.slack.com/services/xxx’

route:
group_by: [‘alertname’, ‘service’]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: ‘webhook’
routes:
# 严重告警 – 立即通知

match:

severity: critical
receiver: ‘pagerduty’
group_wait: 10s
repeat_interval: 1h

# 警告告警 – 定时通知

match:

severity: warning
receiver: ‘slack’
repeat_interval: 12h

# SLO 告警 – 特殊处理

match:

alertname: ErrorBudgetBurnRateHigh
receiver: ‘sre-oncall’
continue: true

receivers:

name: ‘webhook’

webhook_configs:

url: ‘http://alertmanager-webhook:5001/’

send_resolved: true

name: ‘pagerduty’

pagerduty_configs:

service_key: $PAGERDUTY_KEY

severity: critical

name: ‘slack’

slack_configs:

channel: ‘#alerting’

send_resolved: true
title: ‘[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}’
text: ‘{{ range .Alerts }}{{ .Annotations.description }}{{ end }}’

name: ‘sre-oncall’

slack_configs:

channel: ‘#sre-oncall’

send_resolved: true
title: ‘[{{ .Status | toUpper }}] SLO Error Budget Burn’
text: ‘{{ range .Alerts }}{{ .Annotations.description }}{{ end }}’


---

最佳实践

1. SLO 评审流程

markdown
SLO 评审检查清单：

□ 业务目标是否明确
□ SLI 是否可测量
□ SLO 是否合理（不过高或过低）
□ 错误预算是否可接受
□ 告警阈值是否合适
□ 是否已配置对应的监控
□ 是否已定义响应流程


2. 告警调优

yaml

避免告警风暴的优化

1. 使用录制规则减少查询复杂度

2. 设置合理的 for 时间窗口

3. 使用 group_wait 和 group_interval 合并告警

4. 添加持续条件检查

5. 配置解决通知


3. 可视化面板

yaml

Grafana 面板配置建议

panels:

title: “SLO 错误预算使用”

type: graph
targets:

expr: “api:error_budget_remaining:30d”

title: “错误预算燃烧速率”

type: graph
targets:

expr: “http_error_budget_burn_rate:5m”

title: “SLI 趋势”

type: timeseries
targets:

expr: “api:success_rate:5m”

title: “告警状态”

type: stat
targets:

expr: “count(ALERT{alertstate=’firing’})”


---

总结

核心要点回顾

✅ SLO/SLI/SLA 概念：业务导向的监控基础  
✅ 错误预算：量化服务健康度的核心指标  
✅ Prometheus 配置：录制规则 + 告警规则  
✅ 告警策略：错误预算燃烧 + 长期趋势 + 瞬时错误  
✅ Alertmanager：告警分级和路由配置  
✅ 最佳实践：SLO 评审、告警调优、可视化  

性能优化建议

yaml
✅ 录制规则优化：

预先计算常用指标
减少实时查询复杂度
提升查询性能

✅ 告警配置优化：

设置合理的 for 时间窗口
使用 group 合并告警
避免重复告警

✅ 监控配置优化：

合理设置 scrape_interval
使用合理的 retention 时间
配置合理的记录规则间隔

“`

—

*本文档最后更新时间：2026 年 04 月 28 日*
*作者：creator | 适用 Prometheus 2.x + Alertmanager 0.x*

![](https://img.freepik.com/free-vector/prometheus-monitoring-dashboard-illustration_114360-12005.jpg)

打赏赞

Prometheus 监控告警：基于 SLO 的告警设计

引言

SLO 基础

1. SLI 定义

可用性 SLI（请求成功率）

延迟 SLI（请求延迟）

吞吐量 SLI（每秒请求数）

完整性 SLI（数据完整性）

2. SLO 设计原则

错误预算消耗速率（当前错误率 / 允许的错误率）

3. SLO 监控目标

示例：Web API 的 SLO

Prometheus 配置

1. 录制规则（Recording Rules）

prometheus-recording-rules.yml

2. 告警规则（Alerting Rules）

prometheus-alerts-slo.yml

实战案例

1. Web API 服务 SLO 监控

定义 Web API 的 SLO

2. 数据库服务 SLO 监控

定义数据库的 SLO

3. 缓存服务 SLO 监控

定义缓存服务的 SLO

告警策略

1. 错误预算燃烧告警

2. 长期趋势告警

3. 瞬时错误告警

Alertmanager 配置

1. 告警分级和路由

alertmanager.yml

最佳实践

1. SLO 评审流程

2. 告警调优

避免告警风暴的优化

1. 使用录制规则减少查询复杂度

2. 设置合理的 for 时间窗口

3. 使用 group_wait 和 group_interval 合并告警

4. 添加持续条件检查

5. 配置解决通知

3. 可视化面板

Grafana 面板配置建议

总结

核心要点回顾

性能优化建议

标签

相关推荐

发表评论

取消回复

Recent Posts

Recent Comments

Archives

Categories