参考文章:
本文基于官方提供的 RocketMQ Exporter 来监控 RocketMQ 集群。可以实现以下功能:
- Broker TPS/QPS的监控
- 消息积压监控
- 消费组消费演示监控
部署 RocketMQ Exporter
创建以下资源清单,部署 RocketMQ Exporter 服务
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54apiVersion: v1
kind: Service
metadata:
labels:
app: rocketmq-exporter
name: rocketmq-exporter
namespace: publics
spec:
ports:
- port: 5557
name: http
protocol: TCP
targetPort: 5557
selector:
app: rocketmq-exporter
type: ClusterIP
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: rocketmq-exporter
name: rocketmq-exporter
namespace: publics
spec:
replicas: 1
selector:
matchLabels:
app: rocketmq-exporter
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
labels:
app: rocketmq-exporter
spec:
containers:
- image: sawyerlan/rocketmq-exporter:latest
imagePullPolicy: IfNotPresent
name: rocketmq-dashboard
args: ["--rocketmq.config.namesrvAddr=192.160.0.51:9876;192.160.0.52:9876;192.160.0.53:9876"]
ports:
- containerPort: 5557
name: http
protocol: TCP
restartPolicy: Always
tolerations:
- effect: PreferNoSchedule
key: role
operator: Equal
value: master部署 RocketMQ Exporter
1
kubectl create -f rocketmq-exporter.yaml
查看服务是否部署完成
1
2
3
4
5
6# kubectl get pods,svc -n publics -l app=rocketmq-exporter
NAME READY STATUS RESTARTS AGE
pod/rocketmq-exporter-6fb8749cf6-vn48f 1/1 Running 0 103m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/rocketmq-exporter ClusterIP 10.96.80.123 <none> 5557/TCP 106m测试是否能正常采集到指标
1
2
3
4
5
6# curl -s http://10.96.80.123:5557/metrics |tail -n 5
rocketmq_brokeruntime_commitlog_maxoffset{cluster="rocketmq-cluster",brokerIP="192.160.0.53:11011",brokerHost="",des="V4_9_3",boottime="1685186373909",broker_version="399",} 37656.0
rocketmq_brokeruntime_commitlog_minoffset{cluster="rocketmq-cluster",brokerIP="192.160.0.52:11011",brokerHost="",des="V4_9_3",boottime="1685186371897",broker_version="399",} 0.0
rocketmq_brokeruntime_commitlog_minoffset{cluster="rocketmq-cluster",brokerIP="192.160.0.53:11011",brokerHost="",des="V4_9_3",boottime="1685186373909",broker_version="399",} 0.0
rocketmq_brokeruntime_remain_howmanydata_toflush{cluster="rocketmq-cluster",brokerIP="192.160.0.52:11011",brokerHost="",des="V4_9_3",boottime="1685186371897",broker_version="399",} 0.0
rocketmq_brokeruntime_remain_howmanydata_toflush{cluster="rocketmq-cluster",brokerIP="192.160.0.53:11011",brokerHost="",des="V4_9_3",boottime="1685186373909",broker_version="399",} 0.0
配置 Prometheus 监控 RocketMQ
创建 prometheus-additional.yaml 文件,内容如下
1
2
3
4
5
6- job_name: 'rocketmq-exporter'
static_configs:
- targets: ['rocketmq-exporter.publics:5557']
labels:
Env: '生产'
Cluster: 'rocketmq-cluster'使用
prometheus-additional.yaml
文件创建 Secret1
2
3kubectl create secret generic additional-configs \
--from-file=prometheus-additional.yaml \
-n monitoring修改
prometheus-prometheus.yaml
文件,添加挂载相关配置,如下1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
labels:
prometheus: k8s
name: k8s
namespace: monitoring
spec:
alerting:
alertmanagers:
- name: alertmanager-main
namespace: monitoring
port: web
image: quay.io/prometheus/prometheus:v2.22.1
nodeSelector:
kubernetes.io/os: linux
podMonitorNamespaceSelector: {}
podMonitorSelector: {}
probeNamespaceSelector: {}
probeSelector: {}
replicas: 2
resources:
requests:
memory: 400Mi
ruleSelector:
matchLabels:
prometheus: k8s
role: alert-rules
securityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
serviceAccountName: prometheus-k8s
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector: {}
version: v2.22.1
# 增加 additionalScrapeConfigs 相关配置
additionalScrapeConfigs:
key: prometheus-additional.yaml
name: additional-configs
optional: true更新资源
1
kubectl replace -f prometheus-prometheus.yaml
配置 Grafana 监控面板
下载 Grafana 的 json 文件
1
git clone https://github.com/xxd763795151/rocketmq-monitor.git
注意:
- rocketmq-grafana.json: grafana 大屏文件
- rocketmq_alert.yml: 告警规则文件,需要替换其中的两个标签
Env
和Cluster
的值
配置 Alertmanager 告警规则
创建 prometheus-rocketmqRule.yaml 文件,内容如下
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
app.kubernetes.io/component: prometheus
app.kubernetes.io/name: prometheus
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 2.26.0
prometheus: k8s
role: alert-rules
name: prometheus-k8s-rocketmq-rules
namespace: monitoring
spec:
groups:
- name: RocketMQ告警
rules:
- alert: "RocketMQ集群,磁盘空间不足"
expr: rocketmq_brokeruntime_commitlog_disk_ratio{Cluster="rocketmq-cluster",Env="生产"} * 100 > 70
for: 1m
labels:
severity: warning
annotations:
description: '{{$labels.Env}}环境, {{$labels.Cluster}}集群, broker: {{$labels.brokerIP}}, 磁盘可用空间即将不足预警, 当前空间使用率为:{{$value}}%'
summary: '磁盘可用空间即将不足'
- alert: "RocketMQ集群,broker busy告警"
expr: rocketmq_brokeruntime_send_threadpoolqueue_headwait_timemills{Cluster="rocketmq-cluster",Env="生产"} > 200
for: 0m
labels:
severity: warning
annotations:
description: '{{$labels.Env}}环境, {{$labels.Cluster}}集群, broker: {{$labels.brokerIP}}, 消息等待处理时间已经超过200ms'
summary: 'broker压力大'
- alert: "RocketMQ集群, 出现消息积压"
expr: sum(rocketmq_group_diff{Cluster="rocketmq-cluster",Env="生产"}) by (Env, Cluster, group,topic) > 1000
for: 0m
labels:
severity: warning
annotations:
description: '{{$labels.Env}}环境, {{$labels.Cluster}}集群, 消费组{{$labels.group}}消费{{$labels.topic}}的消息时出现消息积压, 积压量为{{$value}}'
summary: 'RocketMQ生产环境,iot_rt_ebike_event_topic_consumer出现消息积压'
- alert: "RocketMQ集群, broker节点挂了"
expr: count(rocketmq_broker_tps{Cluster="rocketmq-cluster",Env="生产"}) by (Env, Cluster) < 2
for: 0m
labels:
severity: warning
annotations:
description: '{{$labels.Env}}环境, {{$labels.Cluster}}集群, broker存活节点个数不足, 当前活跃节点数: {{$value}}'
summary: 'broker节点挂了'
- alert: "RocketMQ集群, 消息提交耗时太久"
expr: rocketmq_brokeruntime_pmdt_1to2s{Cluster="rocketmq-cluster",Env="生产"} + on(Env, Cluster, brokerIP) rocketmq_brokeruntime_pmdt_2to3s{Cluster="rocketmq-cluster",Env="生产"} + on(Env, Cluster, brokerIP) rocketmq_brokeruntime_pmdt_3to4s{Cluster="rocketmq-cluster",Env="生产"} + on(Env, Cluster, brokerIP) rocketmq_brokeruntime_pmdt_4to5s{Cluster="rocketmq-cluster",Env="生产"} + on (Env, Cluster, brokerIP) rocketmq_brokeruntime_pmdt_5to10s{Cluster="rocketmq-cluster",Env="生产"} + on(Env, Cluster, brokerIP) rocketmq_brokeruntime_pmdt_10stomore{Cluster="rocketmq-cluster",Env="生产"} > 0
for: 0m
labels:
severity: warning
annotations:
description: '{{$labels.Env}}环境, {{$labels.Cluster}}集群, broker: {{$labels.brokerIP}}, 最近1分钟消息提交耗时大于1s的有{{$value}}条'
summary: '最近1分钟存在消息提交耗时太久'
- alert: "RocketMQ集群,发送tps激增"
expr: sum(rocketmq_broker_tps{Cluster="rocketmq-cluster",Env="生产"} - rocketmq_broker_tps{Cluster="rocketmq-cluster",Env="生产"} offset 30s) by (Env, Cluster) > 1000
for: 0m
labels:
severity: warning
annotations:
description: '{{$labels.Env}}环境, {{$labels.Cluster}}集群, 发送tps在过去30s内出现激增, 当前增加量:{{$value}}'
summary: '集群发送tps激增'创建 Rule 资源
1
kubectl create -f prometheus-rocketmqRule.yaml
查看规则配置
1
2
3
4
5
6
7
8
9
10# kubectl get prometheusrules.monitoring.coreos.com -n monitoring
NAME AGE
alertmanager-main-rules 2d6h
kube-prometheus-rules 2d6h
kube-state-metrics-rules 2d6h
kubernetes-monitoring-rules 2d6h
node-exporter-rules 2d6h
prometheus-k8s-prometheus-rules 2d6h
prometheus-k8s-rocketmq-rules 70m
prometheus-operator-rules 2d6h登录 Prometheus Web UI 界面,查看 Alerts 页面,查找 RocketMQ 相关的规则是否正常加载进来了