千家信息网

基于k8s Prometheus+Grafana+Altermanager钉钉报警

发表于:2025-02-08 作者:千家信息网编辑
千家信息网最后更新 2025年02月08日,相关推荐1. 使用Prometheus Operator监控kubetnetes集群2. 使用Prometheus Operator实现应用自定义监控一、概述Alertmanager与Promethe
千家信息网最后更新 2025年02月08日基于k8s Prometheus+Grafana+Altermanager钉钉报警

相关推荐

1. 使用Prometheus Operator监控kubetnetes集群

2. 使用Prometheus Operator实现应用自定义监控

一、概述

Alertmanager与Prometheus是相互分离的两个组件。Prometheus服务器根据报警规则将警报发送给Alertmanager,然后Alertmanager将silencing、inhibition、aggregation等消息通过电子邮件、dingtalk和HipChat发送通知。

Alertmanager处理由例如Prometheus服务器等客户端发来的警报。它负责删除重复数据、分组,并将警报通过路由发送到正确的接收器,比如电子邮件、Slack、dingtalk等。Alertmanager还支持groups,silencing和警报抑制的机制。

钉钉作为内部通讯工具,基本上大家在电脑和手机上都能用,消息可以第一时间查看,报警消息的即时性要求比较高,所以适合用钉钉通知。

二、添加钉钉机器人

请参考官方文档:自定义机器人

添加机器人后获取机器人的hook(机器人好像只能在钉钉群里面添加),在后面部署会用到。

机器人hook:https://oapi.dingtalk.com/robot/send?access_token=xxxxxx

三、配置Alertmanager

Alertmanager官方文档:https://github.com/prometheus/docs/blob/db2a09a8a7e193d6e474f37055908a6d432b88b5/content/docs/alerting/configuration.md#webhook_config

修改Alertmanager报警配置,因上面的官方文档已经给出来每个参数的详细信息,就不再一一解释了。

[root@node-01 prometheus]# vim prometheus-operator/values.yaml   config:    global:      resolve_timeout: 2m    route:      group_by: ['job']      group_wait: 30s      group_interval: 2m      repeat_interval: 12h      receiver: 'webhook'      routes:      - match:          alertname: DeadMansSwitch        receiver: 'webhook'    receivers:    - name: 'webhook'      webhook_configs:      - url: http://webhook-dingtalk/dingtalk/send/        send_resolved: true

更新prometheus-operator

[root@node-01 prometheus]# helm upgrade p ./prometheus-operator

修改成功后可以在alertmanager的status页面看到相关配置

四、转换alertmanager数据格式

Alertmanager会以下列JSON格式的数据通过HTTP POST请求发送到端点:

{  "version": "4",  "groupKey": ,    // key identifying the group of alerts (e.g. to deduplicate)  "status": "",  "receiver": ,  "groupLabels": ,  "commonLabels": ,  "commonAnnotations": ,  "externalURL": ,  // backlink to the Alertmanager.  "alerts": [    {      "labels": ,      "annotations": ,      "startsAt": "",      "endsAt": ""    },    ...  ]}

这是测试报警数据的示例:

b'{"receiver":"webhook","status":"firing","alerts":[{    "status":"firing",    "labels":{        "alertname":"DeadMansSwitch",        "prometheus":"monitoring/p-prometheus",        "severity":"none"    },    "annotations":{        "message":"This is a DeadMansSwitch meant to ensure that the entire alerting pipeline is functional."    },    "startsAt":"2019-03-08T10:02:28.680317737Z",    "endsAt":"0001-01-01T00:00:00Z",    "generatorURL":"http://prom.cnlinux.club/graph?g0.expr=vector%281%29\\u0026g0.tab=1"}],"groupLabels":{},"commonLabels":{    "alertname":"DeadMansSwitch",    "prometheus":"monitoring/p-prometheus",    "severity":"none"},"commonAnnotations":{"message":"This is a DeadMansSwitch meant to ensure that the entire alerting pipeline is functional."},"externalURL":"http://alert.cnlinux.club","version":"4","groupKey":"{}/{alertname=\\"DeadMansSwitch\\"}:{}"}\n' 

钉钉对数据的格式是有要求的(具体要求在上面钉钉官方文档),所以需要将Alertmanager传过来的数据进行格式转化。

以下我们用自己写的python脚本来转换。

脚本说明:

  • alertmanager传过来的数据中,重要的是labels{}的数据,但是里面数据太多,很多信息在报警的信息中是不需要的,所以在脚本中添加了一个EXCLUDE_LIST列表,用于排除不需要的数据。
[root@node-01 prometheus]# cat app.py#!/usr/bin/env pythonimport io, syssys.stdout = io.TextIOWrapper(sys.stdout.detach(), encoding='utf-8')sys.stderr = io.TextIOWrapper(sys.stderr.detach(), encoding='utf-8')from flask import Flask, Responsefrom flask import requestimport requestsimport loggingimport jsonimport locale#locale.setlocale(locale.LC_ALL,"en_US.UTF-8")app = Flask(__name__)console = logging.StreamHandler()fmt = '%(asctime)s - %(filename)s:%(lineno)s - %(name)s - %(message)s'formatter = logging.Formatter(fmt)console.setFormatter(formatter)log = logging.getLogger("flask_webhook_dingtalk")log.addHandler(console)log.setLevel(logging.DEBUG)EXCLUDE_LIST = ['prometheus', 'endpoint']@app.route('/')def index():    return 'Webhook Dingtalk by Billy https://blog.51cto.com/billy98'@app.route('/dingtalk/send/',methods=['POST'])def hander_session():    profile_url = sys.argv[1]    post_data = request.get_data()    post_data = json.loads(post_data.decode("utf-8"))['alerts']    post_data = post_data[0]    messa_list = []    messa_list.append('### 报警类型: %s' % post_data['status'].upper())    messa_list.append('**startsAt:** %s' % post_data['startsAt'])    for i in post_data['labels'].keys():        if i in EXCLUDE_LIST:            continue        else:            messa_list.append("**%s:** %s" % (i, post_data['labels'][i]))    messa_list.append('**Describe:** %s' % post_data['annotations']['message'])    messa = (' \\n\\n > '.join(messa_list))    status = alert_data(messa, post_data['labels']['alertname'], profile_url )    log.info(status)    return statusdef alert_data(data,title,profile_url):    headers = {'Content-Type':'application/json'}    send_data = '{"msgtype": "markdown","markdown": {"title": \"%s\" ,"text": \"%s\" }}' %(title,data)  # type: str    send_data = send_data.encode('utf-8')    reps = requests.post(url=profile_url, data=send_data, headers=headers)    return reps.textif __name__ == '__main__':    app.debug = False    app.run(host='0.0.0.0', port='8080')

五、制作Docker镜像

将上面的python脚本做成镜像,然后把他们以服务的形式运行在k8s集群中,保证高可用。

大家也可以用我已经制作成功的镜像:docker pull billy98/webhook-dingtalk:latest,直接pull即可。

[root@node-01 prometheus]# cat DockerfileFROM centos:7 as buildMAINTAINER billy98 5884625@qq.comRUN curl -o /etc/yum.repos.d/epel.repo http://mirrors.aliyun.com/repo/epel-7.repo && yum install -y python36 python36-pip && pip3.6 install flask requests werkzeugADD app.py /usr/local/alert-dingtalk.pyFROM gcr.io/distroless/python3COPY --from=build /usr/local/alert-dingtalk.py /usr/local/alert-dingtalk.pyCOPY --from=build usr/local/lib64/python3.6/site-packages usr/local/lib64/python3.6/site-packagesCOPY --from=build usr/local/lib/python3.6/site-packages usr/local/lib/python3.6/site-packagesENV PYTHONPATH=usr/local/lib/python3.6/site-packages:usr/local/lib64/python3.6/site-packagesEXPOSE 8080ENTRYPOINT ["python","/usr/local/alert-dingtalk.py"]
[root@node-01 prometheus]# docker build -t billy98/webhook-dingtalk:latest .

我这样build出来的镜像只有50多M,具体的使用方法参考:

distroless:https://github.com/GoogleContainerTools/distroless

六、部署webhook-dingtalk

[root@node-01 prometheus]# cat webhook-dingtalk.yaml apiVersion: apps/v1beta2kind: Deploymentmetadata:  labels:    app: webhook-dingtalk  name: webhook-dingtalk  namespace: monitoring  #需要和alertmanager在同一个namespacespec:  replicas: 1  selector:    matchLabels:      app: webhook-dingtalk  template:    metadata:      labels:        app: webhook-dingtalk    spec:      containers:      - image: billy98/webhook-dingtalk:latest        name: webhook-dingtalk        args:        - "https://oapi.dingtalk.com/robot/send?access_token=xxxxxx"        #上面创建的钉钉机器人hook        ports:        - containerPort: 8080          protocol: TCP        resources:          requests:            cpu: 100m            memory: 100Mi          limits:            cpu: 500m            memory: 500Mi        livenessProbe:          failureThreshold: 3          initialDelaySeconds: 30          periodSeconds: 10          successThreshold: 1          timeoutSeconds: 1          tcpSocket:            port: 8080        readinessProbe:          failureThreshold: 3          initialDelaySeconds: 30          periodSeconds: 10          successThreshold: 1          timeoutSeconds: 1          httpGet:            port: 8080            path: /      imagePullSecrets:        - name: IfNotPresent---apiVersion: v1kind: Servicemetadata:  labels:    app: webhook-dingtalk  name: webhook-dingtalk  namespace: monitoring  #需要和alertmanager在同一个namespacespec:  ports:  - name: http    port: 80    protocol: TCP    targetPort: 8080  selector:    app: webhook-dingtalk  type: ClusterIP 

钉钉中报警信息如下:

报警恢复的消息

至此所有的操作已完成。

如有问题欢迎在下面留言交流。希望大家多多关注和点赞,谢谢!

0