【prometheus使用系列】利用Prometheus+alertmanager+自定义webhook实现监控报警

使用的软件版本

  • prometheus版本:2.45
  • alertmanger版本:0.25.0
  • web hook版本:java自定义实现

prometheus的安装与配置

prometheus的安装

直接下载对应的二进制文件到/data/prometheus目录,并解压。

配置prometheus.yml

# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 192.168.1.7:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "/data/prometheus/alert.rules"
# - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["192.168.1.7:9090"]

- job_name: "gm_172_linuxserver"
static_configs:
- targets: ["192.168.1.9:9100","192.168.1.8:9100","192.168.1.7:9100","192.168.1.6:9100","192.168.1.5:9100","192.168.1.4:9100","192.168.1.3:9100","192.168.1.2:9100","192.168.1.1:9100","192.168.1.10:9100","192.168.1.11:9100","192.168.1.12:9100","192.168.1.13:9100","192.168.1.14:9100"]
labels:
project: linux_node_exporter

- job_name: "gm_172_windows_node"
scrape_interval: 15s
static_configs:
- targets: ['192.168.1.201:9182','192.168.1.202:9182','192.168.1.203:9182']
labels:
project: windows_node_exporter
relabel_configs:
- source_labels: [__address__]
target_label: instance

- job_name: "sm_172_windows_node"
scrape_interval: 15s
static_configs:
- targets: ['192.168.1.204:9182','192.168.1.205:9182','192.168.1.206:9182']
labels:
project: windows_node_exporter
relabel_configs:
- source_labels: [__address__]
target_label: instance

- job_name: 'java-eureka-service'
static_configs:
- labels:
project: linux_node_exporter
relabel_configs:
- source_labels: [__address__]
target_label: instance
eureka_sd_configs:
- server: http://192.168.1.4:8761/eureka

配置alert.rules文件

groups:
- name: java_service_health
rules:
- alert: java服务宕机
expr: up{job=~"java-eureka-service"} == 0
for: 3m
labels:
severity: critical
annotations:
summary: "{{$labels.instance}} 所在服务宕机,请尽快处理!"
description: "{{$labels.instance}} 所在服务延时超过3分钟,当前状态{{ $value }}. "

- name: linux_node_health
rules:
- alert: linux磁盘容量使用过高
expr: (node_filesystem_size_bytes{fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}-node_filesystem_free_bytes{fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}) *100/(node_filesystem_avail_bytes {fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}+(node_filesystem_size_bytes{fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}-node_filesystem_free_bytes{fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}))>=80
for: 1m
labels:
severity: warning
annotations:
summary: " {{ $labels.instance }} 所在linux服务器磁盘使用率超过80% "
description: "{{$labels.instance}} 磁盘分区使用大于80%,当前使用率{{ $value }}%."
- alert: linux服务器宕机
expr: up{project=~"linux_node_exporter"} == 0
for: 3m
labels:
severity: critical
annotations:
summary: "{{$labels.instance}} 所在linux服务器宕机,请尽快处理!"
description: "{{$labels.instance}} 所在linux服务器延时超过3分钟,当前状态{{ $value }}. "
- alert: linux TCP连接数
expr: node_netstat_Tcp_CurrEstab > 1000
for: 2m
labels:
severity: critical
annotations:
summary: " TCP_ESTABLISHED过高!"
description: "{{$labels.instance}} TCP_ESTABLISHED大于1000,当前使用率{{ $value }}%."
- alert: 内存使用率过高
expr: (1- (node_memory_Buffers_bytes + node_memory_Cached_bytes + node_memory_MemFree_bytes) / node_memory_MemTotal_bytes) * 100 > 80
for: 5m # 告警持续时间,超过这个时间才会发送给alertmanager
labels:
severity: critical
annotations:
summary: "{{ $labels.instance }} 内存使用率过高,请尽快处理!"
description: "{{ $labels.instance }}内存使用率超过80%,当前使用率{{ $value }}%."

- name: Windows_node_health
rules:
- alert: windows CPU高负荷
expr: 100 - (avg by (instance,job) (irate(windows_cpu_time_total{mode="idle"}[2m])) * 100) > 70
for: 1m
labels:
severity: critical
annotations:
summary: "{{$labels.instance}} CPU使用率过高,请尽快处理!"
description: "{{$labels.instance}} CPU使用大于90%,当前使用率{{ $value }}%. "
- alert: windows内存使用率过高
expr: 100 - 100 * windows_os_physical_memory_free_bytes / windows_cs_physical_memory_bytes > 80
for: 3m # 告警持续时间,超过这个时间才会发送给alertmanager
labels:
severity: critical
annotations:
summary: "{{ $labels.instance }} 内存使用率过高,请尽快处理!"
description: "{{ $labels.instance }}内存使用率超过80%,当前使用率{{ $value }}%."
- alert: windows服务器宕机
expr: up{project=~"windows_node_exporter"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "{{$labels.instance}} 所在windows服务器宕机,请尽快处理!"
description: "{{$labels.instance}} 所在windows服务器延时超过3分钟,当前状态{{ $value }}. "
- alert: windows磁盘容量
expr: 100 - 100 * (windows_logical_disk_free_bytes {volume=~"C:"} / windows_logical_disk_size_bytes {volume=~"C:"}) > 65
for: 1m
labels:
severity: critical
annotations:
summary: "{{$labels.mountpoint}} 磁盘分区使用率过高,请尽快处理!"
description: "{{$labels.instance}} {{$labels.volume}} 磁盘分区使用大于65%,当前使用率{{ $value }}%."

启动Prometheus服务

/usr/lib/systemd/system/prometheus.service内容如下:


[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
Type=simple
ExecStart=/data/prometheus/prometheus --config.file /data/prometheus/prometheus.yml --storage.tsdb.path /data/prometheus/data --web.console.templates=/data/prometheus/consoles --web.console.libraries=/data/prometheus/console_libraries --web.enable-lifecycle
[Install]
WantedBy=multi-user.target

alertmanager的安装与配置

alertmanager的安装

直接下载对应的二进制文件到/data/alertmanager目录,并解压

alertmanager的配置

alertmanager.yml文件的内容如下:

global:
resolve_timeout: 5m

route:
group_by: ['alertname']
group_wait: 30s
group_interval: 1m
repeat_interval: 2m
receiver: 'webhook'
receivers:
- name: 'webhook'
webhook_configs:
- url: 'http://webhook服务器的对外暴露的url地址'
#message: '{{ template "wechat.default.message" . }}'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
templates:
- '安装目录/wechat.tmpl'

alertmanager的启动

使用systemctl启动服务。

/usr/lib/systemd/system/alertmanager.service的内容如下:

[Unit]
Description=alertmanager
Documentation=https://prometheus.io/
After=network.target
StartLimitIntervalSec=0
[Service]
Type=simple
ExecStart=/data/alertmanager/alertmanager --config.file=/data/alertmanager/alertmanager.yml --storage.path=/data/alertmanager/data --log.level=info --log.format=json
WorkingDirectory=/data/alertmanager/
Restart=on-failure
[Install]
WantedBy=multi-user.target

alertmanager发送web hook时的json串

{
"receiver": "webhook",
"status": "firing",
"alerts": [
{
"status": "firing",
"labels": {
"alertname": "CPU使用率",
"instance": "server1",
"job": "node_exporter"
},
"annotations": {
"summary": "CPU使用率超过阈值"
},
"startsAt": "2021-01-01T00:00:00.000Z",
"endsAt": "0001-01-01T00:00:00Z",
"generatorURL": "http://localhost:9090/graph?g0.expr=100%20-%20(avg%20by%20(instance)%20(irate(node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B5m%5D))%20*%20100)&g0.tab=1"
}
],
"groupLabels": {
"alertname": "CPU使用率"
},
"commonLabels": {
"alertname": "CPU使用率",
"instance": "server1",
"job": "node_exporter"
},
"commonAnnotations": {
"summary": "CPU使用率超过阈值"
},
"externalURL": "http://localhost:9093",
"version": "4"
}

用各种编程语言实现web hook服务

这里用java语言实现web hook服务,java版本的web服务主要工作就两个:

  • 解析json串
  • 将解析后的json串,发给企业微信机器人,其实就是实现一个http client。这种,在网上有很多的例子。

企业微信的web hook服务安装

企业微信web hook服务安装目录如下:

企业微信的web hook服务的wrapper启动

#encoding=UTF-8
# Configuration files must begin with a line specifying the encoding
# of the the file.
#
# NOTE - Please use src/conf/wrapper.conf.in as a template for your
# own application rather than the values used for the
# TestWrapper sample.

#********************************************************************
# Wrapper License Properties (Ignored by Community Edition)
#********************************************************************
# Professional and Standard Editions of the Wrapper require a valid
# License Key to start. Licenses can be purchased or a trial license
# requested on the following pages:
# http://wrapper.tanukisoftware.com/purchase
# http://wrapper.tanukisoftware.com/trial

# Include file problems can be debugged by leaving only one '#'
# at the beginning of the following line:
##include.debug

# The Wrapper will look for either of the following optional files for a
# valid License Key. License Key properties can optionally be included
# directly in this configuration file.
#include ../conf/wrapper-license.conf
#include ../conf/wrapper-license-%WRAPPER_HOST_NAME%.conf

# The following property will output information about which License Key(s)
# are being found, and can aid in resolving any licensing problems.
#wrapper.license.debug=TRUE

#********************************************************************
# Wrapper Localization
#********************************************************************
# Specify the language and locale which the Wrapper should use.
#wrapper.lang=en_US # en_US or ja_JP

# Specify the location of the language resource files (*.mo).
wrapper.lang.folder=../lang

#********************************************************************
# Wrapper Java Properties
#********************************************************************
# Java Application
# Locate the java binary on the system PATH:
wrapper.java.command=/usr/lib/jdk1.8.0_161/bin/java
# Specify a specific java binary:
#set.JAVA_HOME=/java/path
#wrapper.java.command=%JAVA_HOME%/bin/java

# Tell the Wrapper to log the full generated Java command line.
#wrapper.java.command.loglevel=INFO

# Java Main class. This class must implement the WrapperListener interface
# or guarantee that the WrapperManager class is initialized. Helper
# classes are provided to do this for you.
# See the following page for details:
# http://wrapper.tanukisoftware.com/doc/english/integrate.html
#wrapper.java.mainclass=org.tanukisoftware.wrapper.test.Main
wrapper.java.mainclass=org.tanukisoftware.wrapper.WrapperJarApp
# Log level for notices about missing Java Classpath entries.
wrapper.java.classpath.missing.loglevel=WARN

# Java Classpath (include wrapper.jar) Add class path elements as
# needed starting from 1
#wrapper.java.classpath.1=../lib/wrappertest.jar
#wrapper.java.classpath.2=../lib/wrapper.jar

wrapper.java.classpath.1=../app/*.jar
wrapper.java.classpath.2=../lib/*.jar
wrapper.java.classpath.3=../conf/

# Java Library Path (location of Wrapper.DLL or libwrapper.so)
wrapper.java.library.path.1=../lib

# Java Bits. On applicable platforms, tells the JVM to run in 32 or 64-bit mode.
wrapper.java.additional.auto_bits=TRUE

# Java Additional Parameters
wrapper.java.additional.1=

# Initial Java Heap Size (in MB)
#wrapper.java.initmemory=3

# Maximum Java Heap Size (in MB)
#wrapper.java.maxmemory=64

# Application parameters. Add parameters as needed starting from 1
#wrapper.app.parameter.1=
#wrapper.app.parameter.1=../app/promethues-webhook-qywx-0.0.1-SNAPSHOT.jar
wrapper.app.parameter.1=/data/webhook-qywx/app/promethues-webhook-qywx-0.0.1-SNAPSHOT.jar

#********************************************************************
# Wrapper Logging Properties
#********************************************************************
# Enables Debug output from the Wrapper.
# wrapper.debug=TRUE

# Format of output for the console. (See docs for formats)
wrapper.console.format=PM

# Log Level for console output. (See docs for log levels)
wrapper.console.loglevel=INFO

# Log file to use for wrapper output logging.
wrapper.logfile=../logs/wrapper.log

# Format of output for the log file. (See docs for formats)
wrapper.logfile.format=LPTM

# Log Level for log file output. (See docs for log levels)
wrapper.logfile.loglevel=INFO

# Roll mode of the log file.
# SIZE_OR_WRAPPER causes the file to be rolled whenever its size exceeds the
# value defined by wrapper.logfile.maxsize, or whenever the Wrapper is
# launched.
wrapper.logfile.rollmode=SIZE_OR_WRAPPER

# Maximum size that the log file will be allowed to grow to before
# the log is rolled. Size is specified in bytes. The default value
# of 0, disables log rolling. May abbreviate with the 'k' (kb) or
# 'm' (mb) suffix. For example: 10m = 10 megabytes.
wrapper.logfile.maxsize=10m

# Maximum number of rolled log files which will be allowed before old
# files are deleted. The default value of 0 implies no limit.
wrapper.logfile.maxfiles=9

# Log Level for sys/event log output. (See docs for log levels)
wrapper.syslog.loglevel=NONE

#********************************************************************
# Wrapper General Properties
#********************************************************************
# Allow for the use of non-contiguous numbered properties
wrapper.ignore_sequence_gaps=TRUE

# Do not start if the pid file already exists.
wrapper.pidfile.strict=TRUE

# Title to use when running as a console
wrapper.console.title=qywx-webhook

#********************************************************************
# Wrapper JVM Checks
#********************************************************************
# Detect DeadLocked Threads in the JVM. (Requires Standard Edition)
wrapper.check.deadlock=TRUE
wrapper.check.deadlock.interval=10
wrapper.check.deadlock.action=RESTART
wrapper.check.deadlock.output=FULL

# Out Of Memory detection.
# (Ignore output from dumping the configuration to the console. This is only needed by the TestWrapper sample application.)
wrapper.filter.trigger.999=wrapper.filter.trigger.*java.lang.OutOfMemoryError
wrapper.filter.allow_wildcards.999=TRUE
wrapper.filter.action.999=NONE
# Ignore -verbose:class output to avoid false positives.
wrapper.filter.trigger.1000=[Loaded java.lang.OutOfMemoryError
wrapper.filter.action.1000=NONE
# (Simple match)
wrapper.filter.trigger.1001=java.lang.OutOfMemoryError
# (Only match text in stack traces if -XX:+PrintClassHistogram is being used.)
#wrapper.filter.trigger.1001=Exception in thread "*" java.lang.OutOfMemoryError
#wrapper.filter.allow_wildcards.1001=TRUE
wrapper.filter.action.1001=RESTART
wrapper.filter.message.1001=The JVM has run out of memory.

#********************************************************************
# Wrapper Email Notifications. (Requires Professional Edition)
#********************************************************************
# Common Event Email settings.
#wrapper.event.default.email.debug=TRUE
#wrapper.event.default.email.smtp.host=<SMTP_Host>
#wrapper.event.default.email.smtp.port=25
#wrapper.event.default.email.subject=[%WRAPPER_HOSTNAME%:%WRAPPER_NAME%:%WRAPPER_EVENT_NAME%] Event Notification
#wrapper.event.default.email.sender=<Sender email>
#wrapper.event.default.email.recipient=<Recipient email>

# Configure the log attached to event emails.
#wrapper.event.default.email.maillog=ATTACHMENT
#wrapper.event.default.email.maillog.lines=50
#wrapper.event.default.email.maillog.format=LPTM
#wrapper.event.default.email.maillog.loglevel=INFO

# Enable specific event emails.
#wrapper.event.wrapper_start.email=TRUE
#wrapper.event.jvm_prelaunch.email=TRUE
#wrapper.event.jvm_start.email=TRUE
#wrapper.event.jvm_started.email=TRUE
#wrapper.event.jvm_deadlock.email=TRUE
#wrapper.event.jvm_stop.email=TRUE
#wrapper.event.jvm_stopped.email=TRUE
#wrapper.event.jvm_restart.email=TRUE
#wrapper.event.jvm_failed_invocation.email=TRUE
#wrapper.event.jvm_max_failed_invocations.email=TRUE
#wrapper.event.jvm_kill.email=TRUE
#wrapper.event.jvm_killed.email=TRUE
#wrapper.event.jvm_unexpected_exit.email=TRUE
#wrapper.event.wrapper_stop.email=TRUE

# Specify custom mail content
wrapper.event.jvm_restart.email.body=The JVM was restarted.\n\nPlease check on its status.\n

#********************************************************************
# Wrapper Windows Service Properties
#********************************************************************
# WARNING - Do not modify any of these properties when an application
# using this configuration file has been installed as a service.
# Please uninstall the service before modifying this section. The
# service can then be reinstalled.

# Name of the service
wrapper.name=testwrapper

# Display name of the service
wrapper.displayname=qywx-webhook

# Description of the service
wrapper.description=qywx-webhook

# Service dependencies. Add dependencies as needed starting from 1
wrapper.ntservice.dependency.1=

# Mode in which the service is installed.
# AUTO_START, DELAY_START (Requires Standard Edition) or DEMAND_START
wrapper.ntservice.starttype=AUTO_START

# Allow the service to interact with the desktop (Windows NT/2000/XP only).
wrapper.ntservice.interactive=FALSE

# Allow the current user to perform certain actions without being prompted for
# administrator credentials. (Requires Professional Edition)
#wrapper.ntservice.permissions.1.account=CURRENT_USER
#wrapper.ntservice.permissions.1.allow=START, STOP, PAUSE_RESUME

参考资料