1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
| #!/bin/bash
# 配置参数 CRITICAL_THRESHOLD=80 # 严重报警阈值(%) WARNING_THRESHOLD=50 # 警告阈值(%) CHECK_INTERVAL=60 # 检查间隔(秒) MAIL="example@mail.com" LOG_FILE="/var/log/cpu_monitor.log" ALERT_HISTORY="/var/log/cpu_alert_history.log"
# 初始化文件 touch $LOG_FILE $ALERT_HISTORY chmod 644 $LOG_FILE $ALERT_HISTORY
# 日志函数 log() { echo "[$(date +%F" "%H:%M:%S)] $1" >> $LOG_FILE }
# 获取系统信息(兼容多系统) get_system_info() { DATE=$(date +%F" "%H:%M:%S) # 兼容CentOS 6/7/8及Ubuntu IP=$(hostname -I 2>/dev/null | awk '{print $1}') if [ -z "$IP" ]; then IP=$(ifconfig | grep 'inet ' | grep -v '127.0.0.1' | head -1 | awk '{print $2}') fi HOSTNAME=$(hostname) }
# 检查依赖命令 check_dependencies() { if ! which vmstat &>/dev/null; then log "错误:未找到vmstat命令,请安装procps包" echo "Error: vmstat command not found. Please install procps package." >&2 exit 1 fi }
# 获取CPU详细指标 get_cpu_metrics() { # 使用vmstat 1 2取第二次采样,避免瞬时值偏差 VMSTAT_OUTPUT=$(vmstat 1 2 | tail -1) US=$(echo $VMSTAT_OUTPUT | awk '{print $13}') # 用户态CPU SY=$(echo $VMSTAT_OUTPUT | awk '{print $14}') # 系统态CPU IDLE=$(echo $VMSTAT_OUTPUT | awk '{print $15}') # 空闲CPU WAIT=$(echo $VMSTAT_OUTPUT | awk '{print $16}') # IO等待CPU USE=$((US + SY)) # 总使用率 }
# 检查冷却期 is_in_cooldown() { local current_time=$(date +%s) local last_alert=$(tail -1 $ALERT_HISTORY 2>/dev/null | awk '{print $1}') if [ -n "$last_alert" ]; then local time_diff=$((current_time - last_alert)) # 严重报警冷却30分钟,普通警告冷却2小时 if [ $1 -ge $CRITICAL_THRESHOLD ] && [ $time_diff -lt 1800 ]; then return 0 elif [ $time_diff -lt 7200 ]; then return 0 fi fi return 1 }
# 记录报警 record_alert() { echo "$(date +%s) $1" >> $ALERT_HISTORY }
# 发送报警邮件 send_alert() { local level=$1 local subject="CPU监控报警[$level]:$HOSTNAME($IP)" local body=" 服务器CPU使用率异常,请关注! 时间:$DATE 主机:$HOSTNAME($IP) CPU使用率:$USE% (用户态:$US% 系统态:$SY%) 空闲CPU:$IDLE% IO等待:$WAIT% 阈值:警告=$WARNING_THRESHOLD% 严重=$CRITICAL_THRESHOLD% " if ! is_in_cooldown $USE; then echo "$body" | mail -s "$subject" $MAIL record_alert $USE log "发送$level报警邮件至$MAIL" else log "$level报警处于冷却期,暂不发送邮件" fi }
# 主监控逻辑 main() { check_dependencies get_system_info get_cpu_metrics # 记录常规状态 log "CPU状态 - 总使用率:$USE% (用户:$US% 系统:$SY%) 空闲:$IDLE% IO等待:$WAIT%" # 判断报警级别 if [ $USE -ge $CRITICAL_THRESHOLD ]; then log "CPU严重超载:$USE% >= $CRITICAL_THRESHOLD%" send_alert "严重" elif [ $USE -ge $WARNING_THRESHOLD ]; then log "CPU警告:$USE% >= $WARNING_THRESHOLD%" send_alert "警告" fi }
# 运行模式:定时任务单次执行或循环监控 if [ "$1" = "cron" ]; then main else while true; do main sleep $CHECK_INTERVAL done fi
|
v1.3.10