最後更新: 2024-04-17
目錄
- Start Options
- Command
- Config file
- Service Poll Time (Poll cycle)
- My Settings
- Web-Interface
- Monit Action
-
可用的監測(CHECK X)
System resource (CPU, Memory)
Disk Uage & I/O (filesystem)
Monitoring a directory
Monitoring files - 重複次數(WITHIN, FOR)
- Monit 的 Limits
- Cheat Sheet
- Doc
介紹
Program:
LSB executable
Monit can
- start a process if it does not run
- restart a process if it does not respond
- stop a process if it uses too much resources
- execute meaningful causal actions in error situations
monitor
- files, directories and filesystems for changes(timestamps , checksum, permission size changes)
- TCP/IP network checks
Version:
monit -V
This is Monit version 5.3.2
Start Options
- -d n Run Monit as a daemon once per n seconds. (poll cycle) [Setting: set daemon n]
- -s statefile Write state information to this file. (A service's monitoring state is persistent across Monit restart)
- -l logfile
Command
- monitor <name | all>
mon a service
- unmonitor<name | all>
Disable monitoring of all services listed in the control file.
-
status
看到 webpanel 有的資料 (亦即是要有 webpanel enable 才用到 !!)
monit status
The Monit daemon 5.3.2 uptime: 0m System 'system_status' status Running monitoring status Monitored load average [0.51] [0.77] [0.81] cpu 16.6%us 8.6%sy 1.0%wa memory usage 4662296 kB [57.6%] swap usage 577540 kB [4.7%] data collected Wed, 15 Jan 2014 16:04:15 Process 'ssh' status Running monitoring status Monitored pid 20952 parent pid 1 uptime 6d 23h 54m children 3 memory kilobytes 2424 memory kilobytes total 8824 memory percent 0.0% memory percent total 0.1% cpu percent 0.0% cpu percent total 0.0% data collected Wed, 15 Jan 2014 16:04:15 ...
- summary
# 沒有 status 那麼詳細
monit summary
Monit 5.25.1 uptime: 26m ┌─────────────────────────────────┬────────────────────────────┬───────────────┐ │ Service Name │ Status │ Type │ ├─────────────────────────────────┼────────────────────────────┼───────────────┤ │ myserver │ OK │ System │ ├─────────────────────────────────┼────────────────────────────┼───────────────┤ │ public │ OK │ Network │ └─────────────────────────────────┴────────────────────────────┴───────────────┘
- reload # The daemon will reread its configuration, close and reopen log files.
- quit
-
validate # Check all services listed in the control file (相當於人手行一次)
# Default behavior when Monit runs in daemon mode
... info : Awakened by User defined signal 1 ... info : Monit daemon with PID 536802 awakened
- start / stop / restart
start <all | name>
stop <all | name>
restart <all | name>
i.e.
# Start all services listed in the control file and enable monitoring for them.
monit start all
Config File
指定用另一個 config file:
monit -c /var/monit/monitrc
Default config file:
# If this file does not exist, Monit will try /etc/monitrc
~/.monitrc
Service 的 config file(service monit start):
/ect/monitrc
... include /etc/monit.d/*.cfg
Test Config File Syntax
monit -t
output:
Control file syntax OK
Global Setting
# Monit's poll cycle # Monit detaches from the console, goes to sleep for the given poll interval, # wakes up and start monitoring again in an endless cycle set daemon 20 start delay 120 # Log Setting set logfile /var/log/monit.log # unique id for the Monit instance set idfile /var/lib/monit/id # saves monitoring states on each cycle. set statefile /var/lib/monit/state # Multiple servers may be specified using a comma separator # (If the first mail server fails, Monit will use the second mail server in the list) set mailserver smtpgw1, smtpgw2 port 1025, localhost # fallback relay # By default, the queue is disabled and if the alert handler fails, # Monit will simply drop the alert message. set eventqueue basedir /var/lib/monit/events # set the base directory where events will be stored slots 100 # optionally limit the queue size # mail format set mail-format { from: [email protected] } # 可以多人收 ALERT, 一行一個人 set alert [email protected] set alert [email protected] # built-in mini-httpd server set httpd port 2812 and use address localhost allow localhost allow admin:ClearTextPW include /etc/monit.d/*.conf
Config file permission:
# 一定要這個 permission
chow root: /etc/monit.conf
chmod 600 /etc/monit.conf
with start delay
要延遲多耐(秒)才開始 mon
期間 monit status 無效
Cannot create socket to [localhost]:2812 -- Connection refused
Log
log to file
set log /var/log/monit.log
log to syslog
set log syslog
Service Poll Time (Poll cycle)
一共有兩種方式:
- EVERY [number] CYCLES
- EVERY [cron]
Example:
# cycles # 每 "set daemon N" 的 2N 行一次 check process nginx with pidfile /var/run/nginx.pid every 2 cycles # cron # cron jobs format. 星期一至五, 8 a.m. 點至 7 p.m 內的每分鐘 check program nginx with pidfile /var/run/nginx.pid every "* 8-19 * * 1-5"
P.S.
Strongly recommend to use an asterix in the minute field or at minimum a range,
* Never use a specific minute as Monit may not run on that minute.
因為 monit 的 scheduler 是 serial polling, 所以不能保證執行時間
e.g.
# a range
0-15
My Settings
Centos 6
/etc/logrotate.d/monit
/var/log/monit.log { missingok notifempty size 100k create 0644 root root postrotate # /bin/systemctl reload monit.service > /dev/null 2>&1 || : /sbin/service monit condrestart > /dev/null 2>&1 || : endscript }
行以下 cmd 先 ~
mkdir /var/lib/monit
rm /etc/monit.d/logging
/etc/monit.conf
# My Basic Config
set daemon 10
start delay 60
set logfile /var/log/monit.log
set idfile /var/lib/monit/id
set statefile /var/lib/monit/state
set mailserver localhost
set alert xxx@xxx
set httpd port 2812 and
use address localhost
allow localhost
include /etc/monit.d/*.conf
RHEL 8
set daemon 10 start delay 60 set idfile /var/lib/monit/id set statefile /var/lib/monit/state set mailserver localhost set alert [email protected] set httpd unixsocket /var/run/monit.sock permission 600 allow localhost check system $HOST if loadavg (1min) per core > 2 for 5 cycles then alert if loadavg (5min) per core > 1.5 for 10 cycles then alert if cpu usage > 95% for 10 cycles then alert if memory usage > 80% then alert if swap usage > 80% then alert include /etc/monit.d/*
Notes
* 不用加 "set log syslog" 因為有 /etc/monit.d/logging
* 建立目錄 /var/lib/monit 去放 idfile 及 statefile, 否則會放在 /root/{.monit.id,.monit.state}
* "set httpd unixsocket" 必須加 "allow localhost" 才用到
* include 不用 *.conf, 因為 upgrade 後又會有 /etc/monit.d/logging
Web-Interface
Default: Disable.
Default TCP port 127.0.0.1:2812
* If security is a concern, bind the HTTP interface to localhost / Unix Socket
Monit HTTPD Authentication:
set httpd port 2812 allow localhost allow 10.1.1.1 allow 192.168.1.0/255.255.255.0 allow 10.0.0.0/8 allow myuser:mypassword allow md5 /etc/httpd/htpasswd john paul ringo george
* allow 們是 "AND" 的關係
Web UI:
UNIX SOCKET
SET HTTPD UNIXSOCKET <path> [UID <uid | username>] [GID <gid | groupname>] [PERMISSION <octal number>] ...
UID, GID
optional, defaults to the user who executes Monit
PERMISSION
optional, absolute octal mode
i.e.
set httpd unixsocket /var/run/monit.sock permission 600 allow localhost
signature
hide Monit version
set httpd port 2812 signature disable
Monit Action
Available actions
IF <TEST> THEN ACTION
ACTION:
- alert
- restart # restarts the service and sends an alert
- start
- stop
- exec # EXEC can be used to execute an arbitrary program and send an alert.
- unmonitor
EXEC (重點: repeat, as)
repeat
The program is executed only once if the test fails.
You can enable execute repetition if the error persists for a given number of cycles:
# 當每個 cycles 係 30 秒時(set daemon 30), 以下設定即每 5 分析.
if failed <test> then exec "/usr/local/bin/sms.sh"
as uid "nobody" and gid "nobody"
repeat every 10 cycles
as
# You may optionally specify the uid and gid
exec "/root/scripts/fixit.sh" as uid nobody and gid nobody
* If Monit is run by root, then all programs executed by Monit will be started with superuser privileges
注意, 如果是 shell script, 那 Script 一定要 "#!/bin/bash"
Notes
ALERT 用 "WITH REMINDER ON N CYCLES", EXEC 用 "REPEAT EVERY N CYCLES"
i.e.
CHECK HOST MyVPN ADDRESS 192.168.88.20 ALERT [email protected] WITH REMINDER ON 2 CYCLES IF FAILED PING COUNT 5 THEN EXEC /home/fortivpn/vpn/start-vpn.sh AS uid fortivpn AND gid fortivpn REPEAT EVERY 2 CYCLES
Service Monitoring Mode
MODE < ACTIVE | PASSIVE >
- ACTIVE: raise alerts and restart the service # DEFAULT
- PASSIVE: raise alerts only
i.e.
# Monit will not try to (re)start this service if it is not running:
check process sybase with pidfile /var/run/sybase.pid
mode passive
start = "/etc/init.d/sybase start"
stop = "/etc/init.d/sybase stop"
可用的監測(CHECK X)
- CHECK SYSTEM <UniqueName> # 系統資源
- CHECK FILESYSTEM <UniqueName> PATH <path> # Disk IO 及 Space Usage
- CHECK PROCESS <UniqueName> <PIDFILE path | MATCHING regex>
- CHECK FILE <UniqueName> PATH <path>
- CHECK FIFO <UniqueName> PATH <path>
- CHECK DIRECTORY <UniqueName> PATH <path>
- CHECK HOST <UniqueName> ADDRESS <Address>
- CHECK PROGRAM <UniqueName> PATH <executable file> [TIMEOUT <number> SECONDS]
CHECK HOST <UniqueName> ADDRESS <Address>
UniqueName 是一個 Label 來. 行 monit status 時會顯示它
e.g.
Settings:
CHECK HOST localhost ADDRESS 127.0.0.1
Output:
Remote Host 'localhost'
status OK
monitoring status Monitored
...
Address 是用於之後的 "IF" checking. 它可以是 Domain / IP
e.g.
CHECK HOST MyVPN ADDRESS 192.168.88.20 IF FAILED PING COUNT 5 THEN EXEC /home/fortivpn/vpn/start-vpn.sh
If the program does not finish executing within <number> seconds,
Monit will terminate it. The default program timeout is 300 seconds
The "status test" allows one to check the program's exit status.
IF STATUS operator value THEN action
System resource (CPU, Memory)
To monitor general system resources such as CPU usage, total memory usage or load average.
If you use the variable $HOST as the name, it will expand to the hostname.
CPU
$HOST 會是 hostname
check system $HOST if loadavg (1min) > 4 then alert if loadavg (5min) > 2 then alert if memory usage > 75% for 6 cycles then alert if swap usage > 25% then alert if cpu usage > 95% for 10 cycles then alert if cpu usage (user) > 70% then alert if cpu usage (system) > 30% then alert if cpu usage (wait) > 20% then alert
Memory usage
What does monit consider to be memory usage?
# On latest Monit (ie: 5.25.x) the memory usage value accounts for ZFS ARC cache
Code
si->memory.usage.bytes = systeminfo.memory.size - zfsarcsize - (uint64_t)(mem_free + buffers + cached + slabreclaimable) * 1024;
# 對比
monit status | grep 'memory usage'
memory usage 2.5 GB [32.4%]
grep -w -e MemTotal -e Buffers -e Cached -e MemFree -e Slab /proc/meminfo
MemTotal: 8060728 kB MemFree: 375184 kB Buffers: 652612 kB Cached: 3048636 kB Slab: 1472140 kB
Usage = MemTotal - MemFree - Buffers - Cached - Slab = 2.45 G
Network
* Unit: Byte
# 10 min up/down > 5 MByte
# 1 hr up size > 4 GByte
check network public with interface eth0 if failed link then alert if changed link then alert if download > 5 MB/s for 20 cycles then alert if upload > 5 MB/s for 20 cycles then alert if total uploaded > 4 GB in last hour then alert
Disk I/O (filesystem)
Monit will normally need to run as the root user to access this metrics.
# Unit: "B","KB","MB","GB"
i.e.
check filesystem datafs with path /dev/sda1 # Usage if space usage > 90% then alert if inode usage > 90% then alert # IO if read rate > 10 MB/s for 5 cycles then alert if read rate > 500 operations/s for 5 cycles then alert if write rate > 10 MB/s for 5 cycles then alert if write rate > 500 operations/s for 5 cycles then alert if service time > 10 milliseconds for 3 times within 5 cycles then alert
* Per-process I/O activity statistics by platform: Byte
Service time per operation
Service Time is the time taken to complete a read or a write operation.
If it grows, it means that the disk is not able to handle the operations fast enough.
# Unit is "ms" (millisecond) or "s" (second)
if service time > 10 milliseconds for 3 times within 5 cycles then alert
Monitoring a directory
check directory bin with path /bin if failed permission 755 then unmonitor if failed uid 0 then unmonitor if failed gid 0 then unmonitor
Monitoring file
TEST:
- IF FAILED [MD5|SHA1] CHECKSUM [EXPECT checksum] THEN action
- IF CHANGED [MD5|SHA1] CHECKSUM THEN action
- IF TIMESTAMP [[operator] value [unit]] THEN action
- IF CHANGED TIMESTAMP THEN action
- IF [DOES] NOT EXIST THEN action
- IF SIZE [[operator] value [unit]] THEN action
- IF CHANGED SIZE THEN action
Changed checksum
if failed checksum expect 8f7f419955cefa0b33a2ba316cba3659 then alert
OR
if changed checksum then exec "/usr/bin/apachectl graceful"
Changed timestamp (unit: "SECOND", "MINUTE", "HOUR" or "DAY")
# If the file is older then N minutes, then things are broken
IF TIMESTAMP > 1 MINUTE THEN alert
EXIST
IF [DOES] NOT EXIST THEN action
PERMISSION TESTING
- IF FAILED PERM(ISSION) octalnumber THEN action
- IF FAILED [E]UID user THEN action
- IF FAILED GID group THEN action
Example:
check file shadow with path /etc/shadow if failed permission 0640 then alert check file shadow with path /etc/shadow if failed uid root then alert check file shadow with path /etc/shadow if failed gid shadow then alert
重複次數(WITHIN, FOR)
WITHIN
[[<X>] [TIMES WITHIN] <Y> CYCLES]
IF CHANGED <TEST> [[<X>] [TIMES WITHIN] <Y> CYCLES] THEN ACTION
e.g.
# An alert is delivered each time the condition becomes true.
# tenth cycle if a service remains in a state
alert foo@bar with reminder on 10 cycles
# tcp port 80 在 5 cycles 內有 3 次無反應時出 alert
if failed port 80 for 3 times within 5 cycles then alert
# 連續一段時間:
if cpu is greater than 50% for 5 cycles then restart
# 限制 restart 的次數 (IF N RESTART WITHIN M CYCLES THEN <action>)
# restarted 2 times within 3 cycles
if 2 restarts within 3 cycles then unmonitor
FOR
Syntax: ... FOR N CYCLES ...
Requires X consecutive events before switching the state
# cycle is failure (1-0-1-0-1-0-...), then "for 2 cycles" condition will never match
IF FAILED PORT 80 FOR 2 CYCLES THEN alert
Monit 的 Limits
monit 自身的 resource limit
Default values:
set limits { programOutput: 512 B, # check program's output truncate limit sendExpectBuffer: 256 B, # limit for send/expect protocol test fileContentBuffer: 512 B, # limit for file content test httpContentBuffer: 1 MB, # limit for HTTP content test networkTimeout: 5 seconds # timeout for network I/O programTimeout: 300 seconds # timeout for check program stopTimeout: 30 seconds # timeout for service stop startTimeout: 30 seconds # timeout for service start restartTimeout: 30 seconds # timeout for service restart }
Cheat Sheet
monit -t # syntax checking
monit reload # config file 更新後必須 reload, 否則 "monitor <name>" 反映不到新 settings
monit summary # 純看 Status(OK|Initializing)
monit status # 有 process 的 info (uptime, threads, children, cpu, memory, io)
Doc
# 很完整的 doc