最後更新: 2019-05-07
前言
如果未熟悉 watchdog 前, 千萬不要在 Production Server 上啟用它
因為你將會意外地 reboot 它多次 !! 這是真的, 因為我已 reboot 了它最少有 5 次 ...
在 Linux 2.6 的 watchdog 系統上分別由以下兩東西組成
- watchdog daemon
- softdog kernel module
softdog
hardware watchdog 比 softdog 可靠, 因為 softdog 由一內核模塊(softdog.ko) 通過 timer 機制實現
當 Kernel 完全死亡時, 那 softdog 自然無效 ........
載入 kernel module:
# 自動有
# mknod /dev/watchdog c 10 130
# chmod 600 /dev/watchdog
modprobe softdog # Linux 本身自帶它, 不用額外安裝
lsmod | grep softdog
softdog 2108 0
P.S.
* 載入後, 並不代表已經啟用 !! 要使用 softdog, 必須要叫醒一次
Enable softdog:
echo a > /dev/watchdog # 用任何字符都要可叫醒它
之後如果在 default=60 內不叫它一次, 系統就會被 reboot !! (module setting: soft_margin=60)
[823573.854213] SoftDog: Unexpected close, not stopping watchdog! <-- 叫醒
30 秒 & ONLY_TESTING :
modprobe softdog soft_margin=30 soft_noboot=1
[823603.936044] SoftDog: Triggered - Reboot ignored. <-- ONLY_TESTING
Other Opts:
soft_panic
Softdog action, set to 1 to panic, 0 to reboot (default=0)
Disable softdog:
echo V > /dev/watchdog
查看是否真的 stop 了:
lsmod | grep softdog
softdog 13510 0
watchdog daemon
watchdog 是一個 daemon 來, 它負責定時對 /dev/watchdog 寫入東西
安裝:
apt-get install watchdog
設定檔:
/etc/watchdog.conf
/etc/default/watchdog
run_watchdog=1
選項:
-q, --no-action # Do not reboot or halt the machine. This is for testing purposes
-s, --sync # Try to synchronize the filesystem every time the process is awake.
# Note that the system is rebooted if for any reason the synchronizing lasts longer than a minute.
-b, --softboot # Kill all processes with SIGTERM. After a short pause kill all remaining processes with SIGKILL.
# this does not apply to the opening of /dev/watchdog
設定:
# set realtime(SCHED_RR) & never swapped out realtime = yes # Set the schedule priority for realtime mode priority = 1 watchdog-device = /dev/watchdog # Default watchdog 會每秒 triggers 一次 /dev/watchdog interval = 10 # /proc/loadavg Check # 0 = disabled max-load-1 = 0 max-load-5 = 18 max-load-15 = 12 # file check # stat the file & exit() file = /var/log/messages # 比 file 更進一步檢查, look for changes #rsyslog needs "$ModLoad immark" & "$MarkMessagePeriod 180" change = 600 # watchdog 會每 interval X logtick 秒寫 msg 到 logfile (/var/log/watchdog) logtick = 10 # ping check: # send out three ping packages ping = 192.168.123.1 interface = vzbr0
Watchdog 可以負責以下的 Checking:
- Is the process table full?
# watchdog will try periodically to fork itself to see whether the process table is full.
- Is there enough free memory?
- Are some files accessible?
- Have some files changed within a given interval?
- Is the average work load too high?
- Has a file table overflow occurred?
- Is a process still running? The process is specified by a pid file.
- Do some IP addresses answer to ping?
- Do network interfaces receive traffic?
- Is the temperature too high? (Temperature data not always available.)
- Execute a user defined command to do arbitrary tests.
Centos7 - watchdog
安裝
yum install watchdog
設定
- /etc/sysconfig/watchdog (不用設定)
- /etc/watchdog.conf
Service
- watchdog.service
- watchdog-ping.service
Load kernel module on boot (Centos 7)
echo softdog > /etc/modules-load.d/softdog.conf
sbin
/usr/sbin/watchdog
Main Program
/usr/sbin/wd_identify
This utility opens /dev/watchdog and gets the identification string from the watchdog which then is printed.
/usr/sbin/wd_keepalive
This is a simplified version of the watchdog daemon.
keeps writing to /dev/watchdog often enough to keep the kernel from resetting
The wd_keepalive daemon can be stopped without causing a reboot if the device /dev/watchdog is closed correctly
Other