watchdog

最後更新: 2019-05-07

前言

 

如果未熟悉 watchdog 前, 千萬不要在 Production Server 上啟用它

因為你將會意外地 reboot 它多次 !! 這是真的, 因為我已 reboot 了它最少有 5 次 ...

在 Linux 2.6 的 watchdog 系統上分別由以下兩東西組成

  • watchdog daemon
  • softdog kernel module

 


softdog

 

hardware watchdog 比 softdog 可靠, 因為 softdog 由一內核模塊(softdog.ko) 通過 timer 機制實現

當 Kernel 完全死亡時, 那 softdog 自然無效 ........

載入 kernel module:

# 自動有

# mknod /dev/watchdog c 10 130

# chmod 600 /dev/watchdog

modprobe softdog                             # Linux 本身自帶它, 不用額外安裝

lsmod | grep softdog

softdog                 2108  0

P.S.

 * 載入後, 並不代表已經啟用 !! 要使用 softdog, 必須要叫醒一次

Enable softdog:

echo a > /dev/watchdog             # 用任何字符都要可叫醒它

之後如果在 default=60 內不叫它一次, 系統就會被 reboot !! (module setting: soft_margin=60)

[823573.854213] SoftDog: Unexpected close, not stopping watchdog!     <-- 叫醒        

30 秒 & ONLY_TESTING :

modprobe softdog soft_margin=30 soft_noboot=1

[823603.936044] SoftDog: Triggered - Reboot ignored.                  <-- ONLY_TESTING

Other Opts:

soft_panic

Softdog action, set to 1 to panic, 0 to reboot (default=0)

Disable softdog:

echo V > /dev/watchdog

查看是否真的 stop 了:

lsmod | grep softdog

softdog                13510  0

 


watchdog daemon

 

watchdog 是一個 daemon 來, 它負責定時對 /dev/watchdog 寫入東西

安裝:

apt-get install watchdog

設定檔:

/etc/watchdog.conf

/etc/default/watchdog

run_watchdog=1

選項:

-q, --no-action             # Do not reboot or halt the machine. This is for testing purposes

-s, --sync                    # Try to synchronize the filesystem every time the process is awake.

                                  # Note that the system is rebooted if for any reason the synchronizing lasts longer than a minute.

-b, --softboot               # Kill all processes with SIGTERM. After a short pause kill all remaining processes with SIGKILL.

                                  # this does not apply to the opening of /dev/watchdog

 

設定:

# set realtime(SCHED_RR) & never swapped out
realtime = yes
# Set the schedule priority for realtime mode
priority = 1


watchdog-device = /dev/watchdog

# Default watchdog 會每秒 triggers 一次 /dev/watchdog      
interval = 10

# /proc/loadavg Check
# 0 = disabled
max-load-1             = 0
max-load-5             = 18
max-load-15            = 12


# file check
# stat the file & exit()
file =  /var/log/messages    
# 比 file 更進一步檢查, look for changes
#rsyslog needs "$ModLoad immark" & "$MarkMessagePeriod 180"    
change = 600



# watchdog 會每 interval X logtick 秒寫 msg 到 logfile (/var/log/watchdog)
logtick = 10


# ping check:
# send out three ping packages
ping                    = 192.168.123.1
interface               = vzbr0

Watchdog 可以負責以下的 Checking:

  • Is the process table full?

# watchdog will try periodically to fork itself to see whether the process table is full.

  • Is there enough free memory?
  • Are some files accessible?
  • Have some files changed within a given interval?
  • Is the average work load too high?
  • Has a file table overflow occurred?
  • Is a process still running? The process is specified by a pid file.
  • Do some IP addresses answer to ping?
  • Do network interfaces receive traffic?
  • Is the temperature too high? (Temperature data not always available.)
  • Execute a user defined command to do arbitrary tests.

 


Centos7 - watchdog

 

安裝

yum install watchdog

設定

  • /etc/sysconfig/watchdog (不用設定)
  • /etc/watchdog.conf

Service

  • watchdog.service
  • watchdog-ping.service

Load kernel module on boot (Centos 7)

echo softdog > /etc/modules-load.d/softdog.conf

sbin

/usr/sbin/watchdog

Main Program

/usr/sbin/wd_identify

This utility opens /dev/watchdog and gets  the identification string from the watchdog which then is printed.

/usr/sbin/wd_keepalive

This is a simplified version of the watchdog daemon.

keeps writing to /dev/watchdog often enough to keep the kernel from resetting

The wd_keepalive daemon can be stopped without causing a reboot if the device /dev/watchdog is closed correctly

 


Other