Linux Watchdog Daemon - Testing

Back to PSC's home page
Back to Watchdog

Testing of the watchdog before you go live (i.e. have it configured from boot time) is essential, otherwise you risk having a machine in an unusable situation of booting, triggering the watchdog, rebooting, etc.

There are a couple of step/stages in the testing the watchdog to consider:

Check the watchdog runs with no test options and successfully opens and refreshes the watchdog device (i.e. like wd_keepalive).
Check that each option you enable configuration file works as expected. Do so one at a time.
Check that each test/repair script work as expected when you run them from the command line.
Check that each test/repair script work when the watchdog runs them (probably a different working directory & PATH).

Some of this is part of the normal installation, some of it customisation.
[top of page]

Precautions

Before you start installing and testing the watchdog you have to consider the potential consequences. These include:

The machine gets in to an endless reboot loop.
The file system(s) get corrupted by a hard reset.
The machine runs, but is trigger-happy and reboots when not expected.

Reboot Loop

In the first case, if you get in to such a state you may need to boot in to safe mode, or use a "live CD" (or USB stick) to boot up and edit the machine's settings to disable the watchdog until you figure out what went wrong. To make this easier, you may want to have the grub boot loader show you the options before booting the normal system.

To do this on a typical Ubuntu 12.04 machine modify the /etc/default/grub file to change the following:

#GRUB_HIDDEN_TIMEOUT=0
#GRUB_HIDDEN_TIMEOUT_QUIET=true
GRUB_TIMEOUT=5

Finally running 'update-grub' (as root or with sudo) to apply those. On reboot you should now get a 5 second countdown and a choice of kernel & safe mode, or memtest shown.
[top of page]

File System Corruption

The second risk, that of an unexpected reset not unmounting the file systems cleanly and causing corruption, is quite small with modern journalling file systems (e.g. ext3/4) but should not be ignored. In particular if you need to use something pretty fragile for one or more mounted file system (like FAT32 for compatibility reasons).

Hence there are a few precautions you should consider:

Use a test computer, and not a live or important computer, for practising your installation, configuration and test.
You can test most things, except for hardware drivers of course, in a virtual machine (VMware, xen, etc) with little risk, as often they allow snapshots of the file system to provide a pain-free way or rolling back changes.
Have an up-to-date backup if you must use something important for testing. You do have a backup?
If fragile file systems are in use, can you unmount them first for testing? If only certain tests need those file systems, then try to run as much of the testing as possible without them mounted and test those last.
Try to use the sync command just before you run any test that might provoke an event.
Initially try testing at times of low disk activity, less I/O means less risk of significant trouble.
Make sure no other users are logged-on and working on the test computer, they will not appreciate such a rude interruption!
Consider editing /etc/default/rcS to enable automatic fixing of file system problems at boot time (or for systemd adding the "fsck.repair=yes" option to the kernel command line and updating the grub boot loader, etc).

However, a having some spurious reboot (kernel panic, hardware fault, power outage, etc) is always a possibility, and as far as possible you should configure the operating system and software in such a way that file systems are checked & repaired automatically, and that data & processes have integrity tests and locks to allow a clean re-start/roll-back from any critical phases of operations.

And test them! You should be able to reboot at any time and recover a functional system, but only with testing will you find out if this really is the case.
[top of page]

Trigger Happy

It can be difficult to configure some of the watchdog's tests in such a manner that they will rescue a hung computer, but are not triggered by normal activity. In particular, the load averages and the memory limits need quite a lot of insight to the machines operational behaviour. Of course, such behaviour can also change without warning if the users alter what they do, when the all are logged-in, etc.

The best advice here is to monitor the machine for a while before configuring the tests, and if that is not possible, to choose limits that are far from the normal use-case so only extreme loads will risk a reboot.

[top of page]

Basic Watchdog Operation

The first test you need to perform is with the "basic installation" of the watchdog daemon (no config file tests enabled, no auto-load scripts) to establish that is can open the watchdog device, and said device is capable of resetting the PC.

Warning: Triggering the reset action is a risk to your machine's file system and application's data integrity! Hence you should make sure as little as possible is running (e.g. email client closed, no one else logged in, etc) and run the 'sync' command just before you execute any test. Also you should really check your machine is not rebuilding or scrubbing any software RAID when doing the tests with:

# more
      /proc/mdstat

Where '#' is the root command prompt (this check also works as a normal user, typically shown with '$' as the command prompt). If there are no MD devices, or they are all showing OK, then proceed.
[top of page]

Checking for the Watchdog Hardware

If you have successfully loaded the watchdog hardware's driver module (or the 'softdog' emulator) then you should see the entry in /dev corresponding to this. For example:

# ls -l /dev/watch*
crw------- 1 root root 10, 130 May 13 16:27 /dev/watchdog

In this case you edit the test copy of your watchdog.conf file, say ./test.conf in the working directory where you are testing the system and add/modify the line to match. For example:

watchdog-device = /dev/watchdog

You can check the device using the wd_identify utility, or look in syslog after starting the watchdog to see if it agrees with your expectations:

#
      wd_identify --config-file ./test.conf

      W83627HF WDT

In this case we have an Itox EL620 motherboard and it is using the w83627hf_wdt watchdog driver module for the Winbond W83627DHG-P chip. this provides system monitoring (temperature, supply voltage, etc) as well as the watchdog timer. So here the test looks good.
[top of page]

Testing the Watchdog Hardware

Next we need to check that the hardware will run and trigger a reboot if the daemon fails. The simplest option here is to run the watchdog daemon first, and check that it is happy with the driver:

# watchdog
      --config-file ./test.conf

Then check the results in syslog:

# grep watchdog /var/log/syslog
May 13 17:47:03 test0 watchdog[12089]: starting daemon (6.00):
May 13 17:47:03 test0 watchdog[12089]: int=1s realtime=yes sync=no load=0,0,0
May 13 17:47:03 test0 watchdog[12089]: memory not checked
May 13 17:47:03 test0 watchdog[12089]: ping: no machine to check
May 13 17:47:03 test0 watchdog[12089]: file: no file to check
May 13 17:47:03 test0 watchdog[12089]: pidfile: no server process to check
May 13 17:47:03 test0 watchdog[12089]: interface: no interface to check
May 13 17:47:03 test0 watchdog[12089]: temperature: no sensors to check
May 13 17:47:03 test0 watchdog[12089]: no test binary files
May 13 17:47:03 test0 watchdog[12089]: no repair binary files
May 13 17:47:03 test0 watchdog[12089]: error retry time-out = 60 seconds
May 13 17:47:03 test0 watchdog[12089]: alive=/dev/watchdog heartbeat=[none] to=root no_act=no force=no
May 13 17:47:03 test0 watchdog[12089]: watchdog now set to 60 seconds
May 13 17:47:03 test0 watchdog[12089]: hardware watchdog identity: W83627HF WDT

Again we see the same hardware identity as "W83627HF WDT" and no error messages, so we should have the daemon running OK. A quick check of that should confirm it:

# ps -Af |
      grep watch

      root        
      7     2  0 16:27
      ?        00:00:00 [watchdog/0]

      root       
      12     2  0 16:27
      ?        00:00:00 [watchdog/1]

      root       
      16     2  0 16:27
      ?        00:00:00 [watchdog/2]

      root       
      20     2  0 16:27
      ?        00:00:00 [watchdog/3]

      root     12089     1 
      0 17:47 ?        00:00:00
      watchdog --config-file ./test.conf

      root     12128 11598  0 17:51
      pts/0    00:00:00 grep --color=auto watch

    

Here we can see all process that have 'watch' in their names, and the watchdog daemon is the 5th entry. It can be seen running as a daemon due to the parent ID being 1 as 'init' has taken over. The process ID, 12089 in this example, is also shown in the syslog entry above.

To stop the watchdog cleanly we could use 'pkill watchdog' to send SIGTERM, however, in this case we want to kill the process without closing the watchdog device driver, so instead we execute the following commands:

# touch
      /forcefsck

      # sync

      # pkill -9 watchdog

      # for n in $(seq 1 60); do echo $n; sleep 1; sync; done

Then we wait...in approximately 60 seconds (the figure reported in syslog here as "watchdog now set to 60 seconds") the machine should reboot as the hardware timer expires.

The command 'touch /forcefsck' tells the machine to check its file systems on reboot even if it thinks they are OK (they won't be, but with a journalling file system they should be recovered automatically and so clean enough). The sync commands are intended to make sure the file system remains as clean as possible when the reset kicks in.

You can get a slightly fancier version of this test in the form of the test-watchdog-reset.sh script in the example scripts download.

Once your machine has rebooted and completed any file system recovery, you should check syslog or boot.log to see it went OK and no real problems were encountered.

If you don't get a reboot, then you need to check:

The driver saw the watchdog exit badly, in syslog you should see something like: "w83627hf/thf/hg/dhg WDT: Unexpected close, not stopping watchdog!".
The watchdog module was the correct one for your hardware, and;
Are there are any BIOS or IPMI settings to enable/disable the watchdog hardware.

If the hardware works OK, then you can concentrate on configuring and testing the health monitoring options.
[top of page]

Testing Accounts

Testing as Root

Normally the watchdog daemon runs as root and so it has the authority to perform all tests (e.g. ping) and if a fault is detected it will bring down all processes and reboot the machine. During testing this can be a bit tiresome, and for a lot of the tests you can run them with a different user account to your normal one and save the risk and wasted time of the forced reboots.

If you need to test as root, for example, to check the network ping test is working as planned, you should consider using the '--no-action' command line option so detected faults do not bring the machine down.

Even so, take considerable care when doing anything as root, because a fault in a test script, etc, could cause serious damage to the machine if run as root. When possible, start your testing as a normal user (see below).
[top of page]

Testing as Normal User

The advantage of testing as a normal user is you can't take the machine down (assuming the watchdog hardware is not in use). However, you can and will kill off all of your own processes if the daemon attempts a shut-down, leading to a fairly brutal logging off!

So when testing as a user-privilege process you should use a separate dummy test account. For example, log in to a terminal window (e.g. open a window and use "su test") before you start the watchdog, and then you can monitor it and trigger test events (e.g. by the example wd_test_action.sh script) from your own log-in to test how it responds.

Even though it will kill off the test-user's log-in if triggered, you will still have the information in syslog and anything you can still see in the terminal window. Again, the '--no-action' command line option can be used to stop it going that far.

[top of page]

Foreground vs Daemon Operation

Normally when you start the watchdog daemon it reads the config file, sets up certain actions (e.g. opening sockets to 'ping' if specified) and then becomes a daemon by forking itself and re-opening the stdin|out|err paths to /dev/null so you see nothing more from it.

To deal with any child processes ("test binary" and "repair binary" actions) it re-directs their stdout & stderr to files in the log directory (/var/log/watchdog/ by default), again so you see nothing coming from them even if they output messages.

The command line option '--foreground' skips the daemonization step, so the watchdog continues to run as a normal program. In addition, it continues to send all status messages to stderr (as well as to syslog) so the operation is visible in real-time. Since it has not closed the normal outputs, any child processes' messages are interleaved with any watchdog messages.

When testing in the foreground the natural thing to do when stopping the program is to use Ctrl+C key stroke. Unfortunately this will not stop the watchdog module (if used) so it could lead to an unexpected reboot! If you are doing foreground testing then the better option is to send SIGTERM to the process from another terminal window.

However, you usually have some grace period after Ctrl+C so you could run the wd_identify program as that will open and then properly close the configured module, thus stopping any reboot. Unless the module is configured with "no way out" in which case testing is tricky, and you have to keep starting wd_keepalive to prevent a reboot (just as the normal service watchdog start|stop command does).

[to be continued...]

[top of page]

Last Updated on 26-Aug-2019 by Paul Crawford
Copyright (c) 2014-19 by Paul S. Crawford. All rights reserved.
Email psc(at)sat(dot)dundee(dot)ac(dot)uk
Absolutely no warranty, use this information at your own risk.