Linux
Watchdog Daemon - Testing
Back to PSC's home page
Back to Watchdog
Testing of the watchdog before you go live
(i.e. have it configured from boot time) is essential, otherwise you
risk having a machine in an unusable situation of booting,
triggering the watchdog, rebooting, etc.
There are a couple of step/stages in the testing the watchdog to
consider:
- Check the watchdog runs with no test options and successfully
opens and refreshes the watchdog device (i.e. like
wd_keepalive).
- Check that each option you enable configuration file works as
expected. Do so one at a time.
- Check that each test/repair script work as expected when you
run them from the command line.
- Check that each test/repair script work when the watchdog runs
them (probably a different working directory & PATH).
Some of this is part of the normal installation, some of it
customisation.
[top
of page]
Precautions
Before you start installing and testing the watchdog you have
to consider the potential consequences. These include:
- The machine gets in to an endless reboot loop.
- The file system(s) get corrupted by a hard reset.
- The machine runs, but is trigger-happy and reboots when not
expected.
Reboot Loop
In the first case, if you get in to such a state you may need to
boot in to safe mode, or use a "live CD" (or USB stick) to boot up
and edit the machine's settings to disable the watchdog until you
figure out what went wrong. To make this easier, you may want to
have the grub boot loader show you the options before booting the
normal system.
To do this on a typical Ubuntu 12.04 machine modify the
/etc/default/grub file to change the following:
#GRUB_HIDDEN_TIMEOUT=0
#GRUB_HIDDEN_TIMEOUT_QUIET=true
GRUB_TIMEOUT=5
Finally running 'update-grub' (as root or with sudo) to apply those.
On reboot you should now get a 5 second countdown and a choice of
kernel & safe mode, or memtest shown.
[top
of page]
File System Corruption
The second risk, that of an unexpected reset not unmounting the file
systems cleanly and causing corruption, is quite small with modern
journalling file systems (e.g. ext3/4) but should not be ignored. In
particular if you need to use something pretty fragile for one or
more mounted file system (like FAT32 for compatibility reasons).
Hence there are a few precautions you should consider:
- Use a test computer, and not a live or important computer, for
practising your installation, configuration and test.
- You can test most things, except for hardware drivers of
course, in a virtual machine (VMware, xen, etc) with little
risk, as often they allow snapshots of the file system to
provide a pain-free way or rolling back changes.
- Have an up-to-date backup if you must use something important
for testing. You do have a backup?
- If fragile file systems are in use, can you unmount them first
for testing? If only certain tests need those file systems, then
try to run as much of the testing as possible without them
mounted and test those last.
- Try to use the sync
command just before you run any test that might provoke an
event.
- Initially try testing at times of low disk activity, less I/O
means less risk of significant trouble.
- Make sure no other users are logged-on and working on the test
computer, they will not appreciate such a rude interruption!
- Consider editing /etc/default/rcS
to enable automatic fixing of file system problems at boot time (or for
systemd adding the "fsck.repair=yes" option to the kernel command line and
updating the grub boot loader, etc).
However, a having some spurious reboot (kernel panic, hardware
fault, power outage, etc) is always a possibility, and as far as
possible you should configure the operating system and software in
such a way that file systems are checked & repaired
automatically, and that data & processes have integrity tests
and locks to allow a clean re-start/roll-back from any critical
phases of operations.
And test them! You should be able to reboot at any time and recover
a functional system, but only with testing will you find out if this
really is the case.
[top
of page]
Trigger Happy
It can be difficult to configure some of the watchdog's tests
in such a manner that they will rescue a hung computer, but are not
triggered by normal activity. In particular, the load averages and
the memory limits need quite a lot of insight to the machines
operational behaviour. Of course, such behaviour can also change
without warning if the users alter what they do, when the all are
logged-in, etc.
The best advice here is to monitor the machine for a while before
configuring the tests, and if that is not possible, to choose limits
that are far from the normal use-case so only extreme loads will
risk a reboot.
[top
of page]
Basic Watchdog Operation
The first test you need to perform is with the "basic installation"
of the watchdog daemon (no config file tests enabled, no auto-load
scripts) to establish that is can open the watchdog device, and said
device is capable of resetting the PC.
Warning: Triggering the reset
action is a risk to your machine's file system and application's
data integrity! Hence you should make sure as little as
possible is running (e.g. email client closed, no one else logged
in, etc) and run the 'sync' command just before you execute any
test. Also you should really check your machine is not rebuilding or
scrubbing any software RAID when doing the tests with:
# more
/proc/mdstat
Where '#' is the root command prompt (this check also works as a
normal user, typically shown with '$' as the command prompt). If
there are no MD devices, or they are all showing OK, then proceed.
[top
of page]
Checking for the Watchdog Hardware
If you have successfully loaded the watchdog hardware's driver
module (or the 'softdog' emulator) then you should see the entry in
/dev corresponding to this. For example:
# ls -l /dev/watch*
crw------- 1 root root 10,
130 May 13 16:27 /dev/watchdog
In this case you edit the test copy of your watchdog.conf file, say
./test.conf in the working directory where you are testing the
system and add/modify the line to match. For example:
watchdog-device = /dev/watchdog
You can check the device using the wd_identify utility, or look in
syslog after starting the watchdog to see if it agrees with your
expectations:
#
wd_identify --config-file ./test.conf
W83627HF WDT
In this case we have an Itox EL620 motherboard and it is using the
w83627hf_wdt watchdog driver module for the Winbond W83627DHG-P
chip. this provides system monitoring (temperature, supply voltage,
etc) as well as the watchdog timer. So here the test looks good.
[top
of page]
Testing the Watchdog Hardware
Next we need to check that the hardware will run and trigger a
reboot if the daemon fails. The simplest option here is to run the
watchdog daemon first, and check that it is happy with the driver:
# watchdog
--config-file ./test.conf
Then check the results in syslog:
# grep watchdog /var/log/syslog
May 13 17:47:03 test0
watchdog[12089]: starting daemon (6.00):
May 13 17:47:03 test0
watchdog[12089]: int=1s realtime=yes sync=no load=0,0,0
May 13 17:47:03 test0
watchdog[12089]: memory not checked
May 13 17:47:03 test0
watchdog[12089]: ping: no machine to check
May 13 17:47:03 test0
watchdog[12089]: file: no file to check
May 13 17:47:03 test0
watchdog[12089]: pidfile: no server process to check
May 13 17:47:03 test0
watchdog[12089]: interface: no interface to check
May 13 17:47:03 test0
watchdog[12089]: temperature: no sensors to check
May 13 17:47:03 test0
watchdog[12089]: no test binary files
May 13 17:47:03 test0
watchdog[12089]: no repair binary files
May 13 17:47:03 test0
watchdog[12089]: error retry time-out = 60 seconds
May 13 17:47:03 test0
watchdog[12089]: alive=/dev/watchdog heartbeat=[none] to=root
no_act=no force=no
May 13 17:47:03 test0
watchdog[12089]: watchdog now set to 60 seconds
May 13 17:47:03 test0
watchdog[12089]: hardware watchdog identity: W83627HF WDT
Again we see the same hardware identity as "W83627HF WDT" and no
error messages, so we should have the daemon running OK. A quick
check of that should confirm it:
# ps -Af |
grep watch
root
7 2 0 16:27
? 00:00:00 [watchdog/0]
root
12 2 0 16:27
? 00:00:00 [watchdog/1]
root
16 2 0 16:27
? 00:00:00 [watchdog/2]
root
20 2 0 16:27
? 00:00:00 [watchdog/3]
root 12089 1
0 17:47 ? 00:00:00
watchdog --config-file ./test.conf
root 12128 11598 0 17:51
pts/0 00:00:00 grep --color=auto watch
Here we can see all process that have 'watch' in their names, and
the watchdog daemon is the 5th entry. It can be seen running as a
daemon due to the parent ID being 1 as 'init' has taken over. The
process ID, 12089 in this example, is also shown in the syslog entry
above.
To stop the watchdog cleanly we could use 'pkill watchdog' to send
SIGTERM, however, in this case we want to kill the process without
closing the watchdog device driver, so instead we execute the
following commands:
# touch
/forcefsck
# sync
# pkill -9 watchdog
# for n in $(seq 1 60); do echo $n; sleep 1; sync; done
Then we wait...in approximately 60 seconds (the figure reported in
syslog here as "watchdog now set to 60 seconds") the machine should
reboot as the hardware timer expires.
The command 'touch /forcefsck' tells the machine to check its file
systems on reboot even if it thinks they are OK (they won't be, but
with a journalling file system they should be recovered
automatically and so clean enough). The sync commands are intended
to make sure the file system remains as clean as possible when the
reset kicks in.
You can get a slightly fancier version of this test in the form of
the test-watchdog-reset.sh script in the example scripts download.
Once your machine has rebooted and completed any file system
recovery, you should check syslog or boot.log to see it went OK and
no real problems were encountered.
If you don't get a reboot, then you need to check:
- The driver saw the watchdog exit badly, in syslog you should
see something like: "w83627hf/thf/hg/dhg WDT: Unexpected close,
not stopping watchdog!".
- The watchdog module was the correct one for your hardware,
and;
- Are there are any BIOS or IPMI settings to enable/disable the
watchdog hardware.
If the hardware works OK, then you can concentrate on configuring
and testing the health monitoring options.
[top
of page]
Testing Accounts
Testing as Root
Normally the watchdog daemon runs as root and so it has the
authority to perform all tests (e.g. ping) and if a fault is
detected it will bring down all processes and reboot the machine.
During testing this can be a bit tiresome, and for a lot of the
tests you can run them with a different user account to your normal
one and save the risk and wasted time of the forced reboots.
If you need to test as root, for example, to check the network ping
test is working as planned, you should consider using the
'--no-action' command line
option so detected faults do not bring the machine down.
Even so, take considerable care when doing anything as
root, because a fault in a test script, etc, could cause serious
damage to the machine if run as root. When possible, start your
testing as a normal user (see below).
[top
of page]
Testing as Normal User
The advantage of testing as a normal user is you can't take the
machine down (assuming the watchdog hardware is not in use).
However, you can and will kill off all of your own processes if the daemon
attempts a shut-down, leading to a fairly brutal logging off!
So when testing as a user-privilege process you should use a
separate dummy test account. For example, log in to a terminal
window (e.g. open a window and use "su test") before you start the
watchdog, and then you can monitor it and trigger test events (e.g.
by the example wd_test_action.sh script) from your own log-in to
test how it responds.
Even though it will kill off the test-user's log-in if triggered,
you will still have the information in syslog and anything you can
still see in the terminal window. Again, the '--no-action' command
line option can be used to stop it going that far.
[top
of page]
Foreground vs Daemon Operation
Normally when you start the watchdog daemon it reads the config
file, sets up certain actions (e.g. opening sockets to 'ping' if
specified) and then becomes a daemon by forking itself and
re-opening the stdin|out|err paths to /dev/null so you see nothing
more from it.
To deal with any child processes ("test binary" and "repair binary"
actions) it re-directs their stdout & stderr to files in the log
directory (/var/log/watchdog/ by default), again so you see nothing
coming from them even if they output messages.
The command line option '--foreground' skips the daemonization step,
so the watchdog continues to run as a normal program. In addition,
it continues to send all status messages to stderr (as well as to
syslog) so the operation is visible in real-time. Since it has not
closed the normal outputs, any child processes' messages are
interleaved with any watchdog messages.
When testing in the foreground the natural thing to do when stopping
the program is to use Ctrl+C key stroke. Unfortunately this will not
stop the watchdog module (if used) so it could lead to an
unexpected reboot! If you are doing foreground testing then the
better option is to send SIGTERM to the process from another
terminal window.
However, you usually have some grace period after Ctrl+C so you
could run the wd_identify program as that will open and then
properly close the configured module, thus stopping any reboot.
Unless the module is configured with "no way out" in which case
testing is tricky, and you have to keep starting wd_keepalive to
prevent a reboot (just as the normal service watchdog start|stop
command does).
[to be continued...]
[top of page]
Last Updated on 26-Aug-2019 by
Paul Crawford
Copyright (c) 2014-19 by Paul S. Crawford. All rights reserved.
Email psc(at)sat(dot)dundee(dot)ac(dot)uk
Absolutely no warranty, use this information at your own risk.