Linux Watchdog - General Tests

Back to PSC's home page
Back to Watchdog

The default state of the configuration file /etc/watchdog.conf after installing the daemon is to do nothing. By default the daemon will not try opening the watchdog module via /dev/watchdog, nor will it run any specific tests. This is safe behaviour, but fairly useless.

The first step in protecting the host against a fault is to edit the configuration file so it actually uses the watchdog module (assuming you have that loaded). By doing so the behaviour of the daemon now is to periodically 'feed' the watchdog and should that fail, the system will perform a hard reboot. This now provides protection against kernel faults (assuming it is a hardware watchdog and not the 'softdog' module that runs in the kernel, of course!).

However, this is still providing only limited protection since a computer can get in to a very unusable state while still running the already-loaded daemon, and unless the daemon stops the periodic refresh it is unlikely that any automated recovery will begin.

Therefore to get the maximum protective value from the watchdog system, the daemon should be configured to run some additional tests that hopefully reveal the usability of the machine. Deciding on what to test is not so simple, because machines can break in more ways than easily imagined, and it is also important not to make the watchdog too sensitive and thus suffer unnecessary reboots.
[top of page]

The general form of the tests are:

Are fundamental resources available?
Are the expected processes or signs of activity seen?
Is it possible to run stuff?

Some of the tests that may be useful are fairly generic to a Linux computer, indeed to any computer. One built-in test of the watchdog is "are file handles still available?", similarly there is the option (though not as simple as it sounds) to test for insufficient memory.

In the next category, one of the "expected processes" you might choose to test (and the example in the configuration file) is the rsyslog daemon. It should be running, and normally init will re-spawn it if it fails, so the loss of that process is a clear sign that something is going very wrong with the machine. For custom computers there may be other application-specific processes to monitor.

However, do not use the watchdog as the primary method of restarting failed daemons, that is something that init should be configured to do. The watchdog's job in this case is the 2nd phase, where a persistently failing daemon is no longer re-spawned by init, and the conclusion is something else is wrong and that a reboot may be the best way to recover the system's operational state.

The third category are the "can I run things?" tests. Of course, this requires various fundamental computer resources and ultimately is the most important question of all. If you can't run a new process, then the machine is very sick. One obvious test here are the system's load averages, an indicator of the queue of processes for execution. While a stupidly high load average tells you little about what is wrong, it is a clear indicator that something is wrong (maybe even malicious like a fork-bomb).

Another test option is to use the watchdog's ability to run an external test program/script. Even if such a script does nothing but 'exit 0' the fact it could be run tells the watchdog that some resources are still available, and perhaps that the bash command interpreter is still healthy.
[top of page]

As well as testing the computer's internal operation, the watchdog can also probe network connections. This sort of test needs to be used with care as then external faults (such as a network switch being rebooted or disconnected) can trigger a reboot that will do nothing to fix the fault. However, for some cases (e.g. when a network interface is prone to locking up due to bad hardware and/or driver code) this may be a useful approach.

The same sort of network-link approach is sometimes used in clustered systems (e.g. SANlock) so that a machine that has problems will disconnect itself from the cluster by rebooting rather than causing further problems.

[top of page]

Last Updated on 26-Aug-2019 by Paul Crawford
Copyright (c) 2014-19 by Paul S. Crawford. All rights reserved.
Email psc(at)sat(dot)dundee(dot)ac(dot)uk
Absolutely no warranty, use this information at your own risk.