Linux Watchdog -
General Tests
Back to PSC's home page
Back to Watchdog
The default state of the configuration file
/etc/watchdog.conf after installing
the daemon is to do nothing. By default the daemon will not
try opening the watchdog module via /dev/watchdog, nor will it run
any specific tests. This is safe behaviour, but fairly useless.
The first step in protecting the host against a fault is to edit the
configuration file so it actually uses the watchdog module (assuming
you have that loaded). By doing so the behaviour of the daemon now
is to periodically 'feed' the watchdog and should that fail, the
system will perform a hard reboot. This now provides protection
against kernel faults (assuming it is a hardware watchdog and not
the 'softdog' module that runs in the kernel, of course!).
However, this is still providing only limited protection since a
computer can get in to a very unusable state while still running the
already-loaded daemon, and unless the daemon stops the periodic
refresh it is unlikely that any automated recovery will begin.
Therefore to get the maximum protective value from the watchdog
system, the daemon should be configured to run some additional tests
that hopefully reveal the usability of the machine. Deciding on what
to test is not so simple, because machines can break in more ways
than easily imagined, and it is also important not to make the
watchdog too sensitive and thus suffer unnecessary reboots.
[top
of page]
The general form of the tests are:
- Are fundamental resources available?
- Are the expected processes or signs of activity seen?
- Is it possible to run stuff?
Some of the tests that may be useful are fairly generic to a Linux
computer, indeed to any computer. One built-in test of the watchdog
is "are file handles still available?", similarly there is the
option (though not as simple as it sounds) to test for insufficient memory.
In the next category, one of the "expected
processes" you might choose to test (and the example in the
configuration file) is the rsyslog daemon. It should be running, and
normally init will re-spawn it if it fails, so the loss of that
process is a clear sign that something is going very wrong with the
machine. For custom computers there may be other
application-specific processes to monitor.
However, do not use the watchdog as the primary method of restarting
failed daemons, that is something that init should be configured to
do. The watchdog's job in this case is the 2nd phase, where a
persistently failing daemon is no longer re-spawned by init, and the
conclusion is something else is wrong and that a reboot may be the
best way to recover the system's operational state.
The third category are the "can I run things?" tests. Of course,
this requires various fundamental computer resources and ultimately
is the most important question of all. If you can't run a new
process, then the machine is very sick. One obvious test here are
the system's load
averages, an indicator of the queue of processes for
execution. While a stupidly high load average tells you little about
what is wrong, it is a clear indicator that something
is wrong (maybe even malicious like a fork-bomb).
Another test option is to use the watchdog's ability to run an external test
program/script. Even if such a script does nothing but 'exit
0' the fact it could be run tells the watchdog that some resources
are still available, and perhaps that the bash command interpreter
is still healthy.
[top
of page]
As well as testing the computer's internal
operation, the watchdog can also probe network connections. This
sort of test needs to be used with care as then external faults
(such as a network switch being rebooted or disconnected) can
trigger a reboot that will do nothing to fix the fault. However, for
some cases (e.g. when a network interface is prone to locking up due
to bad hardware and/or driver code) this may be a useful approach.
The same sort of network-link approach is sometimes used in
clustered systems (e.g. SANlock)
so that a machine that has problems will disconnect itself from the
cluster by rebooting rather than causing further problems.
[top of page]
Last Updated on 26-Aug-2019 by
Paul Crawford
Copyright (c) 2014-19 by Paul S. Crawford. All rights reserved.
Email psc(at)sat(dot)dundee(dot)ac(dot)uk
Absolutely no warranty, use this information at your own risk.