Linux Watchdog Daemon - Test/Repair Scripts

Back to PSC's home page
Back to Watchdog


Functionality beyond the watchdog's built-in tests can easily be added by means of an external program (or script). Before you implement an extension to the watchdog think very hard about why you are doing it, and be sure it is not something already covered by the watchdog daemon's internal tests. Those are listed in the configuration page.

Even more so than the built-in tests, these must be designed for safe operation and tested very thoroughly from the command prompt first, before you test them with the daemon, and only after both sorts of tests should you consider adding to the system configuration!

As pointed out by Zygo Blaxell, the simple fact that you can run a script, any script, is a very good indicator of machine health.
[top of page]

Safety & Security

Extensions to the watchdog run as root, and so must be designed to be safe and not to be easily hijacked to provide either a back-door to the system, or to damage its operation (e.g. delete files, behave so badly it acts as "denial of service"). Some simple starting points:
Don't abuse the watchdog, stick to simple "is the system broken?" stuff. It is not intended to run complex monitoring actions or do general periodic stuff, use nagios & cron for that sort of thing. Most of the above is common sense for normal administrative work, but it pays to double-check anything that has such privileges as the watchdog daemon.
[top of page]

Program Complexity

Don't try be too clever. Keep it simple and testable, as a broken watchdog daemon can be far more trouble than an unsupervised system! Here are some starting points:
[top of page]

Modes of Operation

There are two versions, or more precisely two modes of operation, see the watchdog test/repair script section of the configuration page.

Version 0 Operation

The original watchdog supported the option for a single test binary (i.e. program or script) to implement any extra tests, and another independent option to run a repair binary to handle all error actions.

With the updates to V6.0 it then became possible to have multiple V0 test binaries, but still only one "general" repair binary. In addition, certain error cases are now treated as unrepairable and the repair binary is not called for those case.

The V0 test binary is configured by explicitly listing its path & name in the configuration file, and it is called without any command line options. Basically they are called as:

/somepath/testprog

The V0 tests are expected to return an error code that is zero for all OK, and non-zero for any errors. However, this error code is not a completely free choice, as some values are treated specially. If the test program has multiple tests then it makes some sense to return specific errors codes to indicate the cause of the problem.

NOTE: The V0 test binary should be considered as 'depreciated' and used for reverse compatibility only, and the the V1 test/repair script mode of operation used when ever possible. By doing so the V0 repair binary (see below) only has to support the watchdog built-in tests (ping, file status, etc) and not any test binary.

The V0 repair binary is used to handle all errors except for V1 test binaries (see below), not just ones originating with the V0 test binary. Thus it has to be written to deal with the range of possible problem. Basically, if you implement a V0 repair binary then you must test the command line arguments and only attempt a repair for the conditions you understand.

The V0 repair script is called with the error code as argv[1] and the name of the test (if any) as argv[2]. For example an "operation not permitted" error (errno = 1) on something without an "object" name:

/somepath/repairprog 1

An "access denied" error (errno = 13) for reading file /var/run/somefile.pid would result in this call:

/somepath/repairprog 13 /var/run/somefile.pid

So you should return zero only if:
If the repair script returns non-zero then the machine is rebooted (or shut down, depending on the value). In general, any failure that has no object name is probably unrecoverable anyway.

If you do not know what to do, the best option is to exit with the input error code value (i.e. just pass it through) and allow a reboot to fix things.
[top of page]

Version 1 Operation

Later the watchdog had a new way of running test and repair scripts intended to make it simpler and easier to deal with only repairing stuff you know about. In this mode of operation there is a test directory (default is /etc/watchdog.d/) and any executable file in it is automatically loaded to the daemon's list at start-up.
Each program is assumed to be both the test and repair action rolled in to one, and it is called with the command line option 'test' for test action and 'repair' for the matching repair action (if possible). For example:

/etc/watchdog.d/example.sh test

/etc/watchdog.d/example.sh repair 13 /etc/watchdog.d/example.sh

In this case, the argv[3] value is the full path & name of the test that was executed (the "object" that cause the error) and of course it is only called in response to its own error return.

So when writing a V1 test binary/script you can normally assume the errors will be relevant, but of course you probably need to know what generated the given code in order to repair it. Again, if no repair is sensible then on the "repair" action simply return a non-zero value (e.g. the original error code) and the machine will be rebooted.

You can mix V0 and V1 binary operations, since the V1 operation is only for the auto-loaded executable files, and any V0 repair binary will then be used for all errors other than your V1 tests. However, you should be aware that all test scripts are executed essentially "in parallel" so you must not assume unique/protected access to anything.

Return Codes

A return value (i.e. exit code) of zero is considered "OK" and no further action is needed.

A process can only return an 8-bit number which is normally treated as unsigned, so the original negative watchdog values needs special treatment. However, this has been changed to use the equivalent unsigned 8-bit values so old code will still work, but new bash scripts can use the positive values and not have problems with function returns, etc.

Any non-zero exit value is considered as an error. However, not all codes are treated the same so it is very important to consider what you return to the watchdog daemon on an error condition in order to have it handled in the way you want or expect.

The following watchdog-specific codes (from include/watch_err.h), and Linux system error code (from /usr/include/asm-generic/errno-base.h), are treated as special actions:
Mnemoic
Value
Description
Action
EREBOOT 255 (-1) Unconditional reboot requested. Reboot
ERESET 254 (-2) Unconditional reset requested. In this case it sends SIGSTOP to everything, sync's the file system and uses the watchdog hardware to reset the machine (then attempts the Linux reboot call if the hardware fails to do this).
Reset
EMAXLOAD 253 (-3) Load averages are too high.
Reboot
ETOOHOT 252 (-4) Too hot, power off (or halt) the computer in an orderly manner. Power off
EDONTKNOW
245 (-11)
State unknown, so don't treat this as an error, but also don't reset the retry counter.
Ignore
ENOMEM 12 Out of memory.
Reboot
ENFILE 23 File table overflow.
Reboot
EMFILE 24 Too many open files.
Reboot

NOTE: The watchdog treats any internal failure to fork the process as EREBOOT since something is seriously wrong!

The remaining codes are all treated as 'normal' errors and the retry timer permits them to occur occasionally. If there is a time-out and there is no successful repair action, then the machine performs an orderly reboot.

If you have some 3rd party program that returns error values that are unknown or not suitable (e.g. a minor failure is returning 255 = -1 which would be treated as an immediate reboot by the watchdog), then you can wrap them in a simple bash script to check the 3rd party return code and return either 0 or 1 accordingly.

[top of page]

Examples

There are some example scripts to download associated with this page, and you should also refer to other general information on bash programming.

When you are writing a test script you should consider the following points:
When testing from the bash command prompt typically used with most Linux distributions you can print the exit value using the '$?' variable, for example:

./wd_sensors.sh ; echo $?

This will attempt to run the script 'wd_sensors.sh' in the current directory, then to print the exit value.

[to be done...]

[top of page]

Last Updated on 26-Aug-2019 by Paul Crawford
Copyright (c) 2014-19 by Paul S. Crawford. All rights reserved.
Email psc(at)sat(dot)dundee(dot)ac(dot)uk
Absolutely no warranty, use this information at your own risk.