Linux Watchdog Daemon - Test/Repair Scripts

Back to PSC's home page
Back to Watchdog

Functionality beyond the watchdog's built-in tests can easily be added by means of an external program (or script). Before you implement an extension to the watchdog think very hard about why you are doing it, and be sure it is not something already covered by the watchdog daemon's internal tests. Those are listed in the configuration page.

Even more so than the built-in tests, these must be designed for safe operation and tested very thoroughly from the command prompt first, before you test them with the daemon, and only after both sorts of tests should you consider adding to the system configuration!

As pointed out by Zygo Blaxell, the simple fact that you can run a script, any script, is a very good indicator of machine health.
[top of page]

Safety & Security

Extensions to the watchdog run as root, and so must be designed to be safe and not to be easily hijacked to provide either a back-door to the system, or to damage its operation (e.g. delete files, behave so badly it acts as "denial of service"). Some simple starting points:

All scripts, programs & configuration data must be only writeable by root so ordinary accounts cannot modify them.
All directories in the tree that all scripts, programs & configuration data live in must only be writeable by root (e.g. "chown root:root somedir && chmod go-w somedir"). Otherwise they can be re-named or deleted by an ordinary user account.
Any script or file that has sensitive data (e.g. hard-coded password) must only be readable by root.
Any programs called from inside a script must be treated the same way (as they will generally run as root as well). System calls are normally safe as that is the default permissions, but if you use a custom program then after testing make sure you change its permissions as above.
Don't assume where the current working directory is (it is normally '/' but maybe not for testing), and do not assume the $PATH variable will have anything but the minimal settings. Ideally use the absolute path to any program that could cause problems.

Don't abuse the watchdog, stick to simple "is the system broken?" stuff. It is not intended to run complex monitoring actions or do general periodic stuff, use nagios & cron for that sort of thing. Most of the above is common sense for normal administrative work, but it pays to double-check anything that has such privileges as the watchdog daemon.
[top of page]

Program Complexity

Don't try be too clever. Keep it simple and testable, as a broken watchdog daemon can be far more trouble than an unsupervised system! Here are some starting points:

A set of smaller and more easily tested scripts/programs is preferable to a large complex one. The watchdog daemon can run several in parallel, so you normally use that ability.
Don't assume only one copy will run at a time! Implement locks and unique temporary file names (e.g. using the PID in the name) if necessary, as there could be more than one copy of a program/script running at once. The watchdog tries not to do this, but never assume - check.
Don't start asynchronous (i.e. background) child processes which could persist after the parent has exited, and don't use setsid to detach a process from its parent - both will interfere with the watchdog's ability to deal with hung/timed-out processes.
Test for normal operation, and try to simulate and test all the failure case(s) as well as you can before moving on.
Test the program thoroughly from the command line before you even consider running it for a daemon under test, and do both before you configure the live daemon (starting from reboot) to use it.
Document it! In a few years time you, or someone else, will be faced with modifying or fixing something and then it really pays to have notes about what something was intended to do, how it should go about that job, and any "gotcha!" aspects that are not obvious from the code and purpose.
In general, don't duplicate stuff that already exists (e.g. watchdog internal tests, existing Linux programs), try to use them to achieve the desired check.

[top of page]

Modes of Operation

There are two versions, or more precisely two modes of operation, see the watchdog test/repair script section of the configuration page.

Version 0 Operation

The original watchdog supported the option for a single test binary (i.e. program or script) to implement any extra tests, and another independent option to run a repair binary to handle all error actions.

With the updates to V6.0 it then became possible to have multiple V0 test binaries, but still only one "general" repair binary. In addition, certain error cases are now treated as unrepairable and the repair binary is not called for those case.

The V0 test binary is configured by explicitly listing its path & name in the configuration file, and it is called without any command line options. Basically they are called as:

/somepath/testprog

The V0 tests are expected to return an error code that is zero for all OK, and non-zero for any errors. However, this error code is not a completely free choice, as some values are treated specially. If the test program has multiple tests then it makes some sense to return specific errors codes to indicate the cause of the problem.

NOTE: The V0 test binary should be considered as 'depreciated' and used for reverse compatibility only, and the the V1 test/repair script mode of operation used when ever possible. By doing so the V0 repair binary (see below) only has to support the watchdog built-in tests (ping, file status, etc) and not any test binary.

The V0 repair binary is used to handle all errors except for V1 test binaries (see below), not just ones originating with the V0 test binary. Thus it has to be written to deal with the range of possible problem. Basically, if you implement a V0 repair binary then you must test the command line arguments and only attempt a repair for the conditions you understand.

The V0 repair script is called with the error code as argv[1] and the name of the test (if any) as argv[2]. For example an "operation not permitted" error (errno = 1) on something without an "object" name:

/somepath/repairprog 1

An "access denied" error (errno = 13) for reading file /var/run/somefile.pid would result in this call:

/somepath/repairprog 13 /var/run/somefile.pid

So you should return zero only if:

(A) the error code and object are things you know about and can fix, and;
(B) the fix was successful.

If the repair script returns non-zero then the machine is rebooted (or shut down, depending on the value). In general, any failure that has no object name is probably unrecoverable anyway.

If you do not know what to do, the best option is to exit with the input error code value (i.e. just pass it through) and allow a reboot to fix things.
[top of page]

Version 1 Operation

Later the watchdog had a new way of running test and repair scripts intended to make it simpler and easier to deal with only repairing stuff you know about. In this mode of operation there is a test directory (default is /etc/watchdog.d/) and any executable file in it is automatically loaded to the daemon's list at start-up.
Each program is assumed to be both the test and repair action rolled in to one, and it is called with the command line option 'test' for test action and 'repair' for the matching repair action (if possible). For example:

/etc/watchdog.d/example.sh test

/etc/watchdog.d/example.sh repair 13 /etc/watchdog.d/example.sh

In this case, the argv[3] value is the full path & name of the test that was executed (the "object" that cause the error) and of course it is only called in response to its own error return.

So when writing a V1 test binary/script you can normally assume the errors will be relevant, but of course you probably need to know what generated the given code in order to repair it. Again, if no repair is sensible then on the "repair" action simply return a non-zero value (e.g. the original error code) and the machine will be rebooted.

You can mix V0 and V1 binary operations, since the V1 operation is only for the auto-loaded executable files, and any V0 repair binary will then be used for all errors other than your V1 tests. However, you should be aware that all test scripts are executed essentially "in parallel" so you must not assume unique/protected access to anything.

Return Codes

A return value (i.e. exit code) of zero is considered "OK" and no further action is needed.

A process can only return an 8-bit number which is normally treated as unsigned, so the original negative watchdog values needs special treatment. However, this has been changed to use the equivalent unsigned 8-bit values so old code will still work, but new bash scripts can use the positive values and not have problems with function returns, etc.

Any non-zero exit value is considered as an error. However, not all codes are treated the same so it is very important to consider what you return to the watchdog daemon on an error condition in order to have it handled in the way you want or expect.

The following watchdog-specific codes (from include/watch_err.h), and Linux system error code (from /usr/include/asm-generic/errno-base.h), are treated as special actions:

Mnemoic	Value	Description	Action
EREBOOT	255 (-1)	Unconditional reboot requested.	Reboot
ERESET	254 (-2)	Unconditional reset requested. In this case it sends SIGSTOP to everything, sync's the file system and uses the watchdog hardware to reset the machine (then attempts the Linux reboot call if the hardware fails to do this).	Reset
EMAXLOAD	253 (-3)	Load averages are too high.	Reboot
ETOOHOT	252 (-4)	Too hot, power off (or halt) the computer in an orderly manner.	Power off
EDONTKNOW	245 (-11)	State unknown, so don't treat this as an error, but also don't reset the retry counter.	Ignore
ENOMEM	12	Out of memory.	Reboot
ENFILE	23	File table overflow.	Reboot
EMFILE	24	Too many open files.	Reboot

NOTE: The watchdog treats any internal failure to fork the process as EREBOOT since something is seriously wrong!

The remaining codes are all treated as 'normal' errors and the retry timer permits them to occur occasionally. If there is a time-out and there is no successful repair action, then the machine performs an orderly reboot.

If you have some 3rd party program that returns error values that are unknown or not suitable (e.g. a minor failure is returning 255 = -1 which would be treated as an immediate reboot by the watchdog), then you can wrap them in a simple bash script to check the 3rd party return code and return either 0 or 1 accordingly.

[top of page]

Examples

There are some example scripts to download associated with this page, and you should also refer to other general information on bash programming.

When you are writing a test script you should consider the following points:

Keep it as simple as possible, but no simpler.
You can freely output diagnostic messages to stdout/stderr (for example the 'echo' command in bash) as they will be hidden when running as a daemon (they are redirected to files in /var/log/watchdog/ normally).
Try to be robust by checking that any planned programs are there & executable before attempting to use them.
If non-critical packages are not installed, for example lm-sensors, you should print a warning message but (probably) 'exit 0' so a reboot is not called.
Remember to explicitly end the script with an 'exit 0' statement if you have no errors to guarantee the correct return value.
Test it from the command prompt first.
Devise a way of testing it for the fault conditions you expect.

When testing from the bash command prompt typically used with most Linux distributions you can print the exit value using the '$?' variable, for example:

./wd_sensors.sh ; echo $?

This will attempt to run the script 'wd_sensors.sh' in the current directory, then to print the exit value.

[to be done...]

[top of page]

Last Updated on 26-Aug-2019 by Paul Crawford
Copyright (c) 2014-19 by Paul S. Crawford. All rights reserved.
Email psc(at)sat(dot)dundee(dot)ac(dot)uk
Absolutely no warranty, use this information at your own risk.