Linux Watchdog Daemon - Overview

Linux Watchdog Daemon - Overview

Back to PSC's home page
Back to Watchdog

Introduction

A watchdog in computer terms is something, usually hardware-based, that monitors a complex system for "normal" behaviour and if it fails, performs a system reset to hopefully recover normal operation.

You can read more on this at the Wikipedia entry on WDT.

It is intended as a last resort for maintaining a system's availability and, at the very least, to ensure that the administrator can remotely log-in to diagnose and fix faults of a non-persistent manner. Obviously it won't stop a hardware fault from breaking a system, nor is it any good against a persistent software problem, but for a system that is generally well behaved (and particularly if it is located at a remote site and/or is otherwise essential for operations) it serves to improve the overall availability of the system.

If your application cannot tolerate a short outage, then a watchdog alone is not going to solve it, you need to look at other high-availability solutions for hardware (e.g. RAID for disk error protection) and software (clustering & application mirroring) that will provide an acceptable degree of overall system availability.

With the Linux operating system there are two parts to the watchdog:

The actual hardware timer and kernel driver module that can force a hard reset, and;
The user-space background daemon that refreshes the timer and provides a wider range of health monitoring and recovery options.

Both can function independently, but clearly they are designed to operate together for maximum protection.
[top of page]

The Watchdog Module

Normally the hardware support for a watchdog is simply a timer that is set to some reasonable time-out, and then periodically refreshed by the running software. If for any reason the software stops refreshing the hardware (and has not explicitly shut it down) then it times-out and performs a hardware reset of the computer. In this way even kernel panic type of faults can usually be recovered. Often the chip sets that provide system monitoring (temperature, supply voltages, fan speeds, etc) have a watchdog timer, though one can never be sure if the motherboard manufacturer will have used it!

In the context of the Linux operating system, there is a standard interface to the watchdog hardware provided by the corresponding kernel device driver (module) provided as /dev/watchdog (checking for this is a simple test of the module being loaded). However, such a driver is not usually loaded by default so you may have to manually configure your system to load it. Typically this is done by adding the module name to /etc/modules or (better still so it is loaded on demand) to /etc/default/watchdog by editing watchdog_module="none" to have the module name.

Linux also provides a software watchdog by means of the 'softdog' module. While this it better than nothing, it is far less effective than hardware! Basically if the kernel fails, so does your means of recovery in this case.

The watchdog hardware + driver module provides the most basic of protection. It is started by anything that can periodically write to /dev/watchdog and if that fails for any reason the watchdog hardware times-out and machine is rebooted by means of a hard reset.

However, a hard reset is something that is normally undesirable as it risks file system corruption, so it is much better if you can perform a clean reboot instead.
[top of page]

The Watchdog Daemon

To operated the watchdog device, there is normally a background daemon that can open the device and provide the periodic refresh activity. However, a machine can also get in to a very unusable state without actually terminating the background daemon's operation, therefore the watchdog daemon for Linux can be configured to periodically run a number of basic tests to verify that the machine looks OK.

On failing such tests (possibly with a certain amount of re-try behaviour to avoid being too "trigger happy") the daemon can reboot the machine in a moderately orderly manner in order to keep a log of why it happened, and hopefully avoid file system problems, etc. While doing so, it also has the "insurance" of the hardware timer so if it fails to reboot nicely, there is a hardware reset to follow that up.

This "moderately orderly" shut down is not the normal init-based shut down approach where the proper sequence of shut-down scripts are executed, as that is very likely to fail in a number of the conditions for which watchdog action it is needed (e.g. system out of memory, out of process table space, etc).

So instead it performs the "blunderbuss approach" to stopping all processes by signalling everything with SIGTERM and then after 5 seconds with the non-ignorable SIGKILL, then it tries to update wtmp (so the shut down is recorded), update the random seed (to preserve entropy), sync the CMOS clock to system time (to help ensure the system time is reasonable on reboot), and finally sync and un-mount the file systems before it attempts reset by means of the hardware timer (if that is possible).

The hardware reset approach is preferred over the kernel's reboot API as the kernel stops the watchdog hardware on a normal shut-down or reboot, and thus could hang just after that point without any means of automatic recovery (e.g. a hung RAID card or similar).

There are in fact two daemons used for the watchdog hardware support:

'wd_keepalive' provides only the hardware driver open/refresh/close actions.
'watchdog' provides the driver open/refresh/close actions along with various other system checks.

When the system boots, it starts wd_keepalive as early as possible to protect against serious faults during booting, then once other services are up changes to run the full watchdog. The normal watchdog cannot be started early because some of the tests it could perform might depend on resources that start later in the chain (e.g. network file system, other daemons to monitor, etc). Similarly on shut-down the main watchdog is stopped early and wd_keepalive started in its place to deal gracefully with the stopping of services that might be monitored.
[top of page]

Do I need a Watchdog?

From the introduction it can be seen that most systems that are used "interactively", like a home PC, don't really need it. Basically if it crashes while you are using it then you typically try Ctrl+Alt+Del (maybe also Ctrl+Alt+F1 to try text-mode login) and, if that fails, then simply push the reset button (or hold down the power button for 5 sec) to recover the machine.

Where the watchdog is most useful is situations like ours where you have hardware control computers running continuously or, more commonly, servers operating at remote sites. Both are situations where you may be sleeping or on holiday when it goes wrong and/or recovery involves a tiresome trip to the site. In such cases the last resort of an automated reboot is quite valuable.
[top of page]

Last Updated on 26-Aug-2019 by Paul Crawford
Copyright (c) 2014-19 by Paul S. Crawford. All rights reserved.
Email psc(at)sat(dot)dundee(dot)ac(dot)uk
Absolutely no warranty, use this information at your own risk.