Linux Watchdog -
      General Tests
    
    Back to PSC's home page
    Back to Watchdog
    
The default state of the configuration file
    /etc/watchdog.conf after installing
      the daemon is to do nothing. By default the daemon will not
    try opening the watchdog module via /dev/watchdog, nor will it run
    any specific tests. This is safe behaviour, but fairly useless.
    
    The first step in protecting the host against a fault is to edit the
    configuration file so it actually uses the watchdog module (assuming
    you have that loaded). By doing so the behaviour of the daemon now
    is to periodically 'feed' the watchdog and should that fail, the
    system will perform a hard reboot. This now provides protection
    against kernel faults (assuming it is a hardware watchdog and not
    the 'softdog' module that runs in the kernel, of course!).
    
    However, this is still providing only limited protection since a
    computer can get in to a very unusable state while still running the
    already-loaded daemon, and unless the daemon stops the periodic
    refresh it is unlikely that any automated recovery will begin.
    
    Therefore to get the maximum protective value from the watchdog
    system, the daemon should be configured to run some additional tests
    that hopefully reveal the usability of the machine. Deciding on what
    to test is not so simple, because machines can break in more ways
    than easily imagined, and it is also important not to make the
    watchdog too sensitive and thus suffer unnecessary reboots.
    [top
      of page]
    
The general form of the tests are:
    
      - Are fundamental resources available?
 
      - Are the expected processes or signs of activity seen?
 
      - Is it possible to run stuff?
 
    
    Some of the tests that may be useful are fairly generic to a Linux
    computer, indeed to any computer. One built-in test of the watchdog
    is "are file handles still available?", similarly there is the
    option (though not as simple as it sounds) to test for insufficient memory.
    
    In the next category, one of the "expected
      processes" you might choose to test (and the example in the
    configuration file) is the rsyslog daemon. It should be running, and
    normally init will re-spawn it if it fails, so the loss of that
    process is a clear sign that something is going very wrong with the
    machine. For custom computers there may be other
    application-specific processes to monitor.
    
    However, do not use the watchdog as the primary method of restarting
    failed daemons, that is something that init should be configured to
    do. The watchdog's job in this case is the 2nd phase, where a
    persistently failing daemon is no longer re-spawned by init, and the
    conclusion is something else is wrong and that a reboot may be the
    best way to recover the system's operational state.
    
    The third category are the "can I run things?" tests. Of course,
    this requires various fundamental computer resources and ultimately
    is the most important question of all. If you can't run a new
    process, then the machine is very sick. One obvious test here are
    the system's load
      averages, an indicator of the queue of processes for
    execution. While a stupidly high load average tells you little about
    what is wrong, it is a clear indicator that something
    is wrong (maybe even malicious like a fork-bomb).
    
    Another test option is to use the watchdog's ability to run an external test
      program/script. Even if such a script does nothing but 'exit
    0' the fact it could be run tells the watchdog that some resources
    are still available, and perhaps that the bash command interpreter
    is still healthy.
    [top
      of page]
    
As well as testing the computer's internal
    operation, the watchdog can also probe network connections. This
    sort of test needs to be used with care as then external faults
    (such as a network switch being rebooted or disconnected) can
    trigger a reboot that will do nothing to fix the fault. However, for
    some cases (e.g. when a network interface is prone to locking up due
    to bad hardware and/or driver code) this may be a useful approach.
    
    The same sort of network-link approach is sometimes used in
    clustered systems (e.g. SANlock)
    so that a machine that has problems will disconnect itself from the
    cluster by rebooting rather than causing further problems.
    
    [top of page]
    
Last Updated on 26-Aug-2019 by
    Paul Crawford
    Copyright (c) 2014-19 by Paul S. Crawford. All rights reserved.
    Email psc(at)sat(dot)dundee(dot)ac(dot)uk
    Absolutely no warranty, use this information at your own risk.