Variable
Name |
Variable
Type |
Function / Description |
admin |
string |
This is the email user name
of the person to be notified when the system is rebooting,
the default is "root". Assumes the sendmail program is
installed and configured correctly. |
allocatable-memory |
integer |
This is similar to the older min-memory configuration, but actively tests for a given number of allocatable memory pages (typically 4kB/page on x86 hardware). Zero to disable test. |
change |
integer |
Time limit (in seconds) for a
specified file time-stamp to age. Must come after the
corresponding 'file' entry. |
file |
string/R |
The path/name of a file to be
checked for existence and, (if 'change' given) for age. |
heartbeat-file |
string |
Name of the file for
diagnostic heartbeat writes a time_t value (in ASCII) on
each write to the watchdog device. |
heartbeat-stamps |
integer |
Number of entries in debug
heartbeat file. |
interface |
string/R |
Name of interface (such as
eth0) in /proc/net/dev to check for incoming (RX) bytes. |
interval |
integer |
Time interval (seconds)
between polling for system health. Default is 1, but should
not be more than [watchdog timeout]-2 seconds. |
log-dir |
string |
Path for watchdog log
directory where the heartbeat file is usually kept, and
where the files for re-directing test/repair scripts are
kept. Default is /var/log/watchdog |
logtick |
integer |
Number of polling intervals
between periodic "verbose" status messages. Default is 1
(i.e. every poll event). |
max-load-1 |
integer |
Limit on the 1-minute
load-average before a reboot is triggered. Set to zero to
ignore this test. |
max-load-5 |
integer | Limit on the 5-minute load-average before a reboot is triggered. Set to zero to ignore this test. |
max-load-15 |
integer | Limit on the 15-minute load-average before a reboot is triggered. Set to zero to ignore this test. |
max-temperature |
integer |
Limit on temperature before
shut-down, Celsius. |
min-memory |
integer |
Minimum number of memory
pages (typically 4kB/page on x86 hardware). Zero to disable
test. |
pidfile |
string/R |
Path/name of a PID file
related to a daemon to be monitored. |
ping |
string/R |
The IP address of a target
for ICMP "ping" test. Must be in numeric IPv4 format such as
192.168.1.1 |
ping-count |
integer |
Number of ping attempts per
polling interval. Must be >= 1 and default is 3 (hence
with 1 second polling interval ping delay must be less than
333ms). |
priority |
integer |
The scheduling priority used
with a call to the sched_setscheduler()
function to configure the round-robin (SCHED_RR) priority
for real-time use (only applicable if 'realtime' is true). |
realtime |
yes/no |
This flag is used to tell the
watchdog daemon to lock its memory against paging out, and
also to the permit real-time scheduling. It is strongly recommended to
do this! |
repair-binary |
string |
The path/name of a program
(or bash script, etc) that is used to make a repair on
failed tests (other than auto-loaded V1 test scripts). |
repair-maximum |
integer |
Number of repair attempts on
one "object" without success before giving up and rebooting.
Default is 1, and setting this to zero will allow any number
of repair attempts. |
repair-timeout |
integer |
Time limit (seconds) for the
repair action. Default is 60 and beyond this a reboot is
initiated. |
retry-timeout |
integer |
Time limit (seconds) from the
first failure on a given "object" until it is deemed bad and
a repair attempted (if possible, otherwise a reboot is the
action). Default is 60 seconds. |
sigterm-delay |
integer |
Time between the SIGTERM signal being sent to
all processes and the following SIGKILL signal. Default is 5
seconds, range 2-300. |
temperature-device |
string |
(depreciated) This was used
in V5.13 and below for the old /dev/temperature style of
device. With V5.15 the use of temperature-sensor
is used and old style no longer supported. |
temperature-poweroff |
yes/no |
This flag decides if the
system should power-off on overheating (default = yes), or
perform a system halt and wait for Ctrl-Alt-Del reactivation
(the "no" case). |
temperature-sensor |
string/R |
Name of the file-like device that holds temperature as an ASCII string in milli-Celsius, typically generated by the lm-sensors package. |
test-binary |
string/R |
The path/name of a V0 test
program (or bash script, etc) used to extend the watchdog's
range of health tests. NOTE: The V0 test binary should be considered as 'depreciated' and used for reverse compatibility only, and the the V1 test/repair script mode of operation used when ever possible. |
test-directory |
string |
The path name of the
directory for auto-loaded V1 test/repair scripts. Default
is: test-directory=/etc/watchdog.d This ability can be disabled completely by setting it to no string: test-directory= If the directory is not present it is ignored in any case. |
test-timeout |
integer |
Time limit (seconds) for any
test scripts. Default is 60. This can be set to zero to disable the time-out, however, in this case a hung program will never be actioned, though all other tests will continue normally. |
verbose |
yes/no |
Provides basic control of the
verbosity of the status messages. Previously this was only
possible on the -v / --verbose command line options. |
watchdog-device |
string |
The name of the device for
the watchdog hardware. Default is /dev/watchdog If this is not given (or disabled by setting it to no string) the watchdog can still function, but will not be effective as any internal watchdog faults or kernel panic will be unrecoverable. |
watchdog-timeout |
integer |
The timeout to set the
watchdog device to. Default is 60 seconds and it is not
recommended to change this without good reason. Not all
watchdog hardware supports configuration, or configuration
to second resolution, etc. |
Warning: There is currently a bug/feature where by the order of loading the temperature sensor modules determines the abstracted names (e.g. the first module loaded becomes /sys/class/hwmon/hwmon0 and the second /sys/class/hwmon/hwmon1 etc.)Since the new lm-sensors style of monitoring provides files in milli-Celsius the watchdog now always works in Celsius, and the maximum temperature is set using the configuration option, for example:
If using the abstracted paths (e.g. /sys/class/hwmon/hwmon0) rather then the device paths (e.g. /sys/devices/platform/w83627ehf.2576) then make sure you black-list any modules that are automatically loaded by adding a suitable entry to one of the files in /etc/modprobe.d/ and then add all modules for temperature sensing to /etc/modules as that appears to force deterministic enumeration.
http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages
In a simple form, a load average above 1 per CPUs indicates tasks are being held up due to a lack of resources, either CPU time or I/O delays. This is not a problem if it is only happening at peak times of the day and/or if it is only by a modest amount (say 1-2 times the number of CPUs).