|
Top Five Open Source Packages for System Administratorsby Æleen Frisch, author of Essential System Administration, 3rd Edition12/05/2002 |
This is the fourth installment of a five-part series in which I introduce my current list of the most useful and widely applicable open source administrative tools. In general, these tools can make your job easier no matter what Unix operating system your computers run.
The second place in my top five tools list goes to Nagios, written by Ethan Galstad. Nagios is a feature-rich network monitoring package. Its displays provide current information about system or resource status across an entire network. In addition, it can also be configured to send alerts and perform other actions when problems are detected. This week, we'll look at the sort of monitoring that Nagios provides and also briefly discuss configuring the package.
Note: Nagios was formerly known as Netsaint. Netsaint configuration files are compatible with Nagios, although Nagios has adopted a new, simpler syntax. You can also convert Netsaint configurations files with the included convertcfg utility.
|
In This Series
Number Five: Amanda
Number Four: LDAP
Number Three: GRUB |
Nagios monitors a wide variety of system properties, including system- performance metrics such as load average and free disk space; the presence of important services like HTTP and SMTP; and per-host network availability and reachability. It also allows the system administrator to define what constitutes a significant event on each host--for example, how high a load average is "too high"--and what to do when such conditions are detected.
In addition to detecting problems with hosts and their important services, Nagios also allows the system administrator to specify what should be done as a result. A problem can trigger an alert to be sent to a designated recipient via various communication mechanisms (such as email, Unix message, pager). It is also possible to define an event handler: a program that is run when a problem is detected. Such programs can attempt to solve the problem encountered, and they can also proactively prevent some serious problems when they get triggered by warning conditions.
The information that Nagios collects is displayed in a series of automatically generated Web pages. This format is quite convenient in that it allows a system administrator to view network status information from various points throughout the network.
Figure 1 illustrates the top-level Nagios display, known as the "Tactical Overview."

Figure 1. Nagios Tactical Overview display
|
Related Reading
Essential System Administration |
The narrow column on the left of the display lists links to all of the possible Nagios displays (the one for the current display has been highlighted in the illustration). The Tactical Overview shows very general statistics about the overall network status. In this case, 20 hosts are being monitored, and 16 are currently up. Three hosts are down, and one is unreachable from the monitoring system, presumably because the gateway to it is down. Of the problems on the three hosts that are down, one has been acknowledged by a system administrator. The display also indicates that there are three services that have "critical" status (probably indicating a failure), and two others are in a "warning" state.
Each of the problem indicator displays also functions as a link to another Web page giving details about that particular item.
Figure 2 illustrates a Nagios Status Overview display. The three sections display summary status information about the hosts being monitored (upper left), services being monitored (upper right), and a further status breakdown by host group (lower portion of the boxed section of the figure). Once again, each item contains links to more detailed views of its current information. In this case, the hosts that are being monitored have been configured into four groups for Nagios reporting purposes. Three of the groups contain hosts in the same physical location within the company, and the final group, Printers, contains network printers that are being monitored. The system administrator is free to group hosts and devices in ways that make sense for her needs.

Figure 2. Nagios Status Overview and details for the Printers group
The display at the bottom of Figure 2 shows the most important
part of the detailed display that results when one clicks on the Printers link
in the upper display. It lists each printer separately, along with its device
status and services status. In this example, at the moment, one of the four
printers is down (the printer named ingres).
Figure 3 illustrates the detailed display that can be obtained for
an individual host (or device). Here we see some detailed information about a
host named leah. Once again, there are several sections to the display.
The host name and IP address appear in the upper left of the display, along with
an icon that the system administrator has assigned to this host. Here, the icon
suggests that the system's operating system is some version of Windows;
conventionally, icons are keyed to the operating system type. The table in the
upper right gives some overall uptime and reachability statistics about the host
over the period that the current monitoring session has been running.

Figure 3. Detailed Host Status information about host Leah
The table below the operating system icon, titled "Host State Information" provides information about the current status of the host, including whether or not it is up, how long it has been that way, when it was last checked, and the command used to perform the check, and the settings of various configuration parameters (such as host notifications and event handler).
The box titled "Host Commands" contains a series of links, which allow the system administrator to perform many different monitoring-related actions on this host. The various items are described in Table 1. Examining the list will give you further details about Nagios' capabilities.
Table 1. Available actions in the Nagios Host Information display
| Item | Meaning |
| Disable checks of this host | Stop monitoring this host for availability. |
| Acknowledge this host problem | Respond to a current problem (discussed below). |
| Disable notifications for this host | Don't send alerts if this host is unavailable. |
| Delay next host notification | Delay the next alert for host unavailability. |
| Schedule downtime for this host. Cancel scheduled downtime for this host | Define or cancel schedule downtime. During downtime, host unavailability is not considered a problem |
| Disable notifications for all services on this host. Enable notifications for all services on this host. | Don't/do send alerts if a service on this host fails. |
| Schedule an immediate check of all services on this host | Check all services as soon as possible (rather than waiting for their next scheduled time). |
| Disable checks of all services on this
host Enable checks of all services on this host |
Disable or enable checking service health on this host. |
| Disable event handler for this host | Prevent the event handler from running when a problem is detected on this host. |
| Disable flap detection for this host | Don't try to detect flaps (rapid up-down or on-off oscillations) on this host or its services. |
The second menu item allows you to acknowledge any current problem. Acknowledging simply means "I know about the problem, and it is being handled." Nagios marks the corresponding event as such, and future alerts are suppressed until the item returns to its normal state. This process also allows you to enter a comment explaining the situation, an action that is helpful when more than one administrator regularly examines the monitoring data.
If you don't like all of these table-oriented status displays, Nagios also has the capability to use graphical ones. For example, Figure 4 illustrates a map created for the small network being monitored here. The map is laid out to indicate three separate groups of hosts, with host taurus serving as a gateway between the group at the upper left and the ones at the bottom of the window.

Figure 4. A Nagios map
Much more complex network topologies can be represented in an analogous way. See the Nagios Web site for example screen shots.
|
Initially, configuring Nagios can seem daunting, and there is a fair amount of startup overhead to getting things going. But keep in mind that:
Nagios uses the following configuration files:
The package provides sample starter versions of all of these file. We will consider some aspects of these file types in the remainder of this article.
Nagios configuration files are generally stored in /usr/local/nagios/etc.
This configuration file contains directives that apply to the entire Nagios monitoring system. Here is an annotated sample version illustrating some of its most important features:
The first part of the configuration file specifies various file locations, including the general log file, files holding service check command and notification and event handler command definitions ( |
These directives specify logging settings, including how often logs are rotated (here, daily), the archive directory for old files, whether to log significant problems to syslog as well, and whether to log individual event types. |
# Global settings nagios_user=nagios nagios_group=nagios date_format=us admin_email=nagadmin admin_pager=19995551212 These lines specify various global settings, including the user/group as which the nagios daemon runs, the output format for dates (here, US style), and the administrator's email address. The final item sets the value of the |
Settings related to event handlers. You can optionally define a single event handler for all host failures and service failures in this file if appropriate. Commands are defined in an object configuration file. |
These directives control the number of maximum checks that can be made at the same time (0 means an unlimited number), as well as time-outs for various types of commands (values in seconds). |
These lines tell Nagios to retain information about host and service status between sessions, saving the values every 60 seconds, and reloading them when the facility starts up. |
These directives enable "passive checks": status data produced by external commands which Nagios imports periodically. |
These directives allow you to save Nagios data externally for long term analysis or other purposes. The commands specified here must be defined in some object configuration file. The simplest such command simply writes the command's output to an external file: e.g., echo |
Note that the directives appear in a slightly different order in the sample nagios.cfg file provided with the package.
|
The bulk of Nagios configuration occurs in the object configuration files. These files define hosts and services to be monitored, how various status conditions should be interpreted, and what actions should be taken when they occur. These files are used to define the following items:
Hosts: Computers and other network devices
Host Groups: Named groups of hosts
Services: Important daemons providing specific network services
Contacts: User to be contacted in the event of a problem
Contact Groups: Named groups of contacts
Time Periods: Day and/or time ranges within a week, used to specify when checks are to be performed, notifications are to be sent, and the like
Commands: Commands to be run for all purposes (host/service checking, notifications, event handling, and so on). Nagios provides two files containing many predefined commands: checkcommands.cfg and misccommands.cfg.
Host Dependencies: Specifications of host reachability dependencies. When an intermediate host is down, checks are skipped for all hosts that are dependent on that one.
Service Dependencies: Specifications of service dependency requirements. When a service host is down, checks are skipped for all other services that are dependent on it.
Host Escalations: Definitions of optional escalation levels for host problems
Host Group Escalations: Definitions of optional escalation levels for host groups
Service Escalations: Definitions of optional escalation levels for failed services
The items in red will need to be defined for virtually every Nagios installation; the ones in black are optional. In the sample Nagios configuration provided with the package, each type of object is defined in a separate configuration file (named after the object type, excluding any spaces). However, you can arrange your definitions in any form that makes sense to you.
All of these items are defined via templates: named sets of attributes and settings that can be easily applied to any number of actual objects. For example, here is a template definition for hosts:
define host{
; Template name
name normal
; This is only a template (not a real host)
register 0
; Host notifications are enabled
notifications_enabled 1
; Command to check if host is available
check_command check-host-alive
; Recheck failures this many times
max_check_attempts
; Repeat failure notifications every 2 hours
notification_interval 120
; When to check (time period name)
notification_period 24x7
; Notify when down, unreachable and on recovery
notification_options d,u,r
; Host event handler is enabled
event_handler_enabled 1
; Event handler command (defined elsewhere)
event_handler host-eh
; Flap detection is disabled
flap_detection_enabled 0
; Save performance data
process_perf_data 1
; Save status information across restarts
retain_status_information 1
}
This template defines a variety of host-monitoring settings (which are explained in the comments following the semicolons). Here is a host definition that uses this template:
define host{
; Template on which to base host
use normal
; Note the attribute is not "name" as above
host_name beulah
; Longer description
alias beulah: SuSE 8.1
; IP address
address 192.168.1.44
; Overrides template value
max_check_attempts 8
}
Other hosts may be defined in a similar way. Host definitions themselves can
also be used as templates, provided that a name attribute is included.
Once hosts have been defined, they may be placed into host groups via directives like this one:
define hostgroup{
hostgroup_name bldg2
alias Building 2
contact_groups admins1
members beulah,callisto,ariadne,leah,lovelace,valley
}
This definition creates the host group named bldg2, consisting of six
hosts (all previously defined via define host directives). The
contact_groups attribute specifies who to send notifications to, and it
is defined elsewhere (as we'll see).
You can use as many host groups as you want to. Hosts can be part of multiple host groups, and host groups themselves may be nested.
Here are two service templates and a service definition:
define service{ ; Define defaults for all services
name generic
register 0
; Check service every 30 minutes
normal_check_interval 30
; Retry failing checks every 3 minutes, up to 5 times
retry_check_interval 3
max_check_attempts 5
event_handler_enabled 1
check_period 24x7
; Repeat notifications for failures every 2 hours
notification_interval 120
notification_period 6to22
; Notify contacts about critical failures/recoveries
notification_options c,r
notifications_enabled 1
contact_groups admins
}
define service{ ; Define the SMTP service
use generic
name generic-smtp
register 0
service_description Check SMTP
check_command check_smtp
event_handler eh_smtp
contact_groups mailadmins
}
define service{ ; Define services to be monitored
use generic-SMTP
; Monitor SMTP for all hosts in this host group
host_groups mailhosts
}
The first template (generic) defines some settings, which can be applied to a variety of service types. The second template, generic-SMTP, uses the first template as a starting point and adds to them in order to create a generic SMTP monitoring service. Specifically, it defines a check command, an event handler, and a contact group that are appropriate for the SMTP service. The final define service stanza sets up SMTP monitoring for all of the hosts in the mailhosts host group.
Here are two stanzas defining a contact and a contact group:
define contact{
contact_name nagadmin
alias Nagios Admin
; When to notify about service problems
service_notification_period 6to22
; When to notify about host problems
host_notification_period 24x7
; Notify on critical problems and recoveries
service_notification_options c,r
; Notify on host down and recoveries
host_notification_options d,r
service_notification_commands notify-by-email
host_notification_commands host-notify-by-epager
email nagios-admins@ahania.com
pager $ADMINPAGER$
}
define contactgroup{
contactgroup_name mailadmins
alias Mail Admins
members mailadm,chavez,catfemme
}
The first stanza defines a contact named nagadmin. It also defines
what events to notify this contact about and the time periods during which
notifications should be sent. The commands to use to generate the alerts are
also specified, along with arguments to them (see below).
Time period definitions are quite simple. Here are the definitions of the two time periods we have used so far:
define timeperiod{
timeperiod_name 24x7
alias 24 Hours A Day, 7 Days A Week
sunday 00:00-24:00
monday 00:00-24:00
tuesday 00:00-24:00
wednesday 00:00-24:00
thursday 00:00-24:00
friday 00:00-24:00
saturday 00:00-24:00
}
define timeperiod{
timeperiod_name 6to22
alias Weekdays, 6 AM to 10 PM
Monday 06:00-22:00
Tuesday 06:00-22:00
Wednesday 06:00-22:00
Thursday 06:00-22:00
Friday 06:00-22:00
}
Note that only the applicable days need be included in the definition.
The commands referred to in many of the preceding object definitions also must be defined. For example, here is the SMTP service check command definition:
define command{
command_name check_smtp
command_line $USER1$/check_smtp -H $HOSTADDRESS$
}
This command runs the check_smtp script stored in the directory
defined in the macro $USER1$ (defined in the resource.cfg file--see
below); this macro conventionally holds the path to the Nagios plug-ins
directory. The command is passed the option -H, followed by the IP address
of the host to be checked (the latter is expanded from the built-in
$HOSTADDRESS$ macro).
You can determine the syntax for any plug-in by running it with the
--help option. You can also extend Nagios by adding custom plug-ins of
your own. See the documentation for details on how to accomplish this.
Event handers are defined in the same way, as in this example:
define command{
command_name eh_smtp
command_line /usr/local/nagios/eh/fix_mail $HOSTADDRESS$ $STATETYPE$
}
Here, we define the command named eh_smtp. It specifies the full path
to a program to run, passing two arguments: the host's IP address and the value
of the $STATETYPE$ macro. This item is set to HARD for critical
failures and SOFT for warnings.
Here are the definitions of commands used for notifications (we've wrapped
the command_line setting for clarity):
define command{
command_name notify-by-email
command_line /usr/bin/printf "%b" "***** Nagios 1.0 *****\n\n
Notification Type: $NOTIFICATIONTYPE$\n\n
Service: $SERVICEDESC$\n
Host: $HOSTALIAS$\n
Address: $HOSTADDRESS$\n
State: $SERVICESTATE$\n\n
Date/Time: $DATETIME$\n\n
Additional Info:\n\n$OUTPUT$" |
/usr/bin/mail -s "** $NOTIFICATIONTYPE$
alert - $HOSTALIAS$/$SERVICEDESC$
is $SERVICESTATE$ **" $CONTACTEMAIL$
}
This command constructs a simple email message using the printf
command and many built-in Nagios macros. It then sends the message using the
mail command, specifying the recipient as the $CONTACTEMAIL$
macro. The latter contains the value of the corresponding email
attribute for the host or service that is generating the alert.
|
The cgi.cfg configuration file has several different functions with the Nagios system. Among the most important is authentication, allowing Nagios and its data to be restricted to appropriate people. Here are some sample directives related to authorization:
use_authentication=1
authorized_for_configuration_information=netsaintadmin,root,chavez
authorized_for_all_services=netsaintadmin,root,chavez,maresca
The first entry enables the access control mechanism. The next two entries
specify users who are allowed to view Nagios configuration information and
services status information (respectively). Note that all users also must be
authenticated to the Web server using the usual Apache htpasswd
mechanism.
This same configuration file is also used to store settings for icon-based status displays, as in these examples:
hostextinfo[janine]=;redhat.gif;;redhat.gd2;;168,36;,,;
hostextinfo[ishtar]=;apple.gif;;apple.gd2;;125,36;,,;
These entries specify extended attributes for the hosts defined in the entries labeled janine and ishtar. The filenames in this example specify images files for the host in status tables (GIF format--see Figure 3) and in the status map (GD2 format), and the two numeric values specify the device's location--for example, x and y coordinates--within the 2D status map. (Figure 4 provides an example status map display).
The final configuration file we will consider is the resource.cfg
file. It is used to define site-specific macros, conventionally named
$USER1$ through $USER32$:
# $USER1$ = path to plugins directory
$USER1$=/usr/lib/nagiosplugins
...
# Store a username and password (hidden)
$USER3$=administrator
$USER4$=somepassword
The first macros defines the path to the Nagios plug-ins directory; this usage is assumed by the supplied sample configuration files.
The other two macros are used in this case to store a username and password. These items can be used in command definitions for added security. The resource.cfg file itself can be protected against all non-root access without compromising the ability of CGI programs to run successfully.
Since Nagios configuration is somewhat involved, the package provides a command that can be used to verify it prior to running the program. Here is an example of its use:
# cd /usr/local/nagios/etc
# /usr/local/nagios/bin/nagios -v nagios.cfg
This will check the Nagios configuration, which uses nagios.cfg as its main configuration file.
|
Related Reading Essential System Administration Pocket Reference |
For more information about Nagios, including installation instructions, and how to initiate and manage monitoring, consult the following sources:
If you liked this article and would like to receive the free ESA3 newsletter, you can sign up here.
You may also like AEleen's latest book, the just-released System Administration Pocket Reference.
Æleen Frisch has been a system administrator for over 20 years, tending a plethora of VMS, Unix, Macintosh, and Windows systems. If you liked this article and would like to receive the free ESA3 newsletter, you can sign up at http://www.aeleen.com/esa3_news.htm.
O'Reilly & Associates recently released (August 2002) Essential System Administration, 3rd Edition.
Sample Chapter 11, "Backup and Restore," is available free online.
You can also look at the Table of Contents, the Index, and the full description of the book.
For more information, or to order the book, click here.
Return to ONLamp.com.
Copyright © 2009 O'Reilly Media, Inc.