Eucalyptus and Nagios

by dnurmi

Production deployments of Eucalyptus, like production deployments of any infrastructure software running in a data center, require some amount of health and status monitoring be happening in order to both allow the Eucalyptus/data-center administrator the ability to stay on top of evolving resource situations and to provide invaluable diagnostic information when something is going sideways within the resource pool.  Fortunately for all of us, there exists a wide variety of health/status monitoring system out there, and several of them are of extremely high quality, tried and tested, and are available as part of major Linux distributions as pre-packaged open-source solutions.  One such system that I’m a personal fan of is called Nagios.

To quote from their website:

“Nagios is a powerful monitoring system that enables organizations to identify and resolve IT infrastructure problems before they affect critical business processes.”

Indeed it is!  I first used Nagios is 2000/2001 to watch over a number of Linux servers and have been extremely pleased to watch how it has evolved from a very useful tool from the outset to the fully featured, extensible IT infrastructure monitoring system it is today.  We use Nagios all over the place internally at Eucalyptus, and have recommended it to a number of Eucalyptus users as the monitoring tool to use for their Eucalyptus deployments.

Recently, I sat down with a fresh CentOS 6 based Eucalyptus installation and installed into the pool a basic Nagios system, plus a few Eucalyptus hooks.  In this posting, I’ll be going through the process which turned out to be extremely straight-forward and resulted in a powerful addition to any Eucalyptus deployment.

Step 1: Install Eucalyptus

I’ll omit a full description of how to install Eucalyptus, but a complete guide can be found here.

Step 2: Install Nagios

If you’ve installed Eucalyptus from packages, you already have added the repos that contain the Nagios packages.  For all servers running a Eucalyptus component, run the following to install the Nagios remote test agent (NRPE) and the service check plugins:

# yum install nrpe nagios-plugins-all nagios-plugins-nrpe

Then, choose one server (I chose the machine running my Eucalyptus Cloud Controller), and install the Nagios package:

# yum install nagios

Nagios is now installed.

Step 3: Configure Nagios for Basic System Monitoring

There are a few steps required to get basic system monitoring going with Nagios, but they are very straight-forward and I observe a really nice fact that all ‘unique’ settings for a distributed monitoring installation are constrained to the single Nagios server (i.e. the remote host configuration is identical, which makes it easy to get it right and push out without having to maintain a unique config for each host).

On all hosts, push out the configuration file that will allow the Nagios server to interact with NRPE daemon, which is done by making the following setting

edit /etc/nagios/nrpe.cfg
change 'allowed_hosts=' to 'allowed_hosts=<ip address of your nagios server>'
change the checks at the end to be a little more in line with the built in local check definitions
  command[check_users]=/usr/lib64/nagios/plugins/check_users -w 5 -c 10
  command[check_load]=/usr/lib64/nagios/plugins/check_load -w 15,10,5 -c 30,25,20
  command[check_disk]=/usr/lib64/nagios/plugins/check_disk -w 20% -c 10% -p /
  command[check_procs]=/usr/lib64/nagios/plugins/check_procs -w 250 -c 400 -s RSZDT
  command[check_swap]=/usr/lib64/nagios/plugins/check_swap -w 20% -c 10%
push the file out to all hosts to /etc/nagios/nrpe.cfg
start the NRPE daemon with the command 'service nrpe start' on all hosts

Next, on the Nagios server, we modify the configuration to allow the use of NRPE, and to read remote host config files from a local directory where we’ll store each Eucalyptus host’s unique configuration.

edit /etc/nagios/nagios.cfg
uncomment the line 'cfg_dir=/etc/nagios/servers' and save the file
create the /etc/nagios/servers directory
edit /etc/nagios/objects/commands.cfg
add the following to the end of the file, and save
  define command{
     command_name check_nrpe
     command_line /usr/lib64/nagios/plugins/check_nrpe -H $HOSTADDRESS$ -c $ARG1$
set the nagios admin password to 'nagios' by running 'htpasswd -bc /etc/nagios/passwd nagiosadmin nagios'

Next, on the Nagios server, we set up one configuration file, per host, that defines the host and which checks to run on that host.  The files live in /etc/nagios/servers, and I’ve included an example as it is rather long.  The only modification one would need to make to use this is to modify the IP address defined in the ‘host’ section.  Finally, when all of your hosts have such a file in place, start up Nagios on the front-end with:

service httpd start
service nagios start

Nagios should now be up and monitoring your environment  with the basic checks that we’ve enabled for each host.  To use the Nagios UI, point a browser at your Nagios server (http://your.nagios.server.ip/nagios), log in with user ‘nagiosadmin’ and whatever password you set above (in my example, it was ‘nagios’), and you’re logged in.  I found myself looking, at first, at the ‘hosts’ and ‘services’ displays, which show the status of all of the hosts/services that we just defined.  It takes a few minutes at first for the polling to get started, but you should see services moving from ‘PENDING’ to ‘OK’ (or ‘WARNING’ or ‘CRITICAL’) pretty quickly.

Step 4: Configure Nagios for Eucalyptus

If we stopped right here, we already have a simple to set up, invaluable tool for managing and maintaining a Eucalyptus deployment.  Knowing that networks are up/down, disks are free/full, load is low/high is all extremely useful and in most cases necessary information to have in hand when approaching any Eucalyptus problem.  However, since it was so simple to get this far, I decided to just go one step further and see if I could add a few Eucalyptus specific checks to my installation without writing any scripts or code.  I decided to use the built in logfile checker that comes with Nagios as a basic Eucalyptus service health/status monitor.  Here is how to do it.

edit /etc/nagios/nrpe.cfg
add the following check definitions
  # Eucalyptus checks
  command[check_cclog]=/usr/lib64/nagios/plugins/check_log -F /var/log/eucalyptus/cc.log -O /tmp/nagioscc.log -q "ERROR|WARN"
  command[check_ccfaults]=/usr/lib64/nagios/plugins/check_log -F /var/log/eucalyptus/cc-fault.log -O /dev/null -q "ERR-"
  command[check_nclog]=/usr/lib64/nagios/plugins/check_log -F /var/log/eucalyptus/nc.log -O /tmp/nagiosnc.log -q "ERROR|WARN"
  command[check_ncfaults]=/usr/lib64/nagios/plugins/check_log -F /var/log/eucalyptus/nc-fault.log -O /dev/null -q "ERR-"
  command[check_cloudlog]=/usr/lib64/nagios/plugins/check_log -F /var/log/eucalyptus/cloud-output.log -O /tmp/nagioscloud.log -q "ERROR|WARN"
  command[check_cloudfaults]=/usr/lib64/nagios/plugins/check_log -F /var/log/eucalyptus/cloud-fault.log -O /dev/null -q "ERR-"
  command[check_walrusfaults]=/usr/lib64/nagios/plugins/check_log -F /var/log/eucalyptus/walrus-fault.log -O /dev/null -q "ERR-"
  command[check_scfaults]=/usr/lib64/nagios/plugins/check_log -F /var/log/eucalyptus/sc-fault.log -O /dev/null -q "ERR-"
save the file, and push it out to /etc/nagios/nrpe.cfg on all eucalyptus hosts

Next, to each server config in /etc/nagios/servers on the nagios server machine, put in place service definitions to the appropriate configs (for example, add the cloud/walrus checkers to the machine running the Cloud Controller and/or Walrus, add the cluster controller (CC) checkers to the machine running the cluster controller, etc.).  Since they are long, I’ve linked to a CloudWalrusSC config, a CC config, and an NC config.  You’ll see that, in essence, we’ve just added to the regular resource checks a couple of extra Eucalyptus log file and Eucalyptus fault file checkers which run on the appropriate systems.  Finally, restart Nagios on the Nagios server, and NRPE daemon on all hosts, and check out the UI.

Step 5: Do Some Monitoring!

Here is a screen shot of the resulting services UI, where I’ve induced an ERROR condition on the CLC by throwing invalid requests at it.


To Summarize

Of course, this is a very basic integration, but I feel that adding slightly more sophisticated logfile checking, maybe a custom check that runs ‘euca-describe-services’ to look for non-ENABLED eucalyptus services, and even simple functional test checkers (run an instance, terminate it) would be very straightforward to add to Nagios.   All told, even this basic procedure, plus adding to the Nagios configuration the option of emailing the administrator when events pop, is a functional tool that took almost no time to get going which really underscored, to me,  the power of the Nagios software.  Next, I’ll be using the existing Nagios resource categorization options (service groups, host groups, etc) to map on to Eucalyptus categories (front-end, middle-tier, back-end, clusters) to make the navigation of the UI even easier, and exploring some of the additional official and contributed checkers that are available on the Nagios Exchange.