TechOpsGuys.com Diggin' technology every day

5Jan/10Off

Acknowledge Nagios Alerts Via Email Replies

TechOps Guy:

Monitoring should be annoying by design - when something is broken, we need to fix it and we need to be reminded it needs fixing until it gets fixed. That's why we monitor in the first place. In that vein, I've configured our Nagios server to notify every hour for most alerts. However, there are times when a certain alert can be ignored for a while and I might not have a computer nearby to acknowledge it.

The solution: acknowledge Nagios alerts via email. A quick reply on my smartphone and I'm done.

Setting it up is fairly simple and involves a few components: an MTA (Postfix in my case), procmail (might need to install it), a perl script and the nagios.cmd file. I used the info in this post to get me started. My instructions below were done on two different CentOS 5.4 installs running Nagios 3.0.6 and Nagios 3.2.0.

Procmail
Make a /home/nagios/.procmailrc file (either su to the nagios user or chown to nagios:nagios afterwards) and paste in the following:

LOGFILE=$HOME/.procmailrc.log
MAILDIR=$HOME/Mail
VERBOSE=yes
PATH=/usr/bin
:0
* ^Subject:[    ]*\/[^  ].*
| /usr/lib/nagios/eventhandlers/processmail "${MATCH}"

Postfix
Tell Postfix to use procmail by adding the the following line to /etc/postfix/main.cf (restart Postfix when finished):
mailbox_command = /usr/bin/procmail
You might want to search your main.cf file for mailbox_command to make sure procmail isn't already configured/turned on. You also might want to do whereis procmail to make sure it's in the /usr/bin folder. If your Nagios server hasn't previously been configured to receive email, you've got some configuration to do - that's outside of the scope of this article, but I would suggest getting that up and running first.

Perl Script
Next up is the perl script that procmail references. Create a /usr/lib/nagios/eventhandlers/processmail file and chmod 755 it - paste in the code below:

#!/usr/bin/perl

$correctpassword = 'whatever';   # more of a sanity check than a password and can be anything
$subject = "$ARGV[0]";
$now = `/bin/date +%s`;
chomp $now;
$commandfile = '/usr/local/nagios/var/rw/nagios.cmd';

if ($subject =~ /Host/ ){      # this parses the subject of your email
        ($password, $what, $junk, $junk, $junk, $junk, $junk, $host) = split(/ /, $subject);
        ($host) = ($host) =~ /(.*)\!/;
} else {
        ($foo, $bar) = split(/\//, $subject);
        ($password, $what, $junk, $junk, $junk, $junk, $junk, $host) = split(/\ /, $foo);
        ($service) = $bar =~ /^(.*) is.*$/;
}

$password =~ s/^\s+//;
$password =~ s/\s+$//;

print "$password\t$what\t$host\t$service\n";

unless ($password =~ /$correctpassword/i) {
        print "exiting...wrong password\n";
        exit 1;
}

# ack - this is where the acknowledgement happens
# you could get creative with this and pass all kinds of things via email
# a list of external commands here: http://www.nagios.org/development/apis/externalcommands/
if ($subject =~ /Host/ ) {
        $ack =
"ACKNOWLEDGE_HOST_PROBLEM;$host;1;1;1;email;email;acknowledged through email";
} else {
        $ack = "ACKNOWLEDGE_SVC_PROBLEM;$host;$service;1;1;1;email;acknowledged through email";
}

if ($what =~ /ack/i) {
        sub_print("$ack");
} else {
        print "no valid commands...exiting\n";
        exit 1;
}


sub sub_print {
        $narf=shift;
        open(F, ">$commandfile") or die "cant";
        print F "[$now] $narf\n";
        close F;
}

The script above assumes certain things about how your email subject line is formatted and you might have to tweak it if you've done much/any customization to the Notification commands in the default commands.cfg file. One thing you will need to change is the "Host" variable. The default is to put Host: $HOSTALIAS$ in the subject - you'll need to replace that with $HOSTNAME$ as that is what the nagios.cmd file expects. If you don't change that, the perl script above will pass the $HOSTALIAS$ to the nagios.cmd file and it won't know what to do with it. Below is a sample of my notify-service-by-email command:

define command{
        command_name    notify-service-by-email
        command_line    /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\n\nComment: $SERVICEACKCOMMENT$\n\nService: $SERVICEDESC$\nHost: $HOSTALIAS$\nAddress: $HOSTADDRESS$\nState: $SERVICESTATE$\n\nDate/Time: $LONGDATETIME$\n\nAdditional Info:\n\n$SERVICEOUTPUT$\n\nMonitoring Page: http://nagios1/nagios\n" | /bin/mail -s "** $NOTIFICATIONTYPE$ Service Alert: $HOSTNAME$/$SERVICEDESC$ is $SERVICESTATE$ **" $CONTACTEMAIL$

Example
So, when i get an alert that has a subject something like this:
** PROBLEM alert - server1/CPU Load is WARNING **
I can just reply and add "whatever ack" to the beginning of the subject line:
whatever ack RE: ** PROBLEM alert - server1/CPU Load is WARNING **
and the alert will be acknowledged.

Troubleshooting
As I said earlier, you will want to make sure Postfix is configured correctly for receiving email for the Nagios user - this might be an area where you'll have issues if it's not set up correctly. The other thing that fouled me up a few times was the Notification command section I mentioned above. By passing commands directly to the nagios.cmd file and by watching the log files, you should be able to spot any misconfigs.

Filed under: Monitoring 7 Comments
4Jan/10Off

Uptime of various web properties

TechOps Guy: Nate

Came across a post on Techcrunch, which then lead me to Alertsite, which seems to maintain a list of various web sites in various industries and their average uptime and response time. I thought it was interesting at least that Amazon is up only 97% of the time, and LinkedIn up only 95% of the time for examples. Kind of puts things in perspective I think, an increasing number of people and organizations are "demanding" higher levels of uptime, while it's certainly achievable it seems in many cases the costs are just not worth it.  Taking it to an extreme level, this topic reminds me of this article written several years ago by our best friends at The Register.

When Microsoft goofed the DNS settings on its microsoft.com servers recently, he figured the site would have to be up for the next two hundred years to achieve five-nines uptime.

Don't know why I remember things like that but can't remember other things like birthdays.

Filed under: Monitoring No Comments