Thursday, 31 March 2011

nrpe check_smtp & check_mysql notes

4.12pm. Today I have been playing with nagios plugins that allow me to monitor every service on my development box (Sherpa).

NB: These are local NRPE based checks (the daemon only runs on localhost).

* 'check_smtp' is setup by installing 'nagios-plugins-smtp.x86_64' and then monitored by nrpe by adding the following command to the '/etc/nagios/nrpe.cfg' file.
command[check_smtp]=/usr/lib64/nagios/plugins/check_smtp -H localhost
I can then check this from my NetworkMonitor box via:
# /usr/lib64/nagios/plugins/check_nrpe -H sherpa -c check_smtp
SMTP OK - 0.013 sec. response time|time=0.013064s;;;0.000000
* 'check_mysql' is a little more difficult as it requires a new MySQL user to be created, but easily done. Again we install 'nagios-plugins-mysql.x86_64' but have to be perform the following before editing the '/etc/nagios/nrpe.cfg' file.
# mysql -u xxxxx -p
mysql> CREATE USER 'nrpe'@'localhost' IDENTIFIED BY '*****';
mysql> exit
Now we can test the command locally via:
# /usr/lib64/nagios/plugins/check_mysql -H localhost -u nrpe -p *****
or just add the following to the nrpe.cfg file.
command[check_mysql]=/usr/lib64/nagios/plugins/check_mysql -H localhost -u nrpe -p *****
Dont forget to restart the local nrpe service(!).

Now onto adding the service templates on our Nagios box:
define service{
use generic-service
host_name Sherpa.xxxxxxxxxxxx
service_description postfix SMTP Service Check
check_command check_nrpe!check_smtp
}
define service{
use generic-service
host_name Sherpa.xxxxxxxxxxxx
service_description MySQLd Service Check
check_command check_nrpe!check_mysql
}
reload nagios, job done.

Wednesday, 30 March 2011

Nagios NRPE: Unable to read output

3.25pm. Today I started looking into rolling out nagios monitoring over our *nix servers after the success of recently adding Windows Servers (via NSClient++) and our Core Switches (via SNMP).

Most of my servers are based on CentOS 5.5 so a quick search for nrpe (the linux alternative of nsclient++) on yum provided the required results:
nagios-plugins-nrpe.x86_64 : Provides nrpe plugin for Nagios
nrpe.x86_64 : Host/service/network monitoring agent for Nagios
Installing both packages went pretty well, adding the required nrpe user and service (just needed to 'chkconfig nrpe on' to ensure it started on boot). I then added the NetworkMonitor host IP to the nrpe.cfg file.
allowed_hosts=127.0.0.1,192.168.200.46
Annoyingly this is where I hit a road block, I was able to test communication from the Nagios/NetworkMonitor Box to the NRPE client:
# /usr/lib64/nagios/plugins/check_nrpe -H Sherpa
NRPE v2.12
However when it came to retrieving any useful data such as check_users or check_load I received the following error:
# /usr/lib64/nagios/plugins/check_nrpe -H sherpa -c check_users
NRPE: Unable to read output
# /usr/lib64/nagios/plugins/check_nrpe -H sherpa -c check_load
NRPE: Unable to read output
A quick in the NRPE plugins folder (/usr/lib64/nagios/plugins) provided the answer; it only contained the check_nrpe plugin (that I had installed earlier) and not a lot else. Sure enough I had only installed the nagios nrpe plugin (and not all the nrpe plugins that I first thought). Back to yum:
# yum search nagios-plugins-load
nagios-plugins-load.x86_64 : Nagios Plugin - check_load
So it looks like there is a plugin for each check, that's worth knowing! I can apply this logic to every item I now want to check on this host; or if I am feeling lazy install:
nagios-plugins-all.x86_64 : Nagios Plugins - All plugins
Which does the lot! Now to retest....

# /usr/lib64/nagios/plugins/check_nrpe -H sherpa -c check_users
USERS OK - 1 users currently logged in |users=1;5;10;0

# /usr/lib64/nagios/plugins/check_nrpe -H sherpa -c check_load
OK - load average: 0.00, 0.01, 0.00|load1=0.000;15.000;30.000;0; load5=0.010;10.000;25.000;0; load15=0.000;5.000;20.000;0;

Success!

Thursday, 24 March 2011

Monitoring a HP Procurve 5412zl switch interface with Nagios (SNMP)

8.50am. Yesterday I had a couple of minutes spare to build on my Nagios setup. Since setting up Nagios I have been looking into monitoring the fibre interfaces that connect my core and satellite switches. Unfortunately I was not able to find any good examples of this done before so I set out to try and set it up myself.

To complete this task I used the check_snmp plugin as a base and built up my own command to check the interface status. After running though an snmpwalk on the switch (snmpwalk -v1 IPHERE -c public >~/Procurve_snmpwalk) I discovered the following MIBS that I can use:

ifSpeed.X (speed of link)
ifOperStatus.X (link up or down)
ifOutOctets.X & ifInOctets.X
ifOutUcastPkts.X & ifInUcastPkts.X
ifOutNUcastPkts.X & ifInNUcastPkts.X

The X indicates the Port ID, however to use this integer we need to check what ID links to what interface. This can be achieved by performing a query on the ID using the ifDescr.X MIB. The results to my snmpwalk already gave me the results I required however they can be checked manually on the command line via:
/usr/lib64/nagios/plugins/check_snmp -H %HostnameHere% -C public -o ifDescr.X
By changing the value of X, I am presented with the interface name. For example in my case ID 2 = Port A2 on the switch.

Now I know what the interface ID is for the port is I can start to build up commands based on the MIBS i discovered above. To start with I wanted a simple interface check (up/down) so I wrote the following to the /etc/nagios/objects/commands.cfg file:

define command{
     command_name check_interface_up
     command_line $USER1$/check_snmp -H $HOSTADDRESS$ -C $ARG1$ -o $ARG2$ -w 0:1 -c 0:1 -l 'Link Status'
}

I have set the warning & critical ranges to 0:1. I am unsure if this is the best way to do it, but it does work. The switch returns 1 for interface up and 2 for interface down.

I then added the following to my switch template:

define service{
     use generic-service
     host_name Block-B-5412zl
     service_description Block-A-3500yl uplink 2 status
     check_command check_interface_up!public!ifOperStatus.2
}

Using the ! symbol as a separator I am able to pass different variables to the command; $ARG1$ and $ARG2$ .

$ARG1$ = the community string checked
$ARG2$ = This is where I define the SNMP MIB ifOperStatus.2, however I can replace this with different MIBs that I mentioned earlier.

Thats pretty much it, reload nagios and I am now able to monitor switches interfaces. Next task is monitoring the traffic flowing through the interfaces by monitoring ifOutUcastPkts.X & ifInUcastPkts.X in cacti.

Friday, 18 March 2011

4.50pm. I was recently told about the idea of publishing nagios alerts to a private twitter account

As it was a nice quiet day so I had a look at implementing the scripts today. I had a few problems getting the tweets through our http proxy, but managed to solve this problem by adding:

$ENV{http_proxy}='http://proxy:8080';

to the top of the nitter(v2) script. I then moved the script to /etc/nagios/ and I was away.


Fridays goal: complete.
10.50am. So I have had to look this command up a couple of times now as I keep forgetting. The command for installing CPAN modules into Perl is as follows:

perl -MCPAN -e shell
cpan> install Type::ModuleName

Today's reason for its use? Installing a Perl script to process twitter OAUTH authentication.
Why? Sending alerts from Nagios to twitter.

Wednesday, 16 March 2011

3.50pm. This week I have mostly been installing and configuring up: Nagios.

Cutting straight to the verdict: awesome.

So it takes a little while to set-up, but once you have learnt the slightly odd templates/services/commands/hosts system its just a case of replication.

So far I have configured it to monitor:
* the majority of network printers,
* our Windows servers,
* our Linux servers,
* ....and a handful of procurve switches.

I did have to disable the alerts on the Network Printers as they became incredibly annoying when office staff were saving the planet by turning them off at night.

Next step is to gather more statistics from the linux servers, and monitor key fibre interfaces on the switches.

Wednesday, 9 March 2011

6.45pm. Its done. The data mining/extraction script I have been working on in Perl is finally complete. Last week I set myself a goal of automating the search process that we as website users perform daily. I went to work and set myself up a dummy search site and had a play. A couple of days later (along with a few beers along the way) I have accomplished the following:

* The ability to login into access controlled sites that use session tracking.
* Download a copy of every searchable record, starting with an id 1 and ending with 10000.
* Extract the data from the local files, and build a CSV file.
* Clean up the CSV file, removing all null entries, and blank lines.

The end result?

A 700 line CSV containing information I require from an external searchable system.

Now where's that beer....

Friday, 4 March 2011

10.00pm. Okay so I having quite a productive day in the office, I decided to continue my brain workout by picking up some more Perl code.

I wrote a couple of scripts that are able to download data from search pages, including a version that is able to pass login information, this made the whole challenge slightly more difficult, in the end I was able to utilise the --keep-session-cookies & --save-cookies/--load-cookies arguments of wget to achieve my goals.

The next step was to read in an array of search variables to pass to wget, easily done. I then started looking at automating the whole process by looking at search forms that used unique id's in their POST data; with some friendly $int++ I was able to generate a whole bunch of searches with success.

Next step: Trawling through the pages to extract the data I require.

8.000pm. So over the last 2 days at work I have been setting up and configuring nagios. It's been on my to-do list for a while now (years) and I have finally found a couple of days to crack on with it. Its pretty easy to set up, and once you have worked out the difference between templates, hostgroups, and servicegroups you're all set.

I guess the only frustrating part is the constant personal email bombardment (every hour) if you leave something misconfigured in the office, can't really blame the product for that though.

Thursday, 3 March 2011

11.40am. Well that was easy. I upgraded our virtualisation infrastructure to Xenserver 5.6 a couple of weeks ago but at the time I forgot to upgrade the XenTools packages. With 5 minutes spare today I was able to complete this task by performing the following:

* Attach the XenTools ISO by selecting VM->Install XenServer Tools.
* Mount the DVD:
mount /dev/xvdd /mnt
* Run the script on the mounted DVD.
./mnt/Linux/install.sh
* Unmount the DVD
umount /mnt
* Restart the VM.
* Check under Virtualisation State (found under the general tab for that VM) it says: Optimized

Fin.

Tuesday, 1 March 2011

7.50pm. So I watched 'the social network' over the weekend, and after watching the 'hacking' scene at the beginning of the film, I can confirm that downloading the jpeg pictures via an apache index is indeed kids stuff.

In case you were you were wondering I accomplished it via:
wget -r -A .jpeg APACHE_INDEX_PAGE_URL_HERE

4.00pm. The synchronisation finally finished, only took 14hrs! Now I am struggling with the telephony box again that we had to restart yesterday, it appears to be running on IIS 5.1 Personal Web Server with its oh so joyful 10 simultaneous user limit. It appears to be registering each transaction I complete as a new connection under Opera and then drops me to a 403 after 10 clicks. Joy.

I can imagine the response from their tech support now; "Have you tried restarting the server?"