02 Oct

The humble beginnings of a Monitoring System

Late yesterday afternoon the website here was updated from our subversion repository, and included in that update was the start of our new Network Status/Monitoring system. It is nowhere near complete, but it is beginning to take shape enough to start tinkering with it on the live site and collecting data.

What we have so far is the beginnings of the “master” for the monitoring system, integrated into our Drupal setup here. What is done so far in the code includes:

  • Network Status Block – Will reside in the basic layout of our site (most likely on the right-hand side near the “Proudly Utilizing” block) and provide a quick “at a glance” way of seeing if there are any problems with our network.
  • Network Status Page – Will provide a detailed list of everything in the network, along with a detail page for each device that will show Network Health measurements, System health measurements such as System Load, Memory Usage, Drive Space.
  • Basic Network Health Monitoring – Network health and system health information is collected on a regular basis by the master and stored in the database.
  • Communications support for the Monitoring “Agents” – The plan is that we will have multiple “agents” running on machines in multiple physical locations to monitor specific services (web server, mail server, etc) on each machine. These “agents” will need a central place to report their findings, and retrieve information about what they should be monitoring exactly. The master system as implemented so far has the beginnings of that communications stack already in place.

There's still an enormous amount of work to be done before we'll consider the project “finished” however, as the actual “Agents” themselves that will poll specific services still need to be written and deployed, and there is no actual “alerting” or reactions to any of the data collected. So obviously we've got a ways to go yet, but we wanted to get what we have in place awhile, so we can begin to record some actual data from the servers on the things that are in place (Network health, load averages, etc) so that we can verify it's working as intended, and also so we can get some ideas of what is “normal” for the new system, so that we can set the appropriate levels for the Alerts to kick in when the alerts portion is finished.

Last night after updating the code on the site, I spent a few moments tinkering with the module, before I realized that I had the permissions for the module set to “Staff Only”, so everyone who visited the site was seeing the “Network Status” block and resulting data. There is nothing wrong with that of course, as the entire goal is to get more information out to everyone, but seeing as how the system is currently only collecting about 10% of the data it should, doesn't have all the correct devices in it yet, nor is the actual “status page” ready for public use yet (it has a lot of code debugging messages sprinkled throughout), I pulled the block from circulation once I realized what I had done.

This was the cause of the “mysterious disappearing block” that a couple of people reported last night. It is coming, hopefully within the next week or two there will be something firm to share with everyone… lets just consider last nights limited appearance a “sneak preview” if you will. 🙂