05 Sep

Aftermath of a Bad Chip, Part 2

A continuation from yesterday’s post.

The start of these strange issues with Srv3 all started on August 7th, when Srv3 stopped responding to all requests just after midnight. Complicating matters our monitoring system failed to dispatch the alerts that there was a problem as it should have. Later investigation would reveal this was due to the InnoDB file used to log the status of the checks being full (this was a programming flaw in our monitoring system and has since been corrected) and we ended up unaware there was even a problem until the first client trouble tickets came in a little over 3 hours later. When we did realize there was a problem, we found ourselves unable to gain access to the machine remotely, and contacted the Datacenter around 4am, they investigated, and the report came back “Machine will not POST, Investigating further”, which is something nobody ever wants to hear regarding a server (Basically, the machine had power, and would turn “on”, but nothing ever came up on the console, nor would it ever actually boot, it just sat with a blank screen). It was almost 11am before the server came back online; the outcome report from the datacenter was “Motherboard was dead. Replaced from Stock, Server now back online.”

To be honest, we will probably never know exactly what caused the motherboard to die; a power spike of some type, a loose connection in the machine, or even just a board simply dying for no reason (I used to refer to this as “The magic escaped”, if you watch closely, some components (Power Supplies for instance), you can actually see the magic escape from the device when it goes bad.. some people would say that the magic looks like smoke and smells like smoke, but I insist it’s actually magic, because while I know how a power supply works, and what it does, I very well couldn’t build one or do the job myself, so it must be magic). We’ve had components go bad before, and motherboards are no exception, although the incident rate of such hardware failures for us has dropped considerably since we switch datacenter providers in May 2007; our old datacenter had a practice of using the “most economical hardware available on any given day”, which sometimes led to questions about the quality of the hardware, while the new datacenter provider uses boards such as the one in Srv3, a SuperMicro X7DBR-E, which while not the cheapest motherboard on the market, is a fairly respected server-quality piece of hardware. Anyway, before I digress further then I already have, for whatever reason, the motherboard had died, the datacenter replaced it, and we figured this was our one severe system failure for the season, and we should be happy and carefree for months to come.

Unfortunately this was not the case, and we found ourselves faced with crashes on August 25th, 29th, and the last on September 3rd. In all three of these cases the situation was similar and played out in similar fashions, the alert system would go off, and we would find ourselves unable to access the servers in any fashion remotely. (This reminds me, future blog post: Remote access, how we do it, why, and what features it gives us in a disaster). Our policy in these situations in the past has always been to try and determine the cause of the problem, even if it means a slightly longer downtime (without knowing what caused the problem for sure, it becomes much hard to protect against that problem in the future), so we would contact the datacenter to have a technician investigate at the machine. (We actually have the ability to do a hard power reset via the power strip remotely without involving anyone, but doing so would wipe out any way of knowing exactly what was going on in the machine at the time).

In all three instances the technician would report that even at the local console they were unable to gain access to the machine, and ultimately, the machine would be power cycled by the technician. The server would come back online, and we would begin investigating to see if we could determine the cause of the problem from the servers log files. In all three instances we found that the machine was apparently under a higher then normal load average at the time of the crash, and twice we found possible causes that could explain the high load (In one instance a client’s site was placing an unusually high demand on the mySQL server, in another there was what appeared to be a Distributed Denial of Service attack taking place against one clients site, but even that was strange in that it was “large enough to cause an issue [ed: for Srv3], but too small/varied to trigger the datacenters protection systems”.

The most recent crash on September 3rd had one crucial difference, the details of which I covered pretty well yesterday but to put is a simply as possible, Wednesday’s crash had one benefit to us, and that is that it shined a light on an issue that I’m hopeful will turn out to be the real cause of the problems we’ve been having with Srv3; only time will tell for sure, but I’m trying to be optimistic. (Optimistic yes, but I’m still checking on Srv3 every time I sit down at a machine). Ultimately there are still some questions: Was the bad RAM the cause of the problems on 8/25, 8/29, and 9/3? (If the machine continues to be stable from here on out, as it has so far, my guess is “Yes”) Was the problem with the RAM a result of or related to the motherboard failure? (Possible. A power spike within the machine could have very well affected the RAM chips as well. Another possibility is that the two chips we deemed to be “questionable” and had replaced yesterday may not have been the original chips that were in the machine when the board died. In the case of them diagnosing and fixing the problem on 8/7, it’s entirely possible that they swapped the RAM out at that time as well. We have already asked the datacenter to provide any more information on exactly what steps were taken that morning, but unfortunately we have not gotten much extra information beyond the initial “Motherboard replaced”. As the datacenter has thousands of machines under their care and a fleet of spare parts on hand for swapping out/repairing, anything is possible as the exact course of what the technician did that morning before bringing the machine back online)

What I do know for sure though, is that this whole situation revealed something to me. We’ve failed at something important, and that is what is so prominently displayed on the front page of this very site: “Communication and honesty will be our game” I can’t fault the honesty part, because going back through the tickets logged during these events (and boy, are there tickets. Looking at every ticket in the new system since it came online in June, over half of them are the result of these recent issues with Srv3… not sure if that is a testimony to how annoying these four outages have been for everyone, or how quiet things are normally…. Not complaining mind you, just a weird thing I noticed going through them all), I can’t see where anyone was anything less then forthright and honest in dealing with inquiries from clients.

The “communication” part however, is where I think we can do a little better. There’s a ticket from the first incident on 8/7, that was opened by the client just before 10am, when they noticed their site was down. It was replied to promptly and efficiently stating what had happened to Srv3, where we were with the repairs, and that we would update as soon as more was known. Shortly thereafter the server had come back online, and the ticket was updated again and closed out. Sound good? Not entirely. My question is this: Why, almost 7 hours after we knew there was a problem, did the client have to open a ticket in order to get any information on the problem? Why wasn’t there something right on the front page of www.purenrg.com that said “Srv3 Outage” or similar and they could obtain everything we knew about the problem and where we stood in getting it fixed?

I’m not saying there needs to be a big blinking red banner across the top of our homepage whenever we reboot a server, I’m not saying the admin team needs to twitter every little configuration change they make to the servers. (Although….. :gets that evil look in the eye:.. No, No, They’d string me up from a tree.) But something such as an outage of this magnitude affecting a large swath of our clients should result in easy to find information and status updates during the outage, available on demand for clients, and not by having to log in and file trouble tickets.

When I log into our homepage, I have a little link the upper right corner that says “Network Status”, I click it, and I see a pretty little page that look a lot like the Service Detail screen from Nagios (It should look a lot like it, our current monitoring system is heavily based on Nagios), at a single glance I can see if anything is offline, slow, etc. It’s a simple little thing, but it’s also informative and helpful, why can’t we have something a like that for clients as well?

Now, right now I know there are a couple people (who know our system) going “You can’t give everyone access to our monitoring system! There’s a lot of stuff in there we can’t let just everybody see whenever they want!”…. Some of those same people are going “Wait.. did he just say “… current monitoring system?”. To those people I respond “I know we can’t… and yes, I said –current-.” I know there’s at least one marketing degree holding friend of mine who will say “You want to put news about outages and failures on your front page? Are you insane! That’s a sales nightmare; you’d chase away every would-be client who lands on your doorstep. Your “System News & Announcements” should all be good news and improvements, and your “Blog” should be happy cheery things like baby puppies and birthdays! Where exactly are you planning on putting all this doom and gloom? Can you atleast hide it from people who are not already clients?” To that person I say “You take the good, you take the bad, you take them both and there you have the facts of life.” (See, I did learn something growing up in the 80s… from TV, but still). My point being, we can’t praise the good (our normal, highly reliable service) without also acknowledging the bad, I know that’s not what they teach at marketing school, but I never went to marketing school, however I did go to a Catholic elementary school, and there nice ladies taught me that speaking only half the truth is well… a lie; so it all goes in, the good, the bad, all of it.

I know there are a few reasons we can’t just take that pretty screen I have from our current system and display it for everybody. For one, there are some entries in there that we, or our clients, wouldn’t want shared with the world at large. Imagine for a moment that the entire world learned we run outsourced mail servers for both US presidential candidates, because they see that our monitoring system is watching both “mail.barackobama.com” and “mail.johnmccain.com” for us. Now, first of all, we don’t provide service for either of these domains, so our monitoring system doesn’t include those entries, but it could, in theory. Or it could contain entries for some other machines where the clients wouldn’t want the world to know they’ve outsourced *X*… *X* could be anything. The point is, we take the privacy of our clients seriously, and I won’t jeopardize that, even with silly things like machine names. Secondly, because I know it’s the season for everyone to be touchy about politics in this season; I picked those two domains just because it’s something everyone could easily identify with since they’re both highly public figures, and the order I placed them in is *strictly* because that’s how the domains fall alphabetically… I’m not promoting, endorsing or making a political statement of any kind here, I just needed an example, and this was one I figured would get the point across.

So no, a simple export of our current status screen won’t work from a privacy standpoint. The only servers I want shown in this page are our standard shared linux web hosting servers, the ones that many folks are on, the ones that people would most likely be looking for, and the ones that don’t run the risk of exposing any information we don’t consider pretty much public anyway. Furthermore, there’s some more information I’d like to work into the page that’s not currently contained in our monitoring system: Along with the current network status, I’d like to see a list of “Current Alerts” with details on any current network issues (something where we could update said alert with detailed information as the situation changes), “Past Alerts” with the last few reported problems, and even a “Scheduled Maintenance Windows” section where we could put details of any upcoming scheduled downtimes/etc. It is, I think, entirely doable, extremely informative for everyone involved, and, could even be done in such a way to alleviate the fears of even my marketing degree carrying friend (truthfully, I believe that given our normal track-record for quality uptime and service, this page, in the long run, would be a plus for us on the marketing side, not a negative)

I’m sure there are some other things that we can do to improve communications with clients, and other parts of our service, and if anyone has any suggestions, comments, or even straight out complaints and grievances, I’d very much like it if they sent them my way (Comments here in the blog, the contact us page, a ticket, or even just an email). I always keep an eye out for feedback from clients (and even non-clients), and am always grateful for any ideas of feedback that comes my way. I’m sure in time there will be some other changes coming in the way things run here on the site, I just really got focused on the whole “Why wasn’t this information on the site? Why only tickets? Didn’t we spend all that time developing this new site so that it could be more dynamic and responsive to the needs of everyone? So why didn’t we use it?!” problem and spent all of last night fleshing this idea out. (And now, apparently, half of the day typing this entry, which again has gone on far longer then I planned… oh well.)

I’m off to see if I can shake up some coders and give them a project to think about over the weekend.