Maintenance Archives | Page 2 of 2 | Pure Energy Systems
24 Nov 2009

Planned Service Outage on November 30th

We've been made aware of an emergency maintenance outage that has been scheduled for Monday, November 30th, between 1am and 2am EST. (That would be “Late Sunday Night, Early Monday Morning”)… Apparently there was a recently discovered security vulnerability in particular Cisco Router software versions, and sure enough, our data center has a number of affected machines in their network that need to be upgraded.

The outage will affect the entire “public” networks at our data center, basically cutting our equipment off from the rest of the internet in the process. As a result all client account services, as well as our own websites and customer portal, will be unavailable while the data center network is offline for the upgrade. While the upgrade duration is scheduled for 1 hour, we have been informed to only expect between 15-20 minutes of downtime as the routers reload and fully converge.

We apologize in advance for any inconvenience this planned outage may cause anyone, but maintaining the security and integrity of the network does require occasional upgrades and patches.

17 Oct 2009

All Servers Upgraded…

This evening all client servers for our Shared Linux Hosting users were upgraded to Apache 2.2.14 and PHP 5.2.11.. these were fairly minor upgrades (from previous 2.2.x and 5.2.x) so no configuration or major modifications were required. Downtime for each server was measured in seconds while the Apache services were restarted.

27 Mar 2009
17 Mar 2009

Backend Network Maintenance

We have been advised by our data center that their engineers will be upgrading the router operating systems and rebooting the back end network routers across the Dallas network on March 28th 2009 starting at 3:00am CDT.

Impact to our clients: Little to none: all servers and accounts should be up and accessible to the outside world during any actual back end outages. It will however affect some back end operations on our end, such as the management interfaces between our website and our hosting servers, and the internal DNS propagation system. For instance, new account provisioning, adding a new domain to an existing account, or resetting your cPanel password via the “My Services” section of our website will not function during the actual back end network outage.

Time frame: While the maintenance window is set for 2 hours, we expect no longer than 15 – 20 minutes of downtime.

04 Dec 2008

Upgrade to Apache 2.2 Complete

Apache/2.2.10 (Unix) Server at srv3.purenrg.com Port 80

Such sweeter words have never been spoken by our servers. Okay, so maybe I'm being just a little overly dramatic, but the simple fact is, the jump from Apache 1.3.x to 2.2.x has been something we've been putting off for various reasons for so long that it was beginning to feel like it would never happen. This morning that situation was resolved, and we are happy to report that everything has appeared to go smoothly.

We'll be keeping a close eye on the systems over the next few days to keep an eye out for any unwanted issues, as well as to see how the new version stacks up as far as performance goes.

04 Dec 2008
15 Oct 2008

Unscheduled downtime for our own site.

Had a brief problem this morning with one of our internal Xen VPS boxes, this one resulting in our own website being unavailable for approximately 20 minutes. Ultimately it was determined that the Xen DomU that runs our site was unresponsive (and was not coming back online properly after shutting down that single DomU), and we had to reboot the entire VPS machine after applying some Xen updates to be safe. No shared linux web hosting clients would have been affected by this (except for a moment or two where ns3 was offline while the VPS rebooted, and ns3 going offline momentarily isn't a problem, as ns4 is there to take up the workload).

We are currently running a series of Xen updates across all DomUs on the affected machine, which may result in an additional moments downtime while each one reboots into the newer kernels. Again, no client sites should be affected, just our own.

05 Sep 2008

Aftermath of a Bad Chip, Part 2

A continuation from yesterday’s post.

The start of these strange issues with Srv3 all started on August 7th, when Srv3 stopped responding to all requests just after midnight. Complicating matters our monitoring system failed to dispatch the alerts that there was a problem as it should have. Later investigation would reveal this was due to the InnoDB file used to log the status of the checks being full (this was a programming flaw in our monitoring system and has since been corrected) and we ended up unaware there was even a problem until the first client trouble tickets came in a little over 3 hours later. When we did realize there was a problem, we found ourselves unable to gain access to the machine remotely, and contacted the Datacenter around 4am, they investigated, and the report came back “Machine will not POST, Investigating further”, which is something nobody ever wants to hear regarding a server (Basically, the machine had power, and would turn “on”, but nothing ever came up on the console, nor would it ever actually boot, it just sat with a blank screen). It was almost 11am before the server came back online; the outcome report from the datacenter was “Motherboard was dead. Replaced from Stock, Server now back online.”

To be honest, we will probably never know exactly what caused the motherboard to die; a power spike of some type, a loose connection in the machine, or even just a board simply dying for no reason (I used to refer to this as “The magic escaped”, if you watch closely, some components (Power Supplies for instance), you can actually see the magic escape from the device when it goes bad.. some people would say that the magic looks like smoke and smells like smoke, but I insist it’s actually magic, because while I know how a power supply works, and what it does, I very well couldn’t build one or do the job myself, so it must be magic). We’ve had components go bad before, and motherboards are no exception, although the incident rate of such hardware failures for us has dropped considerably since we switch datacenter providers in May 2007; our old datacenter had a practice of using the “most economical hardware available on any given day”, which sometimes led to questions about the quality of the hardware, while the new datacenter provider uses boards such as the one in Srv3, a SuperMicro X7DBR-E, which while not the cheapest motherboard on the market, is a fairly respected server-quality piece of hardware. Anyway, before I digress further then I already have, for whatever reason, the motherboard had died, the datacenter replaced it, and we figured this was our one severe system failure for the season, and we should be happy and carefree for months to come.

Unfortunately this was not the case, and we found ourselves faced with crashes on August 25th, 29th, and the last on September 3rd. In all three of these cases the situation was similar and played out in similar fashions, the alert system would go off, and we would find ourselves unable to access the servers in any fashion remotely. (This reminds me, future blog post: Remote access, how we do it, why, and what features it gives us in a disaster). Our policy in these situations in the past has always been to try and determine the cause of the problem, even if it means a slightly longer downtime (without knowing what caused the problem for sure, it becomes much hard to protect against that problem in the future), so we would contact the datacenter to have a technician investigate at the machine. (We actually have the ability to do a hard power reset via the power strip remotely without involving anyone, but doing so would wipe out any way of knowing exactly what was going on in the machine at the time).

In all three instances the technician would report that even at the local console they were unable to gain access to the machine, and ultimately, the machine would be power cycled by the technician. The server would come back online, and we would begin investigating to see if we could determine the cause of the problem from the servers log files. In all three instances we found that the machine was apparently under a higher then normal load average at the time of the crash, and twice we found possible causes that could explain the high load (In one instance a client’s site was placing an unusually high demand on the mySQL server, in another there was what appeared to be a Distributed Denial of Service attack taking place against one clients site, but even that was strange in that it was “large enough to cause an issue [ed: for Srv3], but too small/varied to trigger the datacenters protection systems”.

The most recent crash on September 3rd had one crucial difference, the details of which I covered pretty well yesterday but to put is a simply as possible, Wednesday’s crash had one benefit to us, and that is that it shined a light on an issue that I’m hopeful will turn out to be the real cause of the problems we’ve been having with Srv3; only time will tell for sure, but I’m trying to be optimistic. (Optimistic yes, but I’m still checking on Srv3 every time I sit down at a machine). Ultimately there are still some questions: Was the bad RAM the cause of the problems on 8/25, 8/29, and 9/3? (If the machine continues to be stable from here on out, as it has so far, my guess is “Yes”) Was the problem with the RAM a result of or related to the motherboard failure? (Possible. A power spike within the machine could have very well affected the RAM chips as well. Another possibility is that the two chips we deemed to be “questionable” and had replaced yesterday may not have been the original chips that were in the machine when the board died. In the case of them diagnosing and fixing the problem on 8/7, it’s entirely possible that they swapped the RAM out at that time as well. We have already asked the datacenter to provide any more information on exactly what steps were taken that morning, but unfortunately we have not gotten much extra information beyond the initial “Motherboard replaced”. As the datacenter has thousands of machines under their care and a fleet of spare parts on hand for swapping out/repairing, anything is possible as the exact course of what the technician did that morning before bringing the machine back online)

What I do know for sure though, is that this whole situation revealed something to me. We’ve failed at something important, and that is what is so prominently displayed on the front page of this very site: “Communication and honesty will be our game” I can’t fault the honesty part, because going back through the tickets logged during these events (and boy, are there tickets. Looking at every ticket in the new system since it came online in June, over half of them are the result of these recent issues with Srv3… not sure if that is a testimony to how annoying these four outages have been for everyone, or how quiet things are normally…. Not complaining mind you, just a weird thing I noticed going through them all), I can’t see where anyone was anything less then forthright and honest in dealing with inquiries from clients.

The “communication” part however, is where I think we can do a little better. There’s a ticket from the first incident on 8/7, that was opened by the client just before 10am, when they noticed their site was down. It was replied to promptly and efficiently stating what had happened to Srv3, where we were with the repairs, and that we would update as soon as more was known. Shortly thereafter the server had come back online, and the ticket was updated again and closed out. Sound good? Not entirely. My question is this: Why, almost 7 hours after we knew there was a problem, did the client have to open a ticket in order to get any information on the problem? Why wasn’t there something right on the front page of www.purenrg.com that said “Srv3 Outage” or similar and they could obtain everything we knew about the problem and where we stood in getting it fixed?

I’m not saying there needs to be a big blinking red banner across the top of our homepage whenever we reboot a server, I’m not saying the admin team needs to twitter every little configuration change they make to the servers. (Although….. :gets that evil look in the eye:.. No, No, They’d string me up from a tree.) But something such as an outage of this magnitude affecting a large swath of our clients should result in easy to find information and status updates during the outage, available on demand for clients, and not by having to log in and file trouble tickets.

When I log into our homepage, I have a little link the upper right corner that says “Network Status”, I click it, and I see a pretty little page that look a lot like the Service Detail screen from Nagios (It should look a lot like it, our current monitoring system is heavily based on Nagios), at a single glance I can see if anything is offline, slow, etc. It’s a simple little thing, but it’s also informative and helpful, why can’t we have something a like that for clients as well?

Now, right now I know there are a couple people (who know our system) going “You can’t give everyone access to our monitoring system! There’s a lot of stuff in there we can’t let just everybody see whenever they want!”…. Some of those same people are going “Wait.. did he just say “… current monitoring system?”. To those people I respond “I know we can’t… and yes, I said –current-.” I know there’s at least one marketing degree holding friend of mine who will say “You want to put news about outages and failures on your front page? Are you insane! That’s a sales nightmare; you’d chase away every would-be client who lands on your doorstep. Your “System News & Announcements” should all be good news and improvements, and your “Blog” should be happy cheery things like baby puppies and birthdays! Where exactly are you planning on putting all this doom and gloom? Can you atleast hide it from people who are not already clients?” To that person I say “You take the good, you take the bad, you take them both and there you have the facts of life.” (See, I did learn something growing up in the 80s… from TV, but still). My point being, we can’t praise the good (our normal, highly reliable service) without also acknowledging the bad, I know that’s not what they teach at marketing school, but I never went to marketing school, however I did go to a Catholic elementary school, and there nice ladies taught me that speaking only half the truth is well… a lie; so it all goes in, the good, the bad, all of it.

I know there are a few reasons we can’t just take that pretty screen I have from our current system and display it for everybody. For one, there are some entries in there that we, or our clients, wouldn’t want shared with the world at large. Imagine for a moment that the entire world learned we run outsourced mail servers for both US presidential candidates, because they see that our monitoring system is watching both “mail.barackobama.com” and “mail.johnmccain.com” for us. Now, first of all, we don’t provide service for either of these domains, so our monitoring system doesn’t include those entries, but it could, in theory. Or it could contain entries for some other machines where the clients wouldn’t want the world to know they’ve outsourced *X*… *X* could be anything. The point is, we take the privacy of our clients seriously, and I won’t jeopardize that, even with silly things like machine names. Secondly, because I know it’s the season for everyone to be touchy about politics in this season; I picked those two domains just because it’s something everyone could easily identify with since they’re both highly public figures, and the order I placed them in is *strictly* because that’s how the domains fall alphabetically… I’m not promoting, endorsing or making a political statement of any kind here, I just needed an example, and this was one I figured would get the point across.

So no, a simple export of our current status screen won’t work from a privacy standpoint. The only servers I want shown in this page are our standard shared linux web hosting servers, the ones that many folks are on, the ones that people would most likely be looking for, and the ones that don’t run the risk of exposing any information we don’t consider pretty much public anyway. Furthermore, there’s some more information I’d like to work into the page that’s not currently contained in our monitoring system: Along with the current network status, I’d like to see a list of “Current Alerts” with details on any current network issues (something where we could update said alert with detailed information as the situation changes), “Past Alerts” with the last few reported problems, and even a “Scheduled Maintenance Windows” section where we could put details of any upcoming scheduled downtimes/etc. It is, I think, entirely doable, extremely informative for everyone involved, and, could even be done in such a way to alleviate the fears of even my marketing degree carrying friend (truthfully, I believe that given our normal track-record for quality uptime and service, this page, in the long run, would be a plus for us on the marketing side, not a negative)

I’m sure there are some other things that we can do to improve communications with clients, and other parts of our service, and if anyone has any suggestions, comments, or even straight out complaints and grievances, I’d very much like it if they sent them my way (Comments here in the blog, the contact us page, a ticket, or even just an email). I always keep an eye out for feedback from clients (and even non-clients), and am always grateful for any ideas of feedback that comes my way. I’m sure in time there will be some other changes coming in the way things run here on the site, I just really got focused on the whole “Why wasn’t this information on the site? Why only tickets? Didn’t we spend all that time developing this new site so that it could be more dynamic and responsive to the needs of everyone? So why didn’t we use it?!” problem and spent all of last night fleshing this idea out. (And now, apparently, half of the day typing this entry, which again has gone on far longer then I planned… oh well.)

I’m off to see if I can shake up some coders and give them a project to think about over the weekend.

04 Sep 2008

Aftermath of a Bad Chip, Part 1

As I write this blog entry, I'm currently sitting here waiting for Srv3 to come back online after having it’s RAM replacement as part of this afternoon's emergency maintenance.

Over the past couple of weeks we have experienced a number of instances where srv3 would go offline, responding to pings, but appearing to be swamped and un-responsive to actual web/database/email connections or any type of access at the console. At first we believed these to be related to everything from DDoS attacks and excessive client resource usage, to a faulty motherboard. It seemed that every crash was at a time when the machine was under heavier than normal workloads due to these types of things. We would see the crash, recover the system, and see how much workload it was under, dig into the workload and logs, see some suspicious traffic or a script running out of control, and figured that was the cause of the crash. After all, it wouldn't be the first time a DDoS attack or a runaway client program had managed to take one of our systems offline, and while we've gotten pretty good at making our servers pretty resilient under normal circumstances, occasionally we still see one get brought to its knees in this manner. (It’s rare, but it happens. When it does, we learn from the experience and re-tool our protections to prevent the same thing from doing it a second time.)

However, last night, when the server went offline in a similar fashion around 6pm, we rebooted it and found ourselves with a new, previously un-seen symptom, only half of the RAM in the system was recognized and usable. We had to reboot the server a second time later in the night to get the entire amount of system RAM back online. This led to us questioning the stability of the RAM modules in the machine, as even if the chip is not dead, but is acting “flaky”, it could run fine for days/weeks, but under load conditions it could very well fail and cause problems for the entire server.

So in light of this new, possibly connected symptom (and being that we're not huge believers in rare coincidence, especially when it comes to things of a technical nature), we ran some hardware tests last night on srv3. This revealed that one of the two memory modules in the server was “questionable”; while the memory was not failing completely, it was generating some errors while under heavy read/write operations, more than we are comfortable in seeing for any machine in production use. Srv3 does use ECC memory which is designed to “catch and correct” errors reading/writing to the chips, but if the chip is encountering errors on a constant basis when under heavy load, it could very well be the cause of the problems we've been seeing with the machine simply becoming unresponsive while under heavy load.

The admin team believes now that these periods of abnormally high workload/stress on the machine were more the “trigger” for the failure and crashes then the root cause of the crash, with the high workload causing the memory errors to surface and compound the load problem until the server simply crashed.

Early this morning we scheduled an emergency maintenance window with the datacenter for late this afternoon (4pm to 6pm, CDT) for them to take the server offline and replace the RAM modules. Normally we would schedule these windows during a more off-peak time period, but due to the nature of the situation, and the fact that in today’s 24x7x365 global web environment, there's really no “off-peak” time for our clients or us, we wanted to get this memory replaced as soon as possible. We're hoping that once the RAM modules are replaced, things can go back to a normal, everyday quiet routine for Srv3 and the clients housed on it.

A few moments ago I received word that Srv3 was back up and running with the new RAM modules, and so far preliminary testing shows no errors with the new modules. Hopefully this means our little period of strangeness is over and things can get back to normal around these parts.

So that's where we're at from a technical standpoint, we go forward, keep an eye out for any more strangeness (I personally won’t rest easy for at least two weeks, every time I sit down at a console I’ll be doing a quick “Is Srv3 up and okay?” check for myself, because even though I know the monitoring system should alert us quickly if it’s not, I simply will feel better checking for myself, after the strange ways things have gone down over the last few weeks) . From a more overall business objective standpoint and from what we’ve learned from this experience, we obviously have some work ahead of us, and some new goals set for ourselves as a result of this entire experience.

And those topics, what I had really wanted to discuss here, will I fear, be waiting for a follow up blog post, as this entry has already gotten far longer winded then the quick note I originally planned, and there’s quite a bit more that needs to be discussed. I’m going to check Srv3 one more time, and then brave the after-rush-hour traffic home. Hopefully a few hours of mindless relaxation (away from a SSH terminal session) will give me a little perspective and time to collect my thoughts on the rest of the issues; Tomorrow I’ll sit down and put together a better outline of exactly what happened, what went wrong, and how we’re going to fix it, learn from it, and move forward from here.

04 Sep 2008

Unscheduled Maintenance Window for Srv3

We have a previously unscheduled (well, unscheduled prior to today) maintenance window for Srv3 scheduled for Thursday, September 4th at 4:00pm – 6:00pm CDT.

We anticipate that actual downtime for the machine should be less then 30 minutes, but the exact start time will be dependent upon the technicians in the Data Center.

This window is to replace the RAM modules in srv3, we hope to rectify the recent problems we have been experiencing with the server.

All clients on Srv3 should have received an email early this morning when this window was first set, but we wanted to get something out on our website just in case anyone missed the original email.

Update @ 5:33pm Eastern: Srv3 was taken offline for the RAM replacement at approximately 5:20pm Eastern, we are currently awaiting word from the Datacenter on it's return.

Update @ 5:51pm Eastern: Srv3 is back up as of 5:45pm and servicing requests at this time. The Datacenter Technician is performing some basic testing to ensure the new RAM is functioning properly, but so far so good, server may be a touch slower then normal while the testing is in progress.

Update @ 6:02pm Eastern: Datacenter reports that basic testing on the new RAM is showing all clean. We will continue to monitor Srv3 over the next few hours, but hopefully the faulty RAM was the problem all along.

(c) 2020 Pure Energy Systems LLC - All rights reserved.

back to top