09 Sep 2008

The Coming Death of PHP4

As we've previously announced and discussed, we will be dropping support for PHP4 entirely from our servers as it has been discontinued by the PHP Team and no further fixes/upgrades will be coming for it. We currently we estimate this process will be completed across all server before the end of October, barring any severe security exposures in the current PHP4 release before then that causes us to accelerate it's removal.

We do not anticipate this being a widespread problem, as all client sites have been defaulting to using PHP5 for well over 7 months now. If by some chance you have a PHP script on your account that you are currently running in PHP4 mode specifically, and you have been putting off upgrading it to work with PHP5, now is definitely the time to do so.

05 Sep 2008

Aftermath of a Bad Chip, Part 2

A continuation from yesterday’s post.

The start of these strange issues with Srv3 all started on August 7th, when Srv3 stopped responding to all requests just after midnight. Complicating matters our monitoring system failed to dispatch the alerts that there was a problem as it should have. Later investigation would reveal this was due to the InnoDB file used to log the status of the checks being full (this was a programming flaw in our monitoring system and has since been corrected) and we ended up unaware there was even a problem until the first client trouble tickets came in a little over 3 hours later. When we did realize there was a problem, we found ourselves unable to gain access to the machine remotely, and contacted the Datacenter around 4am, they investigated, and the report came back “Machine will not POST, Investigating further”, which is something nobody ever wants to hear regarding a server (Basically, the machine had power, and would turn “on”, but nothing ever came up on the console, nor would it ever actually boot, it just sat with a blank screen). It was almost 11am before the server came back online; the outcome report from the datacenter was “Motherboard was dead. Replaced from Stock, Server now back online.”

To be honest, we will probably never know exactly what caused the motherboard to die; a power spike of some type, a loose connection in the machine, or even just a board simply dying for no reason (I used to refer to this as “The magic escaped”, if you watch closely, some components (Power Supplies for instance), you can actually see the magic escape from the device when it goes bad.. some people would say that the magic looks like smoke and smells like smoke, but I insist it’s actually magic, because while I know how a power supply works, and what it does, I very well couldn’t build one or do the job myself, so it must be magic). We’ve had components go bad before, and motherboards are no exception, although the incident rate of such hardware failures for us has dropped considerably since we switch datacenter providers in May 2007; our old datacenter had a practice of using the “most economical hardware available on any given day”, which sometimes led to questions about the quality of the hardware, while the new datacenter provider uses boards such as the one in Srv3, a SuperMicro X7DBR-E, which while not the cheapest motherboard on the market, is a fairly respected server-quality piece of hardware. Anyway, before I digress further then I already have, for whatever reason, the motherboard had died, the datacenter replaced it, and we figured this was our one severe system failure for the season, and we should be happy and carefree for months to come.

Unfortunately this was not the case, and we found ourselves faced with crashes on August 25th, 29th, and the last on September 3rd. In all three of these cases the situation was similar and played out in similar fashions, the alert system would go off, and we would find ourselves unable to access the servers in any fashion remotely. (This reminds me, future blog post: Remote access, how we do it, why, and what features it gives us in a disaster). Our policy in these situations in the past has always been to try and determine the cause of the problem, even if it means a slightly longer downtime (without knowing what caused the problem for sure, it becomes much hard to protect against that problem in the future), so we would contact the datacenter to have a technician investigate at the machine. (We actually have the ability to do a hard power reset via the power strip remotely without involving anyone, but doing so would wipe out any way of knowing exactly what was going on in the machine at the time).

In all three instances the technician would report that even at the local console they were unable to gain access to the machine, and ultimately, the machine would be power cycled by the technician. The server would come back online, and we would begin investigating to see if we could determine the cause of the problem from the servers log files. In all three instances we found that the machine was apparently under a higher then normal load average at the time of the crash, and twice we found possible causes that could explain the high load (In one instance a client’s site was placing an unusually high demand on the mySQL server, in another there was what appeared to be a Distributed Denial of Service attack taking place against one clients site, but even that was strange in that it was “large enough to cause an issue [ed: for Srv3], but too small/varied to trigger the datacenters protection systems”.

The most recent crash on September 3rd had one crucial difference, the details of which I covered pretty well yesterday but to put is a simply as possible, Wednesday’s crash had one benefit to us, and that is that it shined a light on an issue that I’m hopeful will turn out to be the real cause of the problems we’ve been having with Srv3; only time will tell for sure, but I’m trying to be optimistic. (Optimistic yes, but I’m still checking on Srv3 every time I sit down at a machine). Ultimately there are still some questions: Was the bad RAM the cause of the problems on 8/25, 8/29, and 9/3? (If the machine continues to be stable from here on out, as it has so far, my guess is “Yes”) Was the problem with the RAM a result of or related to the motherboard failure? (Possible. A power spike within the machine could have very well affected the RAM chips as well. Another possibility is that the two chips we deemed to be “questionable” and had replaced yesterday may not have been the original chips that were in the machine when the board died. In the case of them diagnosing and fixing the problem on 8/7, it’s entirely possible that they swapped the RAM out at that time as well. We have already asked the datacenter to provide any more information on exactly what steps were taken that morning, but unfortunately we have not gotten much extra information beyond the initial “Motherboard replaced”. As the datacenter has thousands of machines under their care and a fleet of spare parts on hand for swapping out/repairing, anything is possible as the exact course of what the technician did that morning before bringing the machine back online)

What I do know for sure though, is that this whole situation revealed something to me. We’ve failed at something important, and that is what is so prominently displayed on the front page of this very site: “Communication and honesty will be our game” I can’t fault the honesty part, because going back through the tickets logged during these events (and boy, are there tickets. Looking at every ticket in the new system since it came online in June, over half of them are the result of these recent issues with Srv3… not sure if that is a testimony to how annoying these four outages have been for everyone, or how quiet things are normally…. Not complaining mind you, just a weird thing I noticed going through them all), I can’t see where anyone was anything less then forthright and honest in dealing with inquiries from clients.

The “communication” part however, is where I think we can do a little better. There’s a ticket from the first incident on 8/7, that was opened by the client just before 10am, when they noticed their site was down. It was replied to promptly and efficiently stating what had happened to Srv3, where we were with the repairs, and that we would update as soon as more was known. Shortly thereafter the server had come back online, and the ticket was updated again and closed out. Sound good? Not entirely. My question is this: Why, almost 7 hours after we knew there was a problem, did the client have to open a ticket in order to get any information on the problem? Why wasn’t there something right on the front page of www.purenrg.com that said “Srv3 Outage” or similar and they could obtain everything we knew about the problem and where we stood in getting it fixed?

I’m not saying there needs to be a big blinking red banner across the top of our homepage whenever we reboot a server, I’m not saying the admin team needs to twitter every little configuration change they make to the servers. (Although….. :gets that evil look in the eye:.. No, No, They’d string me up from a tree.) But something such as an outage of this magnitude affecting a large swath of our clients should result in easy to find information and status updates during the outage, available on demand for clients, and not by having to log in and file trouble tickets.

When I log into our homepage, I have a little link the upper right corner that says “Network Status”, I click it, and I see a pretty little page that look a lot like the Service Detail screen from Nagios (It should look a lot like it, our current monitoring system is heavily based on Nagios), at a single glance I can see if anything is offline, slow, etc. It’s a simple little thing, but it’s also informative and helpful, why can’t we have something a like that for clients as well?

Now, right now I know there are a couple people (who know our system) going “You can’t give everyone access to our monitoring system! There’s a lot of stuff in there we can’t let just everybody see whenever they want!”…. Some of those same people are going “Wait.. did he just say “… current monitoring system?”. To those people I respond “I know we can’t… and yes, I said –current-.” I know there’s at least one marketing degree holding friend of mine who will say “You want to put news about outages and failures on your front page? Are you insane! That’s a sales nightmare; you’d chase away every would-be client who lands on your doorstep. Your “System News & Announcements” should all be good news and improvements, and your “Blog” should be happy cheery things like baby puppies and birthdays! Where exactly are you planning on putting all this doom and gloom? Can you atleast hide it from people who are not already clients?” To that person I say “You take the good, you take the bad, you take them both and there you have the facts of life.” (See, I did learn something growing up in the 80s… from TV, but still). My point being, we can’t praise the good (our normal, highly reliable service) without also acknowledging the bad, I know that’s not what they teach at marketing school, but I never went to marketing school, however I did go to a Catholic elementary school, and there nice ladies taught me that speaking only half the truth is well… a lie; so it all goes in, the good, the bad, all of it.

I know there are a few reasons we can’t just take that pretty screen I have from our current system and display it for everybody. For one, there are some entries in there that we, or our clients, wouldn’t want shared with the world at large. Imagine for a moment that the entire world learned we run outsourced mail servers for both US presidential candidates, because they see that our monitoring system is watching both “mail.barackobama.com” and “mail.johnmccain.com” for us. Now, first of all, we don’t provide service for either of these domains, so our monitoring system doesn’t include those entries, but it could, in theory. Or it could contain entries for some other machines where the clients wouldn’t want the world to know they’ve outsourced *X*… *X* could be anything. The point is, we take the privacy of our clients seriously, and I won’t jeopardize that, even with silly things like machine names. Secondly, because I know it’s the season for everyone to be touchy about politics in this season; I picked those two domains just because it’s something everyone could easily identify with since they’re both highly public figures, and the order I placed them in is *strictly* because that’s how the domains fall alphabetically… I’m not promoting, endorsing or making a political statement of any kind here, I just needed an example, and this was one I figured would get the point across.

So no, a simple export of our current status screen won’t work from a privacy standpoint. The only servers I want shown in this page are our standard shared linux web hosting servers, the ones that many folks are on, the ones that people would most likely be looking for, and the ones that don’t run the risk of exposing any information we don’t consider pretty much public anyway. Furthermore, there’s some more information I’d like to work into the page that’s not currently contained in our monitoring system: Along with the current network status, I’d like to see a list of “Current Alerts” with details on any current network issues (something where we could update said alert with detailed information as the situation changes), “Past Alerts” with the last few reported problems, and even a “Scheduled Maintenance Windows” section where we could put details of any upcoming scheduled downtimes/etc. It is, I think, entirely doable, extremely informative for everyone involved, and, could even be done in such a way to alleviate the fears of even my marketing degree carrying friend (truthfully, I believe that given our normal track-record for quality uptime and service, this page, in the long run, would be a plus for us on the marketing side, not a negative)

I’m sure there are some other things that we can do to improve communications with clients, and other parts of our service, and if anyone has any suggestions, comments, or even straight out complaints and grievances, I’d very much like it if they sent them my way (Comments here in the blog, the contact us page, a ticket, or even just an email). I always keep an eye out for feedback from clients (and even non-clients), and am always grateful for any ideas of feedback that comes my way. I’m sure in time there will be some other changes coming in the way things run here on the site, I just really got focused on the whole “Why wasn’t this information on the site? Why only tickets? Didn’t we spend all that time developing this new site so that it could be more dynamic and responsive to the needs of everyone? So why didn’t we use it?!” problem and spent all of last night fleshing this idea out. (And now, apparently, half of the day typing this entry, which again has gone on far longer then I planned… oh well.)

I’m off to see if I can shake up some coders and give them a project to think about over the weekend.

04 Sep 2008

Aftermath of a Bad Chip, Part 1

As I write this blog entry, I'm currently sitting here waiting for Srv3 to come back online after having it’s RAM replacement as part of this afternoon's emergency maintenance.

Over the past couple of weeks we have experienced a number of instances where srv3 would go offline, responding to pings, but appearing to be swamped and un-responsive to actual web/database/email connections or any type of access at the console. At first we believed these to be related to everything from DDoS attacks and excessive client resource usage, to a faulty motherboard. It seemed that every crash was at a time when the machine was under heavier than normal workloads due to these types of things. We would see the crash, recover the system, and see how much workload it was under, dig into the workload and logs, see some suspicious traffic or a script running out of control, and figured that was the cause of the crash. After all, it wouldn't be the first time a DDoS attack or a runaway client program had managed to take one of our systems offline, and while we've gotten pretty good at making our servers pretty resilient under normal circumstances, occasionally we still see one get brought to its knees in this manner. (It’s rare, but it happens. When it does, we learn from the experience and re-tool our protections to prevent the same thing from doing it a second time.)

However, last night, when the server went offline in a similar fashion around 6pm, we rebooted it and found ourselves with a new, previously un-seen symptom, only half of the RAM in the system was recognized and usable. We had to reboot the server a second time later in the night to get the entire amount of system RAM back online. This led to us questioning the stability of the RAM modules in the machine, as even if the chip is not dead, but is acting “flaky”, it could run fine for days/weeks, but under load conditions it could very well fail and cause problems for the entire server.

So in light of this new, possibly connected symptom (and being that we're not huge believers in rare coincidence, especially when it comes to things of a technical nature), we ran some hardware tests last night on srv3. This revealed that one of the two memory modules in the server was “questionable”; while the memory was not failing completely, it was generating some errors while under heavy read/write operations, more than we are comfortable in seeing for any machine in production use. Srv3 does use ECC memory which is designed to “catch and correct” errors reading/writing to the chips, but if the chip is encountering errors on a constant basis when under heavy load, it could very well be the cause of the problems we've been seeing with the machine simply becoming unresponsive while under heavy load.

The admin team believes now that these periods of abnormally high workload/stress on the machine were more the “trigger” for the failure and crashes then the root cause of the crash, with the high workload causing the memory errors to surface and compound the load problem until the server simply crashed.

Early this morning we scheduled an emergency maintenance window with the datacenter for late this afternoon (4pm to 6pm, CDT) for them to take the server offline and replace the RAM modules. Normally we would schedule these windows during a more off-peak time period, but due to the nature of the situation, and the fact that in today’s 24x7x365 global web environment, there's really no “off-peak” time for our clients or us, we wanted to get this memory replaced as soon as possible. We're hoping that once the RAM modules are replaced, things can go back to a normal, everyday quiet routine for Srv3 and the clients housed on it.

A few moments ago I received word that Srv3 was back up and running with the new RAM modules, and so far preliminary testing shows no errors with the new modules. Hopefully this means our little period of strangeness is over and things can get back to normal around these parts.

So that's where we're at from a technical standpoint, we go forward, keep an eye out for any more strangeness (I personally won’t rest easy for at least two weeks, every time I sit down at a console I’ll be doing a quick “Is Srv3 up and okay?” check for myself, because even though I know the monitoring system should alert us quickly if it’s not, I simply will feel better checking for myself, after the strange ways things have gone down over the last few weeks) . From a more overall business objective standpoint and from what we’ve learned from this experience, we obviously have some work ahead of us, and some new goals set for ourselves as a result of this entire experience.

And those topics, what I had really wanted to discuss here, will I fear, be waiting for a follow up blog post, as this entry has already gotten far longer winded then the quick note I originally planned, and there’s quite a bit more that needs to be discussed. I’m going to check Srv3 one more time, and then brave the after-rush-hour traffic home. Hopefully a few hours of mindless relaxation (away from a SSH terminal session) will give me a little perspective and time to collect my thoughts on the rest of the issues; Tomorrow I’ll sit down and put together a better outline of exactly what happened, what went wrong, and how we’re going to fix it, learn from it, and move forward from here.

16 Jul 2008

mySQL upgraded to v5

MySQL has been upgraded across all servers to v5.0.51a This upgrade finally jumps us away from the legacy v4 branch and brings with it many new improvements and tweaks.

While there was a snag on one server that resulted in a approximately 15 minutes of unscheduled problems with that particular server (php content was not being served properly), overall the upgrade went smoothly and we are looking forward to continued improvements from the MySQL team.

22 May 2008

Migration of ns3.purenrg.com

Tomorrow we will begin the process of replacing ns3.purenrg.com, one of the namservers used for client domains. The new ns3 has been online for the last week, quietly staying sync'd with ns4.purenrg.com and servicing those requests we have sent it's way while testing to insure everything was up to speed.

Tomorrow we will update the nameserver record for ns3.purenrg.com, to point it to the IP address of the new machine ( 208.43.104.254 ), and over the weekend all dns traffic for client domains should drift away from the old server and over to the new one.

** There should be -no change or action- required on your part for this **

If your domain name is pointed to (as most are) ns3.purenrg.com & ns4.purenrg.com, our updating the record for ns3.purenrg.com with the registrar should be the only change needed at all. When you pointed your domain to our namservers, you simply told your domain provider “ns3.purenrg.com” and “ns4.purenrg.com”; the record that we will update tomorrow will be reflected in your domain records once the change makes it way out to all the root nameservers.

If your domain name is not pointed to our nameservers (ie: you are using some other dns service for your domain), then you are not using ns3.purenrg.com anyway and are unaffected by this change.

The only situation where anyone should need to change anything are:

1) You have a monitoring program or service that is hard-coded to watch the old ns3.purenrg.com machine, by IP address ( 64.246.42.242 ). You will want to change said program or service to watch the new IP address ( 208.43.104.254 ) instead.

After updating the main record tomorrow, and giving it the weekend to filter out around the internet, we will monitor the activity on the old ns3 machine next week to see if there are any requests still coming in at the old machine. We do not intend to take the old server offline until atleast the following week, once we are sure everything out there is updated properly.

Just wanted to keep everyone up to date on this change, even though we do not see it actually affecting anyone, aside from that one footnote. 🙂

28 Mar 2008

Bandwidth Warning Notifications Tweaked

The domain yourdomain.domain (account) is about to exceed their bandwidth limit (X/X Megs)

The above is probably a pretty familiar email subject to some of our clients. In the past the cPanel servers would auto-notify both the client and our support team whenever an account had exceeded 80% of their monthly bandwidth allowance in a month. It's not uncommon for us to see this warning emails start rolling in towards the end of the month for a number of accounts.

The original idea was to provide adequate warning to clients so that if they anticipated going over their limit that month, arrangements could be made.

The problem is, that as our plans have grown in size, we never adjusted the point at which these warnings were sent out; the warnings have always started going out when an account reached 80%. With the bandwidth growth of our accounts over the last year, 80% is not necessarily the best place for the warnings to start anymore. For instance, our smallest plan (Linux Bronze), comes with 5 gigabytes of transfer now each month, and the 80% mark is 4 gigabytes; while our largest standard plan (Linux Gold) comes with 50 gigabytes, making the “warning” trigger 40 gigabytes.

Long story short, we've adjusted the “trigger” to be 95% of the monthly limit now. This should help cut down on the number of clients who regularly get the warning emails at the end of every month, yet never actually have to worry about running over their limits.

12 Feb 2008

All Shared Linux Hosting Plans Upgraded!

We are happy to announce that effective immediately, all of our standard shared hosting linux accounts (Linux Bronze thru Linux Gold) are being upgraded with increased resource allocations. This upgrade applies to disk space and monthly bandwidth transfer, as well as amenities such as the limits on number of email accounts, domains, etc available to each account. On average plans have received approximately a doubling of disk space, and a 60-70% increase in their bandwidth limits in this latest round of upgrades.

The new plan limits are now posted on our website, and current clients need do nothing to start utilizing the new limits, all client accounts have been upgraded automatically.

04 Feb 2008

PHP5 now the default for .php files

In the two and a half weeks since we deployed PHP5 as an option for all clients, we've received only a handful of tickets regarding script compatibility with PHP5, all of which have been resolved at this time. Most of the issues that turned up involved needs/desires for a specific php5 module that we missed in the initial upgrade process.

As the upgrade appears to have gone smoothly thus far, The transition to using PHP5 as the default for all *.php files was performed this morning. At this time, all *.php scripts are now running under the PHP5 system across all servers.

Any clients who have scripts that still require the older PHP4 system can still use it by either naming their php scripts *.php4, or by changing the default php version for their account from within their cPanel screen. Please keep in mind, the PHP team has announced there will be no upgrades to the PHP4 line, including security fixes, past August 8th 2008, which means we will most likely be removing PHP4 sometime shortly after August 8th.

17 Jan 2008

PHP5 available on all servers

We are happy to announce that effective immediately, PHP5 is now available on all shared linux hosting accounts here at Pure Energy.

Clients can utilize PHP5 by either changing their default PHP version inside cPanel (cPanel -> PHP Configuration), or by naming PHP files with the .php5 extension. On February 4th we will switch the default for all *.php files on our servers over to PHP5. If clients require the use of PHP4 beyond that date, they will need to use the PHP Configuration Screen in cPanel to select PHP4 for their account, or rename their PHP files to .php4.

The PHP4 system will be kept in place as an option for those users who require it, as long as a secure version of the PHP4 branch is still available from the PHP team. The PHP Team has announced their plans to continue releasing security fixes through August 2008, so we anticipate keeping PHP4 available as an option until that time.

17 Jul 2007

Shared Hosting Plans Upgraded…

Today we are happy to announce that all of our shared linux web hosting plans have been upgraded in size, while retaining the same sensible rates. On average each plan has been enlarged by 27% in terms of disk space and monthly bandwidth usage. This enhancement comes courtesy of our continued pledge to offer sustainable, reasonably priced services for our clients. All clients with shared linux webhosting accounts will immediately gain access to the increased bandwidth and diskspace effective immediately, with no action required on their part to take advantage of the change.

(c) 2019 Pure Energy Systems LLC - All rights reserved.

back to top