TechOpsGuys.com Diggin' technology every day

June 30, 2012

Amazon Cloud: Two power outages in two weeks

Filed under: Datacenter — Tags: , — Nate @ 11:54 am

By now you should know I’m no fan of Amazon’s cloud, it makes me feel I’m stuck in the 90s when I use it. I’ve been using it quite a bit for the past two years(with two different companies) but finally about to get the hell out of there. The last set of systems is set to migrate before my trip to Seattle.

Last week they had one outage in one of their availability zones, though it took them well over an hour to admit it was a power outage, they first tried to say “oh some volumes are experiencing increased latency”. What a load of crap. It should take all of 5 seconds to know there is a power outage. The stuff I manage had minor impact fortunately since we are down to just a few things left, we lost some stuff but none of it critical.

Then last night they have another one, which seems to have made some news too.

A slew of sites, including Netflix, Instagram and Pinterest, have gone down this evening, thanks to “power issues” at Amazon’s Elastic Compute Cloud data center in North Virginia. The websites rely on Amazon’s cloud services to power their services. Some pretty violent storms in the region are apparently causing the problems.

This had slightly more impact on stuff I’m responsible for, one of my co-workers handled the issues it wasn’t much to do fortunately. I can only imagine the havok of a larger organization like one of the above that depend more heavily on their cloud.

What a lot of people don’t realize though is these two outages aren’t really considered outages in Amazon’s mind, at least for that region, because only one data center or part of one data center went off line. Their SLA is worded so that they exempt themselves from the effects of such an outage and put the onerous on the customer to deal with it. I suspect these facilities aren’t even Tier IV, because Tier IV is expensive and Amazon is about cheap. If they were Tier IV a simple storm wouldn’t of caused equipment to lose power.

I remember a couple years ago the company I was at had some gear co-located near Chicago at an Equinix site, some big storms and flooding if I remember right rolled through. We didn’t have redundant power of course(more on that below), but there was no impact to the equipment other than an email to us saying the site was on generator power for some time and then another email saying the site was back on utility power.

There are exceptions of course, poor design being one. I think back to what was once Internap‘s premier data center in Seattle Fisher Plaza which was once plagued by power issues and eventually resulted in more then 24 hours of downtime due to a fire knocking out many well known sites like Bing Travel as well as many others. It took them months to repair the facility, they had generator trucks sitting out front providing generator power 24/7 during the repairs. From a storage perspective I remember being told stories of at least one or two customers’ NetApp equipment taking more then 24 hours to come back online (file system checks), I’m sure folks that had battery backed cache were in sort of a panic not knowing when or if power would be restored to the facility. Some of my friends were hosted there at another company with a really small 3PAR array and were not worried though, because 3PAR systems dump their cache to an internal disk on the controller when the power goes out, so batteries are not required past that point. Since cache is mirrored there is two copies of it stored on different disks. Some newer systems have fancy flash-backed cache that is even nicer.

Fisher Plaza for a while had about one power outage per year, every year for at least 3 years in a row. Including the somewhat famous EPO event where someone went out of their way to hit the Emergency Power Off switch (there was no emergency) and shut down the facility. After that all customers had to go through EPO Training, which was humorous.

Being the good operations person that I am, shortly after I started work at a company back in 2006 that was hosted at Fisher plaza I started working on plans to move them out – the power issues was too much to bear. We still had about nine months left in our contract and my boss was unsure how we could go before that was up given it would cost a lot. I had an awesome deal on the table from a local AT&T Facility which I had good experiences with (though density wise they are way out dated and after an AT&T re-organization in around 2008 I wouldn’t even consider AT&T as a data center provider now). Anyways, I had this great deal and wanted to move but we had a hard time getting past the fact that we still owed a ton on the contract for Internap and we couldn’t get out of it. Then Fisher Plaza had another power outage (this was in 2006, the fire was three years later). The VP said to us something along the lines of I don’t care what it takes, I want to get out of there now. Music to my ears, things got moving quickly and we moved out within a month or so. I was hosted at that AT&T data center for a good 5 years personally and the companies I was at was hosted there for a good I want to say 8-9 years between the two without a single power event that I am aware of. I was there once when the facility lost power, but the data center floor was unaffected. I believe there was a few other power outages, but again nothing impacting customer equipment.

There are other bad designs out there too – personally I consider anything that relies on a flywheel UPS to be badly designed, because there isn’t enough time for on site personnel to manually try to correct a situation before the UPS runs out of juice.  I want at least 10-15 minutes at full load.

Internap later opened a newer fancier data center down in Tukwila in facility owned by Sabey. That is a massive campus, they claim 1.2M square feet of data center space. There is a large Microsoft presence there as well. On one of my tours of the facility I asked their technical people whether or not they use real UPS or a Flywheel, and they said real UPS. They commented how Microsoft literally next door used Flywheel and they said how Microsoft is seemingly constantly running their generators(far more frequently than your typical routine load testing), they did not know specifically why but speculated maybe they don’t trust the Fly wheels, and laughed with me. That same Internap facility had another power outage, shortly after it opened though that one was human error. Apparently there was some fault in a UPS, and some person did something bad, the only way to fix it was to shut everything down. Internap claimed they addressed that problem by having every on site action double checked and signed off. I know people that are hosted there and have not heard of issues since the new policies were put in place.

Another reason is being a cheap bastard. I think Amazon falls into this area – they address it for their own applications with application level availability, global load balancing and fancy Citrix load balancers.  I was at another company a few years ago that fell into this area too of being a cheap bastard and not wanting to invest in redundant power. People view power as a utility that won’t ever go down, especially in a data center – and this view is reinforced the longer you go without having a power outage. I remember a couple outages at a real cheap co-location the company was using in Seattle, where some other customer plugged a piece of fancy Cisco gear in and for some reason it tripped the UPS which knocked out a half dozen of our racks, because they didn’t have redundant power. So naturally we had an outage due to that.  The same thing happened again a few weeks later, after the customer replaced the Cisco gear with a newer Cisco thing and the UPS tripped again. Don’t know why.

The back end infrastructure was poorly designed as well, they had literally roughly 2 dozen racks all running off the same UPS, none of them had redundant power(I thought how hard can it be to alternate between UPSs every other rack? Apparently they didn’t think of that or didn’t want to spend for it).  It was a disaster waiting to happen. They were lucky and they did not have such a disaster while I was there. It was like pulling teeth to get them to commit to redundant power for the new 3PAR system, and even then they’d only agree to one UPS feed and one non UPS feed. This had it’s own issues on occasion.

One of my former co-workers told me a story about a data center he used to work at – the worst of both worlds – bad design AND cheap bastard. They bought these generators and enclosed them somewhat in some sort of structure outside. Due to environmental regulations they could not test them very often, only a couple minutes a month or something like that. Maybe the generators were cheap crappy ones that belched out more pollution than others, I don’t know. But the point is they never could fully test them. They had a real power outage one day, and they went outside and watched as the generators kicked on, they were happy.

Then a few minutes later they shut down and the facility lost all power. WTF? They went and turned them on again, and a few minutes later they shut off again.  Apparently the structure they built around the generators did not leave enough space for cooling and the generators were overheating and shutting down.

Back to Amazon and their SLAs (or lack thereof). I’m torn between funny and sad when I see people attacking Amazon customers like Netflix or the other social things that are on their cloud when they go down as a result of an Amazon downtime. They rag on the customers for not making their software more resilient against such things. Amazon expects you to do this, they do it after all if Amazon can do it anyone can right?

Yeah, reality is different. Most companies do not do that and probably never will. At a certain scale it makes sense, for some applications it makes sense. For the vast majority it does not, and the proof is in the pudding – most companies don’t do it. I’ve worked at two different companies that built their apps from the ground up in Amazon and neither made any considerations for this aspect of availability. I know there are folks out there that DO do this but they are in the small minority, who think they are hip because they can survive a data center going down without impacting things.

It’s far simpler, and cheaper to address the problem in a more traditional way with highly available infrastructure for the vast majority of applications anyways. Disasters do happen and you should still be prepared, but that’s far different from the Amazon model of “built to fail”. These aren’t the first power issues Amazon has had and certainly won’t be the last.

The main point to this post is trying to illustrate the difference in how the SLAs are worded, how the particular service provider responds, and how customers respond to the event.

A counter example I have brought up many times, a combination of a power issue AND a fire over at a Terremark facility a few years ago, resulted in no customer impact. Good design and no cheap bastards there.

Some irony here is that Amazon tries to recruit me about once every six months. I politely tell them I’m not interested, unless it’s a person I know then I tell them why I’m not interested, and believe me I’m being incredibly polite here.

The current state of Infrastructure as a Service cloud offerings is just a disaster in general (there are some exceptions to parts of the rules here and there). Really everything about it is flawed from the costs to the availability to the basic way you allocate resources. For those of you out there that use cloud offerings and feel like you’ve traveled back in time I feel your pain, it’s been the most frustrating two years of my career by far. Fortunately that era is coming to a close in a couple of weeks and boy does it feel good.

This blog had a many hour outage recently, of course it’s not powered by redundant systems though it does have redundant power supplies(I suspect the rack doesn’t have true redundant power I don’t know it’s a managed co-location though I own the server). A few nights ago there was some networking issues, I don’t know details I haven’t tried to find out. But the  provider who gets me the service(I think they have a cage in the facility they are a computer reseller), had their website on the same subnet as mine and I saw that was unreachable as well.

Whatever it was it was not a power issue since the uptime of my systems was unchanged once things got fixed. Though my bridging OpenBSD VM running pf on my ESXi system crashed for some reason (internal VMware error – maybe too many network errors). So I had to manually fire up the VM again before my other VMs could get internet access.  Not the end of the world though it’s just one small server running personal stuff. As you might know I ran my server in the Terremark cloud for about a year while I transitioned between physical server hosts (last server was built in 2004, this one about a year ago). When I started thinking about off site backups, I very quickly determined that cloud wasn’t going to cut it for the costs and it was far cheaper to just buy a server with RAID and put it in a co-lo, with roughly 3.6TB of usable capacity protected by RAID-10 on enterprise nearline SAS drives and a hardware RAID controller with battery backed cache I’m happy.

Powered by WordPress