TechOpsGuys.com Diggin' technology every day

April 18, 2013

Giant Explosion of companies moving off of AWS

Filed under: Datacenter — Tags: — Nate @ 10:24 am

Maybe I’m not alone in this world after all. I have ranted and raved about how terrible Amazon’s cloud is for years now. I used it at two different companies for around two years(been almost a year since I last used it at this point) and it was, by far the worst experience in my professional career. I could go on for an entire afternoon listing all the problems and lack of abilities, features etc that I had experienced — not to mention the costs.

But anyway, onto the article which I found on slashdot. It made my day, well the day is young still so perhaps something better will come along.

A week ago I quoted Piston’s CTO saying that there was a “giant explosion” of companies moving off of Amazon Web Services (AWS). At the time, I noted that he had good reason to say that, since he started a company that builds software used by companies to build private clouds.

[..]

Enterprises that are moving to private clouds tend to be those that had developers start using the cloud without permission.

[..]

Other businesses are “trying to get out,” he said. AWS has made its compute services very sticky by making them difficult for users to remove workloads like databases to run them elsewhere.

Myself I know of several folks who have stuff in Amazon, and for the most part the complaints are similar to mine. Very few that I have heard of are satisfied. The people that seem to be satisfied (in my experience) are those that don’t see the full picture, or don’t care. They may be satisfied because they don’t want to worry about infrastructure no matter the cost, or they want to be able to point the finger at an external service provider when stuff breaks (Amazon’s support is widely regarded as worthless).

“We have thousands of AWS customers, and we have not found anyone who is happy with their tech support,” says Laderman.

I was at a company paying six figures a month in fees and they refused to give us any worthwhile support. Any company in the enterprise space would of been more than happy to permanently station an employee on site to make sure the customer is happy for those kind of payments. Literally everyone who used the Amazon stuff in the company hated it, and the company wanted Amazon to come help show us the way — and they said no.

I am absolutely convinced (as I’ve seen it first and second hand) in many cases the investors in the startups have conflicts of interest and want their startups to use Amazon because the investors benefit from them growing as well. Amazon then uses this marketing stuff to pimp to other customers. This of course happens all over the place with other companies, but there are a lot of folks that are invested in Amazon relatively speaking compared to most other companies.

There’s no need for me to go into specifics as to why Amazon sucks here – for those you can see some of the past posts. This is just a quickie.

Anyway, that’s it for now.. I saw the article and it made me smile.

July 9, 2012

Amazon outages from a Datacenter Perspective

Filed under: Datacenter — Tags: , — Nate @ 2:56 pm

I just came across this blog post (“Cloud Infrastructure Might be Boring, but Data Center Infrastructure Is Hard”), and the author spent a decent amount of time ripping into Amazon from a data center operations perspective –

But on the facilities front, it’s hard to see how the month of June was anything short of a disaster for Amazon on the data center operations side.

Also covered are past outages and the author concludes that Amazon lacks discipline in operating their facilities as a chain of outages illustrates over the past few years

[..]since all of them can be traced back to a lack of discipline in the operation of the data centers in question.

[..]I wish they would just ditch the US East-1 data center that keeps giving them problems.  Of course the vast, vast majority of AWS instances are located there, so that may involve acquiring more floor space.

Sort of reminds me when Internap had their massive outage and then followed up by offering basically free migration to their new data center for any customer that wanted it – so many opted for it that they ran out of space pretty quick (though I’m sure they have since provisioned tons more space since the new facility had the physical capacity to handle everyone + lots more once fully equipped).

This goes back to my post where I ripped into them from a customer perspective, the whole built to fail model. For Amazon it doesn’t matter of a data center goes offline, they have the capacity to take the hit elsewhere and global DNS will move the load over in a matter of seconds.  Most of their customers don’t do that (because it’s really expensive and complex mainly – did you happen to notice there’s really no help for customers that want to replicate data or configuration between EC2 Regions?). As I tried to point out before, at anything other than massive scale it’s far more cost effective(and orders of magnitude simpler) for the vast majority of the applications and workloads out there to have the redundancy in the infrastructure (and of course the operational ability to run the facilities properly) to handle those sorts of events.

Though I’d argue with the author on one point – cloud infrastructure is hard.  (Updated, since the author said it was boring rather than easy, my brain interpreted it as one is hard the other must not be, for whatever reason 🙂 ) Utility infrastructure is easy but true cloud infrastructure is hard.  The main difference being the self service aspect of things. There are a lot of different software offerings trying to offer some sort of self service or another but for the most part they still seem pretty limited or lack maturity (and in some cases really costly). It’s interesting to see the discussions about OpenStack for example – not a product I’d encourage anyone to use in house just yet unless you have developer resources that can help keep it running.

December 7, 2011

Impending rolling outages in EC2

Filed under: Datacenter — Tags: — Nate @ 8:55 pm

I don’t write too much about EC2, despite how absolutely terrible it is, I will be writing about it in more depth soon(months most likely, it’s complicated). Nothing is more frustrating than working with stuff in EC2.

I have told some folks recently that my private rants about EC2 and associated services makes me feel like those folks in 2005-7 screaming about the implosion of the housing market yet for the most part nobody was listening because that’s not what they wanted to hear.

Same goes for EC2.

Anyways, I wanted to mention this, which talks about impending rolling outages across the Amazon infrastructure (within the next week or two).

Oh wait these are not outages, these are “scheduled maintenance events”.

That you can’t opt out of. You can postpone them a bit, but you can’t avoid them entirely, short of getting the hell outta there (which is a project I am working on – finally! Going to Atlanta next week, more than 4 months later than I was originally expecting)

Yeah, good design there. Better design? Take a look at what the folks over at a provider in the UK called UltraSpeed does, it’s clear they are passionate about what they do, and things like 15 minute SLA for restoring a failed server show they take pride in their work(look ma! No hard disks in the servers! Automated off site backups to another country!). Or Terremark – fire in the data center? No problem.

I have little doubt this is in response to critical security flaws which can only be addressed by rebooting the tens or hundreds of thousands of VMs across their infrastructure in a short time before it gets exploited, assuming it’s not being exploited already.

I fully expect that perhaps by the end of this month there will be some security group out there that discloses the vulnerability that Amazon is frantically trying to address now.

October 6, 2010

Amazon EC2: Not your father’s enterprise cloud

Filed under: Datacenter — Tags: , — Nate @ 9:00 am

OK, so obviously I am old enough that my father did not have clouds back in his days, well not the infrastructure clouds that are offered today. I just was trying to think of a somewhat zingy type of topic. And I understand enterprise can have many meanings depending on the situation, it could mean a bank that needs high uptime for example. In this case I use the term enterprise to signify the need for 24×7 operation.

Here I am, once again working on stuff related to “the cloud”, and it seems like everything “cloud” part of it revolves around EC2.

Even after all the work I have done recently and over the past year or two with regards to cloud proposals, I don’t know why it didn’t hit me until probably in the past week or so but it did (sorry if I’m late to the party).

There are a lot of problems with running traditional infrastructure in the Amazon cloud, as I’m sure many have experienced first hand. The realization that occured to me wasn’t that of course.

The realization was that there isn’t a problem with the Amazon cloud itself, but there is a problem with how it is:

  • Marketed
  • Targeted

Which leads to people using the cloud for things it was not intended to ever be used for. In regards to Amazon, one has to look no further than their SLA on EC2 to immediately rule it out for any sort of “traditional” application which includes:

  • Web servers
  • Database servers
  • Any sort of multi tier application
  • Anything that is latency sensitive
  • Anything that is sensitive to security
  • Really, anything that needs to be available 24×7

Did you know that if they lose power to a rack, or even a row of racks that is not considered an outage? It’s not as if they provide you with the knowledge of where your infrastructure is in their facilities, they rather you just pay them more and put things in different zones and regions.

Their SLA says in part that they can in fact lose an entire data center (“availability zone”), and that’s not considered an outage.  Amazon describes this as an “availability zone”

Additionally, they are physically separate, such that even extremely uncommon disasters such as fires, tornados or flooding would only affect a single Availability Zone.

And while I can’t find it on their site at the moment, I swear not too long ago their SLA included a provision that said even if they lost TWO data centers it’s still not an outage unless you can’t spin up new systems in a THIRD. Think of how many hundreds to thousands of servers are knocked off line when an Amazon data center becomes unavailable. I think they may of removed the two availability zones clause because not all of their regions have more than two zones(last I checked only us-east did, but maybe more have them now).

I was talking to someone who worked at Amazon not too long ago and had in fact visited the us-east facilities, and said all of the availability zones were in the same office park, really quite close to each other. They may of had different power generators and such, but quite likely if a tornado or flooding hit, more than one zone would be impacted, likely the entire region would go out(that is Amazon’s code word for saying all availability zones are down). While I haven’t experienced it first hand I know of several incidents that impacted more that one availability zone, indicating that there is more things shared between them than customers are led to believe.

Then there is the extremely variable performance & availability of the services as a whole. On more than one occasion I have seen Amazon reboot the underlying hardware w/o any notification (note they can’t migrate the work loads off the machine! anything on the machine at the time is killed!).  I also love how unapologetic they are when it comes to things like data loss. Basically they say you didn’t replicate the data enough times, so it’s your fault. Now I can certainly understand that bad things happen from time to time, that is expected, what is not expected though is how they handle it. I keep thinking back to this article I read on The Register a couple years ago, good read.

Once you’re past that, there’s the matter of reliability. In my experience with it, EC2 is fairly reliable, but you really need to be on your shit with data replication, because when it fails, it fails hard. My pager once went off in the middle of the night, bringing me out of an awesome dream about motorcycles, machine guns, and general ass-kickery, to tell me that one of the production machines stopped responding to ping. Seven or so hours later, I got an e-mail from Amazon that said something to the effect of:

There was a bad hardware failure. Hope you backed up your shit.

Look at it this way: at least you don’t have a tapeworm.

-The Amazon EC2 Team

I’m sure I have quoted it before in some posting somewhere, but it’s such an awesome and accurate description.

So go beyond the SLAs, go beyond the performance and availability issues.

Their infrastructure is “built to fail” which is a good concept at very large scale, I’m sure every big web-type company does something similar. The concept really falls apart at small scale though.

Everyone wants to get to the point where they have application level high availability and abstract the underlying hardware from both a performance and reliability standpoint. I know that, you know that. But what a lot of the less technical people don’t understand is that this is HARD TO DO. It takes significant investments in time & money to pull off. And at large scale these investments do pay back big. But at small scale they can really hurt you. You spend more time building your applications and tools to handle unreliable infrastructure when you could be spending time adding the features that will actually make your customers happy.

There is a balance there, as with anything. My point is that with the Amazon cloud those concepts are really forced upon you, if you want to use their service as a more “traditional” hosting model. And the overhead associated with that is ENORMOUS.

So back to my point as to the problem isn’t with Amazon itself, it’s with whom it is targeted to and the expectations around it. They provide a fine service, if you use it for what it was intended. EC2 stands for “elastic compute”, the first thing that comes to my mind when I hear that kind of term I think of HPC-type applications, data processing, back end type stuff that isn’t latency sensitive, and is more geared towards infrastructure failure.

But even then, that concept falls apart if you have a need for 24×7 operations. The cost model even of Amazon, the low cost “leader” in cloud computing doesn’t hold water vs doing it yourself.

Case in point, earlier in the year at another company I was directed to go on another pointless expedition comparing the Amazon cloud to doing it in house for a data intensive 24×7 application. Not even taking into account the latency introduced by S3, operational overhead with EC2, performance and availability problems. Assuming everything worked PERFECTLY, or at least as good as physical hardware – the ROI for the project for keeping it in house was less than 7 months(I re-checked the numbers and revised the ROI from the original 10 months to 7 months, I was in a hurry writing this morning before work). And this was for good quality hardware with 3 years of NBD on site support. This wasn’t scraping bottom of the barrel. To give you an idea on the savings after those 7 months it could more than pay for my yearly salary and benefits, and other expenses a company has for an employee for each and every month after that.

OK so we’re passed that point now. Onto a couple of really cool slides I came up for a pending presentation, which I really thing illustrate the Amazon cloud quite well, another one of those “picture is worth fifty words” type of thing. The key point here is capacity utilization.

What has using virtualization over the past half decade (give or take..) taught us? What has the massive increases in server and storage capacity taught us? Well they taught me that applications no longer have the ability to exploit the capacity of the underlying hardware. There are very rare exceptions to this but in general over the past  I would say at least 15 years of my experience applications really have never had the ability to exploit the underlying capacity of the hardware. How many systems do you see averaging under 5% cpu? Under 3%? Under 2% ? How many systems do you see with disk drives that are 75% empty? 80%?

What else has virtualization given us? It’s given us the opportunities to logically isolate workloads into different virtual machines, which can ease operational overhead associated with managing such workloads, both from a configuration standpoint as well as a capacity planning standpoint.

That’s my point. Virtualization has given us the ability to consolidate these workloads onto fewer resources. I know this is a point everyone understands I’m not trying to make people look stupid, but my point here with regards to Amazon is their model doesn’t take us forward — it takes us backward. Here are those two slides that illustrate this:

(Click image for full size)

And the next slide

(Click image for full size)

Not all cloud providers are created equal of course. The Terremark Enterprise cloud (not vCloud Express mind you), for example is resource pool based. I have no personal experience with their enterprise cloud (I am a vCloud express user for my personal stuff[2x1VCPU servers – including the server powering this blog!]). Though I did interact with them pretty heavily earlier in the year on a big proposal I was working on at the time. I’m not trying to tell you that Terremark is more or less cost effective, just that they don’t reverse several years of innovation and progress in the infrastructure area.

I’m sure Terremark is not the only provider that can provide resources based on resource pools instead of hard per-VM allocations. I just keep bringing them up because I’m more familiar with their stuff due to several engagements with them at my last company(none of which ever resulted in that company becoming a customer). I originally became interested in Terremark because I was referred to them by 3PAR, and I’m sure by now you know I’m a fan of 3PAR, Terremark is a very heavy 3PAR user. And they are a big VMware user, and you know I like VMware by now right?

If Amazon would be more, what is the right word, honest? up front? Better at setting expectations I think their customers would be better off, mainly they would have less of them because such customers would realize what that cloud is made for. Rather than trying to fit a square peg in a round hole. If you whack it hard enough you can usually get it in, but well you know what I mean.

As this blog  entry exceeds 1,900 words now I feel I should close it off. If you read this far, hopefully I made some sense to you. I’d love to share more of my presentation as I feel it’s quite good but I don’t want to give all of my secrets away 🙂

Thanks for reading.

Powered by WordPress