TechOpsGuys.com Diggin' technology every day

December 7, 2011

Impending rolling outages in EC2

Filed under: Datacenter — Tags: — Nate @ 8:55 pm

I don’t write too much about EC2, despite how absolutely terrible it is, I will be writing about it in more depth soon(months most likely, it’s complicated). Nothing is more frustrating than working with stuff in EC2.

I have told some folks recently that my private rants about EC2 and associated services makes me feel like those folks in 2005-7 screaming about the implosion of the housing market yet for the most part nobody was listening because that’s not what they wanted to hear.

Same goes for EC2.

Anyways, I wanted to mention this, which talks about impending rolling outages across the Amazon infrastructure (within the next week or two).

Oh wait these are not outages, these are “scheduled maintenance events”.

That you can’t opt out of. You can postpone them a bit, but you can’t avoid them entirely, short of getting the hell outta there (which is a project I am working on – finally! Going to Atlanta next week, more than 4 months later than I was originally expecting)

Yeah, good design there. Better design? Take a look at what the folks over at a provider in the UK called UltraSpeed does, it’s clear they are passionate about what they do, and things like 15 minute SLA for restoring a failed server show they take pride in their work(look ma! No hard disks in the servers! Automated off site backups to another country!). Or Terremark – fire in the data center? No problem.

I have little doubt this is in response to critical security flaws which can only be addressed by rebooting the tens or hundreds of thousands of VMs across their infrastructure in a short time before it gets exploited, assuming it’s not being exploited already.

I fully expect that perhaps by the end of this month there will be some security group out there that discloses the vulnerability that Amazon is frantically trying to address now.

4 Comments

  1. Your forgot to mention that some of those reboots (instance reboots vs. system reboots) will cause your instances’ public DNS name and IP to change (a stop/start reboot). Time to go update any external references to your instances (like monitoring because their CloudWatch monitors can’t monitor anything in the OS). Hope you don’t have hundreds/thousands of instances.

    Comment by Tyson — December 8, 2011 @ 2:35 pm

  2. I was not aware of the system reboots! Though from what I see I think the IPs will not change for system reboots..

    Comment by Nate — December 8, 2011 @ 3:15 pm

  3. The IPs changed on our instances that needed system (stop/start) reboots.

    Comment by Tyson — December 9, 2011 @ 12:44 pm

  4. […] Amazon has had far more downtime for companies that I have worked for (either before or since I was there) than any infrastructure related outages at companies I was at where they hosted their own stuff. I'd say it's safe to say an order of magnitude more outages. Of course not all of these are called outages by Amazon, they leave themselves enough wiggle room to drive an aircraft carrier through in their SLAs. My favorite one was probably the forced reboot of their entire infrastructure. […]

    Pingback by Top 10 outages of the year « TechOpsGuys.com — December 18, 2012 @ 11:10 am

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress