TechOpsGuys.com Diggin' technology every day

June 30, 2012

Synchronized Reboot of the Internet

Filed under: linux — Tags: — Nate @ 7:37 pm

[UPDATE – I’ve been noticing some people claim that kernels newer than 2.6.29 are not affected, well I got news for you, I have 200+ VMs that run 2.6.32 that say otherwise (one person in the comments mentions Kernel 3.2 is impacted too!) 🙂 ]

[ UPDATE 2 – this is a less invasive fix that my co-worker has tested on our systems:

date -s "`date -u`"

]
Been fighting a little fire that I’m sure hundreds if not thousands are fighting as well it happened at just before midnight UTC when a leap second was inserted into our systems, and well that seemed to trip a race condition in Linux, that I assume most thought was fixed but I guess people didn’t test it.

[3613992.610268] Clock: inserting leap second 23:59:60 UTC

 

The behavior as I’m sure your all aware of by now is a spike in CPU usage, normally our systems run on average under 8% cpu usage, and this really pegged them up by ten fold. Fortunately vSphere held up and we had the capacity to eat it, the resource pools helped make sure production had it’s share of CPU power. Only minimal impact to the customers, our external alerting never even went off, that was a good sign.

CPU Spike on a couple hundred VMs all at the same time (the above cluster has 441Ghz of CPU resources)

We were pretty lost at first, fortunately my co-worker had a thought maybe it was leap second related, we dug into things more and eventually came across this page (thanks google for being up to date), which confirmed the theory and confirmed we weren’t the only ones impacted by it.  Fortunately our systems were virtualized by a system that was not impacted by the issue so we did not experience any issues on the bare metal only in the VMs. From the page

Just today, Sat June 30th – starting soon after the start of the day GMT. We’ve had a handful of blades in different datacentres as managed by different teams all go dark – not responding to pings, screen blank.

They’re all running Debian Squeeze – with everything from stock kernel to custom 3.2.21 builds. Most are Dell M610 blades, but I’ve also just lost a Dell R510 and other departments have lost machines from other vendors too. There was also an older IBM x3550 which crashed and which I thought might be unrelated, but now I’m wondering.

It wasn’t long after that we started getting more confirmations of the issue from pretty much everyone out there. We haven’t dug into more of a root cause at this point we’ve been busy rebooting Linux VMs which seems to be a good workaround (didn’t need the steps indicated on the page). Even our systems that are up to date with kernel patches and stuff as recently as a month ago were impacted. Red Hat apparently is issuing a new advisory for their systems since they were impacted as well.

Some systems behaved well under the high load, others were so unresponsive they had to be power cycled. There was usually one process that was chewing through an abnormal amount of CPU, for the systems I saw it was mostly Splunk and autofs.  I think it was just coincidence though, perhaps processes that were using CPU at the instant the leap second was inserted into the system.

The internet is in the midst of a massive reboot. I pity the foo who has a massive number of systems and has to co-ordinate some complex massive reboot (unless there is another way – for me reboot was simplest and fastest).

I for one was not aware that a leap second was coming or the potential implications, it’s obvious I’m not alone. I do recall leap seconds in the past not causing issues for any of the systems I managed. I logged into my personal systems including the one that powers this blog, and there are no issues on them. My laptop runs Ubuntu 10.04 as well(same OS rev as the servers I’ve been rebooting for the past 2 hours) and no issues there either (been using it all afternoon).

Maybe someday someone will explain to me in a way that makes sense why we give a crap about adding a second, I really don’t care if the world is out of sync by a few seconds with the rest of the known universe, if it’s that important we should have a scientific time or something, and let the rest of the normal folks go about their way. Same goes for daylight savings time. Imagine the power bill as a result of this fiasco, with 1000s, to 100,000s of servers spiking to 100% CPU usage all at the same time.

Microsoft will have a field day with this one I’m sure 🙂

 

Java and DNS caching

Filed under: General — Nate @ 12:20 pm

I wanted to write a brief note on this since it’s a fairly wide spread problem that I’ve encountered when supporting Java-based applications (despite the problem I much prefer supporting Java-based apps than any other language at this point by leaps and bounds).

The problem is with a really, really stupid default setting with regards to DNS caching set in the java.security file. It’s an old setting, I recall first coming across it I want to say in 2004 or 2003 even. But it’s still a default even today, and some big names out there apparently are not aware or do not care because I come across this issue from a client perspective on what feels like a semi regular basis.

I’ll let the file speak for itself:

#
# The Java-level namelookup cache policy for successful lookups:
#
# any negative value: caching forever
# any positive value: the number of seconds to cache an address for
# zero: do not cache
#
# default value is forever (FOREVER). For security reasons, this
# caching is made forever when a security manager is set. When a security
# manager is not set, the default behavior in this implementation
# is to cache for 30 seconds.
#
# NOTE: setting this to anything other than the default value can have
#       serious security implications. Do not set it unless
#       you are sure you are not exposed to DNS spoofing attack.
#
#networkaddress.cache.ttl=-1

If your experienced with DNS at all you can probably tell right away the above is a bad default to have. The idea that you open yourself to DNS spoofing attack is just brain dead I’m sorry. You may be very well opening yourself to DNS Spoofing attacks by caching those responses. I think to a recent post of mine related to the Amazon cloud, specifically their Elastic Load Balancers – as terrible as they are they also by-design change IP addresses at random intervals. Sometimes resulting in really bad things happening.

“Amazon Web Services’ Elastic Load Balancer is a dynamic load-balancer managed by Amazon. Load balancers regularly swapped around with each other which can lead to surprising results; like getting millions of requests meant for a different AWS customer.

Swapping IPs at random is obviously heavily dependent upon all portions of DNS resolution operating perfectly. Ask anyone experienced with DNS what they do when they migrate from one IP to another and you’ll very likely hear them say they keep the old IP active for X number of DAYS or WEEKS regardless of their DNS TTL settings because some clients or systems simply don’t obey them.  This is pretty standard practice. When I moved that one company out of Fisher Plaza (previous post) to the new facility I stuck a basic Apache proxy server in the old data center for a month forwarding all requests to the new site (other things like SMTP/DNS was handled by other means).

Java takes it to a new level though I’ll admit that. Why that is a default is just, well I really don’t have words to answer that.

Fortunately Amazon EC2 customers have another solution considering how terrible ELB is, they can use Zeus (oh sorry I meant Stingray).  I’ve been using it for the past 7-8 months and it’s quite nice, easy to use, very flexible and powerful much like a typical F5 or Citrix load balancer (much easier to manage than Citrix). It even integrates with EC2 APIs – it can use an elastic IP to provide automatic fail over (fails over in 10-20 seconds if I recall right, much faster than any DNS-based system could). Because of the Elastic IP the IP of the load balancer will never change. The only real downside to Zeus is it’s limited to a single IP address (not Zeus’ fault this is Amazon limitation). So you can only do one SSL cert per Zeus cluster, the costs can add up quick if you need lots of certs, since the cost of Zeus is probably 5-10x the cost of ELB (and worth every stinking penny too).

Oh that and the Elastic IP is only external (another Amazon limitation – you may see a trend here – ELB too has no static IP internally). So if you want to load balance internal resources say web server 1 talking to web server 2, you either have to route that traffic out to the internet and back in, or to the internal IP of the Zeus load balancer, and manually update the configuration if/when the load balancer fails over because the internal IP will change. I was once taught a long time ago everything behind a VIP. Which means extensive use of internal load balancing for everything from HTTP to HTTPS, to DNS to NTP to SMTP – everything. With the Citrix load balancers we’ve extended our intelligent load balancing to MySQL since it has native support for MySQL now (not aware of anyone else that does – Layer 4 load balancing doesn’t count here).

Amazon Cloud: Two power outages in two weeks

Filed under: Datacenter — Tags: , — Nate @ 11:54 am

By now you should know I’m no fan of Amazon’s cloud, it makes me feel I’m stuck in the 90s when I use it. I’ve been using it quite a bit for the past two years(with two different companies) but finally about to get the hell out of there. The last set of systems is set to migrate before my trip to Seattle.

Last week they had one outage in one of their availability zones, though it took them well over an hour to admit it was a power outage, they first tried to say “oh some volumes are experiencing increased latency”. What a load of crap. It should take all of 5 seconds to know there is a power outage. The stuff I manage had minor impact fortunately since we are down to just a few things left, we lost some stuff but none of it critical.

Then last night they have another one, which seems to have made some news too.

A slew of sites, including Netflix, Instagram and Pinterest, have gone down this evening, thanks to “power issues” at Amazon’s Elastic Compute Cloud data center in North Virginia. The websites rely on Amazon’s cloud services to power their services. Some pretty violent storms in the region are apparently causing the problems.

This had slightly more impact on stuff I’m responsible for, one of my co-workers handled the issues it wasn’t much to do fortunately. I can only imagine the havok of a larger organization like one of the above that depend more heavily on their cloud.

What a lot of people don’t realize though is these two outages aren’t really considered outages in Amazon’s mind, at least for that region, because only one data center or part of one data center went off line. Their SLA is worded so that they exempt themselves from the effects of such an outage and put the onerous on the customer to deal with it. I suspect these facilities aren’t even Tier IV, because Tier IV is expensive and Amazon is about cheap. If they were Tier IV a simple storm wouldn’t of caused equipment to lose power.

I remember a couple years ago the company I was at had some gear co-located near Chicago at an Equinix site, some big storms and flooding if I remember right rolled through. We didn’t have redundant power of course(more on that below), but there was no impact to the equipment other than an email to us saying the site was on generator power for some time and then another email saying the site was back on utility power.

There are exceptions of course, poor design being one. I think back to what was once Internap‘s premier data center in Seattle Fisher Plaza which was once plagued by power issues and eventually resulted in more then 24 hours of downtime due to a fire knocking out many well known sites like Bing Travel as well as many others. It took them months to repair the facility, they had generator trucks sitting out front providing generator power 24/7 during the repairs. From a storage perspective I remember being told stories of at least one or two customers’ NetApp equipment taking more then 24 hours to come back online (file system checks), I’m sure folks that had battery backed cache were in sort of a panic not knowing when or if power would be restored to the facility. Some of my friends were hosted there at another company with a really small 3PAR array and were not worried though, because 3PAR systems dump their cache to an internal disk on the controller when the power goes out, so batteries are not required past that point. Since cache is mirrored there is two copies of it stored on different disks. Some newer systems have fancy flash-backed cache that is even nicer.

Fisher Plaza for a while had about one power outage per year, every year for at least 3 years in a row. Including the somewhat famous EPO event where someone went out of their way to hit the Emergency Power Off switch (there was no emergency) and shut down the facility. After that all customers had to go through EPO Training, which was humorous.

Being the good operations person that I am, shortly after I started work at a company back in 2006 that was hosted at Fisher plaza I started working on plans to move them out – the power issues was too much to bear. We still had about nine months left in our contract and my boss was unsure how we could go before that was up given it would cost a lot. I had an awesome deal on the table from a local AT&T Facility which I had good experiences with (though density wise they are way out dated and after an AT&T re-organization in around 2008 I wouldn’t even consider AT&T as a data center provider now). Anyways, I had this great deal and wanted to move but we had a hard time getting past the fact that we still owed a ton on the contract for Internap and we couldn’t get out of it. Then Fisher Plaza had another power outage (this was in 2006, the fire was three years later). The VP said to us something along the lines of I don’t care what it takes, I want to get out of there now. Music to my ears, things got moving quickly and we moved out within a month or so. I was hosted at that AT&T data center for a good 5 years personally and the companies I was at was hosted there for a good I want to say 8-9 years between the two without a single power event that I am aware of. I was there once when the facility lost power, but the data center floor was unaffected. I believe there was a few other power outages, but again nothing impacting customer equipment.

There are other bad designs out there too – personally I consider anything that relies on a flywheel UPS to be badly designed, because there isn’t enough time for on site personnel to manually try to correct a situation before the UPS runs out of juice.  I want at least 10-15 minutes at full load.

Internap later opened a newer fancier data center down in Tukwila in facility owned by Sabey. That is a massive campus, they claim 1.2M square feet of data center space. There is a large Microsoft presence there as well. On one of my tours of the facility I asked their technical people whether or not they use real UPS or a Flywheel, and they said real UPS. They commented how Microsoft literally next door used Flywheel and they said how Microsoft is seemingly constantly running their generators(far more frequently than your typical routine load testing), they did not know specifically why but speculated maybe they don’t trust the Fly wheels, and laughed with me. That same Internap facility had another power outage, shortly after it opened though that one was human error. Apparently there was some fault in a UPS, and some person did something bad, the only way to fix it was to shut everything down. Internap claimed they addressed that problem by having every on site action double checked and signed off. I know people that are hosted there and have not heard of issues since the new policies were put in place.

Another reason is being a cheap bastard. I think Amazon falls into this area – they address it for their own applications with application level availability, global load balancing and fancy Citrix load balancers.  I was at another company a few years ago that fell into this area too of being a cheap bastard and not wanting to invest in redundant power. People view power as a utility that won’t ever go down, especially in a data center – and this view is reinforced the longer you go without having a power outage. I remember a couple outages at a real cheap co-location the company was using in Seattle, where some other customer plugged a piece of fancy Cisco gear in and for some reason it tripped the UPS which knocked out a half dozen of our racks, because they didn’t have redundant power. So naturally we had an outage due to that.  The same thing happened again a few weeks later, after the customer replaced the Cisco gear with a newer Cisco thing and the UPS tripped again. Don’t know why.

The back end infrastructure was poorly designed as well, they had literally roughly 2 dozen racks all running off the same UPS, none of them had redundant power(I thought how hard can it be to alternate between UPSs every other rack? Apparently they didn’t think of that or didn’t want to spend for it).  It was a disaster waiting to happen. They were lucky and they did not have such a disaster while I was there. It was like pulling teeth to get them to commit to redundant power for the new 3PAR system, and even then they’d only agree to one UPS feed and one non UPS feed. This had it’s own issues on occasion.

One of my former co-workers told me a story about a data center he used to work at – the worst of both worlds – bad design AND cheap bastard. They bought these generators and enclosed them somewhat in some sort of structure outside. Due to environmental regulations they could not test them very often, only a couple minutes a month or something like that. Maybe the generators were cheap crappy ones that belched out more pollution than others, I don’t know. But the point is they never could fully test them. They had a real power outage one day, and they went outside and watched as the generators kicked on, they were happy.

Then a few minutes later they shut down and the facility lost all power. WTF? They went and turned them on again, and a few minutes later they shut off again.  Apparently the structure they built around the generators did not leave enough space for cooling and the generators were overheating and shutting down.

Back to Amazon and their SLAs (or lack thereof). I’m torn between funny and sad when I see people attacking Amazon customers like Netflix or the other social things that are on their cloud when they go down as a result of an Amazon downtime. They rag on the customers for not making their software more resilient against such things. Amazon expects you to do this, they do it after all if Amazon can do it anyone can right?

Yeah, reality is different. Most companies do not do that and probably never will. At a certain scale it makes sense, for some applications it makes sense. For the vast majority it does not, and the proof is in the pudding – most companies don’t do it. I’ve worked at two different companies that built their apps from the ground up in Amazon and neither made any considerations for this aspect of availability. I know there are folks out there that DO do this but they are in the small minority, who think they are hip because they can survive a data center going down without impacting things.

It’s far simpler, and cheaper to address the problem in a more traditional way with highly available infrastructure for the vast majority of applications anyways. Disasters do happen and you should still be prepared, but that’s far different from the Amazon model of “built to fail”. These aren’t the first power issues Amazon has had and certainly won’t be the last.

The main point to this post is trying to illustrate the difference in how the SLAs are worded, how the particular service provider responds, and how customers respond to the event.

A counter example I have brought up many times, a combination of a power issue AND a fire over at a Terremark facility a few years ago, resulted in no customer impact. Good design and no cheap bastards there.

Some irony here is that Amazon tries to recruit me about once every six months. I politely tell them I’m not interested, unless it’s a person I know then I tell them why I’m not interested, and believe me I’m being incredibly polite here.

The current state of Infrastructure as a Service cloud offerings is just a disaster in general (there are some exceptions to parts of the rules here and there). Really everything about it is flawed from the costs to the availability to the basic way you allocate resources. For those of you out there that use cloud offerings and feel like you’ve traveled back in time I feel your pain, it’s been the most frustrating two years of my career by far. Fortunately that era is coming to a close in a couple of weeks and boy does it feel good.

This blog had a many hour outage recently, of course it’s not powered by redundant systems though it does have redundant power supplies(I suspect the rack doesn’t have true redundant power I don’t know it’s a managed co-location though I own the server). A few nights ago there was some networking issues, I don’t know details I haven’t tried to find out. But the  provider who gets me the service(I think they have a cage in the facility they are a computer reseller), had their website on the same subnet as mine and I saw that was unreachable as well.

Whatever it was it was not a power issue since the uptime of my systems was unchanged once things got fixed. Though my bridging OpenBSD VM running pf on my ESXi system crashed for some reason (internal VMware error – maybe too many network errors). So I had to manually fire up the VM again before my other VMs could get internet access.  Not the end of the world though it’s just one small server running personal stuff. As you might know I ran my server in the Terremark cloud for about a year while I transitioned between physical server hosts (last server was built in 2004, this one about a year ago). When I started thinking about off site backups, I very quickly determined that cloud wasn’t going to cut it for the costs and it was far cheaper to just buy a server with RAID and put it in a co-lo, with roughly 3.6TB of usable capacity protected by RAID-10 on enterprise nearline SAS drives and a hardware RAID controller with battery backed cache I’m happy.

Powered by WordPress