TechOpsGuys.com Diggin' technology every day

June 30, 2012

Synchronized Reboot of the Internet

Filed under: linux — Tags: — Nate @ 7:37 pm

[UPDATE – I’ve been noticing some people claim that kernels newer than 2.6.29 are not affected, well I got news for you, I have 200+ VMs that run 2.6.32 that say otherwise (one person in the comments mentions Kernel 3.2 is impacted too!) 🙂 ]

[ UPDATE 2 – this is a less invasive fix that my co-worker has tested on our systems:

date -s "`date -u`"

]
Been fighting a little fire that I’m sure hundreds if not thousands are fighting as well it happened at just before midnight UTC when a leap second was inserted into our systems, and well that seemed to trip a race condition in Linux, that I assume most thought was fixed but I guess people didn’t test it.

[3613992.610268] Clock: inserting leap second 23:59:60 UTC

 

The behavior as I’m sure your all aware of by now is a spike in CPU usage, normally our systems run on average under 8% cpu usage, and this really pegged them up by ten fold. Fortunately vSphere held up and we had the capacity to eat it, the resource pools helped make sure production had it’s share of CPU power. Only minimal impact to the customers, our external alerting never even went off, that was a good sign.

CPU Spike on a couple hundred VMs all at the same time (the above cluster has 441Ghz of CPU resources)

We were pretty lost at first, fortunately my co-worker had a thought maybe it was leap second related, we dug into things more and eventually came across this page (thanks google for being up to date), which confirmed the theory and confirmed we weren’t the only ones impacted by it.  Fortunately our systems were virtualized by a system that was not impacted by the issue so we did not experience any issues on the bare metal only in the VMs. From the page

Just today, Sat June 30th – starting soon after the start of the day GMT. We’ve had a handful of blades in different datacentres as managed by different teams all go dark – not responding to pings, screen blank.

They’re all running Debian Squeeze – with everything from stock kernel to custom 3.2.21 builds. Most are Dell M610 blades, but I’ve also just lost a Dell R510 and other departments have lost machines from other vendors too. There was also an older IBM x3550 which crashed and which I thought might be unrelated, but now I’m wondering.

It wasn’t long after that we started getting more confirmations of the issue from pretty much everyone out there. We haven’t dug into more of a root cause at this point we’ve been busy rebooting Linux VMs which seems to be a good workaround (didn’t need the steps indicated on the page). Even our systems that are up to date with kernel patches and stuff as recently as a month ago were impacted. Red Hat apparently is issuing a new advisory for their systems since they were impacted as well.

Some systems behaved well under the high load, others were so unresponsive they had to be power cycled. There was usually one process that was chewing through an abnormal amount of CPU, for the systems I saw it was mostly Splunk and autofs.  I think it was just coincidence though, perhaps processes that were using CPU at the instant the leap second was inserted into the system.

The internet is in the midst of a massive reboot. I pity the foo who has a massive number of systems and has to co-ordinate some complex massive reboot (unless there is another way – for me reboot was simplest and fastest).

I for one was not aware that a leap second was coming or the potential implications, it’s obvious I’m not alone. I do recall leap seconds in the past not causing issues for any of the systems I managed. I logged into my personal systems including the one that powers this blog, and there are no issues on them. My laptop runs Ubuntu 10.04 as well(same OS rev as the servers I’ve been rebooting for the past 2 hours) and no issues there either (been using it all afternoon).

Maybe someday someone will explain to me in a way that makes sense why we give a crap about adding a second, I really don’t care if the world is out of sync by a few seconds with the rest of the known universe, if it’s that important we should have a scientific time or something, and let the rest of the normal folks go about their way. Same goes for daylight savings time. Imagine the power bill as a result of this fiasco, with 1000s, to 100,000s of servers spiking to 100% CPU usage all at the same time.

Microsoft will have a field day with this one I’m sure 🙂

 

5 Comments

  1. Sometimes it’s nice to know that you are not the only person on the planet with a big lot of servers acting up 🙁

    Comment by Jan Hugo Prins — June 30, 2012 @ 8:39 pm

  2. Always good to know! I can’t express the relief I felt when I came across that other web site with the quote I have in the post. These systems are still fairly new and I was wondering if we came across a new bug or something, since I have never seen anything like this in my roughly 17 years of managing systems.

    btw, thanks for the post!!

    hope your systems recover soon.

    Comment by Nate — June 30, 2012 @ 8:42 pm

  3. I am running Kernel 3.2.12

    => affected!

    Comment by sysop — July 1, 2012 @ 12:03 am

  4. Maybe someday someone will explain to me in a way that makes sense why we give a crap about adding a second, I really don’t care if the world is out of sync by a few seconds with the rest of the known universe, if it’s that important we should have a scientific time or something, and let the rest of the normal folks go about their way

    — that attitude is the single and only reason why bugs like that go lost.

    Comment by darkfader — July 1, 2012 @ 11:46 am

  5. thanks for the comment!

    I could never do QA myself but firmly believe that whether it’s daylight savings time, leap seconds, leap years, all that is a complete and total waste of time. If we got rid of them then we wouldn’t have to care about those bugs to begin with, think of the millions of devices that may have to be updated, computer servers are just the most basic and easy to update. It was pretty well documented I believe when DST was changed in the U.S. a few years back it had pretty adverse effects not only on everyone frantically updating their stuff(especially painful was things like traffic lights I believe), but power usage actually went up instead of down, what they had hoped.

    make it simpler, and life will be better. Ask pretty much anyone other than a hard core scientist if a leap second matters to them or their lives or their business and they’ll say no. All people care about is that time is in sync with each other, they don’t care if it’s in sync with the aliens in outer space.

    Comment by Nate — July 2, 2012 @ 12:57 am

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress