[UPDATE - I've been noticing some people claim that kernels newer than 2.6.29 are not affected, well I got news for you, I have 200+ VMs that run 2.6.32 that say otherwise (one person in the comments mentions Kernel 3.2 is impacted too!) ]
[ UPDATE 2 - this is a less invasive fix that my co-worker has tested on our systems:
date -s "`date -u`"
Been fighting a little fire that I'm sure hundreds if not thousands are fighting as well it happened at just before midnight UTC when a leap second was inserted into our systems, and well that seemed to trip a race condition in Linux, that I assume most thought was fixed but I guess people didn't test it.
[3613992.610268] Clock: inserting leap second 23:59:60 UTC
The behavior as I'm sure your all aware of by now is a spike in CPU usage, normally our systems run on average under 8% cpu usage, and this really pegged them up by ten fold. Fortunately vSphere held up and we had the capacity to eat it, the resource pools helped make sure production had it's share of CPU power. Only minimal impact to the customers, our external alerting never even went off, that was a good sign.
We were pretty lost at first, fortunately my co-worker had a thought maybe it was leap second related, we dug into things more and eventually came across this page (thanks google for being up to date), which confirmed the theory and confirmed we weren't the only ones impacted by it. Fortunately our systems were virtualized by a system that was not impacted by the issue so we did not experience any issues on the bare metal only in the VMs. From the page
Just today, Sat June 30th - starting soon after the start of the day GMT. We've had a handful of blades in different datacentres as managed by different teams all go dark - not responding to pings, screen blank.
They're all running Debian Squeeze - with everything from stock kernel to custom 3.2.21 builds. Most are Dell M610 blades, but I've also just lost a Dell R510 and other departments have lost machines from other vendors too. There was also an older IBM x3550 which crashed and which I thought might be unrelated, but now I'm wondering.
It wasn't long after that we started getting more confirmations of the issue from pretty much everyone out there. We haven't dug into more of a root cause at this point we've been busy rebooting Linux VMs which seems to be a good workaround (didn't need the steps indicated on the page). Even our systems that are up to date with kernel patches and stuff as recently as a month ago were impacted. Red Hat apparently is issuing a new advisory for their systems since they were impacted as well.
Some systems behaved well under the high load, others were so unresponsive they had to be power cycled. There was usually one process that was chewing through an abnormal amount of CPU, for the systems I saw it was mostly Splunk and autofs. I think it was just coincidence though, perhaps processes that were using CPU at the instant the leap second was inserted into the system.
The internet is in the midst of a massive reboot. I pity the foo who has a massive number of systems and has to co-ordinate some complex massive reboot (unless there is another way - for me reboot was simplest and fastest).
I for one was not aware that a leap second was coming or the potential implications, it's obvious I'm not alone. I do recall leap seconds in the past not causing issues for any of the systems I managed. I logged into my personal systems including the one that powers this blog, and there are no issues on them. My laptop runs Ubuntu 10.04 as well(same OS rev as the servers I've been rebooting for the past 2 hours) and no issues there either (been using it all afternoon).
Maybe someday someone will explain to me in a way that makes sense why we give a crap about adding a second, I really don't care if the world is out of sync by a few seconds with the rest of the known universe, if it's that important we should have a scientific time or something, and let the rest of the normal folks go about their way. Same goes for daylight savings time. Imagine the power bill as a result of this fiasco, with 1000s, to 100,000s of servers spiking to 100% CPU usage all at the same time.
Microsoft will have a field day with this one I'm sure