TechOpsGuys.com Diggin' technology every day

September 8, 2011

What an outage..

Filed under: Random Thought — Tags: — Nate @ 10:32 pm

I’ve caused my share of outages, whether it’s applications, systems, networking, storage. My ratio of fixing outages to causing outages is quite good though, so overall I think I do alright.

But every time I am the cause of an outage it’s hard not to feel guilty in some way right? Even if it was an honest mistake. Was just looking at the local news and they were reporting on the power outage in southern California and Arizona and mentioned an Arizona power company believes an employee working at a sub station is what triggered the cascading failure causing:

  • Power outage for up to 5 million people in two states
  • Killed the commute for those in San Diego tonight
  • Shutting down a San Diego airport
  • Closing schools in San Diego tomorrow
  • Even a nuclear reactor was taken off line for safety

I’m not sure what kind of person this employee is of course, it may of just been an honest mistake, or they may of not been a mistake maybe they were doing exactly the right thing and something failed, who knows. But I certainly do feel for them, the sheer level of guilt has got to be hard to bare.

But at the same time how many people can brag that they single handedly took out a nuclear reactor?

I suppose the bigger issue is the design of the grid how one fault can cascade to impact so many, it’s reported that the outage has even spread to the northern portion of Mexico as well. Stuff like this really makes me fear the wide scale deployment of the “smart grid” stuff, which I believe will make the grid far, far more vulnerable than it already is today.

2 Comments

  1. This outage reminds me of the one on the east coast a number of years ago. The way the power grid works seems to be less and less about fault tolerance and redundancy and more about how to get power from one part of the grid to another part of the grid.

    Sometimes, it feels like we are running on a RAID-0 striped array, if you know what I mean.

    It’s funny that it is going to be pinned on one employee, when the real issue should be why the system as a whole is demonstrating such a poor showing for resilience, especially after what happened in on the east coast and what recently happened in Fukushima, Japan.

    The fault at the sub-station should have been routed around and the system should have suffered no impact. Instead, we see the opposite: magnified impact. My guess is that the sub-station in question represents a choke point for power distribution throughout that region. Not unlike the long haul power that runs through a good part of California, across the bay, and into San Francisco.

    I’m guessing it is alot like that game, “World of Goo”, where each node is tenuously connected to the next node, and as cities develop/evolve, the paths sometimes get compacted and collapsed down. Ie, 5 pathways between region A and region B get reduced down to 2 pathways, to cut costs and to allow for expansion of grid coverage to C and D. However, while those 2 pathways may have been able to carry the load from A to B, even if one path should fail and they need to rely on a single leg, with the addition of coverage for C and D.. as well as unexpected growth in A and B or shifting of power sourcing, failure of one path means a systemic failure for all, as the remaining pathways don’t have sufficient spare capacity to spread the load, causing a cascade failure.

    Of course, the same problem can be extended to other situations: virtualization failover setups: surviving the failure of one node is contingent on the remaining nodes having sufficient capacity to take on the load of the failed node. More importantly, there needs to be sufficient capacity in any one node to take on the largest VM in the failed node… otherwise… cascade failure of your virtualization cluster as each node gives it a shot at starting up the VM in question. Ah… fond memories of that… 🙂

    Capacity planning and fault tolerance/HA planning… you’d think that it would be a hard requirement for critical infrastructure… but sometimes, even the best of plans go sideways over time or if other factors come into play in the decision making process… -_-;;

    Comment by Wing Wong — September 9, 2011 @ 6:57 am

  2. “Skynet IS THE VIRUS!!!” – John Connor, Terminator 3 – Rise of the Machines

    Comment by Emory — September 13, 2011 @ 12:06 pm

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress