TechOpsGuys.com Diggin' technology every day

1Jun/12Off

London Internet Exchange downed by Loop

TechOps Guy: Nate

This probably doesn't happen very often at these big internet exchanges but found the news sort of interesting.

I had known  for a few years that the LINX was a dual vendor environment, one side was Foundry/Brocade the other was Extreme, they are one of the few places that go out of their way to advertise what they use. I'm sure it gets them a better discount :)   It seems the LINX replaced the Foundry/Brocade with Juniper at some point since I last checked(less than a year ago). Though their site still mentions usage of EAPS (Extreme's ring protocol) and MRP (Foundry's ring protocol). I assume Juniper has not adopted MRP, though they probably have something similar. Looking at the design of the Juniper LAN vs the Extreme LAN (and the Brocade LAN before Juniper), the Juniper one looks a lot more complicated.  I wonder if they are using Juniper's new protocol(s) to manage it? Qfabric I think it's called? It seems LINX still has some Brocade in one of their edge networks.

Apparently the Juniper side is what suffered the loop -

"Linx is trying to determine where the loop originated and we are also addressing why the protection on Juniper's LAN didn't work."

I wanted to point out again, since it's been a while since I covered it (and only then was it buried in the post, wasn't part of the title), that Extreme has a protocol (that as far as I know is unique - let me know if there is another vendor or protocol that is similar - note of course I am not referring to anything like STP) that can detect and recover(in some cases) loops automatically. I've only used it in detect mode to-date. I was also telling someone about this protocol who was learning the ropes on Extreme gear after coming from a Juniper background so thought I would mention it again.

The protocol is the Extreme Loop Recovery Protocol (ELRP). The documentation does a better job at explaining it than I can.

The Extreme Loop Recovery Protocol (ELRP) is used to detect network loops in a Layer 2 network. A switch running ELRP transmits multicast packets with a special MAC destination address out of some or all of the ports belonging to a VLAN. All of the other switches in the network treat this packet as a regular, multicast packet and flood it to all of the ports belonging to the VLAN.

When the packets transmitted by a switch are received back by that switch, this indicates a loop in the Layer 2 network. After a loop is detected through ELRP, different actions can be taken such as blocking certain ports to prevent loop or logging a message to system log. The action taken is largely dependent on the protocol using ELRP to detect loops in the network.

The design seems simple enough to me, I'm not sure why others haven't come up with something similar (or if they have let me know!)

It's rare to have a loop in a data center environment but I do remember a couple loops I came across in an office environment many years ago that ELRP helped trace down. I'm not sure what method one would use to trace down a loop without something like ELRP - perhaps just looking at port stats and trying to determine where the bulk of the traffic is and disabling ports or unplugging cables until it stops.

[Tangent]

I remember an outage one company I was at took one time to upgrade some of our older 10/100 3COM switches to gigabit Extreme switches. It was a rushed migration, I was working with the network engineer that we had, the switches were installed in a centralized location with tons of cables, none of which were labeled. So I guess it comes as little surprised while during the migration someone (probably me) happened to plug the same cable back into one of the switches causing a loop. It took a few minutes to track down, at one point our boss was saying get ready to roll back. The network engineer and I looked at each other and laughed there was no roll back, well not one that was going to be smooth it would of taken another hour of downtime to remove the Extreme switches and re-install the 3COM and re-cable stuff. Fortunately I found the loop. This was about a year or so before I was aware of the existence of ELRP. We discovered the loop mainly after all the switch lights started blinking in sequence, normally a bad thing. Then users reported they lost all connectivity.

One of my friends who is another network engineer told me a story when I was in Atlanta earlier in the year about a customer who was a university or something. They had major network performance problems but could not track them down. These problems had been going on for literally months. My friend went out as a consultant and they brought him into their server/network room and his jaw dropped, they had probably 2 dozen switches and ALL of them were blinking in sequence. He knew what the problem was right away and informed the customer. But the customer was adamant that the lights were supposed to blink that way and the problem was elsewhere(not kidding here). The customer had other issues like running overlapping networks on the same VLAN etc. My friend had a lot of suggestions for the customer but the customer felt insulted by him telling them their network had so many problems so they kicked him out and told the company not to send him back. A couple months later the customer went through some sort of audit process and failed miserably and grudgingly asked (begged) to get my friend back since he was the only one they knew that seemed to know what he was doing. He went back and fixed the network I assume (I forgot that last bit of the story).

[End Tangent]

ELRP can detect a loop immediately and give a very informative system log entry as to the port(s) the loop is occurring on so you can take action. It works best of course if it is running on all ports, so you can pinpoint down to the edge port itself. But if for some reason the edge is not an Extreme switch at least you can get it at a higher layer and can isolate it further that way.

You can either leave it running periodically every X seconds it will send a probe out, or you can run it on demand for a real time assessment. There is also integration with ESRP which I wrote about a while ago, although I don't use the integrated mode (see the original post as to how that works and why). I normally leave it running sending requests out at least say once every 30 seconds.

LINX had another outage (which was the last time I looked at their vendor stats) a couple of years ago (this one affected me since my company had gear hosted in London at the time and our monitors were tripped by this event), though no mention of which LAN the outage occurred on. One user wrote

It wasn't a port upgrade, a member's session was being turned up and due to miscommunication between the member's engineer and the LINX engineer a loop was somehow introduced in to the network which caused a broadcast storm and a switches CPU to max out cue packet loss and dropped BGP sessions.

As a cause to the outage that occurred two years ago. So I guess it was another loop! For all I know LINX is not running ELRP in their environment either.

It's not exactly advertised by Extreme if you talk to them, it's one of those things that's buried in the docs. Same goes for ESRP. Two really useful protocols that Extreme almost never mentions, two things that make them stand out in the industry and they don't talk about them. I'm told that one reason could be is they are proprietary(vs EAPS which is not and Extreme touts EAPS a lot but EAPS is layer 2 only!), though as I have mentioned in the past ESRP doesn't require any software at the edge to function and can support managed and unmanaged devices. So you don't require an Extreme-only network to run (just at the core, like most any other protocol). ELRP is even less stringent - can be run on any Extreme switch, no interoperability issues. If there were open variants of the protocols that'd be better of course, but again, these seem to be unique in the industry so tout what you got! Customers don't have to use them if they don't want to and it can make a network administrator's life vastly simpler in many cases by leveraging what you have available to you. Good luck integrating Extreme or Cisco or Brocade into Juniper's Qfabric ? Or into Force10's distributed core setup ? There are interoperability issues abound with most of the systems out there.