TechOpsGuys.com Diggin' technology every day

March 9, 2010

Yawn..

Filed under: Networking — Tags: — Nate @ 9:42 am

I was just watching some of my daily morning dose of CNBC and they had all these headlines about how Cisco was going to make some earth shattering announcement(“Change the internet forever”), and then the announcement hit, some new CRS-1 router, that claimed 12x faster performance than the competition. So naturally I was curious. Robert Paisano on the floor of the NYSE was saying how amazing it was that the router could download the library of congress in 1 second(he probably didn’t understand the router would have no place to put it).

If I want a high end router that means I’m a service provider and in that case my personal preference would be for Foundry Networks (now Brocade). Juniper makes good stuff too of course though honestly I am not nearly as versed in their technology. Granted I’ll probably never work for such a company as those companies are really big and I prefer small companies.

But in any case wanted to illustrate (another) point. According to Cisco’s own site, their fastest single chassis system has a mere 4.48 terrabits of switching capacity. This is called the CRS-3, which I don’t even see listed as a product on their site, perhaps it’s yet to come. The biggest, baddest product they have on their site right now is a 16-slot CRS-1. This according to their own site, has a total switching capacity of a paltry 1.2Tbps, and even worse a per-slot capacity of 40Gbps (hello 2003).

So take a look at the Foundry Networks (the Brocade name makes me shudder, I have never liked them) , their NetIron XMR series. From their documentation the “total switching fabric”, ranges from 960 gigabits on the low end to 7.68 terrabits on the high end. Switch forwarding capacity ranges from 400 gigabits to 3.2 terrabits. This comes out to 120 gigabits of full duplex switch fabric per slot (same across all models). While I haven’t been able to determine precisely how long XMR has been on the market I have found evidence that it is at least nearly 3 years old.

To put it in another perspective, in a 48U rack with the new CRS-3 you can get 4.48 terrabits of switching fabric(1 chassis is 48U). With Foundry in the same rack you can get one XMR32k and one XMR16k(combined size 47U) for a total of 11.52 terrabits of switching fabric. More than double the fabric in the same space, from a product that is 3 years old. And as you can imagine in the world of IT, 3 years is a fairly significant amount of time.

And while I’m here and talking about Foundry and Brocade take a look at this from Brocade, it’s funny it’s like something I would write. Compares the Brocade Director switches vs Cisco (“Numbers don’t lie”). One of my favorite quotes:

To ensure accuracy, Brocade hired an independent electrician to test both the Brocade 48000 and the Cisco MDS 9513 and found that the 120 port Cisco configuration actually draws 1347 watts, 45% higher than Cisco’s claim of 931 watts. In fact, an empty 9513 draws more electrical current (5.6 amps) than a fully-populated 384 port Brocade 48000 (5.2 amps). Below is Brocade’s test data. Where are Cisco’s verified results?

Another

With 33% more bandwidth per slot (64Gb vs 48Gb), three times as much overall bandwidth (1.5Tb vs 0.5 Tb) and a third the power draw, the Brocade 48000 is a more scalable building block, regardless of the scale, functionality or lifetime of the fabric. Holistically or not, Brocade can match the “advanced functionality” that Cisco claims, all while using far less power and for a much [?? I think whoever wrote it was in a hurry]

That’s just too funny.

March 1, 2010

The future of networking in hypervisors – not so bright

Filed under: Networking,Virtualization — Nate @ 10:15 pm

UPDATED Some networking companies see that they are losing control of the data center networks when it comes to blades and virtualization. One has reacted by making their own blades, others have come up with strategies and collaborating on standards to try to take back the network by moving the traffic back into the switching gear. Yet another has licensed their OS to have another company make blade switches on their behalf.

Where at least part of the industry wants to go is move the local switching out of the hypervisor and back into the Ethernet switches. Now this makes sense for the industry, because they are losing their grip on the network when it comes to virtualization. But this is going backwards in my opinion. Several years ago we had big chassis switches with centralized switch fabrics where(I believe, kind of going out on a limb here) if port 1 on blade 1 wanted to talk to port 2, then it had to go back to the centralized fabric before port 2 would see the traffic. That’s a lot of distance to travel. Fast forward a few years and now almost every vendor is advertising local switching. Which eliminates this trip. Makes things faster, and more scalable.

Another similar evolution in switching design was moving from backplane systems to midplane systems. I only learned about some of the specifics recently, prior to that I really had no idea what the difference was between a backplane and a midplane. But apparently the idea behind a midplane is to drive significantly higher throughput on the system by putting the switching fabric closer to the line cards. An inch here, an inch there could mean hundreds of gigabits of lost throughput or increased complexity/line noise etc in order to achieve those high throughput numbers. But again, the idea is moving the fabric closer to what needs it, in order to increase performance. You can see examples of a midplane systems in blades with the HP c7000 chassis, or in switches in the Extreme Black Diamond 20808(page 7). Both of them have things that plug into both the front and the back. I thought that was mainly due to space constraints on the front, but it turns out it seems more about minimizing the distance of connectivity between the fabric on the back and the thing using the fabric on the front. Also note that the fabric modules on the rear are horizontal while the blades on the front are vertical, I think this allows the modules to further reduce the physical distance between the fabric and the device at the other end by directly covering more slots, less distance to travel on the midplane.

Moving the switching out of the hypervisor, if VM #1 wants to talk to VM #2, having that go outside of the server and make a U-turn and come right back into it is stupid. Really stupid. It’s the industry grasping at straws trying to maintain control when they should be innovating. It goes against the two evolutions in switching designs I outlined above.

What I’ve been wanting to see myself is to integrate the switch into the server. Have a X GbE chip that has the switching fabric built into it. Most modern network operating systems are pretty modular and portable(a lot of them seem to be based on Linux or BSD). I say integrate it onto the blade for best performance, maybe use the distributed switch frame work(or come up with some other more platform independent way to improve management). The situation will only get worse in coming years, with VM servers potentially having hundreds of cores and TBs of memory at their disposal, your to the point now practically where you can fit an entire rack of traditional servers onto one hypervisor.

I know that for example Extreme uses Broadcom in most all of their systems, and Broadcom is what most server manufacturers use as their network adapters, even HP’s Flex10 seems to be based on Broadcom? How hard can it be for Broadcom to make such a chip(set) so that companies like Extreme (or whomever else might use Broadcom in their switches) could program it with their own stuff to make it a mini switch?

From the Broadcom press release above (2008):

To date, Broadcom is the only silicon vendor with all of the networking components (controller, switch and physical layer devices) necessary to build a complete end-to-end 10GbE data center. This complete portfolio of 10GbE network infrastructure solutions enables OEM partners to enhance their next generation servers and data centers.

Maybe what I want makes too much sense and that’s why it’s not happening, or maybe I’m just crazy.

UPDATE – I just wanted to clarify my position here, what I’m looking for is essentially to offload the layer 2 switching functionality from the hypervisor to a chip on the server itself. Whether it’s a special 10GbE adapter that has switching fabric or a dedicated add-on card which only has the switching fabric. Not interested in offloading layer 3 stuff, that can be handled upstream.  Also interested in integrating things like ACLs, sFlow, QoS, rate limiting and perhaps port mirroring.

ProCurve Not my favorite

Filed under: Networking,Virtualization — Nate @ 10:06 pm

I gotta find something new to talk about, after this..

I was thinking this evening and thought about my UCS/HP network shootout post I posted over the weekend and thought maybe I came across too strong in favor of HP’s networking gear.

As all three of you know, HP is not my favorite networking vendor. Not even my second favorite, or even my third.

But they do have some cool technology with this Virtualconnect stuff. I only wish blade interfaces were more standardized.

February 28, 2010

VMware dream machine

Filed under: Networking,Storage,Virtualization — Tags: , , , , , , — Nate @ 12:47 am

(Originally titled fourty eight all round, I like VMware dream machine more)

UPDATED I was thinking more about the upcoming 12-core Opterons and the next generation of HP c Class blades, and thought of a pretty cool configuration to have, hopefully it becomes available.

Imagine a full height blade that is quad socket, 48 cores (91-115Ghz), 48 DIMMs (192GB with 4GB sticks), 4x10Gbps Ethernet links and 2x4Gbps fiber channel links (total of 48Gbps of full duplex bandwidth). The new Opterons support 12 DIMMs per socket, allowing the 48 DIMM slots.

Why 4x10Gbps links? Well I was thinking why not.. with full height blades you can only fit 8 blades in a c7000 chassis. If you put a pair of 2x10Gbps switches in that gives you 16 ports. It’s not much more $$ to double up on 10Gbps ports. Especially if your talking about spending upwards of say $20k on the blade(guesstimate) and another $9-15k blade on vSphere software per blade. And 4x10Gbps links gives you up to 16 virtual NICs using VirtualConnect per blade, each of them adjustable in 100Mbps increments.

Also given the fact that it is a full height blade, you have access to two slots worth of I/O, which translates into 320Gbps of full duplex fabric available to a single blade.

That kind of blade ought to handle just about anything you can throw at it. It’s practically a super computer in of itself. Right now HP holds the top spot for VMark scores, with a 8 socket 6 core system(48 total cores) out pacing even a 16 socket 4 core system(64 total cores).

The 48 CPU cores will give the hypervisor an amazing number of combinations for scheduling vCPUs. Here’s a slide from a presentation I was at last year which illustrates the concept behind the hypervisor scheduling single and multi vCPU VMs:

There is a PDF out there from VMware that talks about the math formulas behind it all, it has some interesting commentary on CPU scheduling with hypervisors:

[..]Extending this principle, ESX Server installations with a greater number of physical CPUs offer a greater chance of servicing competing workloads optimally. The chance that the scheduler can find room for a particular workload without much reshuffling of virtual machines will always be better when the scheduler has more CPUs across which it can search for idle time.

This is even cooler though, honestly I can’t pretend to understand the math myself! –

Scheduling a two-VCPU machine on a two-way physical ESX Server hosts provides only one possible allocation for scheduling the virtual machine. The number of possible scheduling opportunities for a two-VCPU machine on a four-way or eight-way physical ESX Server host is described by combinatorial mathematics using the formula N! / (R!(N-R)!) where N=the number of physical CPUs on the ESX Server host and R=the number of VCPUs on the machine being scheduled.1 A two-VCPU virtual machine running on a four-way ESX Server host provides (4! / (2! (4-2)!) which is (4*3*2 / (2*2)) or 6 scheduling possibilities. For those unfamiliar with combinatory mathematics, X! is calculated as X(X-1)(X-2)(X-3)…. (X- (X-1)). For example 5! = 5*4*3*2*1.

Using these calculations, a two-VCPU virtual machine on an eight-way ESX Server host has (8! / (2! (8-2)!) which is (40320 / (2*720)) or 28 scheduling possibilities. This is more than four times the possibilities a four-way ESX Server host can provide. Four-vCPU machines demonstrate this principle even more forcefully. A four-vCPU machine scheduled on a four-way physical ESX Server host provides only one possibility to the scheduler whereas a four-VCPU virtual machine on an eight-CPU ESX Server host will yield (8! / (4!(8-4)!) or 70 scheduling possibilities, but running a four-vCPU machine on a sixteen-way ESX Server host will yield (16! / (4!(16-4)!) which is (20922789888000 / ( 24*479001600) or 1820 scheduling possibilities. That means that the scheduler has 1820 unique ways in which it can place the four-vCPU workload on the ESX Server host. Doubling the physical CPU count from eight to sixteen results in 26 times the scheduling flexibility for the four-way virtual machines. Running a four-way virtual machine on a Host with four times the number of physical processors (16-way ESX Server host) provides over six times more flexibility than we saw with running a two-way VM on a Host with four times the number of physical processors (8-way ESX Server host).

Anyone want to try to extrapolate that and extend it to a 48-core system? 🙂

It seems like only yesterday that I was building DL380G5 ESX 3.5 systems with 8 CPU cores and 32GB of ram, with 8x1Gbps links thinking of how powerful they were. This would be six of those in a single blade. And only seems like a couple weeks ago I was building VMware GSX systems with dual socket single core systems and 16GB ram..

So, HP do me a favor and make a G7 blade that can do this, that would make my day! I know fitting all of those components on a single full height blade won’t be easy. Looking at the existing  BL685c blade, it looks like they could do it, remove the internal disks(who needs em, boot from SAN or something), and put an extra 16 DIMMs for a total of 48.

I thought about using 8Gbps fiber channel but then it wouldn’t be 48 all round 🙂

UPDATE Again I was thinking about this and wanted to compare the costs vs existing technology. I’m estimating roughly a $32,000 price tag for this kind of blade and vSphere Advanced licensing (note you cannot use Enterprise licensing on a 12-core CPU, hardware pricing extrapolated from existing HP BL685G6 quad socket 6 core blade system with 128GB ram). The approximate price of an 8-way 48-core HP DL785 with 192GB, 4x10GbE and 2x4Gb Fiber with vSphere licensing comes to about roughly $70,000 (because VMWare charges on a per socket basis the licensing costs go up fast). Not only that but you can only fit 6 of these DL785 servers in a 42U rack, and you can fit 32 of these blades in the same rack with room to spare. So less than half the cost, and 5 times the density(for the same configuration). The DL785 has an edge in memory slot capacity, which isn’t surprising given its massive size, it can fit 64 DIMMs vs 48 on my VMware dream machine blade.

Compared to a trio of HP BL495c blades each with 12 cores, and 64GB of memory, approximate pricing for that plus advanced vSphere is $31,000 for a total of 36 cores and 192GB of memory. So for $1,000 more you can add an extra 12 cores, cut your server count by 66%, probably cut your power usage by some amount and improve consolidation ratios.

So to summarize, two big reasons for this type of solution are:

  • More efficient consolidation on a per-host basis by having less “stranded” resources
  • More efficient consolidation on a per-cluster basis because you can get more capacity in the 32-node limit of a VMware cluster(assuming you want to build a cluster that big..) Again addressing the “stranded capacity” issue. Imagine what a resource pool could do with 3.3 Thz of compute capacity and 9.2TB of memory? All with line rate 40Gbps networking throughout? All within a single cabinet ?

Pretty amazing stuff to me anyways.

[For reference – Enterprise Plus licensing would add an extra $1250/socket plus more in support fees. VMware support costs not included in above pricing.]

END UPDATE

February 27, 2010

Cisco UCS Networking falls short

Filed under: Networking,Virtualization — Tags: , , , , — Nate @ 4:33 am

UPDATED Yesterday when I woke up I had an email from Tolly in my inbox, describing a new report comparing the networking performance of the Cisco UCS vs the HP c Class blade systems. Both readers of the blog know I haven’t been a fan of Cisco for a long time(about 10 years, since I first started learning about the alternatives), and I’m a big fan of HP c Class (again never used it, but planning on it). So as you could imagine I couldn’t resist what it said considering the amount of hype that Cisco has managed to generate for their new systems(the sheer number of blog posts about it make me feel sick at times).

I learned a couple things from the report that I did not know about UCS before (I often times just write their solutions off since they have a track record of under performance, over price and needless complexity).

The first was that the switching fabric is external to the enclosure, so if two blades want to talk to each other that traffic must leave the chassis in order to do so, an interesting concept which can have significant performance and cost implications.

The second is that the current UCS design is 50% oversubscribed, which is what this report targets as a significant weakness of the UCS vs the HP c Class.

The mid plane design of the c7000 chassis is something that HP is pretty proud of(for good reason), capable of 160Gbps full duplex to every slot, totaling more than 5 Terrabits of fabric, they couldn’t help but take shots at IBM’s blade system and comment on how it is oversubscribed and how you have to be careful in how you configure the system based on that oversubscription when I talked to them last year.

This c7000 fabric is far faster than most high end chassis Ethernet switches, and should allow fairly transparent migration to 40Gbps ethernet when the standard arrives for those that need it. In fact HP already has 40Gbps Infiniband modules available for c Class.

The test involved six blades from each solution, when testing throughput of four blades both solutions performed similarly(UCS was 0.76Gbit faster). Add two more blades and start jacking up the bandwidth requirements. HP c Class scales linerally as the traffic goes up, UCS seems to scale lineraly in the opposite direction. End result is with 60Gbit of traffic being requested(6 blades @ 10Gbps), HP c Class managed to choke out 53.65Gbps, and Cisco UCS managed to cough up a mere 27.37Gbps. On UCS, pushing six blades at max performance actually resulted in less performance than four blades at max performance, significantly less. Illustrating serious weaknesses in the QoS on the system(again big surprise!).

The report mentions putting Cisco UCS in a special QoS mode for the test because without this mode performance was even worse. There is only 80Gbps of fabric available for use on the UCS(4x10Gbps full duplex). You can get a second fabric module for UCS but it cannot be used for active traffic, only as a backup.

UPDATE – A kind fellow over at Cisco took notice of our little blog here(thanks!!) and wanted to correct what they say is a bad test on the part of Tolly, apparently Tolly didn’t realize that the fabrics could be used in active-active(maybe that complexity thing rearing it’s head I don’t know). But in the end I believe the test results are still valid, just at an incorrect scale. Each blade requires 20Gbps of full duplex fabric in order to be non blocking throughout. The Cisco UCS chassis provides for 80Gbps of full duplex fabric, allowing 4 blades to be non blocking. HP by contrast allows up to three dual port Flex10 adapters per half height server which requires 120Gbps of full duplex fabric to support at line rate. Given each slot supports 160Gbps of fabric, you could get another adapter in there but I suspect there isn’t enough real estate on the blade to connect the adapter! I’m sure 120Gbps of ethernet on a single half height blade is way overkill, but if it doesn’t radically increase the cost of the system, as a techie myself I do like the fact that the capacity is there to grow into.

Things get a little more complicated when you start talking about non blocking internal fabric(between blades) and the rest of the network, since HP designs their switches to support 16 blades, and Cisco designs their fabric modules to support 8. You can see by the picture of the Flex10 switch that there are 8 uplink ports on it, not 16, but it’s pretty obvious that is due to space constraints because the switch is half width. END UPDATE

The point I am trying to make here isn’t so much the fact that HP’s architecture is superior to that of Cisco’s. It’s not that HP is faster than Cisco. It’s the fact that HP is not oversubscribed and Cisco is. In a world where we have had non blocking switch fabrics for nearly 15 years it is disgraceful that a vendor would have a solution where six servers cannot talk to each other without being blocked. I have operated 48-port gigabit swtiches which have 256 gigabits of switching fabric, that is more than enough for 48 systems to talk to each other in a non blocking way. There are 10Gbps switches that have 500-800 gigabits of switching fabric allowing 32-48 systems to talk to each other in a non blocking way. These aren’t exactly expensive solutions either. That’s not even considering the higher end backplane and midplane based system that run into the multiple terrabits of switching fabrics connecting hundreds of systems at line rates.

I would expect such a poor design to come from a second tier vendor, not a vendor that has a history of making networking gear for blade switches for several manufacturers for several years.

So say take it worst case, what if you want completely non blocking fabric from each and every system? For me I am looking to HP c Class and 10Gbs Virtual Connect mainly for inttra chassis communication within the vSphere environment. In this situation with a cheap configuration on HP, you are oversubscribed 2:1 when talking outside of the chassis. For most situations this is probably fine, but say that wasn’t good enough for you. Well you can fix it by installing two more 10Gbps switches on the chassis (each switch has 8x10GbE uplinks). That will give you 32x10Gbps uplink ports enough for 16 blades each having 2x10Gbps connections. All line rate, non blocking throughout the system. That is 320 Gigabits vs 80 Gigabits available on Cisco UCS.

HP doesn’t stop there, with 4x10Gbps switches you’ve only used up half of the available I/O slots on the c7000 enclosure, can we say 640 Gigabits of total non-blocking ethernet throughput vs 80 gigabits on UCS(single chassis for both) ? I mean for those fans of running vSphere over NFS, you could install vSphere on a USB stick or SD card and dedicate the rest of the I/O slots to networking if you really need that much throughput.

Of course this costs more than being oversubscribed, the point is the customer can make this decision based on their own requirements, rather than having the limitation be designed into the system.

Now think about this limitation in a larger scale environment. Think about the vBlock again from that new EMC/Cisco/VMware alliance. Set aside the fact that it’s horribly overpriced(I think mostly due to EMC’s side). But this system is designed to be used in large scale service providers. That means unpredictable loads from unrelated customers running on a shared environment. Toss in vMotion and DRS, you could be asking for trouble when it comes to this oversubscription stuff, vMotion (as far as I know) relies entirely on CPU and memory usage. At some point I think it will take storage I/O into account as well. I haven’t heard of it taking into account network congestion, though in theory it’s possible. But it’s much better to just have a non blocking fabric to begin with, you will increase your utilization, efficiency, and allow you to sleep better at night.

Makes me wonder how does Data Center Ethernet (whatever it’s called this week?) hold up under these congestion conditions that the UCS suffers from? Lots of “smart” people spent a lot of time making Ethernet lossless only to design the hardware so that it will incur significant loss in transit. In my experience systems don’t behave in a predictable manor when storage is highly constrained.

I find it kind of ironic that a blade solution from the world’s largest networking company would be so crippled when it came to the network of the system. Again, not a big surprise to me, but there are a lot of Cisco kids out there I see that drink their koolaid without thinking twice, and of course I couldn’t resist to rag again on Cisco.

I won’t bother to mention the recent 10Gbps Cisco Nexus test results that show how easily you can cripple it’s performance as well(while other manufacturers perform properly at non-blocking line rates), maybe will save that for another blog entry.

Just think, there is more throughput available to a single slot in a HP c7000 chassis than there is available to the entire chassis on a UCS. If you give Cisco the benefit of the second fabric module, setting aside the fact you can’t use it in active-active, the HP c7000 enclosure has 32 times the throughput capacity of the Cisco UCS. That kind of performance gap even makes Cisco’s switches look bad by comparison.

December 9, 2009

AT&T plans on stricter mobile data plans

Filed under: Networking — Nate @ 6:03 pm

You know one thing that really drives me crazy about users? It’s those people that think they have a right to megabits, if not tens of megabits of bandwidth for pennies a month. Those people that complain $50/mo is such a ripoff for 5-10Mbit broadband!

I have always had a problem with unlimited plans myself, I recall in the mid 90s getting kicked off more than a few ISPs for being connected to their modems 24/7 for days on end. The plan was unlimited. So I used it. I asked, even pleaded for the ISPs to tell me what is the real limit. You know what? Of all of the ones I tried at the time there was only one. I was in Orange County California at the time and the ISP was neptune.net. I still recall to this day the owner’s answer. He did the math calculating the number of hours in a day/week/month and said that’s how many I can use. So I signed up and used that ISP for a few years(until I moved to Washington) and he never complained(and almost never got a busy signal). I have absolutely no problem paying more for premium service, it’s just I appreciate full disclosure on any service I get especially if it is advertised as unlimited.

Companies are starting to realize that the internet wasn’t built to scale at the edge. It’s somewhat fast at the core, but the pipes from the edge to the core are a tiny fraction of what they could be(and if we increased those edge pipes you need to increase the core by an order(s) of magnitude). Take for example streaming video.  There is almost non stop chatter on the net on how people are going to ditch TV and watch everything on the internet(or many things on the internet). I had a lengthy job interview with one such company that wanted to try to make that happen, they are now defunct, but they specialized in peer-to-peer video streaming with a touch of CDN. I remember the CTO telling me some stat he saw from Akamai, which is one of the largest CDNs out there(certainly the most well known I believe). Saying how at one point they were bragging about having something like 10,000 simultaneous video streams flowing through their system(or maybe it was 50,000 or something).

Put that in some perspective, think about the region around you, how many cable/satellite subscribers there are, and think how well your local broadband provider could handle unicast streaming of video from so many of them from sites out on the net. Things will come to a grinding halt very quickly.

It certainly is a nice concept to be able to stream video(I love that video, it’s the perfect example illustrating the promise of the internet) and other high bit rate content(maybe video games), but the fact is it just doesn’t scale. It works fine when there are only a few users. We need an order(s) of magnitude more bandwidth towards the edge to be able to handle this. Or, in theory at least, high grade multicast, and vast amounts of edge caching. Though multicast is complicated enough that I’m not holding my breath for it being deployed on a wide scale on the internet anytime soon, the best hope might be when everyone is on IPv6, but I’m not sure. On paper it sounds good, don’t know how well it might work in practice though on a massive scale.

So as a result companies are wising up, a small percentage of users are abusing their systems by actually using them for what they are worth. The rest of the users haven’t caught on yet. These power users are forcing the edge bandwidth providers to realize that the plans dreamed up by the marketing departments just isn’t going to cut it(at least not right now, maybe in the future). So they are doing things like capping data transfers, or charging slightly excessive fees, or cutting users off entirely.

The biggest missing piece to the puzzle has been to provide an easy way for the end user to know how much bandwidth they are using so they can control the usage themselves, don’t blow your monthly cap in 24 hours. It seems that Comcast is working on this now, and AT&T is working on it for their wireless subscribers. That’s great news. Provide solid limits for their various tiers of service for the users, and provide an easy way for users to monitor their progress on their limits. I only wish wireless companies did that for their voice plans(how hard can it be for a phone to keep track of your minutes). That said I did sign up for Sprint’s simply unlimited plan so I wouldn’t have to worry about minutes myself, saved a good chunk off my previous 2000-minute plan. Even though I don’t use anywhere near what I used to(seem to average 300-500 minutes/month at the most), I still like the unlimited plan just in case.

Anyways, I suppose it’s unfortunate that the users get the shaft in the end, they should of gotten the shaft from the beginning but I suppose the various network providers wanted to get their foot in the door with the users, get them addicted(or at least try), then jack up the rates later once they realize their original ideas were not possible.

Bandwidth isn’t cheap, at low volumes it can cost upwards of $100/Mbit or even more at a data center(where you don’t need to be concerned about telco charges or things like local loops). So if you think your getting the shaft for paying $50/mo for a 10Mbit+ burstable connection, shut up and be thankful your not paying more than 10x that.

So no, I’m not holding my breath for wide scale deployment of video streaming over the internet, or wireless data plans that simultaneously allow you to download at multi-megabit speeds while providing really unlimited data plans at consumer level pricing. The math just doesn’t work.

I’m not bitter or anything, you’d probably be shocked on how little bandwidth I actually use on my own broadband connection, it’s a tiny amount, mainly because there isn’t a whole lot of stuff on the internet that I find interesting anymore. I was much more excited back in the 90s, but as time as gone on my interest in the internet in general has declined(probably doesn’t help that my job for the past several years has been supporting various companies where their main business was internet-facing).

I suppose the next step beyond basic bandwidth monitoring might be something along the lines of internet roaming. In which you can get a data plan with a very high cap(or unlimited), but only for certain networks(perhaps mainly local ones to avoid going over the backbones), but pay a different rate for general access to the internet. Myself, I’m very much for net neutrality only where it relates to restricting bandwidth providers from directly charging content companies for access for their users(e.g. Comcast charging Google extra so Comcast users can watch Youtube). They should be charging the users for that access, not the content providers.

(In case your wondering what inspired this post it was the AT&T iPhone data plan changes that I linked to above).

December 2, 2009

Extremely Simple Redundancy Protocol

Filed under: Networking — Tags: , , , — Nate @ 7:31 am

ESRP. That is what I have started calling it at least. The official designation is Extreme Standby Router Protocol. It’s one of, if not the main reason I prefer Extreme switches at the core of any Layer 3 network. I’ll try to explain why here, because Extreme really doesn’t spend any time promoting this protocol, I’m still pushing them to change that.

I’ve deployed ESRP at two different companies in the past five years ranging from

What are two basic needs of any modern network?

  1. Layer 2 loop prevention
  2. Layer 3 fault tolerance

Traditionally these are handled by separate protocols that are completely oblivious to one another mainly some form of STP/RSTP and VRRP(or maybe HSRP if your crazy). There have been for a long time interoperability issues with various implementations of STP as well over the years, further complicating the issue because STP often needs to run on every network device for it to work right.

With ESRP life is simpler.

Advantages of ESRP include:

  • Collapsing of layer 2 loop prevention and layer 3 fault tolerance(with IP/MAC takeover) into a single protocol
  • Can run in either layer 2 only mode, layer 3 only mode or in combination mode(default).
  • Sub second convergence/recovery times.
  • Eliminates the need to run protocols of any sort on downstream network equipment
  • Virtually all down stream devices supported. Does not require an Extreme-only network. Fully inter operable with other vendors like Cisco, HP, Foundry, Linksys, Netgear etc.
  • Supports both managed and unmanaged down stream switches
  • Able to override loop prevention on a per-port basis(e.g. hook a firewall or load balancer directly to the core switches, and you trust they will handle loop prevention themselves in active/fail over mode)
  • The “who is master?” question can be determined by setting an ESRP priority level which is a number from 0-254 with 255 being standby state.
  • Set up from scratch in as little as three commands(for each core switch)
  • Protect a new vlan with as little as 1 command (for each core switch)
  • Only one IP address per vlan needed for layer 3 fault tolerance(IP-based management provided by dedicated out of band management port)
  • Supports protecting up to 3000 vlans per ESRP instance
  • Optional “load balancing” by running core switches in active-active mode with some vlans on one, and others on the other.
  • Additional fail over based on tracking of pings, route table entries or vlans.
  • For small to medium sized networks you can use a pair of X450A(48x1GbE) or X650(24x10GbE) switches as your core for a very low priced entry level solution.
  • Mature protocol. I don’t know exactly how old it is, but doing some searches indicates at least 10 years old at this point
  • Can provide significantly higher overall throughput vs ring based protocols(depending on the size of the ring), as every edge switch is directly connected to the core.
  • Nobody else in the industry has a protocol that can do this. If you know of another protocol that combines layer 2 and layer 3 into a single protocol let me know. For a while I thought Foundry’s VSRP was it, but it turns out that is mainly layer 2 only. I swear I read a PDF that talked about limited layer 3 support in VSRP back in 2004/2005 time frame but not anymore.  I haven’t spent the time to determine the use cases between VSRP and Foundry’s MRP which sounds similar to Extreme’s EAPS which is a layer 2 ring protocol heavily promoted by Extreme.

Downsides to ESRP:

  • Extreme Proprietary protocol. To me this is not a big deal as you only run this protocol at the core. Downstream switches can be any vendor.
  • Perceived complexity due to wide variety of options, but they are optional, basic configurations should work fine for most people and it is simple to configure.
  • Default election algorithm includes port weighting, this can be good and bad depending on your point of view. Port weighting means if you have an equal number of active links of the same speed on each core switch, and the master switch has a link go down the network will fail over. If you have non-switches connected directly to the core(e.g. firewall) I will usually disable the port weighting on those specific ports so I can reboot the firewall without causing the core network to fail over. I like port weighting myself, viewing it as the network trying to maintain it’s highest level of performance/availability. That is, who knows why that port was disconnected, bad cable? bad ASIC, bad port? Fail over to the other switch that has all of it’s links in a healthy state.
  • Not applicable to all network designs(is anything?)

The optimal network configuration for ESRP is very simple, it involves two core switches cross connected to each other(with at least two links), with a number of edge switches, each edge switch has at least one link to each core switch. You can have as few as three switches in your network, or you can have several hundred(as many as you can connect to your core switches max today I think is say 760 switches using high density 1GbE ports on a Black Diamond 8900, plus 8x1Gbps ports for cross connect).

ESRP Mesh Network Design

ESRP Domains

ESRP uses a concept of domains to scale itself. A single switch is master of a particular domain which can include any number of vlans up to 3000. Health packets are sent for the domain itself, rather than the individual vlans dramatically simplifying things and making them more scalable simultaneously.

This does mean that if there is a failure in one vlan, all of the vlans for that domain will fail over, not that one specific vlan. You can configure multiple domains if you want, I configure my networks with one domain per ESRP instance. Multiple domains can come in handy if you want to distribute the load between the core switches. A vlan can be a member of only one ESRP domain(I expect, I haven’t  tried to verify).

Layer 2 loop prevention

The way ESRP loop prevention works is the links going to the slave switch are placed in a blocking state, which eliminates the need for downstream protocols and allows you to provide support for even unmanaged switches transparently.

Layer 3 fault tolerance

Layer 3 fault tolerance in ESRP operates in two different modes depending on whether or not the downstream switches are Extreme. It assumes by default they are, you can override this behavior on a per-port basis. In an all-Extreme network ESRP uses EDP [Extreme Discovery Protocol](similar to Cisco’s CDP) to inform down stream switches the core has failed over and to flush their forwarding entries for the core switch.

If downstream switches are not Extreme switches, and you decided to leave the core switch in default configuration, it will likely take some time(seconds, minutes) for those switches to expire their forwarding table entries and discover the network has changed.

Port Restart

If you know you have downstream switches that are not Extreme I suggest for best availability to configure the core switches to restart the ports those switches are on. Port restart is a feature of ESRP which will cause the core switch to reset the links of the ports you configure to try to force those switches to flush their forwarding table. This process takes more time than in an Extreme-only network. In my own tests specifically with older Cisco layer 2 switches, with F5 BigIP v9, and Cisco PIX this process takes less than one second(if you have a ping session going and trigger a fail over event to occur rarely is a ping lost).

Host attached ports

If you are connecting devices like a load balancer, or a firewall directly to the switch, you typically want to hand off loop prevention to those devices, so that the slave core switch will allow traffic to traverse those specific ports regardless of the state of the network. Host attached mode is an ESRP feature that is enabled on a per-port basis.

Integration with ELRP

ESRP does not protect you from every type of loop in the network, by design it’s intended to prevent a loop from occurring between the edge switch and the two core switches. If someone plugs an edge switch back into itself for example that will cause a loop still.

ESRP integrates with another Extreme specific protocol named ELRP or Extreme Loop Recovery Protocol. Again I know of no other protocol in the industry that is similar, if you do let me know.

What ELRP does is it sends packets out on the various ports you configure and looks for the number of responses. If there is more than it expects it sees that as a loop. There are three modes to ELRP(this is getting a bit off topic but is still related). The simplist mode is one shot mode where you can have ELRP send it’s packets once and report, the second mode is periodic mode where you configure the switch to send packets periodically, I usually use 10 seconds or something, and it will log if there are loops detected(it tells you specifically what ports the loops are originating on).

The third mode is integrated mode, which is how it relates to ESRP. Myself I don’t use integrated mode and suggest you don’t either at least if you follow an architecture that is the same as mine. What integrated mode does is if there is a loop detected it will tell ESRP to fail over, hoping that the standby switch has no such loop. In my setups the entire network is flat, so if there is a loop detected on one core switch, chances are extremely(no pun intended) high that the same loop exists on the other switch. So there’s no point in trying to fail over. But I still configure all of my Extreme switches(both edge and core) with ELRP in periodic mode, so if a loop occurs I can track it down easier.

Example of an ESRP configuration

We will start with this configuration:

  • A pair of Summit X450A-48T switches as our core
  • 4x1Gbps trunked cross connects between the switches (on ports 1-4)
  • Two downstream switches, each with 2x1Gbps uplinks on ports 5,6 and 7,8 respectively which are trunked as well.
  • One VLAN named “webservers” with a tag of 3500 and an IP address of 10.60.1.1
  • An ESRP domain named esrp-prod

The non ESRP portion of this configuration is:

enable sharing 1 grouping 1-4 address-based L3_L4
enable sharing 5 grouping 5-6 address-based L3_L4
enable sharing 7 grouping 7-8 address-based L3_L4
create vlan webservers
config webservers tag 3500
config webservers ipaddress 10.60.1.1 255.255.255.0
config webservers add ports 1,5,7 tagged

What this configuration does

  • Creates a port sharing group(802.3ad) grouping ports 1-4 into a virtual port 1.
  • Creates a port sharing group(802.3ad) grouping ports 5-6 into a virtual port 5.
  • Creates a port sharing group(802.3ad) grouping ports 5-7 into a virtual port 7.
  • Creates a vlan named webservers
  • Assigns tag 3500 to the vlan webservers
  • Assigns the IP 10.60.1.1 with the netmask 255.255.255.0 to the vlan webservers
  • Adds the virtual ports 1,5,7 in a tagged mode to the vlan webservers

The ESRP portion of this configuration is:

create esrp esrp-prod
config esrp-prod add master webservers
config esrp-prod priority 100
config esrp-prod ports mode 1 host
enable esrp

The only difference between the master and slave, is to change the priority. From 0-254 higher numbers is higher priority, 255 is reserved for putting the switch in standby state.

What this configuration does

  • Creates an ESRP domain named esrp-prod.
  • Adds a master vlan to the domain, I believe the master vlan carries the control traffic
  • Configures the switch for a specific priority [optional – I highly recommend doing it]
  • Enables host attach mode for port 1, which is a virtual trunk for ports 1-4. This allows traffic for potentially other host attached ports on the slave switch to traverse to the master to reach other hosts on the network. [optional – I highly recommend doing it]
  • enables ESRP itself (you can use the command show esrp at this point to view the status)

Protecting additional vlans with ESRP

It is a simple one liner command to each core switch, extending the example above, say you added a vlan appservers with it’s associated parameters and wanted to protect it, the command is:

config esrp-prod add member appservers

That’s it.

Gotchas with ESRP

There is only one gotcha that I can think of off hand specific to ESRP. I believe it is a bug, and reported it a couple of years ago(code rev 11.6.3.3 and earlier, current code rev is 12.3.x) I don’t know if it is fixed yet. But if you are using port restart configured ports on your switches, and you add a vlan to your ESRP domain, those links will get restarted(as expected), what is not expected is this causes the network to fail over because for a moment the port weighting kicks in and detects link failure so it forces the switch to a slave state. I think the software could be aware why the ports are going down and not go to a slave state.

Somewhat related, again with port weightings, if you are connecting a new switch to the network, and you happen to connect it to the slave switch first, port weighting will kick in being that the slave switch now has more active ports than the master, and will trigger ESRP to fail over.

The workaround to this, and in general it’s a good practice anyways with ESRP, is to put the slave switch in a standby state when you are doing maintenance on it, this will prevent any unintentional network fail overs from occurring while your messing with ports/vlans etc. You can do this by setting the ESRP priority to 255. Just remember to put it back to a normal priority after you are done. Even in a standby state, if you have ports that are in host attached mode(again e.g. firewalls or load balancers) those ports are not impacted by any state changes in ESRP.

Sample Modern Network design with ESRP

Switches:

  • 2 x Extreme Networks Summit X650-24t with 10GbaseT for the core
  • 22 x Extreme Networks Summit X450A-48T each with an XGM2-2xn expansion module which provides 2x10GbaseT up links providing 1,056 ports of highest performance edge connectivity (optionally select X450e for lower, or X350 for lowest cost edge connectivity. Feel free to mix/match all of them use the same 10GbaseT up link module).

Cross connect the X650 switches to each other using 2x10GbE links with CAT6A UTP cable. Connect each of the edge switches to each of the core switches with CAT5e/CAT6/CAT6a UTP cable. Since we are working at 10Gbps speeds there is no link aggregation/trunking needed for the edge(there is still aggregation used between the core switches) simplifying configuration even further

Is a thousand ports not enough? Break out the 512Gbps stacking for the X650 and add another pair of X650s, your configuration changes to include:

  • Two pairs of 2 x Extreme Networks X650-24t switches in stacked mode with a 512Gbps interconnect(exceeds many chassis switch backplane performance).
  • 46 x 48-port edge switches providing 2,208 ports of edge connectivity.

Two thousand ports not enough, really? You can go further though the stacking interconnect performance drops in half, add another pair of X650s and your configuration changes to include:

  • Two pairs of 3 x Extreme Networks X650-24t switches in stacked mode with a 256Gbps interconnect(still exceeds many chassis switch backplane performance).
  • 70 x 48-port edge switches providing 3,360 ports of edge connectivity.

The maximum number of switches in an X650 stack is eight. My personal preference is with this sort of setup don’t go beyond three. There’s only so much horsepower to do all of the routing and stuff and when your talking about having more than three thousand ports connected to them, I just feel more comfortable that you have a bigger switch if you go beyond that.

Take a look at the Black Diamond 8900 series switch modules on the 8800 series chassis. It is a more traditional core switch that is chassis based. The 8900 series modules are new, providing high density 10GbE and even high density 1GbE(96 ports per slot). It does not support 10GbaseT at the moment, but I’m sure that support isn’t far off. It does offer a 24-port 10GbE line card with SFP+ ports(there is a SFP+ variant of the X650 as well). I believe the 512Gbps stacking between a pair of X650s is faster than the backplane interconnect on the Black Diamond 8900 which is between 80-128Gbps per slot depending on the size of the chassis(this performance expected to double in 2010). While the backplane is not as fast, the CPUs are much faster, and there is a lot more memory, to do routing/management tasks than is available on the X650.

The upgrade process for going from an X650-based stack to a Black Diamond based infrastructure is fairly straight forward. They run the same operating system, they have the same configuration files. You can take down your slave ESRP switch, copy the configuration to the Black Diamond, re-establish all of the links and then repeat the process with the master ESRP switch. You can do this all with approximately one second of combined downtime.

So I hope, in part with this posting you can see what draws me to the Extreme portfolio of products. It’s not just the hardware, or the lower cost, but the unique software components that tie it together. In fact as far as I know Extreme doesn’t even make their own network chipsets anymore. I think the last one was in the Black Diamond 10808 released in 2003, which is a high end FPGA-based architecture(they call it programable ASICs, I suspect that means high end FPGA but not certain). They primarily(if not exclusively) use Broadcom chipsets now. They’ve used Broadcom in their Summit series for many years, but their decision to stop making their own chips is interesting in that it does lower their costs quite a bit. And their software is modular enough to be able to adapt to many configurations (e.g. their Black Diamond 10808 uses dual processor Pentium III CPUs, the Summit X450 series uses ARM-based CPUs I think)

November 24, 2009

Legacy CLI

Filed under: Networking — Tags: , — Nate @ 5:05 pm

One of the bigger barriers to adoption of new equipment often revolves around user interface. If people have to adapt to something radically different some of them naturally will resist. In the networking world, switches in particular Extreme Networks has been brave enough to go against the grain, toss out the legacy UI and start from scratch(they did this more than a decade ago). While most other companies out there tried to make their systems look/feel like Cisco for somewhat obvious reasons.

Anyways I’ve always though highly of them for doing that, don’t do what everyone else is doing just because they are doing it that way, do it better(if you can). I think they have accomplished that. Their configuration is almost readable in plain english, the top level commands are somewhat similar to 3PAR in some respects:

  • create
  • delete
  • configure
  • unconfigure
  • enable
  • disable

Want to add a vlan ? create vlan Want to configure that vlan? configure vlan (or config vlan for short, or config <vlan name> for shorter). Want to turn on sFlow? enable sflow. You get the idea. There are of course many other commands but the bulk of your work is spent with these. You can actually login to an Extreme XOS-based switch that is on the internet, instructions are here. It seems to be a terminal server and you connect on the serial port as you can do things like reboot the switch and wipe out the configuration and you don’t lose connectivity or anything. If you want a more advanced online lab they have them, but they are not freely accessible.

Anyways back on topic, legacy cli. I first heard rumors of this about five years ago when I was looking at getting(and eventually did) a pair of Black Diamond 10808 switches which at the time was the first and only switch that ran Extremeware XOS.  Something interesting I learned recently which I had no idea was the case was that Extremeware XOS is entirely XML based. I knew the configuration file was XML based, but they take it even further than that, commands issued on the CLI are translated into XML objects and submitted to the system transparently. Which I thought was pretty cool.

About three years ago I asked them about it again and the legacy cli project had been shelved they said due to lack of customer interest. But now it’s back, and it’s available.

Now really back on topic. The reason for this legacy cli is so that people that are used to using the 30+ year old broken UI that others like Cisco use can use something similar on Extreme if they really want to. At least it should smooth out a migration to the more modern UI and concepts associated with Extremeware XOS(and Extremeware before it), an operating system that was built from the ground up with layer 3 services in mind(and the UI experience shows it). XOS was also built from the ground up(First released to production in December 2003) to support IPv6 as well. I’m not a fan of IPv6 myself but that’s another blog entry.

It’s not complete yet, right now it’s limited to most of the layer 2 functions of the switch, layer 3 stuff is not implimented at this point. I don’t know if it will be implimented I suppose it depends on customer feedback. But anyways if you have a hard time adjusting to a more modern world, this is available for use. The user guide is here.

If you are like me and like reading technical docs, I highly reccomend the Extremware XOS Concepts Guide. There’s so much cool stuff in there I don’t know where to begin, and it’s organized so well! They really did an outstanding job on their docs.

November 17, 2009

Affordable 10GbE has arrived

Filed under: Networking — Tags: , — Nate @ 6:00 pm

10 Gigabit Ethernet has been around for many years, for much of that time it has been for the most part(and with most vendors still is) restricted to more expensive chassis switches. For most of these switches the port density available for 10GbE is quite low as well, often maxing out at less than 10 ports per slot.

Within the past year Extreme Networks launched their X650 series of 1U switches, which currently consists of 3 models:

  • 24-port 10GbE SFP+
  • 24-port 10GbaseT first generation
  • 24-port 10GbaseT second generation (added link to press release, I didn’t even know they announced the product yesterday it’s been available for a little while at least)

For those that aren’t into networking too much, 10GbaseT is an ethernet standard that provides 10 Gigabit speeds over standard CAT5e/CAT6/CAT6a cable.

All three of them are line rate, full layer 3 capable, and even have high speed stacking(ranging from 40Gbps to 512Gbps depending on configuration). Really nobody else in the industry has this ability at this time at least among:

  • Brocade (Foundry Networks) – Layer 2 only (L3 coming at some point via software update), no stacking, no 10GbaseT
  • Force10 Networks – Layer 2 only, no stacking, no 10GbaseT
  • Juniper Networks – Layer 2 only, no stacking, no 10GbaseT. An interesting tidbit here is the Juniper 1U 10GbE switch is an OEM’d product, does not run their “JunOS” operating system, and will never have Layer 3 support. They will at some point I’m sure have a proper 10GbE switch but they don’t at the moment.
  • Arista Networks – Partial Layer 3(more coming in software update at some point), no stacking, they do have 10GbaseT and offer a 48-port version of the switch.
  • Brocade 8000 – Layer 2 only, no stacking, no 10GbaseT (This is a FCoE switch but you can run 10GbE on it as well)
  • Cisco Nexus 5000 – Layer 2 only, no stacking, no 10GbaseT (This is a FCoE switch but you can run 10GbE on it as well)
  • Fulcrum Micro Monte Carlo – I had not heard of these guys until 30 seconds ago, found them just now. I’m not sure if this is a real product, it says reference design, I think you can get it but it seems targeted at OEMs rather than end users. Perhaps this is what Juniper OEMs for their stuff(The Fulcrum Monaco looks the same as the Juniper switch). Anyways they do have 10GbaseT, no mention of Layer 3 that I can find beyond basic IP routing, no stacking. Probably not something you want to use in your data center directlty due to it’s reference design intentions.

The biggest complaints against 10GbaseT have been that it was late to market(first switches appeared somewhat recently), and it is more power hungry. Well fortunately for it the adoption rate of 10GbE has been pretty lackluster over the past few years with few deployments outside of really high end networks because the cost was too prohibitive.

As for the power usage, the earlier 10GbaseT switches did use more power because well it usually requires more power to drive stuff over copper vs fiber. But the second generation X650-24T from Extreme has lowered the power requirements by ~30%(reduction of 200W per switch), making it draw less power than the SFP+ version of the product! All models have an expansion slot on the rear for stacking and additional 10GbE ports. For example if you wanted all copper ports on the front but needed a few optical, you could get an expansion module for the back that provides 8x 10GbE SFP+ ports on the rear. Standard it comes with a module that has 4x1GbE SFP ports and 40Gbps stacking ports.

So what does it really cost? I poked around some sites trying to find some of the “better” fully layer 3 1U switches out there from various vendors to show how cost effective 10GbE can be, at least on a per-gigabit basis it is cheaper than 1GbE is today. This is street pricing, not list pricing, and not “back room” discount pricing. YMMV

VendorModelNumber of ports on the frontBandwidth
for front
ports
(Full Duplex)
Priced
From
Street
Price
Cost per
Gigabit
Support
Costs?
Extreme NetworksX650-24t24 x 10GbE480 GbpsCDW$19,755 *$41.16Yes
Force10 NetworksS50N48 x 1GbE 96 GbpsInsight$5,078$52.90Yes
Extreme NetworksX450e-48p48 x 1GbE 96 GbpsDell$5,479$57.07Optional
Extreme NetworksX450a-48t48 x 1GbE 96 GbpsDell$6,210$64.69Yes
Juniper NetworksEX420048 x 1GbE 96 GbpsCDW$8,323$86.69Yes
Brocade (Foundry Networks)NetIron CES 2048C48 x 1GbE 96 GbpsPendingPendingPendingYes
Cisco Systems3750E-48TD48 x 1GbE 96 GbpsCDW$13,500$140.63Yes

* The Extreme X650 switch by default does not include a power supply(it has two internal power supply bays for AC or DC PSUs). So the price includes the cost of a single AC power supply.

HP VirtualConnect for Dummies

Filed under: Networking,Storage,Virtualization — Tags: , , , — Nate @ 5:27 pm

Don’t know what VirtualConnect is? Check this e-book out. Available to the first 2,500 people that register. I just browsed over it myself it seems pretty good.

I am looking forward to using the technology sometime next year(trying to wait for the 12-core Opterons before getting another blade system). Certainly looks really nice on paper, and the price is quite good as well compared to the competition. It was first introduced I believe in 2006 so it’s fairly mature technology.

« Newer PostsOlder Posts »

Powered by WordPress