TechOpsGuys.com Diggin' technology every day

1Mar/10Off

The future of networking in hypervisors – not so bright

UPDATED Some networking companies see that they are losing control of the data center networks when it comes to blades and virtualization. One has reacted by making their own blades, others have come up with strategies and collaborating on standards to try to take back the network by moving the traffic back into the switching gear. Yet another has licensed their OS to have another company make blade switches on their behalf.

Where at least part of the industry wants to go is move the local switching out of the hypervisor and back into the Ethernet switches. Now this makes sense for the industry, because they are losing their grip on the network when it comes to virtualization. But this is going backwards in my opinion. Several years ago we had big chassis switches with centralized switch fabrics where(I believe, kind of going out on a limb here) if port 1 on blade 1 wanted to talk to port 2, then it had to go back to the centralized fabric before port 2 would see the traffic. That's a lot of distance to travel. Fast forward a few years and now almost every vendor is advertising local switching. Which eliminates this trip. Makes things faster, and more scalable.

Another similar evolution in switching design was moving from backplane systems to midplane systems. I only learned about some of the specifics recently, prior to that I really had no idea what the difference was between a backplane and a midplane. But apparently the idea behind a midplane is to drive significantly higher throughput on the system by putting the switching fabric closer to the line cards. An inch here, an inch there could mean hundreds of gigabits of lost throughput or increased complexity/line noise etc in order to achieve those high throughput numbers. But again, the idea is moving the fabric closer to what needs it, in order to increase performance. You can see examples of a midplane systems in blades with the HP c7000 chassis, or in switches in the Extreme Black Diamond 20808(page 7). Both of them have things that plug into both the front and the back. I thought that was mainly due to space constraints on the front, but it turns out it seems more about minimizing the distance of connectivity between the fabric on the back and the thing using the fabric on the front. Also note that the fabric modules on the rear are horizontal while the blades on the front are vertical, I think this allows the modules to further reduce the physical distance between the fabric and the device at the other end by directly covering more slots, less distance to travel on the midplane.

Moving the switching out of the hypervisor, if VM #1 wants to talk to VM #2, having that go outside of the server and make a U-turn and come right back into it is stupid. Really stupid. It's the industry grasping at straws trying to maintain control when they should be innovating. It goes against the two evolutions in switching designs I outlined above.

What I've been wanting to see myself is to integrate the switch into the server. Have a X GbE chip that has the switching fabric built into it. Most modern network operating systems are pretty modular and portable(a lot of them seem to be based on Linux or BSD). I say integrate it onto the blade for best performance, maybe use the distributed switch frame work(or come up with some other more platform independent way to improve management). The situation will only get worse in coming years, with VM servers potentially having hundreds of cores and TBs of memory at their disposal, your to the point now practically where you can fit an entire rack of traditional servers onto one hypervisor.

I know that for example Extreme uses Broadcom in most all of their systems, and Broadcom is what most server manufacturers use as their network adapters, even HP's Flex10 seems to be based on Broadcom? How hard can it be for Broadcom to make such a chip(set) so that companies like Extreme (or whomever else might use Broadcom in their switches) could program it with their own stuff to make it a mini switch?

From the Broadcom press release above (2008):

To date, Broadcom is the only silicon vendor with all of the networking components (controller, switch and physical layer devices) necessary to build a complete end-to-end 10GbE data center. This complete portfolio of 10GbE network infrastructure solutions enables OEM partners to enhance their next generation servers and data centers.

Maybe what I want makes too much sense and that's why it's not happening, or maybe I'm just crazy.

UPDATE - I just wanted to clarify my position here, what I'm looking for is essentially to offload the layer 2 switching functionality from the hypervisor to a chip on the server itself. Whether it's a special 10GbE adapter that has switching fabric or a dedicated add-on card which only has the switching fabric. Not interested in offloading layer 3 stuff, that can be handled upstream.  Also interested in integrating things like ACLs, sFlow, QoS, rate limiting and perhaps port mirroring.

TechOps Guy: Nate

Comments (5) Trackbacks (3)
  1. Hi guys, love the blog!

    Not sure that incorporating a switch into the server is the answer. Other than the physical distance, an in-server switch is going to look just like an in-chassis switch. And that shorter distance isn’t really going to buy us anything, not in a properly built enterprise class blade chassis where the backplane is completely passive with only traces on the board connecting the “back” of the servers to the switching fabric I/O.

    I’d prefer to keep the switching for VM’s on the same host within the hypervisor itself. This keeps my switching in software, distance is not an issue, and the hop is truly a virtual hop.

    Plus, I’m all for making my server motherboards less complex, not more complex. Many companies are taking this approach in MB design, truly trying to minimize the number of on-board components to reduce potential component failures.

  2. Thanks for the comment! The issue I have with keeping switching in software is it’s not scalable. It’s fine for basic things, but if you want more advanced functionality like sFlow (http://www.sflow.org/sFlowOverview.pdf) it really needs to be in hardware, especially if your talking about 10GbE speeds.

    You could even go beyond sFlow and have ClearFLOW (http://www.extremenetworks.com//libraries/whitepapers/WPCLEAR-Flow_1083.pdf)

    Motherboards don’t need to be more complex it could just be another PCIe card, or it could be integrated into the 10GbE chip itself. From talking with a friend at Broadcom their current 10GbE chip used by OEMs is already very sophisticated. The offloading should be transparent.

    The way the industry is (trying) to move though it appears is that there will be no local switching anymore period in the hypervisor, if VM #1 wants to talk to VM #2 on the same hypervisor that traffic will go out the network adapter to a switch, the switch then sends the traffic right back. I forget the standard off hand but they are working on it. The issue now is that networking standards prevent a switch from sending traffic in this manor today, otherwise they’d be doing this now.

  3. I have read so many posts about the blogger lovers however this article is truly a
    fastidious post, keep it up.

  4. Una Musica Brutal, by Gotan Project: from Buddha Bar
    IV. One of the key reasons for this effect is that these types of music utilize a
    tempo between 55 and 85 beats per minute. We
    are literally not the same person we were a minute ago, let alone a day, a month
    or a year ago.

  5. 00 for the fondant covered frosting, chocolate ganache icing,
    and marzipan frosting and $. I find it the easiest to sift the flour and other dry ingredients before I start with the creaming of the butter and sugar.
    Making hand shaped flowers like mums, roses, carnations,
    calla lilies, rosebuds and daisy will be taught as well.