TechOpsGuys.com

March 11, 2010

Panasas NFS performance posted

Filed under: Storage — Tags: benchmarks, nfs, panasas, specsfs — Nate @ 5:48 pm

I have heard of Panasas on occasion and for some reason recently I saw a story or a link to them so I decided to poke around to see what they do. I like technology..

Anyways I was shocked to see their system design. I mean I’ve seen systems like Isilon and Xiotech and Pillar who have embedded controllers in each of their storage shelves, this is an interesting concept for boosting performance though given the added complexity and stuff to each shelf I imagine can boost the costs by quite a bit too I don’t know.

But Panasas has taken it to an even further extreme, putting a disk controller for every two disks in the system! I mean I’m sure it’s great for maximum performance but wow, it just seems like such a massive overkill(which can be good for certain apps I’m sure). I was/am still shocked 🙂

So today I was poking around again at the latest SPEC SFS results for NFS, and saw they posted some numbers finally.

Fairly impressive numbers but I just can’t get past the number of CPUs they are using. They posted 77,137 IOPS with 160 disks hosting NAS data (80 SATA and 80 SSD). They used a total of 110 Intel CPUs (80 1.5Ghz Celerons and 30 1.8Ghz Pentium Ms) and 440 gigabytes ofÂ RAM cache.

By contrast, Avere which I posted about recently (never used their stuff, never talked to them before), posted 131,591 IOPS with 72 disks hosting NAS data(48 15k SAS, 24 SATA), 14 Intel CPUs(2.5Ghz quad core, so 56 cores) and 423 gigabytes of RAM cache. This is on a 6-node cluster. This Avere configuration is not using SSD (they released an SSD version since these results were posted)

The bar certainly is being raised by these players implementing massive caches. NetApp showed off some pretty impressive numbers as well with their PAM last year, more than 500GB of cache(PAM is a read cache only) though again not nearly as effective as Avere since they came in at 60,507 IOPS with 56 15k RPM disks.

Comments (6)

March 10, 2010

Save 50% off vSphere essentials for the next 90 days

Filed under: Virtualization — Tags: vmware, vsphere — Nate @ 3:00 pm

Came across this today, which mentions you can save about 50% when licensing vSphere essentials for the next ~90 days. As you may know Essentials is a really cheap way to get your vSphere stuff managed by vCenter. For your average dual socket 16-blade system as an example it is 91% cheaper(savings of ~$26,000) than going with vSphere Standard edition. Note that the vCenter included with Essentials needs to be thrown away if your managing more than three hosts with it. You’ll still need to buy vCenter standard (regardless of what version of vSphere you buy).

Comments Off

March 9, 2010

The Atomic Unit of Compute

Filed under: Virtualization — Tags: cloud — Nate @ 5:16 pm

I found this pretty fascinating, as someone who has been talking to several providers it certainly raises some pretty good points.

[..]Another of the challenges youâ€™ll face along the way of Cloud is that of how to measure exactly what it is you are offering. But having a look at what the industry is doing wonâ€™t give you much helpâ€¦ as with so many things in IT, there is no standard. Amazon have their EC2 unit, and state that it is roughly the equivalent of 1.0-1.2GHz of a 2007 Opteron or Xeon CPU. With Azure, Microsoft havenâ€™t gone down the same path â€“ their indicative pricing/sizing shows a base compute unit of 1.6GHz with no indication as to what is underneath. Rackspace flip the whole thing on itâ€™s head by deciding that memory is the primary resource constraint, therefore theyâ€™ll just charge for that and presumably give you as much CPU as you want (but with no indication as to the characteristics of the underlying CPU). Which way should you go? IMHO, none of the above.[..]

We need to have a standard unit of compute, that applies to virtual _and_ physical, new hardware and old, irrespective of AMD or Intel (or even SPARC or Power). And of course, itâ€™s not all just about GHz because all GHz are most definitely not equal and yes it _does_ matter to applications. And lets not forget the power needed to deliver those GHz.

In talking with Terremark it seems their model is around VMware resource pools where they allocate you a set amount of Ghz for your account. They have a mixture of Intel dual socket systems and AMD quad socket systems, and if you run a lot of multi vCPU VMs you have a higher likelihood of ending up in the AMD pool vs the Intel one. I have been testing their vCloud Express product for my own personal needs(1 vCPU, 1.5GB ram 50GB HD), and noticed that my VM is on one of the AMD quad socket systems.

Comments Off

Yawn..

Filed under: Networking — Tags: cisco — Nate @ 9:42 am

I was just watching some of my daily morning dose of CNBC and they had all these headlines about how Cisco was going to make some earth shattering announcement(“Change the internet forever”), and then the announcement hit, some new CRS-1 router, that claimed 12x faster performance than the competition. So naturally I was curious. Robert Paisano on the floor of the NYSE was saying how amazing it was that the router could download the library of congress in 1 second(he probably didn’t understand the router would have no place to put it).

If I want a high end router that means I’m a service provider and in that case my personal preference would be for Foundry Networks (now Brocade). Juniper makes good stuff too of course though honestly I am not nearly as versed in their technology. Granted I’ll probably never work for such a company as those companies are really big and I prefer small companies.

But in any case wanted to illustrate (another) point. According to Cisco’s own site, their fastest single chassis system has a mere 4.48 terrabits of switching capacity. This is called the CRS-3, which I don’t even see listed as a product on their site, perhaps it’s yet to come. The biggest, baddest product they have on their site right now is a 16-slot CRS-1. This according to their own site, has a total switching capacity of a paltry 1.2Tbps, and even worse a per-slot capacity of 40Gbps (hello 2003).

So take a look at the Foundry Networks (the Brocade name makes me shudder, I have never liked them) , their NetIron XMR series. From their documentation the “total switching fabric”, ranges from 960 gigabits on the low end to 7.68 terrabits on the high end. Switch forwarding capacity ranges from 400 gigabits to 3.2 terrabits. This comes out to 120 gigabits of full duplex switch fabric per slot (same across all models). While I haven’t been able to determine precisely how long XMR has been on the market I have found evidence that it is at least nearly 3 years old.

To put it in another perspective, in a 48U rack with the new CRS-3 you can get 4.48 terrabits of switching fabric(1 chassis is 48U). With Foundry in the same rack you can get one XMR32k and one XMR16k(combined size 47U) for a total of 11.52 terrabits of switching fabric. More than double the fabric in the same space, from a product that is 3 years old. And as you can imagine in the world of IT, 3 years is a fairly significant amount of time.

And while I’m here and talking about Foundry and Brocade take a look at this from Brocade, it’s funny it’s like something I would write. Compares the Brocade Director switches vs Cisco (“Numbers don’t lie”). One of my favorite quotes:

To ensure accuracy, Brocade hired an independent electrician to test both the Brocade 48000 and the Cisco MDS 9513 and found that the 120 port Cisco configuration actually draws 1347 watts, 45% higher than Cisco’s claim of 931 watts. In fact, an empty 9513 draws more electrical current (5.6 amps) than a fully-populated 384 port Brocade 48000 (5.2 amps). Below is Brocade’s test data. Where are Cisco’s verified results?

Another

With 33% more bandwidth per slot (64Gb vs 48Gb), three times as much overall bandwidth (1.5Tb vs 0.5 Tb) and a third the power draw, the Brocade 48000 is a more scalable building block, regardless of the scale, functionality or lifetime of the fabric. Holistically or not, Brocade can match the “advanced functionality” that Cisco claims, all while using far less power and for a much [?? I think whoever wrote it was in a hurry]

That’s just too funny.

Comments (9)

March 4, 2010

Dell/Denali Servers/Storage luncheon March 25th

Filed under: Events,Storage — Nate @ 11:06 am

Been a while since I posted an event, but if your looking for new servers/storage for your Exchange setup this event may be a good excuse to get away from work for a while.

Choose the Right Storage Solution for your Microsoft Exchange Environment

ThursdayÂ March 25^th, 2010

11:30am â€“ 1:30pm

El Gaucho

City Center Plaza

450 108th Ave NE

Bellevue, WA 98004

Join us for a complimentary technical seminar and learn how the Dell EqualLogic PS Series storage solution and Microsoft Exchange, deployed on Dell PowerEdge servers can deliver[..]

Myself I don’t expect to learn anything, andÂ 3PAR storage can run exchange for a large number of users(from these numbers you could extrapolate a max of 192,000 mailboxes on a single storage system each with a heavy I/O profile), so not really in the market for some Equallogic storage. BUT I like to get away, especially if it’s local. I do find it curious that the event is specifically about Exchange, that is the mindset of dedicated storage to a particular application. When the industry trend seems to be leaning towards storage that is shared amongst many applications. Given that Microsoft doesn’t appear to be an event sponsor, I find this idea curious.

Thought this was interesting as well, Microsoft recommends RAID 1 for Exchange but (from one of the links above)..

Internal tests performed by 3PAR show that using RAID 5 (7+1)â€”i.e., seven data blocks per parity blockâ€”demonstrated that the same simulated Exchange workload used for Exchange 2007 ESRP testing had disk latencies that were higher than RAID 1 but well within Microsoftâ€™s recommendations[..]

Going from RAID 1+0 to RAID 5+0 (7+1) is a pretty dramatic shift, showing how fast their “Fast” RAID is, and of course if you find out you laid data out incorrectly you can fix it on the fly. I wonder what Dell will say about their stuff.

Comments (1)

March 2, 2010

Avere front ending Isilon

Filed under: Storage — Tags: avere, isilon, netapp, nfs — Nate @ 1:21 pm

UPDATED

How do all these cool people find our blog? A friendly fellow from Isilon commented that apparently the article from The Register isn’t accurate in that Avere is front ending NetApp gear not Isilon. But in any case I have been thinking about Avere and the Symantec stuff off and on recently anyways.. END UPDATE

A really interesting article over at The Register about how Sony has deployed an Avere cluster(s) to front end their Isilon(and perhaps other) gear too. A good quote:

The thing that grabs your attention here is that Avere is being used to accelerate some of the best scale-out NAS on the planet, not bog standard filers with limited scalability.

Avere certainly has some good performance metrics(pay attention to the IOPS per physical disk), and more recently they introduced a model that run on top of SSD, I haven’t seen any performance results for it yet but I’m sure it’s a significant boost. As The Register mentions in their article if this technology really is good enough for this purpose it has the potential(of course) to be extremely disruptive in the industry, wrecking havoc with many of the remaining (and very quickly dwindling) smaller scale out NAS vendors. Kind of funny really seeing how Isilon spun the news.

From Avere’s site, in talking about comparing Spec SFS results:

A comparison of these results and the number of disks required shows that Avere used dramatically fewer disks. BlueArc used 292 disks to achieve 146,076 ops/sec with 3.34 ms ORT. Exanet used 592 disks to achieve 119,550 ops/sec with 2.07ms ORT (overall response time). HP used 584 disks to achieve 134,689 ops/sec and 2.53 ms ORT. Huawei Symantec used 960 disks to achieve 176,728 ops/sec with 1.67ms ORT. NetApp used 324 disks to achieve 120,011 ops/sec with 1.95ms ORT. By contrast, Avere used only 79 drives to achieve 131,591 ops/sec with 1.38ms ORT. Doing a little math, Avere achieves 3.3, 8.2, 7.2, 9.0, and 4.5 times more ops/sec per disk used than the other vendors.

Which got me thinking again, Symantec last year released a Filestore product, my friends over at 3PAR were asking me if I was interested in it. To-date I have not been because the only performance numbers released to-date have been not very efficient. And it’s still a new product so who knows how well it works in the real world, granted that Symantec does have a history of file systems with their Norton File System (NFS) product.

Unfortunately there isn’t much technical info on the Filestore product on their web site.

Built to run on commodity servers and most storage arrays, FileStore is an incredibly simple-to-install soft appliance. This combination of low-cost hardware, “pay as you grow” scalability and easy administration give FileStore a significant cost advantage over specialized appliances. With support for both SAN and iSCSI storage, FileStore delivers the performance needed for the most demanding applications.

It claims N-way active-active or active-passive clustering, up to 16 nodes in a cluster, up to 2PB of storage and 200 million files per file system. Which for most people is more than enough. I don’t know how it is licensed though or how well it scales on a single node, could it run on a aforementioned 48-all-round system?

Where does 3PAR fit into this? Well Symantec was the first company(so far the only one that I know of) to integrate Thin Reclamation into their file system, which integrates really well with 3PAR arrays at least. The file system uses some sort of SCSI command which is passed back to the array when files are deleted/reclaimed. So that the I/O never hits the spindles, the array transparently re-maps the blocks to be available for use.

3PAR Thin Reclamation for Veritas Storage Foundation keeps storage volumes thin over time by allowing granular, automated, non-disruptive space reclamation within the InServ array. This is accomplished by communicating deleted block information to the InServ using the Thin Reclamation API. Upon receiving this information, the InServ autonomically frees this allocated but unused storage space. The thin reclamation capabilities provide environments using Veritas Storage Foundation by Symantec an easy way to keep their thin volumes thin over time, especially in situations where a large number of writes and deletes occur.

But I was thinking that you could front end one of these Filestore clusters with an Avere cluster and get some pretty flexible high performing storage.

Something I’d like myself to explore at some point.

Comments (3)

March 1, 2010

The future of networking in hypervisors – not so bright

Filed under: Networking,Virtualization — Nate @ 10:15 pm

UPDATED Some networking companies see that they are losing control of the data center networks when it comes to blades and virtualization. One has reacted by making their own blades, others have come up with strategies and collaborating on standards to try to take back the network by moving the traffic back into the switching gear. Yet another has licensed their OS to have another company make blade switches on their behalf.

Where at least part of the industry wants to go is move the local switching out of the hypervisor and back into the Ethernet switches. Now this makes sense for the industry, because they are losing their grip on the network when it comes to virtualization. But this is going backwards in my opinion. Several years ago we had big chassis switches with centralized switch fabrics where(I believe, kind of going out on a limb here) if port 1 on blade 1 wanted to talk to port 2, then it had to go back to the centralized fabric before port 2 would see the traffic. That’s a lot of distance to travel. Fast forward a few years and now almost every vendor is advertising local switching. Which eliminates this trip. Makes things faster, and more scalable.

Another similar evolution in switching design was moving from backplane systems to midplane systems. I only learned about some of the specifics recently, prior to that I really had no idea what the difference was between a backplane and a midplane. But apparently the idea behind a midplane is to drive significantly higher throughput on the system by putting the switching fabric closer to the line cards. An inch here, an inch there could mean hundreds of gigabits of lost throughput or increased complexity/line noise etc in order to achieve those high throughput numbers. But again, the idea is moving the fabric closer to what needs it, in order to increase performance. You can see examples of a midplane systems in blades with the HP c7000 chassis, or in switches in the Extreme Black Diamond 20808(page 7). Both of them have things that plug into both the front and the back. I thought that was mainly due to space constraints on the front, but it turns out it seems more about minimizing the distance of connectivity between the fabric on the back and the thing using the fabric on the front. Also note that the fabric modules on the rear are horizontal while the blades on the front are vertical, I think this allows the modules to further reduce the physical distance between the fabric and the device at the other end by directly covering more slots, less distance to travel on the midplane.

Moving the switching out of the hypervisor, if VM #1 wants to talk to VM #2, having that go outside of the server and make a U-turn and come right back into it is stupid. Really stupid. It’s the industry grasping at straws trying to maintain control when they should be innovating. It goes against the two evolutions in switching designs I outlined above.

What I’ve been wanting to see myself is to integrate the switch into the server. Have a X GbE chip that has the switching fabric built into it. Most modern network operating systems are pretty modular and portable(a lot of them seem to be based on Linux or BSD). I say integrate it onto the blade for best performance, maybe use the distributed switch frame work(or come up with some other more platform independent way to improve management). The situation will only get worse in coming years, with VM servers potentially having hundreds of cores and TBs of memory at their disposal, your to the point now practically where you can fit an entire rack of traditional servers onto one hypervisor.

I know that for example Extreme uses Broadcom in most all of their systems, and Broadcom is what most server manufacturers use as their network adapters, even HP’s Flex10 seems to be based on Broadcom? How hard can it be for Broadcom to make such a chip(set) so that companies like Extreme (or whomever else might use Broadcom in their switches) could program it with their own stuff to make it a mini switch?

From the Broadcom press release above (2008):

To date, Broadcom is the only silicon vendor with all of the networking components (controller, switch and physical layer devices) necessary to build a complete end-to-end 10GbE data center. This complete portfolio of 10GbE network infrastructure solutions enables OEM partners to enhance their next generation servers and data centers.

Maybe what I want makes too much sense and that’s why it’s not happening, or maybe I’m just crazy.

UPDATE – I just wanted to clarify my position here, what I’m looking for is essentially to offload the layer 2 switching functionality from the hypervisor to a chip on the server itself. Whether it’s a special 10GbE adapter that has switching fabric or a dedicated add-on card which only has the switching fabric. Not interested in offloading layer 3 stuff, that can be handled upstream.Â Also interested in integrating things like ACLs, sFlow, QoS, rate limiting and perhaps port mirroring.

Comments (8)

ProCurve Not my favorite

Filed under: Networking,Virtualization — Nate @ 10:06 pm

I gotta find something new to talk about, after this..

I was thinking this evening and thought about my UCS/HP network shootout post I posted over the weekend and thought maybe I came across too strong in favor of HP’s networking gear.

As all three of you know, HP is not my favorite networking vendor. Not even my second favorite, or even my third.

But they do have some cool technology with this Virtualconnect stuff. I only wish blade interfaces were more standardized.

Comments Off

February 28, 2010

VMware dream machine

Filed under: Networking,Storage,Virtualization — Tags: 6100, amd, blades, c-class, hp, opteron, vmware — Nate @ 12:47 am

(Originally titled fourty eight all round, I like VMware dream machine more)

UPDATED I was thinking more about the upcoming 12-core Opterons and the next generation of HP c Class blades, and thought of a pretty cool configuration to have, hopefully it becomes available.

Imagine a full height blade that is quad socket, 48 cores (91-115Ghz), 48 DIMMs (192GB with 4GB sticks), 4x10Gbps Ethernet links and 2x4Gbps fiber channel links (total of 48Gbps of full duplex bandwidth). The new Opterons support 12 DIMMs per socket, allowing the 48 DIMM slots.

Why 4x10Gbps links? Well I was thinking why not.. with full height blades you can only fit 8 blades in a c7000 chassis. If you put a pair of 2x10Gbps switches in that gives you 16 ports. It’s not much more $$ to double up on 10Gbps ports. Especially if your talking about spending upwards of say $20k on the blade(guesstimate) and another $9-15k blade on vSphere software per blade. And 4x10Gbps links gives you up to 16 virtual NICs using VirtualConnect per blade, each of them adjustable in 100Mbps increments.

Also given the fact that it is a full height blade, you have access to two slots worth of I/O, which translates into 320Gbps of full duplex fabric available to a single blade.

That kind of blade ought to handle just about anything you can throw at it. It’s practically a super computer in of itself. Right now HP holds the top spot for VMark scores, with a 8 socket 6 core system(48 total cores) out pacing even a 16 socket 4 core system(64 total cores).

The 48 CPU cores will give the hypervisor an amazing number of combinations for scheduling vCPUs. Here’s a slide from a presentation I was at last year which illustrates the concept behind the hypervisor scheduling single and multi vCPU VMs:

There is a PDF out there from VMware that talks about the math formulas behind it all, it has some interesting commentary on CPU scheduling with hypervisors:

[..]Extending this principle, ESX Server installations with a greater number of physical CPUs offer a greater chance of servicing competing workloads optimally. The chance that the scheduler can find room for a particular workload without much reshuffling of virtual machines will always be better when the scheduler has more CPUs across which it can search for idle time.

This is even cooler though, honestly I can’t pretend to understand the math myself! –

Scheduling a two-VCPU machine on a two-way physical ESX Server hosts provides only one possible allocation for scheduling the virtual machine. The number of possible scheduling opportunities for a two-VCPU machine on a four-way or eight-way physical ESX Server host is described by combinatorial mathematics using the formula N! / (R!(N-R)!) where N=the number of physical CPUs on the ESX Server host and R=the number of VCPUs on the machine being scheduled.1 A two-VCPU virtual machine running on a four-way ESX Server host provides (4! / (2! (4-2)!) which is (4*3*2 / (2*2)) or 6 scheduling possibilities. For those unfamiliar with combinatory mathematics, X! is calculated as X(X-1)(X-2)(X-3)â€¦. (X- (X-1)). For example 5! = 5*4*3*2*1.

Using these calculations, a two-VCPU virtual machine on an eight-way ESX Server host has (8! / (2! (8-2)!) which is (40320 / (2*720)) or 28 scheduling possibilities. This is more than four times the possibilities a four-way ESX Server host can provide. Four-vCPU machines demonstrate this principle even more forcefully. A four-vCPU machine scheduled on a four-way physical ESX Server host provides only one possibility to the scheduler whereas a four-VCPU virtual machine on an eight-CPU ESX Server host will yield (8! / (4!(8-4)!) or 70 scheduling possibilities, but running a four-vCPU machine on a sixteen-way ESX Server host will yield (16! / (4!(16-4)!) which is (20922789888000 / ( 24*479001600) or 1820 scheduling possibilities. That means that the scheduler has 1820 unique ways in which it can place the four-vCPU workload on the ESX Server host. Doubling the physical CPU count from eight to sixteen results in 26 times the scheduling flexibility for the four-way virtual machines. Running a four-way virtual machine on a Host with four times the number of physical processors (16-way ESX Server host) provides over six times more flexibility than we saw with running a two-way VM on a Host with four times the number of physical processors (8-way ESX Server host).

Anyone want to try to extrapolate that and extend it to a 48-core system? 🙂

It seems like only yesterday that I was building DL380G5 ESX 3.5 systems with 8 CPU cores and 32GB of ram, with 8x1Gbps links thinking of how powerful they were. This would be six of those in a single blade. And only seems like a couple weeks ago I was building VMware GSX systems with dual socket single core systems and 16GB ram..

So, HP do me a favor and make a G7 blade that can do this, that would make my day! I know fitting all of those components on a single full height blade won’t be easy. Looking at the existingÂ BL685c blade, it looks like they could do it, remove the internal disks(who needs em, boot from SAN or something), and put an extra 16 DIMMs for a total of 48.

I thought about using 8Gbps fiber channel but then it wouldn’t be 48 all round 🙂

UPDATE Again I was thinking about this and wanted to compare the costs vs existing technology. I’m estimating roughly a $32,000 price tag for this kind of blade and vSphere Advanced licensing (note you cannot use Enterprise licensing on a 12-core CPU, hardware pricing extrapolated from existing HP BL685G6 quad socket 6 core blade system with 128GB ram). The approximate price of an 8-way 48-core HP DL785 with 192GB, 4x10GbE and 2x4Gb Fiber with vSphere licensing comes to about roughly $70,000 (because VMWare charges on a per socket basis the licensing costs go up fast). Not only that but you can only fit 6 of these DL785 servers in a 42U rack, and you can fit 32 of these blades in the same rack with room to spare. So less than half the cost, and 5 times the density(for the same configuration). The DL785 has an edge in memory slot capacity, which isn’t surprising given its massive size, it can fit 64 DIMMs vs 48 on my VMware dream machine blade.

Compared to a trio of HP BL495c blades each with 12 cores, and 64GB of memory, approximate pricing for that plus advanced vSphere is $31,000 for a total of 36 cores and 192GB of memory. So for $1,000 more you can add an extra 12 cores, cut your server count by 66%, probably cut your power usage by some amount and improve consolidation ratios.

So to summarize, two big reasons for this type of solution are:

More efficient consolidation on a per-host basis by having less “stranded” resources
More efficient consolidation on a per-cluster basis because you can get more capacity in the 32-node limit of a VMware cluster(assuming you want to build a cluster that big..) Again addressing the “stranded capacity” issue. Imagine what a resource pool could do with 3.3 Thz of compute capacity and 9.2TB of memory? All with line rate 40Gbps networking throughout? All within a single cabinet ?

Pretty amazing stuff to me anyways.

[For reference – Enterprise Plus licensing would add an extra $1250/socket plus more in support fees. VMware support costs not included in above pricing.]

END UPDATE

Comments (5)

February 27, 2010

Cisco UCS Networking falls short

Filed under: Networking,Virtualization — Tags: blades, c-class, cisco, hp, ucs — Nate @ 4:33 am

UPDATED Yesterday when I woke up I had an email from Tolly in my inbox, describing a new report comparing the networking performance of the Cisco UCS vs the HP c Class blade systems. Both readers of the blog know I haven’t been a fan of Cisco for a long time(about 10 years, since I first started learning about the alternatives), and I’m a big fan of HP c Class (again never used it, but planning on it). So as you could imagine I couldn’t resist what it said considering the amount of hype that Cisco has managed to generate for their new systems(the sheer number of blog posts about it make me feel sick at times).

I learned a couple things from the report that I did not know about UCS before (I often times just write their solutions off since they have a track record of under performance, over price and needless complexity).

The first was that the switching fabric is external to the enclosure, so if two blades want to talk to each other that traffic must leave the chassis in order to do so, an interesting concept which can have significant performance and cost implications.

The second is that the current UCS design is 50% oversubscribed, which is what this report targets as a significant weakness of the UCS vs the HP c Class.

The mid plane design of the c7000 chassis is something that HP is pretty proud of(for good reason), capable of 160Gbps full duplex to every slot, totaling more than 5 Terrabits of fabric, they couldn’t help but take shots at IBM’s blade system and comment on how it is oversubscribed and how you have to be careful in how you configure the system based on that oversubscription when I talked to them last year.

This c7000 fabric is far faster than most high end chassis Ethernet switches, and should allow fairly transparent migration to 40Gbps ethernet when the standard arrives for those that need it. In fact HP already has 40Gbps Infiniband modules available for c Class.

The test involved six blades from each solution, when testing throughput of four blades both solutions performed similarly(UCS was 0.76Gbit faster). Add two more blades and start jacking up the bandwidth requirements. HP c Class scales linerally as the traffic goes up, UCS seems to scale lineraly in the opposite direction. End result is with 60Gbit of traffic being requested(6 blades @ 10Gbps), HP c Class managed to choke out 53.65Gbps, and Cisco UCS managed to cough up a mere 27.37Gbps. On UCS, pushing six blades at max performance actually resulted in less performance than four blades at max performance, significantly less. Illustrating serious weaknesses in the QoS on the system(again big surprise!).

The report mentions putting Cisco UCS in a special QoS mode for the test because without this mode performance was even worse. There is only 80Gbps of fabric available for use on the UCS(4x10Gbps full duplex). You can get a second fabric module for UCS but it cannot be used for active traffic, only as a backup.

UPDATE – A kind fellow over at Cisco took notice of our little blog here(thanks!!) and wanted to correct what they say is a bad test on the part of Tolly, apparently Tolly didn’t realize that the fabrics could be used in active-active(maybe that complexity thing rearing it’s head I don’t know). But in the end I believe the test results are still valid, just at an incorrect scale. Each blade requires 20Gbps of full duplex fabric in order to be non blocking throughout. The Cisco UCS chassis provides for 80Gbps of full duplex fabric, allowing 4 blades to be non blocking. HP by contrast allows up to three dual port Flex10 adapters per half height server which requires 120Gbps of full duplex fabric to support at line rate. Given each slot supports 160Gbps of fabric, you could get another adapter in there but I suspect there isn’t enough real estate on the blade to connect the adapter! I’m sure 120Gbps of ethernet on a single half height blade is way overkill, but if it doesn’t radically increase the cost of the system, as a techie myself I do like the fact that the capacity is there to grow into.

Things get a little more complicated when you start talking about non blocking internal fabric(between blades) and the rest of the network, since HP designs their switches to support 16 blades, and Cisco designs their fabric modules to support 8. You can see by the picture of the Flex10 switch that there are 8 uplink ports on it, not 16, but it’s pretty obvious that is due to space constraints because the switch is half width. END UPDATE

The point I am trying to make here isn’t so much the fact that HP’s architecture is superior to that of Cisco’s. It’s not that HP is faster than Cisco. It’s the fact that HP is not oversubscribed and Cisco is. In a world where we have had non blocking switch fabrics for nearly 15 years it is disgraceful that a vendor would have a solution where six servers cannot talk to each other without being blocked. I have operated 48-port gigabit swtiches which have 256 gigabits of switching fabric, that is more than enough for 48 systems to talk to each other in a non blocking way. There are 10Gbps switches that have 500-800 gigabits of switching fabric allowing 32-48 systems to talk to each other in a non blocking way. These aren’t exactly expensive solutions either. That’s not even considering the higher end backplane and midplane based system that run into the multiple terrabits of switching fabrics connecting hundreds of systems at line rates.

I would expect such a poor design to come from a second tier vendor, not a vendor that has a history of making networking gear for blade switches for several manufacturers for several years.

So say take it worst case, what if you want completely non blocking fabric from each and every system? For me I am looking to HP c Class and 10Gbs Virtual Connect mainly for inttra chassis communication within the vSphere environment. In this situation with a cheap configuration on HP, you are oversubscribed 2:1 when talking outside of the chassis. For most situations this is probably fine, but say that wasn’t good enough for you. Well you can fix it by installing two more 10Gbps switches on the chassis (each switch has 8x10GbE uplinks). That will give you 32x10Gbps uplink ports enough for 16 blades each having 2x10Gbps connections. All line rate, non blocking throughout the system. That is 320 Gigabits vs 80 Gigabits available on Cisco UCS.

HP doesn’t stop there, with 4x10Gbps switches you’ve only used up half of the available I/O slots on the c7000 enclosure, can we say 640 Gigabits of total non-blocking ethernet throughput vs 80 gigabits on UCS(single chassis for both) ? I mean for those fans of running vSphere over NFS, you could install vSphere on a USB stick or SD card and dedicate the rest of the I/O slots to networking if you really need that much throughput.

Of course this costs more than being oversubscribed, the point is the customer can make this decision based on their own requirements, rather than having the limitation be designed into the system.

Now think about this limitation in a larger scale environment. Think about the vBlock again from that new EMC/Cisco/VMware alliance. Set aside the fact that it’s horribly overpriced(I think mostly due to EMC’s side). But this system is designed to be used in large scale service providers. That means unpredictable loads from unrelated customers running on a shared environment. Toss in vMotion and DRS, you could be asking for trouble when it comes to this oversubscription stuff, vMotion (as far as I know) relies entirely on CPU and memory usage. At some point I think it will take storage I/O into account as well. I haven’t heard of it taking into account network congestion, though in theory it’s possible. But it’s much better to just have a non blocking fabric to begin with, you will increase your utilization, efficiency, and allow you to sleep better at night.

Makes me wonder how does Data Center Ethernet (whatever it’s called this week?) hold up under these congestion conditions that the UCS suffers from? Lots of “smart” people spent a lot of time making Ethernet lossless only to design the hardware so that it will incur significant loss in transit. In my experience systems don’t behave in a predictable manor when storage is highly constrained.

I find it kind of ironic that a blade solution from the world’s largest networking company would be so crippled when it came to the network of the system. Again, not a big surprise to me, but there are a lot of Cisco kids out there I see that drink their koolaid without thinking twice, and of course I couldn’t resist to rag again on Cisco.

I won’t bother to mention the recent 10Gbps Cisco Nexus test results that show how easily you can cripple it’s performance as well(while other manufacturers perform properly at non-blocking line rates), maybe will save that for another blog entry.

Just think, there is more throughput available to a single slot in a HP c7000 chassis than there is available to the entire chassis on a UCS. If you give Cisco the benefit of the second fabric module, setting aside the fact you can’t use it in active-active, the HP c7000 enclosure has 32 times the throughput capacity of the Cisco UCS. That kind of performance gap even makes Cisco’s switches look bad by comparison.

Comments (25)

« Newer Posts — Older Posts »

TechOpsGuys.com Diggin' technology every day

March 11, 2010

March 10, 2010

March 9, 2010

March 4, 2010

March 2, 2010

March 1, 2010

February 28, 2010

February 27, 2010