TechOpsGuys.com Diggin' technology every day

June 21, 2010

HP BL685c G7 Launched – Opteron 6100

Filed under: News,Virtualization — Tags: , , , , — Nate @ 10:22 am

I guess my VMware dream machine will remain a dream for now, HP launched their next generation G7 Opteron 6100 blades today, and while still very compelling systems, after the 6100 launched I saw the die size had increased somewhat (not surprising), it was enough to remove the ability to have 4 CPU sockets AND 48 memory slots on one full height blade.

Still a very good comparison illustrating the elimination of the 4P tax, that is eliminating the premium associated with quad socket servers. If you configure a BL485c G7 with 2×12-core CPUs and 128GB of memory(about $16,000), vs a BL685c G7 with 256GB of memory and the 4×12-core CPUs (about $32,000), the cost is about the same, no premium.

By contrast configuring a BL685c G6 with six core CPUs (e.g. half the number of cores as the G7), same memory, same networking, same fiber channel, the cost is roughly $52,000.

These have new Flex Fabric 2 NICs, which from the specs page seem to indicate they include iSCSI or FCoE support (I assume some sort of software licensing needed to unlock the added functionality? though can’t find evidence of it). Here is a white paper on the Flex Fabric stuff, from what I gather it’s just an evolutionary step of Virtual Connect. Myself of course have never had any real interest in FCoE (search the archives for details), but nice I suppose that HP is giving the option to those that do want to jump on that wagon.

February 28, 2010

VMware dream machine

Filed under: Networking,Storage,Virtualization — Tags: , , , , , , — Nate @ 12:47 am

(Originally titled fourty eight all round, I like VMware dream machine more)

UPDATED I was thinking more about the upcoming 12-core Opterons and the next generation of HP c Class blades, and thought of a pretty cool configuration to have, hopefully it becomes available.

Imagine a full height blade that is quad socket, 48 cores (91-115Ghz), 48 DIMMs (192GB with 4GB sticks), 4x10Gbps Ethernet links and 2x4Gbps fiber channel links (total of 48Gbps of full duplex bandwidth). The new Opterons support 12 DIMMs per socket, allowing the 48 DIMM slots.

Why 4x10Gbps links? Well I was thinking why not.. with full height blades you can only fit 8 blades in a c7000 chassis. If you put a pair of 2x10Gbps switches in that gives you 16 ports. It’s not much more $$ to double up on 10Gbps ports. Especially if your talking about spending upwards of say $20k on the blade(guesstimate) and another $9-15k blade on vSphere software per blade. And 4x10Gbps links gives you up to 16 virtual NICs using VirtualConnect per blade, each of them adjustable in 100Mbps increments.

Also given the fact that it is a full height blade, you have access to two slots worth of I/O, which translates into 320Gbps of full duplex fabric available to a single blade.

That kind of blade ought to handle just about anything you can throw at it. It’s practically a super computer in of itself. Right now HP holds the top spot for VMark scores, with a 8 socket 6 core system(48 total cores) out pacing even a 16 socket 4 core system(64 total cores).

The 48 CPU cores will give the hypervisor an amazing number of combinations for scheduling vCPUs. Here’s a slide from a presentation I was at last year which illustrates the concept behind the hypervisor scheduling single and multi vCPU VMs:

There is a PDF out there from VMware that talks about the math formulas behind it all, it has some interesting commentary on CPU scheduling with hypervisors:

[..]Extending this principle, ESX Server installations with a greater number of physical CPUs offer a greater chance of servicing competing workloads optimally. The chance that the scheduler can find room for a particular workload without much reshuffling of virtual machines will always be better when the scheduler has more CPUs across which it can search for idle time.

This is even cooler though, honestly I can’t pretend to understand the math myself! –

Scheduling a two-VCPU machine on a two-way physical ESX Server hosts provides only one possible allocation for scheduling the virtual machine. The number of possible scheduling opportunities for a two-VCPU machine on a four-way or eight-way physical ESX Server host is described by combinatorial mathematics using the formula N! / (R!(N-R)!) where N=the number of physical CPUs on the ESX Server host and R=the number of VCPUs on the machine being scheduled.1 A two-VCPU virtual machine running on a four-way ESX Server host provides (4! / (2! (4-2)!) which is (4*3*2 / (2*2)) or 6 scheduling possibilities. For those unfamiliar with combinatory mathematics, X! is calculated as X(X-1)(X-2)(X-3)…. (X- (X-1)). For example 5! = 5*4*3*2*1.

Using these calculations, a two-VCPU virtual machine on an eight-way ESX Server host has (8! / (2! (8-2)!) which is (40320 / (2*720)) or 28 scheduling possibilities. This is more than four times the possibilities a four-way ESX Server host can provide. Four-vCPU machines demonstrate this principle even more forcefully. A four-vCPU machine scheduled on a four-way physical ESX Server host provides only one possibility to the scheduler whereas a four-VCPU virtual machine on an eight-CPU ESX Server host will yield (8! / (4!(8-4)!) or 70 scheduling possibilities, but running a four-vCPU machine on a sixteen-way ESX Server host will yield (16! / (4!(16-4)!) which is (20922789888000 / ( 24*479001600) or 1820 scheduling possibilities. That means that the scheduler has 1820 unique ways in which it can place the four-vCPU workload on the ESX Server host. Doubling the physical CPU count from eight to sixteen results in 26 times the scheduling flexibility for the four-way virtual machines. Running a four-way virtual machine on a Host with four times the number of physical processors (16-way ESX Server host) provides over six times more flexibility than we saw with running a two-way VM on a Host with four times the number of physical processors (8-way ESX Server host).

Anyone want to try to extrapolate that and extend it to a 48-core system? 🙂

It seems like only yesterday that I was building DL380G5 ESX 3.5 systems with 8 CPU cores and 32GB of ram, with 8x1Gbps links thinking of how powerful they were. This would be six of those in a single blade. And only seems like a couple weeks ago I was building VMware GSX systems with dual socket single core systems and 16GB ram..

So, HP do me a favor and make a G7 blade that can do this, that would make my day! I know fitting all of those components on a single full height blade won’t be easy. Looking at the existing  BL685c blade, it looks like they could do it, remove the internal disks(who needs em, boot from SAN or something), and put an extra 16 DIMMs for a total of 48.

I thought about using 8Gbps fiber channel but then it wouldn’t be 48 all round 🙂

UPDATE Again I was thinking about this and wanted to compare the costs vs existing technology. I’m estimating roughly a $32,000 price tag for this kind of blade and vSphere Advanced licensing (note you cannot use Enterprise licensing on a 12-core CPU, hardware pricing extrapolated from existing HP BL685G6 quad socket 6 core blade system with 128GB ram). The approximate price of an 8-way 48-core HP DL785 with 192GB, 4x10GbE and 2x4Gb Fiber with vSphere licensing comes to about roughly $70,000 (because VMWare charges on a per socket basis the licensing costs go up fast). Not only that but you can only fit 6 of these DL785 servers in a 42U rack, and you can fit 32 of these blades in the same rack with room to spare. So less than half the cost, and 5 times the density(for the same configuration). The DL785 has an edge in memory slot capacity, which isn’t surprising given its massive size, it can fit 64 DIMMs vs 48 on my VMware dream machine blade.

Compared to a trio of HP BL495c blades each with 12 cores, and 64GB of memory, approximate pricing for that plus advanced vSphere is $31,000 for a total of 36 cores and 192GB of memory. So for $1,000 more you can add an extra 12 cores, cut your server count by 66%, probably cut your power usage by some amount and improve consolidation ratios.

So to summarize, two big reasons for this type of solution are:

  • More efficient consolidation on a per-host basis by having less “stranded” resources
  • More efficient consolidation on a per-cluster basis because you can get more capacity in the 32-node limit of a VMware cluster(assuming you want to build a cluster that big..) Again addressing the “stranded capacity” issue. Imagine what a resource pool could do with 3.3 Thz of compute capacity and 9.2TB of memory? All with line rate 40Gbps networking throughout? All within a single cabinet ?

Pretty amazing stuff to me anyways.

[For reference – Enterprise Plus licensing would add an extra $1250/socket plus more in support fees. VMware support costs not included in above pricing.]

END UPDATE

February 27, 2010

Cisco UCS Networking falls short

Filed under: Networking,Virtualization — Tags: , , , , — Nate @ 4:33 am

UPDATED Yesterday when I woke up I had an email from Tolly in my inbox, describing a new report comparing the networking performance of the Cisco UCS vs the HP c Class blade systems. Both readers of the blog know I haven’t been a fan of Cisco for a long time(about 10 years, since I first started learning about the alternatives), and I’m a big fan of HP c Class (again never used it, but planning on it). So as you could imagine I couldn’t resist what it said considering the amount of hype that Cisco has managed to generate for their new systems(the sheer number of blog posts about it make me feel sick at times).

I learned a couple things from the report that I did not know about UCS before (I often times just write their solutions off since they have a track record of under performance, over price and needless complexity).

The first was that the switching fabric is external to the enclosure, so if two blades want to talk to each other that traffic must leave the chassis in order to do so, an interesting concept which can have significant performance and cost implications.

The second is that the current UCS design is 50% oversubscribed, which is what this report targets as a significant weakness of the UCS vs the HP c Class.

The mid plane design of the c7000 chassis is something that HP is pretty proud of(for good reason), capable of 160Gbps full duplex to every slot, totaling more than 5 Terrabits of fabric, they couldn’t help but take shots at IBM’s blade system and comment on how it is oversubscribed and how you have to be careful in how you configure the system based on that oversubscription when I talked to them last year.

This c7000 fabric is far faster than most high end chassis Ethernet switches, and should allow fairly transparent migration to 40Gbps ethernet when the standard arrives for those that need it. In fact HP already has 40Gbps Infiniband modules available for c Class.

The test involved six blades from each solution, when testing throughput of four blades both solutions performed similarly(UCS was 0.76Gbit faster). Add two more blades and start jacking up the bandwidth requirements. HP c Class scales linerally as the traffic goes up, UCS seems to scale lineraly in the opposite direction. End result is with 60Gbit of traffic being requested(6 blades @ 10Gbps), HP c Class managed to choke out 53.65Gbps, and Cisco UCS managed to cough up a mere 27.37Gbps. On UCS, pushing six blades at max performance actually resulted in less performance than four blades at max performance, significantly less. Illustrating serious weaknesses in the QoS on the system(again big surprise!).

The report mentions putting Cisco UCS in a special QoS mode for the test because without this mode performance was even worse. There is only 80Gbps of fabric available for use on the UCS(4x10Gbps full duplex). You can get a second fabric module for UCS but it cannot be used for active traffic, only as a backup.

UPDATE – A kind fellow over at Cisco took notice of our little blog here(thanks!!) and wanted to correct what they say is a bad test on the part of Tolly, apparently Tolly didn’t realize that the fabrics could be used in active-active(maybe that complexity thing rearing it’s head I don’t know). But in the end I believe the test results are still valid, just at an incorrect scale. Each blade requires 20Gbps of full duplex fabric in order to be non blocking throughout. The Cisco UCS chassis provides for 80Gbps of full duplex fabric, allowing 4 blades to be non blocking. HP by contrast allows up to three dual port Flex10 adapters per half height server which requires 120Gbps of full duplex fabric to support at line rate. Given each slot supports 160Gbps of fabric, you could get another adapter in there but I suspect there isn’t enough real estate on the blade to connect the adapter! I’m sure 120Gbps of ethernet on a single half height blade is way overkill, but if it doesn’t radically increase the cost of the system, as a techie myself I do like the fact that the capacity is there to grow into.

Things get a little more complicated when you start talking about non blocking internal fabric(between blades) and the rest of the network, since HP designs their switches to support 16 blades, and Cisco designs their fabric modules to support 8. You can see by the picture of the Flex10 switch that there are 8 uplink ports on it, not 16, but it’s pretty obvious that is due to space constraints because the switch is half width. END UPDATE

The point I am trying to make here isn’t so much the fact that HP’s architecture is superior to that of Cisco’s. It’s not that HP is faster than Cisco. It’s the fact that HP is not oversubscribed and Cisco is. In a world where we have had non blocking switch fabrics for nearly 15 years it is disgraceful that a vendor would have a solution where six servers cannot talk to each other without being blocked. I have operated 48-port gigabit swtiches which have 256 gigabits of switching fabric, that is more than enough for 48 systems to talk to each other in a non blocking way. There are 10Gbps switches that have 500-800 gigabits of switching fabric allowing 32-48 systems to talk to each other in a non blocking way. These aren’t exactly expensive solutions either. That’s not even considering the higher end backplane and midplane based system that run into the multiple terrabits of switching fabrics connecting hundreds of systems at line rates.

I would expect such a poor design to come from a second tier vendor, not a vendor that has a history of making networking gear for blade switches for several manufacturers for several years.

So say take it worst case, what if you want completely non blocking fabric from each and every system? For me I am looking to HP c Class and 10Gbs Virtual Connect mainly for inttra chassis communication within the vSphere environment. In this situation with a cheap configuration on HP, you are oversubscribed 2:1 when talking outside of the chassis. For most situations this is probably fine, but say that wasn’t good enough for you. Well you can fix it by installing two more 10Gbps switches on the chassis (each switch has 8x10GbE uplinks). That will give you 32x10Gbps uplink ports enough for 16 blades each having 2x10Gbps connections. All line rate, non blocking throughout the system. That is 320 Gigabits vs 80 Gigabits available on Cisco UCS.

HP doesn’t stop there, with 4x10Gbps switches you’ve only used up half of the available I/O slots on the c7000 enclosure, can we say 640 Gigabits of total non-blocking ethernet throughput vs 80 gigabits on UCS(single chassis for both) ? I mean for those fans of running vSphere over NFS, you could install vSphere on a USB stick or SD card and dedicate the rest of the I/O slots to networking if you really need that much throughput.

Of course this costs more than being oversubscribed, the point is the customer can make this decision based on their own requirements, rather than having the limitation be designed into the system.

Now think about this limitation in a larger scale environment. Think about the vBlock again from that new EMC/Cisco/VMware alliance. Set aside the fact that it’s horribly overpriced(I think mostly due to EMC’s side). But this system is designed to be used in large scale service providers. That means unpredictable loads from unrelated customers running on a shared environment. Toss in vMotion and DRS, you could be asking for trouble when it comes to this oversubscription stuff, vMotion (as far as I know) relies entirely on CPU and memory usage. At some point I think it will take storage I/O into account as well. I haven’t heard of it taking into account network congestion, though in theory it’s possible. But it’s much better to just have a non blocking fabric to begin with, you will increase your utilization, efficiency, and allow you to sleep better at night.

Makes me wonder how does Data Center Ethernet (whatever it’s called this week?) hold up under these congestion conditions that the UCS suffers from? Lots of “smart” people spent a lot of time making Ethernet lossless only to design the hardware so that it will incur significant loss in transit. In my experience systems don’t behave in a predictable manor when storage is highly constrained.

I find it kind of ironic that a blade solution from the world’s largest networking company would be so crippled when it came to the network of the system. Again, not a big surprise to me, but there are a lot of Cisco kids out there I see that drink their koolaid without thinking twice, and of course I couldn’t resist to rag again on Cisco.

I won’t bother to mention the recent 10Gbps Cisco Nexus test results that show how easily you can cripple it’s performance as well(while other manufacturers perform properly at non-blocking line rates), maybe will save that for another blog entry.

Just think, there is more throughput available to a single slot in a HP c7000 chassis than there is available to the entire chassis on a UCS. If you give Cisco the benefit of the second fabric module, setting aside the fact you can’t use it in active-active, the HP c7000 enclosure has 32 times the throughput capacity of the Cisco UCS. That kind of performance gap even makes Cisco’s switches look bad by comparison.

November 17, 2009

HP VirtualConnect for Dummies

Filed under: Networking,Storage,Virtualization — Tags: , , , — Nate @ 5:27 pm

Don’t know what VirtualConnect is? Check this e-book out. Available to the first 2,500 people that register. I just browsed over it myself it seems pretty good.

I am looking forward to using the technology sometime next year(trying to wait for the 12-core Opterons before getting another blade system). Certainly looks really nice on paper, and the price is quite good as well compared to the competition. It was first introduced I believe in 2006 so it’s fairly mature technology.

November 3, 2009

The new Cisco/EMC/Vmware alliance – the vBlock

Filed under: Storage,Virtualization — Tags: , , , , , , , — Nate @ 6:04 pm

Details were released a short time ago thanks to The Register on the vBlock systems coming from the new alliance of Cisco and EMC, who dragged along Vmware(kicking and screaming I’m sure). The basic gist of it is to be able to order a vBlock and have it be a completely integrated set of infrastructure ready to go, servers and networking from Cisco, storage from EMC, and Hypervisor from VMware.

vBlock0 consists of rack mount servers from Cisco, and unknown EMC storage, price not determined yet

vBlock1 consists 16-32 blade servers from Cisco and EMC CX4-480 storage system. Price ranges from $1M – 2.8M

vBlock2 consists of 32-64 blade servers from Cisco and an EMC V-MAX. Starting price $6M.

Sort of like FCoE, sounds nice in concept but the details fall flat on their face.

First off is the lack of choice. That is Cisco’s blades are based entirely on the Xeon 5500s, which are, you guessed it limited to two sockets. And at least at the moment limited to four cores. I haven’t seen word yet on compatibility with the upcoming 8-core cpus if they are socket/chip set compatible with existing systems or not(if so, wonderful for them..). Myself I prefer more raw cores, and AMD is the one that has them today(Istanbul with 6 cores, Q1 2010 with 12 cores). But maybe not everyone wants that so it’s nice to have choice. In my view HP blades win out here for having the broadest selection of offerings from both Intel and AMD. Combine that with their dense memory capacity(16 or 18 DIMM slots on a half height blade), allows you up to 1TB of memory in a blade chassis in an afforadable confiugration using 4GB DIMMs. Yes Cisco has their memory extender technology but again IMO at least with a dual socket Xeon 5500 that it is linked to the CPU core:memory density is way outta whack. It may make more sense when we have 16, 24, or even 32 cores on a system using this technology. I’m sure there are niche applications that can take advantage of it on a dual socket/quad core configuration, but the current Xeon 5500 is really holding them back with this technology.

Networking, it’s all FCoE based, I’ve already written a blog entry on that, you can read about my thoughts on FCoE here.

Storage, you can see how even with the V-MAX EMC hasn’t been able to come up with a storage system that can start on the smaller end of the scale, something that is not insanely unaffordable to 90%+ of the organizations out there. So on the more affordable end they offer you a CX4. If you are an organization that is growing you may find yourself outliving this array pretty quickly. You can add another vBlock, or you can rip and replace it with a V-MAX which will scale much better, but of course the entry level pricing for such a system makes it unsuitable for almost everyone to try to start out with even on the low end.

I am biased towards 3PAR of course as both of the readers of the blog know, so do yourself a favor and check out their F and T series systems, if you really think you want to scale high go for a 2-node T800, the price isn’t that huge, the only difference between a T400 and a T800 is the backplane. They use “blocks” to some extent, blocks being controllers(in pairs, up to four pairs), disk chassis(40 disks per chassis, up to 8 per controller pair I think). Certainly you can’t go on forever, or can you? If you don’t imagine you will scale to really massive levels go for a T400 or even a F400.  In all cases you can start out with only two controllers the additional cost to give you the option of an online upgrade to four controllers is really trivial, and offers nice peace of mind. You can even go from a T400 to a T800 if you wanted, just need to switch out the back plane (downtime involved). The parts are the same! the OS is the same! How much does it cost? Not as much as you would expect. When 3PAR announced their first generation 8-node system 7 years ago, entry level price started at $100k. You also get nice things like their thin built in technology which will allow you to run those eager zeroed VMs for fault tolerance and not consume any disk space or I/O for the zeros. You can also get multi level synchronous/asynchronous replication for a fraction of the cost of others. I could go on all day but you get the idea. There are so many fiber ports on the 3PAR arrays that you don’t need a big SAN infrastructure just hook your blade enclosures directly to the array.

And as for networking hook your 10GbE Virtual Connect switches on your c Class enclosures to your existing infrastructure. I am hoping/expecting HP to support 10GbaseT soon, and drop the CX4 passive copper cabling. The Extreme Networks Summit X650 stands alone as the best 1U 10GbE (10GbaseT or SFP+) switch on the market. Whether it is line rate, or full layer 3, or high speed stacking, or lower power consuming 10GbaseT vs fiber optics,  or advanced layer 3 networking protocols to simplify management,  price and ease of use — nobody else comes close. If you want bigger check out the Black Diamond 8900 series.

Second you can see with their designs that after the first block or two the whole idea of a vBlock sort of falls apart. That is pretty quickly your likely to just be adding more blades(especially if you have a V-MAX), rather than adding more storage and more blades.

Third you get the sense that these aren’t really blocks at all. The first tier is composed of rack mount systems, the second tier is blade systems with CX4, the third tier is blade systems with V-MAX. Each tier has something unique which hardly makes it a solution you can build as a “block” as you might expect from something called a vBlock. Given the prices here I am honestly shocked that the first tier is using rack mount systems. Blade chassis do not cost much, I would of expected them to simply use a blade chassis with just one or two blades in it. Really shows that they didn’t spend much time thinking about this.

I suppose if you treated these as blocks in their strictest sense and said yes we won’t add more than 64 blades to a V-MAX, and add it like that you could get true blocks, but I can imagine the amount of waste doing something like that is astronomical.

I didn’t touch on Vmware at all, I think their solution is solid, and they have quite a bit of choices. I’m certain with this vBlock they will pimp the enterprise plus version of software, but I really don’t see a big advantage of that version with such a small number of physical systems(a good chunk of the reason to go to that is improved management with things like host profiles and distributed switches). As another blogger recently noted, Vmware has everything to lose out of this alliance, I’m sure they have been fighting hard to maintain their independence and openness, this reeks of the opposite, they will have to stay on their toes for a while when dealing with their other partners like HP, IBM, NetApp, and others..

« Newer Posts

Powered by WordPress