Diggin' technology every day


VMware dream machine

TechOps Guy: Nate

(Originally titled fourty eight all round, I like VMware dream machine more)

UPDATED I was thinking more about the upcoming 12-core Opterons and the next generation of HP c Class blades, and thought of a pretty cool configuration to have, hopefully it becomes available.

Imagine a full height blade that is quad socket, 48 cores (91-115Ghz), 48 DIMMs (192GB with 4GB sticks), 4x10Gbps Ethernet links and 2x4Gbps fiber channel links (total of 48Gbps of full duplex bandwidth). The new Opterons support 12 DIMMs per socket, allowing the 48 DIMM slots.

Why 4x10Gbps links? Well I was thinking why not.. with full height blades you can only fit 8 blades in a c7000 chassis. If you put a pair of 2x10Gbps switches in that gives you 16 ports. It's not much more $$ to double up on 10Gbps ports. Especially if your talking about spending upwards of say $20k on the blade(guesstimate) and another $9-15k blade on vSphere software per blade. And 4x10Gbps links gives you up to 16 virtual NICs using VirtualConnect per blade, each of them adjustable in 100Mbps increments.

Also given the fact that it is a full height blade, you have access to two slots worth of I/O, which translates into 320Gbps of full duplex fabric available to a single blade.

That kind of blade ought to handle just about anything you can throw at it. It's practically a super computer in of itself. Right now HP holds the top spot for VMark scores, with a 8 socket 6 core system(48 total cores) out pacing even a 16 socket 4 core system(64 total cores).

The 48 CPU cores will give the hypervisor an amazing number of combinations for scheduling vCPUs. Here's a slide from a presentation I was at last year which illustrates the concept behind the hypervisor scheduling single and multi vCPU VMs:

There is a PDF out there from VMware that talks about the math formulas behind it all, it has some interesting commentary on CPU scheduling with hypervisors:

[..]Extending this principle, ESX Server installations with a greater number of physical CPUs offer a greater chance of servicing competing workloads optimally. The chance that the scheduler can find room for a particular workload without much reshuffling of virtual machines will always be better when the scheduler has more CPUs across which it can search for idle time.

This is even cooler though, honestly I can't pretend to understand the math myself! -

Scheduling a two-VCPU machine on a two-way physical ESX Server hosts provides only one possible allocation for scheduling the virtual machine. The number of possible scheduling opportunities for a two-VCPU machine on a four-way or eight-way physical ESX Server host is described by combinatorial mathematics using the formula N! / (R!(N-R)!) where N=the number of physical CPUs on the ESX Server host and R=the number of VCPUs on the machine being scheduled.1 A two-VCPU virtual machine running on a four-way ESX Server host provides (4! / (2! (4-2)!) which is (4*3*2 / (2*2)) or 6 scheduling possibilities. For those unfamiliar with combinatory mathematics, X! is calculated as X(X-1)(X-2)(X-3)…. (X- (X-1)). For example 5! = 5*4*3*2*1.

Using these calculations, a two-VCPU virtual machine on an eight-way ESX Server host has (8! / (2! (8-2)!) which is (40320 / (2*720)) or 28 scheduling possibilities. This is more than four times the possibilities a four-way ESX Server host can provide. Four-vCPU machines demonstrate this principle even more forcefully. A four-vCPU machine scheduled on a four-way physical ESX Server host provides only one possibility to the scheduler whereas a four-VCPU virtual machine on an eight-CPU ESX Server host will yield (8! / (4!(8-4)!) or 70 scheduling possibilities, but running a four-vCPU machine on a sixteen-way ESX Server host will yield (16! / (4!(16-4)!) which is (20922789888000 / ( 24*479001600) or 1820 scheduling possibilities. That means that the scheduler has 1820 unique ways in which it can place the four-vCPU workload on the ESX Server host. Doubling the physical CPU count from eight to sixteen results in 26 times the scheduling flexibility for the four-way virtual machines. Running a four-way virtual machine on a Host with four times the number of physical processors (16-way ESX Server host) provides over six times more flexibility than we saw with running a two-way VM on a Host with four times the number of physical processors (8-way ESX Server host).

Anyone want to try to extrapolate that and extend it to a 48-core system? :)

It seems like only yesterday that I was building DL380G5 ESX 3.5 systems with 8 CPU cores and 32GB of ram, with 8x1Gbps links thinking of how powerful they were. This would be six of those in a single blade. And only seems like a couple weeks ago I was building VMware GSX systems with dual socket single core systems and 16GB ram..

So, HP do me a favor and make a G7 blade that can do this, that would make my day! I know fitting all of those components on a single full height blade won't be easy. Looking at the existing  BL685c blade, it looks like they could do it, remove the internal disks(who needs em, boot from SAN or something), and put an extra 16 DIMMs for a total of 48.

I thought about using 8Gbps fiber channel but then it wouldn't be 48 all round :)

UPDATE Again I was thinking about this and wanted to compare the costs vs existing technology. I'm estimating roughly a $32,000 price tag for this kind of blade and vSphere Advanced licensing (note you cannot use Enterprise licensing on a 12-core CPU, hardware pricing extrapolated from existing HP BL685G6 quad socket 6 core blade system with 128GB ram). The approximate price of an 8-way 48-core HP DL785 with 192GB, 4x10GbE and 2x4Gb Fiber with vSphere licensing comes to about roughly $70,000 (because VMWare charges on a per socket basis the licensing costs go up fast). Not only that but you can only fit 6 of these DL785 servers in a 42U rack, and you can fit 32 of these blades in the same rack with room to spare. So less than half the cost, and 5 times the density(for the same configuration). The DL785 has an edge in memory slot capacity, which isn't surprising given its massive size, it can fit 64 DIMMs vs 48 on my VMware dream machine blade.

Compared to a trio of HP BL495c blades each with 12 cores, and 64GB of memory, approximate pricing for that plus advanced vSphere is $31,000 for a total of 36 cores and 192GB of memory. So for $1,000 more you can add an extra 12 cores, cut your server count by 66%, probably cut your power usage by some amount and improve consolidation ratios.

So to summarize, two big reasons for this type of solution are:

  • More efficient consolidation on a per-host basis by having less "stranded" resources
  • More efficient consolidation on a per-cluster basis because you can get more capacity in the 32-node limit of a VMware cluster(assuming you want to build a cluster that big..) Again addressing the "stranded capacity" issue. Imagine what a resource pool could do with 3.3 Thz of compute capacity and 9.2TB of memory? All with line rate 40Gbps networking throughout? All within a single cabinet ?

Pretty amazing stuff to me anyways.

[For reference - Enterprise Plus licensing would add an extra $1250/socket plus more in support fees. VMware support costs not included in above pricing.]



Cisco UCS Networking falls short

TechOps Guy: Nate

UPDATED Yesterday when I woke up I had an email from Tolly in my inbox, describing a new report comparing the networking performance of the Cisco UCS vs the HP c Class blade systems. Both readers of the blog know I haven't been a fan of Cisco for a long time(about 10 years, since I first started learning about the alternatives), and I'm a big fan of HP c Class (again never used it, but planning on it). So as you could imagine I couldn't resist what it said considering the amount of hype that Cisco has managed to generate for their new systems(the sheer number of blog posts about it make me feel sick at times).

I learned a couple things from the report that I did not know about UCS before (I often times just write their solutions off since they have a track record of under performance, over price and needless complexity).

The first was that the switching fabric is external to the enclosure, so if two blades want to talk to each other that traffic must leave the chassis in order to do so, an interesting concept which can have significant performance and cost implications.

The second is that the current UCS design is 50% oversubscribed, which is what this report targets as a significant weakness of the UCS vs the HP c Class.

The mid plane design of the c7000 chassis is something that HP is pretty proud of(for good reason), capable of 160Gbps full duplex to every slot, totaling more than 5 Terrabits of fabric, they couldn't help but take shots at IBM's blade system and comment on how it is oversubscribed and how you have to be careful in how you configure the system based on that oversubscription when I talked to them last year.

This c7000 fabric is far faster than most high end chassis Ethernet switches, and should allow fairly transparent migration to 40Gbps ethernet when the standard arrives for those that need it. In fact HP already has 40Gbps Infiniband modules available for c Class.

The test involved six blades from each solution, when testing throughput of four blades both solutions performed similarly(UCS was 0.76Gbit faster). Add two more blades and start jacking up the bandwidth requirements. HP c Class scales linerally as the traffic goes up, UCS seems to scale lineraly in the opposite direction. End result is with 60Gbit of traffic being requested(6 blades @ 10Gbps), HP c Class managed to choke out 53.65Gbps, and Cisco UCS managed to cough up a mere 27.37Gbps. On UCS, pushing six blades at max performance actually resulted in less performance than four blades at max performance, significantly less. Illustrating serious weaknesses in the QoS on the system(again big surprise!).

The report mentions putting Cisco UCS in a special QoS mode for the test because without this mode performance was even worse. There is only 80Gbps of fabric available for use on the UCS(4x10Gbps full duplex). You can get a second fabric module for UCS but it cannot be used for active traffic, only as a backup.

UPDATE - A kind fellow over at Cisco took notice of our little blog here(thanks!!) and wanted to correct what they say is a bad test on the part of Tolly, apparently Tolly didn't realize that the fabrics could be used in active-active(maybe that complexity thing rearing it's head I don't know). But in the end I believe the test results are still valid, just at an incorrect scale. Each blade requires 20Gbps of full duplex fabric in order to be non blocking throughout. The Cisco UCS chassis provides for 80Gbps of full duplex fabric, allowing 4 blades to be non blocking. HP by contrast allows up to three dual port Flex10 adapters per half height server which requires 120Gbps of full duplex fabric to support at line rate. Given each slot supports 160Gbps of fabric, you could get another adapter in there but I suspect there isn't enough real estate on the blade to connect the adapter! I'm sure 120Gbps of ethernet on a single half height blade is way overkill, but if it doesn't radically increase the cost of the system, as a techie myself I do like the fact that the capacity is there to grow into.

Things get a little more complicated when you start talking about non blocking internal fabric(between blades) and the rest of the network, since HP designs their switches to support 16 blades, and Cisco designs their fabric modules to support 8. You can see by the picture of the Flex10 switch that there are 8 uplink ports on it, not 16, but it's pretty obvious that is due to space constraints because the switch is half width. END UPDATE

The point I am trying to make here isn't so much the fact that HP's architecture is superior to that of Cisco's. It's not that HP is faster than Cisco. It's the fact that HP is not oversubscribed and Cisco is. In a world where we have had non blocking switch fabrics for nearly 15 years it is disgraceful that a vendor would have a solution where six servers cannot talk to each other without being blocked. I have operated 48-port gigabit swtiches which have 256 gigabits of switching fabric, that is more than enough for 48 systems to talk to each other in a non blocking way. There are 10Gbps switches that have 500-800 gigabits of switching fabric allowing 32-48 systems to talk to each other in a non blocking way. These aren't exactly expensive solutions either. That's not even considering the higher end backplane and midplane based system that run into the multiple terrabits of switching fabrics connecting hundreds of systems at line rates.

I would expect such a poor design to come from a second tier vendor, not a vendor that has a history of making networking gear for blade switches for several manufacturers for several years.

So say take it worst case, what if you want completely non blocking fabric from each and every system? For me I am looking to HP c Class and 10Gbs Virtual Connect mainly for inttra chassis communication within the vSphere environment. In this situation with a cheap configuration on HP, you are oversubscribed 2:1 when talking outside of the chassis. For most situations this is probably fine, but say that wasn't good enough for you. Well you can fix it by installing two more 10Gbps switches on the chassis (each switch has 8x10GbE uplinks). That will give you 32x10Gbps uplink ports enough for 16 blades each having 2x10Gbps connections. All line rate, non blocking throughout the system. That is 320 Gigabits vs 80 Gigabits available on Cisco UCS.

HP doesn't stop there, with 4x10Gbps switches you've only used up half of the available I/O slots on the c7000 enclosure, can we say 640 Gigabits of total non-blocking ethernet throughput vs 80 gigabits on UCS(single chassis for both) ? I mean for those fans of running vSphere over NFS, you could install vSphere on a USB stick or SD card and dedicate the rest of the I/O slots to networking if you really need that much throughput.

Of course this costs more than being oversubscribed, the point is the customer can make this decision based on their own requirements, rather than having the limitation be designed into the system.

Now think about this limitation in a larger scale environment. Think about the vBlock again from that new EMC/Cisco/VMware alliance. Set aside the fact that it's horribly overpriced(I think mostly due to EMC's side). But this system is designed to be used in large scale service providers. That means unpredictable loads from unrelated customers running on a shared environment. Toss in vMotion and DRS, you could be asking for trouble when it comes to this oversubscription stuff, vMotion (as far as I know) relies entirely on CPU and memory usage. At some point I think it will take storage I/O into account as well. I haven't heard of it taking into account network congestion, though in theory it's possible. But it's much better to just have a non blocking fabric to begin with, you will increase your utilization, efficiency, and allow you to sleep better at night.

Makes me wonder how does Data Center Ethernet (whatever it's called this week?) hold up under these congestion conditions that the UCS suffers from? Lots of "smart" people spent a lot of time making Ethernet lossless only to design the hardware so that it will incur significant loss in transit. In my experience systems don't behave in a predictable manor when storage is highly constrained.

I find it kind of ironic that a blade solution from the world's largest networking company would be so crippled when it came to the network of the system. Again, not a big surprise to me, but there are a lot of Cisco kids out there I see that drink their koolaid without thinking twice, and of course I couldn't resist to rag again on Cisco.

I won't bother to mention the recent 10Gbps Cisco Nexus test results that show how easily you can cripple it's performance as well(while other manufacturers perform properly at non-blocking line rates), maybe will save that for another blog entry.

Just think, there is more throughput available to a single slot in a HP c7000 chassis than there is available to the entire chassis on a UCS. If you give Cisco the benefit of the second fabric module, setting aside the fact you can't use it in active-active, the HP c7000 enclosure has 32 times the throughput capacity of the Cisco UCS. That kind of performance gap even makes Cisco's switches look bad by comparison.


SSD Not ready yet?

TechOps Guy: Nate

SSD and storage tiering seem to be hot topics these days, certain organizations are pushing them pretty hard, though it seems the "market" is not buying the hype, or doesn't see the cost benefit(yet).

In the consumer space SSD seems to be problematic, with seemingly wide spread firmware issues, performance issues, and even reliability issues. In the enterprise space most storage manufacturers have yet to adopt it, and I've yet to see a storage array that has enough oomph to drive SSD effectively(TMS units aside). It seems SSD really came out of nowhere and none of the enterprise players have systems that can drive the IOPS that SSD can drive.

And today I see news seeing that STEC stock has tanked because they yet again came out and said EMC customers aren't buying SSD so they aren't selling as much stuff as they thought.

With this delay in adoptionn for the enterprise space it makes me wonder if STEC will even be around in the future, HDD manufacturers, like enterprise storage companies sort of missed the boat when it came to SSD, but with such a slow adoption rate it may allow the manufacturers of spinning rust to catch up and win back the business that they lost to STEC in the meantime.

Then there's the whole concept around automagic storage tiering at the sub volume level. It sounds cool on paper, though I'm not yet convinced on it's effectiveness in the real world, mainly due to the delay involved in a system detecting particular hot blocks/regions and moving them to SSD, maybe by the time they are moved the data is no longer needed. I've not yet talked with someone with real world experience with this sort of thing, so I can only speculate at this point. Compellent of course has the most advanced automagic storage tiering today, they promote it pretty heavily, I've only talked to one person who's worked with Compellent and he said he specifically only recommended their gear for smaller installs. I've never seen SPC-1 numbers posted by Compellent so at least in my mind their implementation remains in question, while the core technology certainly sounds nice.

Coincidently, Compellent's stock took a similar 25% hair cut recently after their earnings were released, I guess expectations were too high.

I'd like to see a long running test, along the lines of what NetApp submitted for SPC-1, for the same array, two tests, one with automagic storage tiering turned on, the other without, and see the difference. I'm not sure how SPC-1 works internally, if it is a suitable test to illustrate automagic storage tiering or not, but at least it's a baseline that can be used to compare with other systems.

Filed under: Storage No Comments

Uptime matters

TechOps Guy: Nate

A friend of mine sent me a link to this xkcd comic and said it reminded him of me, I thought it was fitting given the slogan on the site.

Devotion to Duty

Filed under: Uncategorized No Comments

AMD 12-core chips on schedule

TechOps Guy: Nate

I came across this article a few days ago on Xbitlabs and was surprised it didn't seem to get replicated elsewhere. I found while playing with a stock tracking tool on my PDA (was looking at news regarding AMD). I'm not an investor but I find the markets interesting and entertaining at times.

Anyways it mentioned some good news from my perspective that is the 12-core Opterons (rather call them that then their code name because the code names quickly become confusing, I used to stay on top of all the CPU specs back in the Socket 7 days) are on track to ship this quarter. I was previously under the impression I guess incorrectly that they would ship by the end of next quarter. And it was Intel's 8-core chips that would ship this quarter.

From the article

AMD Opteron “Magny-Cours” processor will be the first chip for the AMD G34 “Maranello” platform designed for Opteron processors 6000-series with up to 16 cores, quad-channel memory interface, 2 or 4 sockets, up to 12 memory modules per socket and some server and enterprise-specific functionality. Magny-Cours microprocessors feature two six-core or quad-core dies on one piece of substrate.

I read another article recently on The Register which mentioned AMD's plans to take the chip to 16-cores in 2011.  I've been eagerly waiting for the 12-core chips for some time now mainly for virtualization, having the extra cores gives more CPU scheduler options when scheduling multi vCPU virtual machines. And it further increases the value of dual socket systems, allowing 24 real cores in a dual socket configuration that to me is just astonishing. And having the ability to have 24 memory sockets on a dual socket system is also pretty amazing. I have my doubts that anyone can fit 24 memory modules on a single half height blade but who knows. Right now to my knowledge HP has the densest half height blade as far as memory is concerned with 18 DIMMs for a Xeon 5500-based system and 16 DIMMs for an 6-core Opteron-based system. IBM recently announced a new more dense blade with 18 slots but it appears it is full height, so doesn't really qualify. I think a dual socket full height blade is a waste of space. Some Sun blades have good densities as well though I'm not well versed in their technology.


Why I hate the cloud

TechOps Guy: Nate

Ugh, I hate all this talk about the cloud, for the most part what I can see is it's a scam to sell mostly overpriced/high margin services to organizations who don't know any better.  I'm sure there are plenty of organizations out there that have IT staff that aren't as smart as my cat, but there are plenty that have people that are smarter too.

The whole cloud concept is sold pretty good I have to admit. It frustrates me so much I don't know how properly express it. The marketing behind the cloud is such that it gives some people the impression that they can get nearly unlimited resources at their disposal, with good SLAs, good performance and pay pennies on the dollar.

It's a fantasy. That reality doesn't exist. Now sure the cost models of some incompetent organizations out there might be bad enough to the point that clouds make a lot of sense. But again there are quite a few that already have a cost effective way of operating. I suppose I am not the target customer, as every cloud provider I have talked to or seen cost analysis for has come in at a MINIMUM of 2.5-3x more expensive than doing it in house, going as high as 10x. Even the cheap crap that Amazon offers is a waste of money.

In my perspective, a public cloud(by which I mean an external cloud service provider, vs hosting "cloud" in house by way of virtual machines, grid computing and the like) has a few of use cases:

  1. Outsourced infrastructure for very small environments. I'm talking single digit servers here, low utilization etc.
  2. Outsourced "managed" cloud services, which would replace managed hosting(in the form of dedicated physical hardware) primarily to gain the abstraction layer from the hardware to handle things like fault tolerance and DR better. Again really only cost effective for small environments.
  3. Peak capacity processing - sounds good on paper, but you really need a scale-out application to be able to handle it, very few applications can handle such a situation gracefully. That is being able to nearly transparently shift compute resources to a remote cloud on demand for short periods of time to handle peak capacity. But I can't emphasize enough the fact that the application really has to be built from the ground up to be able to handle such a situation. A lot of the newer "Web 2.0" type shops are building(or have built) such applications, but of course the VAST majority of applications most organizations will use were never designed with this concept in mind. There are frequently significant concerns surrounding privacy and security.

I'm sure you can extract other use cases, but in my opinion those other use cases assume a (nearly?) completely incompetent IT/Operations staff and/or management layers that prevent the organization from operating efficiently. I believe this is common in many larger organizations unfortunately, which is one reason I steer clear of them when looking for employment.

It just drives me nuts when I encounter someone who either claims the cloud is going to save them all the money in the world, or someone who is convinced that it will (but they haven't yet found the provider that can do it).

Outside of the above use cases, I would bet money that for any reasonably efficient IT shop(usually involves a team of 10 or fewer people) can do this cloud thing far cheaper than any service provider would offer the service to them. And if a service provider did happen to offer at or below cost pricing I would call BS on them. Either they are overselling oversubscribed systems that they won't be able to sustain, or they are buying customers so that they can build a customer base. Even what people often say is the low cost leader for cloud Amazon is FAR more expensive than doing it in house in every scenario I have seen.

Almost equally infuriating to me are those that believe all virtualization solutions are created equal, and that oh we can go use the free stuff(i.e. "free" Xen) rather than pay for vSphere. I am the first to admit that vSphere enterprise plus is not worth the $$ for virtually all customers out there, there is a TON of value available in the lower end versions of VMware. Much like Oracle, sadly it seems when many people think of VMware they immediately gravitate towards the ultra high end and say "oh no it's too expensive!". I've been running ESX for a few years now and have gotten by just fine without DRS, without host profiles, without distributed switches, without vMotion, without storage vMotion, the list goes on..! Not saying they aren't nice features, but if you are cost conscious you often need to ask yourself while those are nice to have do you really need them. I'd wager frequently the answer is no.

Tagged as: 5 Comments

Is Virtualisation ready for prime time?

TechOps Guy: Nate

The Register asked that question and some people responded, anyone familiar ?

When was your first production virtualisation deployment and what did it entail? My brief story is below(copied from the comments of the first article, easier than re-writing it).

My first real production virtualization deployment was back in mid 2004 I believe, using VMware GSX I think v3.0 at the time(now called VMware server).

The deployment was an emergency decision that followed a failed software upgrade to a cluster of real production servers that was shared by many customers. The upgrade was supposed to add support for a new customer that was launching within the week(they had already started a TV advertising campaign). Every attempt was made to make the real deployment work but there were critical bugs and it had to get rolled back, after staying up all night working on it people started asking what we were going to do next.

One idea(forgot who maybe it was me) was to build a new server with vmware and transfer the QA VM images to it(1 tomcat web server, 1 BEA weblogic app server, 1 win2k SQL/IIS server, the main DB was on Oracle and we used another schema for that cluster on our existing DB) and use it for production, that would be the fastest turnaround to get something working. The expected load was supposed to be really low so we went forward. I spent what felt like 60 of the next 72 hours getting the systems ready and tested over the weekend with some QA help, and we launched on schedule on the following Monday.

Why VMs and not real servers? Well we already had the VM images, and we were really short on physical servers, at least good ones anyways. Back then building a new server from scratch was a fairly painful process, though not as painful as integrating a brand new environment. What would usually take weeks of testing we pulled off in a couple of days. I remember one of the tough/last issues to track down was a portion of the application failing due to a missing entry in /etc/hosts (a new portion of functionality that not many were aware of).

The second time I've managed to make The Register(yay!), the first would be a response to my Xiotech speculations a few months back.

Tagged as: No Comments