Mar/100
The Atomic Unit of Compute
TechOps Guy: Nate
I found this pretty fascinating, as someone who has been talking to several providers it certainly raises some pretty good points.
[..]Another of the challenges you’ll face along the way of Cloud is that of how to measure exactly what it is you are offering. But having a look at what the industry is doing won’t give you much help… as with so many things in IT, there is no standard. Amazon have their EC2 unit, and state that it is roughly the equivalent of 1.0-1.2GHz of a 2007 Opteron or Xeon CPU. With Azure, Microsoft haven’t gone down the same path – their indicative pricing/sizing shows a base compute unit of 1.6GHz with no indication as to what is underneath. Rackspace flip the whole thing on it’s head by deciding that memory is the primary resource constraint, therefore they’ll just charge for that and presumably give you as much CPU as you want (but with no indication as to the characteristics of the underlying CPU). Which way should you go? IMHO, none of the above.[..]
We need to have a standard unit of compute, that applies to virtual _and_ physical, new hardware and old, irrespective of AMD or Intel (or even SPARC or Power). And of course, it’s not all just about GHz because all GHz are most definitely not equal and yes it _does_ matter to applications. And lets not forget the power needed to deliver those GHz.
In talking with Terremark it seems their model is around VMware resource pools where they allocate you a set amount of Ghz for your account. They have a mixture of Intel dual socket systems and AMD quad socket systems, and if you run a lot of multi vCPU VMs you have a higher likelihood of ending up in the AMD pool vs the Intel one. I have been testing their vCloud Express product for my own personal needs(1 vCPU, 1.5GB ram 50GB HD), and noticed that my VM is on one of the AMD quad socket systems.
Mar/100
Yawn..
TechOps Guy: Nate
I was just watching some of my daily morning dose of CNBC and they had all these headlines about how Cisco was going to make some earth shattering announcement(“Change the internet forever”), and then the announcement hit, some new CRS-1 router, that claimed 12x faster performance than the competition. So naturally I was curious. Robert Paisano on the floor of the NYSE was saying how amazing it was that the router could download the library of congress in 1 second(he probably didn’t understand the router would have no place to put it).
If I want a high end router that means I’m a service provider and in that case my personal preference would be for Foundry Networks (now Brocade). Juniper makes good stuff too of course though honestly I am not nearly as versed in their technology. Granted I’ll probably never work for such a company as those companies are really big and I prefer small companies.
But in any case wanted to illustrate (another) point. According to Cisco’s own site, their fastest single chassis system has a mere 4.48 terrabits of switching capacity. This is called the CRS-3, which I don’t even see listed as a product on their site, perhaps it’s yet to come. The biggest, baddest product they have on their site right now is a 16-slot CRS-1. This according to their own site, has a total switching capacity of a paltry 1.2Tbps, and even worse a per-slot capacity of 40Gbps (hello 2003).
So take a look at the Foundry Networks (the Brocade name makes me shudder, I have never liked them) , their NetIron XMR series. From their documentation the “total switching fabric”, ranges from 960 gigabits on the low end to 7.68 terrabits on the high end. Switch forwarding capacity ranges from 400 gigabits to 3.2 terrabits. This comes out to 120 gigabits of full duplex switch fabric per slot (same across all models). While I haven’t been able to determine precisely how long XMR has been on the market I have found evidence that it is at least nearly 3 years old.
To put it in another perspective, in a 48U rack with the new CRS-3 you can get 4.48 terrabits of switching fabric(1 chassis is 48U). With Foundry in the same rack you can get one XMR32k and one XMR16k(combined size 47U) for a total of 11.52 terrabits of switching fabric. More than double the fabric in the same space, from a product that is 3 years old. And as you can imagine in the world of IT, 3 years is a fairly significant amount of time.
And while I’m here and talking about Foundry and Brocade take a look at this from Brocade, it’s funny it’s like something I would write. Compares the Brocade Director switches vs Cisco (“Numbers don’t lie”). One of my favorite quotes:
To ensure accuracy, Brocade hired an independent electrician to test both the Brocade 48000 and the Cisco MDS 9513 and found that the 120 port Cisco configuration actually draws 1347 watts, 45% higher than Cisco’s claim of 931 watts. In fact, an empty 9513 draws more electrical current (5.6 amps) than a fully-populated 384 port Brocade 48000 (5.2 amps). Below is Brocade’s test data. Where are Cisco’s verified results?
Another
With 33% more bandwidth per slot (64Gb vs 48Gb), three times as much overall bandwidth (1.5Tb vs 0.5 Tb) and a third the power draw, the Brocade 48000 is a more scalable building block, regardless of the scale, functionality or lifetime of the fabric. Holistically or not, Brocade can match the “advanced functionality” that Cisco claims, all while using far less power and for a much [?? I think whoever wrote it was in a hurry]
That’s just too funny.
Mar/104
The Smooth F5 Big-IP LTM Upgrade That Wasn’t
TechOps Guy: Tycen
A few weeks ago I attended an F5/VMware/Dell luncheon (where Dell failed to show up, something about a prelim ship date of 3 weeks out). After the event I talked to a couple of F5 engineers and asked them about upgrading our Big-IP LTM 3600’s from 9.4.7 to their latest 10.1.0. We have a redundant pair of 3600’s in active/standby mode. According to them, it was as easy as upgrading the standby node, failing over, and then upgrading the other node. We have a pretty basic config, not a lot of nodes/pools/virtual servers and no add-on modules (at this time). We used the default partitioning. Easy as pie.
I followed this F5 guide which for me basically boiled down to these steps:
- # mkdir /shared/images
- copy ISO to /shared/images
- # cd /shared/images
- # im – This copies over the image2disk utility, and then presents a status message, which lets you know that the im command is nolonger supported, and tells you how to proceed
- # image2disk –instslot=HD1.2 –format=volumes
- # switchboot -b HD1.2
- reboot
The trouble started with step 5 above. It gave me an error that I needed to re-activate my keys. Not a big problem, but still made me nervous since I had a narrow window to do this upgrade in. But, re-activation was easy through the web interface (System > License > Re-activate).
The next issue was more scary – after I re-activated I re-issued the command in step 5 and the Big-IP reboots automatically (no mention of this in the upgrade doc linked to above). And it takes FOREVER to reboot. I’m sure it’s doing a lot of really tricky stuff (reformatting and upgrading), but still it’s an anxious wait. For me it was about 12 minutes (the linked upgrade guide says between 3 and 7 minutes). I was just about to put my shoes on and head to the datacenter (30 miles away) when the pings started responding.
This is where things got ugly. When the newly updgraded node came back online, it took over and became the ACTIVE node! I was just barely getting logged into it when my internal monitoring reported that the load balancer had failed over. And, that wouldn’t have been too bad because of course I had done a config sync before I started this whole process, expect that the now active node couldn’t load it’s config (more on that below). It was sitting there with a blank config (it had the correct self IPs and HA config) and users were getting nothing, not even a maintenance page. So, I forced it to standby so the other node could at least serve a maintenance page while I figured out why it wasn’t loading the config.
The bigip.conf file was there and looked intact. I can’t remember now what pointed me in the right direction (maybe while doing a b load, but I finally figured out that it was missing some class files in /var/class. I had previously used Jason Rahm’s maintenance page generator script which creates some class files used for hosting a maintenance page. Apparently the upgrade wiped out those files and the config wouldn’t load without them. (sidenote: the iRule generated by that script isn’t compatible with 10.x – but there is a new version of the script – v2 – that detects what code you’re running and builds the iRule accordingly – I have yet to use it to generate a new maintenance page and iRule). I rsync’d the class files from the other Big-IP and that allowed the config to load. I was then able to fail back to the Big-IP with the 10.x code and it seems to be working fine. Now I just need to update the other node and pray it doesn’t try to take over after reboot.
The first node I updated was set as the preferred active node (System > High Availability > Redundancy), so maybe that’s why it took over after the upgrade/reboot. But, that would be a bug in my opinion since the other node was healthy and active. Setting this to “None” might have kept the unwanted failover from happening, but I’m not going to downgrade and find out.
Another (minor) annoying thing was that the SSH authorized_keys were wiped out, so some monitoring scripts I had set up didn’t work until I added the monitoring host’s key back in to the authorized_keys file.
One final thing, I did not need to do step 6. Running the switchboot command w/o any arguments shows that HD1.2 is the default and only boot image. And, as I detailed above, the reboot in step 7 was done for me – whether I was ready for it or not.
All in all, it was not a smooth upgrade. But, I’m sure there are a lot worse things that could have happened. And, hey, at least now 10.x has vim!
Mar/100
Dell/Denali Servers/Storage luncheon March 25th
TechOps Guy: Nate
Been a while since I posted an event, but if your looking for new servers/storage for your Exchange setup this event may be a good excuse to get away from work for a while.
Join us for a complimentary technical seminar and learn how the Dell EqualLogic PS Series storage solution and Microsoft Exchange, deployed on Dell PowerEdge servers can deliver[..]
Myself I don’t expect to learn anything, and 3PAR storage can run exchange for a large number of users(from these numbers you could extrapolate a max of 192,000 mailboxes on a single storage system each with a heavy I/O profile), so not really in the market for some Equallogic storage. BUT I like to get away, especially if it’s local. I do find it curious that the event is specifically about Exchange, that is the mindset of dedicated storage to a particular application. When the industry trend seems to be leaning towards storage that is shared amongst many applications. Given that Microsoft doesn’t appear to be an event sponsor, I find this idea curious.
Thought this was interesting as well, Microsoft recommends RAID 1 for Exchange but (from one of the links above)..
Internal tests performed by 3PAR show that using RAID 5 (7+1)—i.e., seven data blocks per parity block—demonstrated that the same simulated Exchange workload used for Exchange 2007 ESRP testing had disk latencies that were higher than RAID 1 but well within Microsoft’s recommendations[..]
Going from RAID 1+0 to RAID 5+0 (7+1) is a pretty dramatic shift, showing how fast their “Fast” RAID is, and of course if you find out you laid data out incorrectly you can fix it on the fly. I wonder what Dell will say about their stuff.
Mar/102
Avere front ending Isilon
TechOps Guy: Nate
UPDATED
How do all these cool people find our blog? A friendly fellow from Isilon commented that apparently the article from The Register isn’t accurate in that Avere is front ending NetApp gear not Isilon. But in any case I have been thinking about Avere and the Symantec stuff off and on recently anyways.. END UPDATE
A really interesting article over at The Register about how Sony has deployed an Avere cluster(s) to front end their Isilon(and perhaps other) gear too. A good quote:
The thing that grabs your attention here is that Avere is being used to accelerate some of the best scale-out NAS on the planet, not bog standard filers with limited scalability.
Avere certainly has some good performance metrics(pay attention to the IOPS per physical disk), and more recently they introduced a model that run on top of SSD, I haven’t seen any performance results for it yet but I’m sure it’s a significant boost. As The Register mentions in their article if this technology really is good enough for this purpose it has the potential(of course) to be extremely disruptive in the industry, wrecking havoc with many of the remaining (and very quickly dwindling) smaller scale out NAS vendors. Kind of funny really seeing how Isilon spun the news.
From Avere’s site, in talking about comparing Spec SFS results:
A comparison of these results and the number of disks required shows that Avere used dramatically fewer disks. BlueArc used 292 disks to achieve 146,076 ops/sec with 3.34 ms ORT. Exanet used 592 disks to achieve 119,550 ops/sec with 2.07ms ORT (overall response time). HP used 584 disks to achieve 134,689 ops/sec and 2.53 ms ORT. Huawei Symantec used 960 disks to achieve 176,728 ops/sec with 1.67ms ORT. NetApp used 324 disks to achieve 120,011 ops/sec with 1.95ms ORT. By contrast, Avere used only 79 drives to achieve 131,591 ops/sec with 1.38ms ORT. Doing a little math, Avere achieves 3.3, 8.2, 7.2, 9.0, and 4.5 times more ops/sec per disk used than the other vendors.
Which got me thinking again, Symantec last year released a Filestore product, my friends over at 3PAR were asking me if I was interested in it. To-date I have not been because the only performance numbers released to-date have been not very efficient. And it’s still a new product so who knows how well it works in the real world, granted that Symantec does have a history of file systems with their Norton File System (NFS) product.
Unfortunately there isn’t much technical info on the Filestore product on their web site.
Built to run on commodity servers and most storage arrays, FileStore is an incredibly simple-to-install soft appliance. This combination of low-cost hardware, “pay as you grow” scalability and easy administration give FileStore a significant cost advantage over specialized appliances. With support for both SAN and iSCSI storage, FileStore delivers the performance needed for the most demanding applications.
It claims N-way active-active or active-passive clustering, up to 16 nodes in a cluster, up to 2PB of storage and 200 million files per file system. Which for most people is more than enough. I don’t know how it is licensed though or how well it scales on a single node, could it run on a aforementioned 48-all-round system?
Where does 3PAR fit into this? Well Symantec was the first company(so far the only one that I know of) to integrate Thin Reclamation into their file system, which integrates really well with 3PAR arrays at least. The file system uses some sort of SCSI command which is passed back to the array when files are deleted/reclaimed. So that the I/O never hits the spindles, the array transparently re-maps the blocks to be available for use.
3PAR Thin Reclamation for Veritas Storage Foundation keeps storage volumes thin over time by allowing granular, automated, non-disruptive space reclamation within the InServ array. This is accomplished by communicating deleted block information to the InServ using the Thin Reclamation API. Upon receiving this information, the InServ autonomically frees this allocated but unused storage space. The thin reclamation capabilities provide environments using Veritas Storage Foundation by Symantec an easy way to keep their thin volumes thin over time, especially in situations where a large number of writes and deletes occur.
But I was thinking that you could front end one of these Filestore clusters with an Avere cluster and get some pretty flexible high performing storage.
Something I’d like myself to explore at some point.
Mar/102
The future of networking in hypervisors – not so bright
TechOps Guy: Nate
UPDATED Some networking companies see that they are losing control of the data center networks when it comes to blades and virtualization. One has reacted by making their own blades, others have come up with strategies and collaborating on standards to try to take back the network by moving the traffic back into the switching gear. Yet another has licensed their OS to have another company make blade switches on their behalf.
Where at least part of the industry wants to go is move the local switching out of the hypervisor and back into the Ethernet switches. Now this makes sense for the industry, because they are losing their grip on the network when it comes to virtualization. But this is going backwards in my opinion. Several years ago we had big chassis switches with centralized switch fabrics where(I believe, kind of going out on a limb here) if port 1 on blade 1 wanted to talk to port 2, then it had to go back to the centralized fabric before port 2 would see the traffic. That’s a lot of distance to travel. Fast forward a few years and now almost every vendor is advertising local switching. Which eliminates this trip. Makes things faster, and more scalable.
Another similar evolution in switching design was moving from backplane systems to midplane systems. I only learned about some of the specifics recently, prior to that I really had no idea what the difference was between a backplane and a midplane. But apparently the idea behind a midplane is to drive significantly higher throughput on the system by putting the switching fabric closer to the line cards. An inch here, an inch there could mean hundreds of gigabits of lost throughput or increased complexity/line noise etc in order to achieve those high throughput numbers. But again, the idea is moving the fabric closer to what needs it, in order to increase performance. You can see examples of a midplane systems in blades with the HP c7000 chassis, or in switches in the Extreme Black Diamond 20808(page 7). Both of them have things that plug into both the front and the back. I thought that was mainly due to space constraints on the front, but it turns out it seems more about minimizing the distance of connectivity between the fabric on the back and the thing using the fabric on the front. Also note that the fabric modules on the rear are horizontal while the blades on the front are vertical, I think this allows the modules to further reduce the physical distance between the fabric and the device at the other end by directly covering more slots, less distance to travel on the midplane.
Moving the switching out of the hypervisor, if VM #1 wants to talk to VM #2, having that go outside of the server and make a U-turn and come right back into it is stupid. Really stupid. It’s the industry grasping at straws trying to maintain control when they should be innovating. It goes against the two evolutions in switching designs I outlined above.
What I’ve been wanting to see myself is to integrate the switch into the server. Have a X GbE chip that has the switching fabric built into it. Most modern network operating systems are pretty modular and portable(a lot of them seem to be based on Linux or BSD). I say integrate it onto the blade for best performance, maybe use the distributed switch frame work(or come up with some other more platform independent way to improve management). The situation will only get worse in coming years, with VM servers potentially having hundreds of cores and TBs of memory at their disposal, your to the point now practically where you can fit an entire rack of traditional servers onto one hypervisor.
I know that for example Extreme uses Broadcom in most all of their systems, and Broadcom is what most server manufacturers use as their network adapters, even HP’s Flex10 seems to be based on Broadcom? How hard can it be for Broadcom to make such a chip(set) so that companies like Extreme (or whomever else might use Broadcom in their switches) could program it with their own stuff to make it a mini switch?
From the Broadcom press release above (2008):
To date, Broadcom is the only silicon vendor with all of the networking components (controller, switch and physical layer devices) necessary to build a complete end-to-end 10GbE data center. This complete portfolio of 10GbE network infrastructure solutions enables OEM partners to enhance their next generation servers and data centers.
Maybe what I want makes too much sense and that’s why it’s not happening, or maybe I’m just crazy.
UPDATE - I just wanted to clarify my position here, what I’m looking for is essentially to offload the layer 2 switching functionality from the hypervisor to a chip on the server itself. Whether it’s a special 10GbE adapter that has switching fabric or a dedicated add-on card which only has the switching fabric. Not interested in offloading layer 3 stuff, that can be handled upstream. Also interested in integrating things like ACLs, sFlow, QoS, rate limiting and perhaps port mirroring.
Mar/100
ProCurve Not my favorite
TechOps Guy: Nate
I gotta find something new to talk about, after this..
I was thinking this evening and thought about my UCS/HP network shootout post I posted over the weekend and thought maybe I came across too strong in favor of HP’s networking gear.
As all three of you know, HP is not my favorite networking vendor. Not even my second favorite, or even my third.
But they do have some cool technology with this Virtualconnect stuff. I only wish blade interfaces were more standardized.
Feb/100
Fourty-eight all round
TechOps Guy: Nate
I was thinking more about the upcoming 12-core Opterons and the next generation of HP c Class blades, and thought of a pretty cool configuration to have, hopefully it becomes available.
Imagine a full height blade that is quad socket, 48 cores (91-115Ghz), 48 DIMMs (192GB with 4GB sticks), 4×10Gbps Ethernet links and 2×4Gbps fiber channel links (total of 48Gbps of full duplex bandwidth). The new Opterons support 12 DIMMs per socket, allowing the 48 DIMM slots.
Why 4×10Gbps links? Well I was thinking why not.. with full height blades you can only fit 8 blades in a c7000 chassis. If you put a pair of 2×10Gbps switches in that gives you 16 ports. It’s not much more $$ to double up on 10Gbps ports. Especially if your talking about spending upwards of say $12k on the blade(guesstimate) and another $9-15k blade on vSphere software per blade. And 4×10Gbps links gives you up to 16 virtual NICs using VirtualConnect per blade, each of them adjustable in 100Mbps increments.
Also given the fact that it is a full height blade, you have access to two slots worth of I/O, which translates into 320Gbps of full duplex fabric available to a single blade.
That kind of blade ought to handle just about anything you can throw at it. It’s practically a super computer in of itself. Right now HP holds the top spot for VMark scores, with a 8 socket 6 core system(48 total cores) out pacing even a 16 socket 4 core system(64 total cores).
The 48 CPU cores will give the hypervisor an amazing number of combinations for scheduling vCPUs. Here’s a slide from a presentation I was at last year which illustrates the concept behind the hypervisor scheduling single and multi vCPU VMs:
There is a PDF out there from VMware that talks about the math formulas behind it all, it has some interesting commentary on CPU scheduling with hypervisors:
[..]Extending this principle, ESX Server installations with a greater number of physical CPUs offer a greater chance of servicing competing workloads optimally. The chance that the scheduler can find room for a particular workload without much reshuffling of virtual machines will always be better when the scheduler has more CPUs across which it can search for idle time.
This is even cooler though, honestly I can’t pretend to understand the math myself! -
Scheduling a two-VCPU machine on a two-way physical ESX Server hosts provides only one possible allocation for scheduling the virtual machine. The number of possible scheduling opportunities for a two-VCPU machine on a four-way or eight-way physical ESX Server host is described by combinatorial mathematics using the formula N! / (R!(N-R)!) where N=the number of physical CPUs on the ESX Server host and R=the number of VCPUs on the machine being scheduled.1 A two-VCPU virtual machine running on a four-way ESX Server host provides (4! / (2! (4-2)!) which is (4*3*2 / (2*2)) or 6 scheduling possibilities. For those unfamiliar with combinatory mathematics, X! is calculated as X(X-1)(X-2)(X-3)…. (X- (X-1)). For example 5! = 5*4*3*2*1.
Using these calculations, a two-VCPU virtual machine on an eight-way ESX Server host has (8! / (2! (8-2)!) which is (40320 / (2*720)) or 28 scheduling possibilities. This is more than four times the possibilities a four-way ESX Server host can provide. Four-vCPU machines demonstrate this principle even more forcefully. A four-vCPU machine scheduled on a four-way physical ESX Server host provides only one possibility to the scheduler whereas a four-VCPU virtual machine on an eight-CPU ESX Server host will yield (8! / (4!(8-4)!) or 70 scheduling possibilities, but running a four-vCPU machine on a sixteen-way ESX Server host will yield (16! / (4!(16-4)!) which is (20922789888000 / ( 24*479001600) or 1820 scheduling possibilities. That means that the scheduler has 1820 unique ways in which it can place the four-vCPU workload on the ESX Server host. Doubling the physical CPU count from eight to sixteen results in 26 times the scheduling flexibility for the four-way virtual machines. Running a four-way virtual machine on a Host with four times the number of physical processors (16-way ESX Server host) provides over six times more flexibility than we saw with running a two-way VM on a Host with four times the number of physical processors (8-way ESX Server host).
Anyone want to try to extrapolate that and extend it to a 48-core system?
It seems like only yesterday that I was building DL380G5 ESX 3.5 systems with 8 CPU cores and 32GB of ram, with 8×1Gbps links thinking of how powerful they were. This would be six of those in a single blade. And only seems like a couple weeks ago I was building VMware GSX systems with dual socket single core systems and 16GB ram..
So, HP do me a favor and make a G7 blade that can do this, that would make my day! I know fitting all of those components on a single full height blade won’t be easy. Looking at the existing BL685c blade, it looks like they could do it, remove the internal disks(who needs em, boot from SAN or something), and put an extra 16 DIMMs for a total of 48.
I thought about using 8Gbps fiber channel but then it wouldn’t be 48 all round
Feb/102
Cisco UCS Networking falls short
TechOps Guy: Nate
UPDATED Yesterday when I woke up I had an email from Tolly in my inbox, describing a new report comparing the networking performance of the Cisco UCS vs the HP c Class blade systems. Both readers of the blog know I haven’t been a fan of Cisco for a long time(about 10 years, since I first started learning about the alternatives), and I’m a big fan of HP c Class (again never used it, but planning on it). So as you could imagine I couldn’t resist what it said considering the amount of hype that Cisco has managed to generate for their new systems(the sheer number of blog posts about it make me feel sick at times).
I learned a couple things from the report that I did not know about UCS before (I often times just write their solutions off since they have a track record of under performance, over price and needless complexity).
The first was that the switching fabric is external to the enclosure, so if two blades want to talk to each other that traffic must leave the chassis in order to do so, an interesting concept which can have significant performance and cost implications.
The second is that the current UCS design is 50% oversubscribed, which is what this report targets as a significant weakness of the UCS vs the HP c Class.
The mid plane design of the c7000 chassis is something that HP is pretty proud of(for good reason), capable of 160Gbps full duplex to every slot, totaling more than 5 Terrabits of fabric, they couldn’t help but take shots at IBM’s blade system and comment on how it is oversubscribed and how you have to be careful in how you configure the system based on that oversubscription when I talked to them last year.
This c7000 fabric is far faster than most high end chassis Ethernet switches, and should allow fairly transparent migration to 40Gbps ethernet when the standard arrives for those that need it. In fact HP already has 40Gbps Infiniband modules available for c Class.
The test involved six blades from each solution, when testing throughput of four blades both solutions performed similarly(UCS was 0.76Gbit faster). Add two more blades and start jacking up the bandwidth requirements. HP c Class scales linerally as the traffic goes up, UCS seems to scale lineraly in the opposite direction. End result is with 60Gbit of traffic being requested(6 blades @ 10Gbps), HP c Class managed to choke out 53.65Gbps, and Cisco UCS managed to cough up a mere 27.37Gbps. On UCS, pushing six blades at max performance actually resulted in less performance than four blades at max performance, significantly less. Illustrating serious weaknesses in the QoS on the system(again big surprise!).
The report mentions putting Cisco UCS in a special QoS mode for the test because without this mode performance was even worse. There is only 80Gbps of fabric available for use on the UCS(4×10Gbps full duplex). You can get a second fabric module for UCS but it cannot be used for active traffic, only as a backup.
UPDATE – A kind fellow over at Cisco took notice of our little blog here(thanks!!) and wanted to correct what they say is a bad test on the part of Tolly, apparently Tolly didn’t realize that the fabrics could be used in active-active(maybe that complexity thing rearing it’s head I don’t know). But in the end I believe the test results are still valid, just at an incorrect scale. Each blade requires 20Gbps of full duplex fabric in order to be non blocking throughout. The Cisco UCS chassis provides for 80Gbps of full duplex fabric, allowing 4 blades to be non blocking. HP by contrast allows up to three dual port Flex10 adapters per half height server which requires 120Gbps of full duplex fabric to support at line rate. Given each slot supports 160Gbps of fabric, you could get another adapter in there but I suspect there isn’t enough real estate on the blade to connect the adapter! I’m sure 120Gbps of ethernet on a single half height blade is way overkill, but if it doesn’t radically increase the cost of the system, as a techie myself I do like the fact that the capacity is there to grow into.
Things get a little more complicated when you start talking about non blocking internal fabric(between blades) and the rest of the network, since HP designs their switches to support 16 blades, and Cisco designs their fabric modules to support 8. You can see by the picture of the Flex10 switch that there are 8 uplink ports on it, not 16, but it’s pretty obvious that is due to space constraints because the switch is half width. END UPDATE
The point I am trying to make here isn’t so much the fact that HP’s architecture is superior to that of Cisco’s. It’s not that HP is faster than Cisco. It’s the fact that HP is not oversubscribed and Cisco is. In a world where we have had non blocking switch fabrics for nearly 15 years it is disgraceful that a vendor would have a solution where six servers cannot talk to each other without being blocked. I have operated 48-port gigabit swtiches which have 256 gigabits of switching fabric, that is more than enough for 48 systems to talk to each other in a non blocking way. There are 10Gbps switches that have 500-800 gigabits of switching fabric allowing 32-48 systems to talk to each other in a non blocking way. These aren’t exactly expensive solutions either. That’s not even considering the higher end backplane and midplane based system that run into the multiple terrabits of switching fabrics connecting hundreds of systems at line rates.
I would expect such a poor design to come from a second tier vendor, not a vendor that has a history of making networking gear for blade switches for several manufacturers for several years.
So say take it worst case, what if you want completely non blocking fabric from each and every system? For me I am looking to HP c Class and 10Gbs Virtual Connect mainly for inttra chassis communication within the vSphere environment. In this situation with a cheap configuration on HP, you are oversubscribed 2:1 when talking outside of the chassis. For most situations this is probably fine, but say that wasn’t good enough for you. Well you can fix it by installing two more 10Gbps switches on the chassis (each switch has 8×10GbE uplinks). That will give you 32×10Gbps uplink ports enough for 16 blades each having 2×10Gbps connections. All line rate, non blocking throughout the system. That is 320 Gigabits vs 80 Gigabits available on Cisco UCS.
HP doesn’t stop there, with 4×10Gbps switches you’ve only used up half of the available I/O slots on the c7000 enclosure, can we say 640 Gigabits of total non-blocking ethernet throughput vs 80 gigabits on UCS(single chassis for both) ? I mean for those fans of running vSphere over NFS, you could install vSphere on a USB stick or SD card and dedicate the rest of the I/O slots to networking if you really need that much throughput.
Of course this costs more than being oversubscribed, the point is the customer can make this decision based on their own requirements, rather than having the limitation be designed into the system.
Now think about this limitation in a larger scale environment. Think about the vBlock again from that new EMC/Cisco/VMware alliance. Set aside the fact that it’s horribly overpriced(I think mostly due to EMC’s side). But this system is designed to be used in large scale service providers. That means unpredictable loads from unrelated customers running on a shared environment. Toss in vMotion and DRS, you could be asking for trouble when it comes to this oversubscription stuff, vMotion (as far as I know) relies entirely on CPU and memory usage. At some point I think it will take storage I/O into account as well. I haven’t heard of it taking into account network congestion, though in theory it’s possible. But it’s much better to just have a non blocking fabric to begin with, you will increase your utilization, efficiency, and allow you to sleep better at night.
Makes me wonder how does Data Center Ethernet (whatever it’s called this week?) hold up under these congestion conditions that the UCS suffers from? Lots of “smart” people spent a lot of time making Ethernet lossless only to design the hardware so that it will incur significant loss in transit. In my experience systems don’t behave in a predictable manor when storage is highly constrained.
I find it kind of ironic that a blade solution from the world’s largest networking company would be so crippled when it came to the network of the system. Again, not a big surprise to me, but there are a lot of Cisco kids out there I see that drink their koolaid without thinking twice, and of course I couldn’t resist to rag again on Cisco.
I won’t bother to mention the recent 10Gbps Cisco Nexus test results that show how easily you can cripple it’s performance as well(while other manufacturers perform properly at non-blocking line rates), maybe will save that for another blog entry.
Just think, there is more throughput available to a single slot in a HP c7000 chassis than there is available to the entire chassis on a UCS. If you give Cisco the benefit of the second fabric module, setting aside the fact you can’t use it in active-active, the HP c7000 enclosure has 32 times the throughput capacity of the Cisco UCS. That kind of performance gap even makes Cisco’s switches look bad by comparison.
Feb/100
SSD Not ready yet?
TechOps Guy: Nate
SSD and storage tiering seem to be hot topics these days, certain organizations are pushing them pretty hard, though it seems the “market” is not buying the hype, or doesn’t see the cost benefit(yet).
In the consumer space SSD seems to be problematic, with seemingly wide spread firmware issues, performance issues, and even reliability issues. In the enterprise space most storage manufacturers have yet to adopt it, and I’ve yet to see a storage array that has enough oomph to drive SSD effectively(TMS units aside). It seems SSD really came out of nowhere and none of the enterprise players have systems that can drive the IOPS that SSD can drive.
And today I see news seeing that STEC stock has tanked because they yet again came out and said EMC customers aren’t buying SSD so they aren’t selling as much stuff as they thought.
With this delay in adoptionn for the enterprise space it makes me wonder if STEC will even be around in the future, HDD manufacturers, like enterprise storage companies sort of missed the boat when it came to SSD, but with such a slow adoption rate it may allow the manufacturers of spinning rust to catch up and win back the business that they lost to STEC in the meantime.
Then there’s the whole concept around automagic storage tiering at the sub volume level. It sounds cool on paper, though I’m not yet convinced on it’s effectiveness in the real world, mainly due to the delay involved in a system detecting particular hot blocks/regions and moving them to SSD, maybe by the time they are moved the data is no longer needed. I’ve not yet talked with someone with real world experience with this sort of thing, so I can only speculate at this point. Compellent of course has the most advanced automagic storage tiering today, they promote it pretty heavily, I’ve only talked to one person who’s worked with Compellent and he said he specifically only recommended their gear for smaller installs. I’ve never seen SPC-1 numbers posted by Compellent so at least in my mind their implementation remains in question, while the core technology certainly sounds nice.
Coincidently, Compellent’s stock took a similar 25% hair cut recently after their earnings were released, I guess expectations were too high.
I’d like to see a long running test, along the lines of what NetApp submitted for SPC-1, for the same array, two tests, one with automagic storage tiering turned on, the other without, and see the difference. I’m not sure how SPC-1 works internally, if it is a suitable test to illustrate automagic storage tiering or not, but at least it’s a baseline that can be used to compare with other systems.
