TechOpsGuys.com Diggin' technology every day

August 3, 2011

VMware revamps vSphere 5 licensing again

Filed under: Virtualization — Tags: , — Nate @ 5:40 pm

I guess someone over there high up was listening, nice to see the community had some kind of impact, VMware has adjusted their policies to some degree, far from perfect, but more bearable than the original plan.

The conspiracy theorist makes me think VMware put bogus numbers out there to begin with, never having any intension of following through with them to gauge the reaction, and then adjusted them to what they probably originally would of offered and try to make people think like they “won” by getting VMware to reduce the impact to some degree.

vSphere Enterprise List Pricing comparison (w/o support)

# of SocketsRAMvSphere 4 EnterprisevSphere 5
Enterprise
(old)
vSphere 5
Enterprise
(new)
Cost increase over vSphere 4
2256GB2 Licenses - $5,7508 Licenses - $23,0004 Licenses - $11,500100%
4512GBN/A16 Licenses - $46,0008 Licenses - $23,000N/A
81024GBN/A32 Licenses - $92,00016 Licenses - $46,000N/A

vSphere Enterprise+ List Pricing comparison (w/o support)

# of SocketsRAMvSphere 4 Enterprise+vSphere 5 Enterprise+
(old)
vSphere 5 Enterprise+
(new)
Cost increase over vSphere 4
2256GB2 Licenses - $6,9905 Licenses (240GB) - $17,4753 Licenses (288GB) - $10,48550% higher
4512GB4 Licenses - $13,98011 Licenses (528GB) - $38,4455 Licenses (480GB) - $17,47525% higher
81024GB8 Licenses - $27,96021 Licenses(1008GB) - $73,995
11 Licenses (1056GB) - $38,44537% higher

There were other changes too, see the official VMware blog post above for the details. They quadrupled the amount of vRAM available for the free ESXi to 32GB which I still think is not enough, should be, say at least 128GB.

Also of course they are pooling their licenses so the numbers fudge out a bit more depending on the # of hosts and stuff.

One of the bigger changes is VMs larger than 96GB will not need more than 1 license. Though I can’t imagine there are many 96GB VMs out there… even with 1 license if I wanted several hundred gigs of ram for a system I would put in on real hardware, get more cpu cores to boot (not unlikely you have 48-64+ cores of cpu for such a system, which is far beyond where vSphere 5 can scale to for a single VM).

I did some rounding in the price estimates, because the numbers are not divisible cleanly by the amount of ram specified.

It seems VMware has effectively priced their “Enterprise” product out of the market if you have any more than a trivial amount of memory. vSphere 4 Enterprise was, of course limited to 256GB of ram, but look at the cost of that compared to the new stuff, pretty staggering.

Quad socket 512GB looks like the best bet on these configurations anyways.

I still would like to see pricing based more on features than on hardware.  E.g. give me vSphere standard edition with 96GB per CPU of vRAM licensing, because a lot of those things in Enteprise+ I don’t need (some are nice to have but very few are critical for most people I believe). As-is users are forced into the higher tiers due to the arbitrary limits set on the licensing, not as bad as the original vSphere 5 pricing but still pretty bad for some users when compared to vSphere 4.

Or give me free ESXi with the ability to individually license software features such as vMotion etc on top of it on a per-socket basis or something.

I think the licensing scheme needs more work. VMware could also do their customers a favor by communicating how this will change in the future, as bigger and bigger machines come out it’s logical to think the memory limits would be increased over time.

The biggest flaw in the licensing scheme remains it measures based on what is provisioned, rather than what is used. There is no excuse for this from VMware since they own the hypervisor and have all the data.

Billing based on provision vs usage is the biggest scam in this whole cloud era.

July 20, 2011

VMware Licensing models

Filed under: Virtualization — Tags: , — Nate @ 5:38 am

[ was originally combined with another post but I decided to split out ]

VMware has provided it’s own analysis of their customers hardware deployments and telling folks that ~95% of their customers won’t be impacted by the licensing changes. I feel pretty confident that most of those customers are likely massively under utilizing their hardware. I feel confident because I went through that phase as well. Very, very few workloads are truly cpu bound especially with 8-16+ cores per socket.

It wouldn’t surprise me at all that many of those customers when they go to refresh their hardware change their strategy pretty dramatically – provided the licensing permits it. The new licensing makes me think we should bring back 4GB memory sticks and 1 GbE. It is very wasteful to assign 11 CPU licenses to a quad socket system with 512GB of memory, memory only licenses should be available at a significant discount over CPU+memory licenses at the absolute minimum. Not only that but large amounts of memory are actually affordable now. It’s hard for me to imagine at least having a machine with a TB of memory in it for around $100k, it wasn’t TOO long ago that it would of run you 10 times that.

And as to VMware’s own claims that this new scheme will help align ANYTHING better, by using memory pools across the cluster – just keep this in mind. Before this change we didn’t have to care about memory at all, whether we used 1% or 95%, whether some hosts used all of their ram and others used hardly any. It didn’t matter. VMware is not making anything simpler. I read somewhere about them saying some crap about aligning more with IT as a service. Are you kidding me? How may buzz words do we need here?

The least VMware can do is license based on usage. Remember pay for what you use, not what you provision. When I say usage I mean actual usage. Not charging me for the memory my Linux systems are allocating towards (frequently) empty disk buffers (goes to the memory balloon argument). If I allocate 32GB of ram to a VM that is only using 1GB of memory I should be charged for 1GB, not 32GB. Using vSphere’s own active memory monitor would be an OK start.

Want to align better and be more dynamic? align based on memory usage and CPU usage, let me run unlimited cores on the cluster and you can monitor actual usage on a per-socket basis, so if on average (say you can bill based on 95% similar to bandwidth) your using 40% of your CPU then you only need 40% licensing. I still much prefer the flat licensing model in almost any arrangement rather than usage based but if your going to make it usage based, really make it usage based.

Oh yeah – and forget about anything that charges you per VM too (hello SRM). That’s another bogus licensing scheme. It goes completely against the trend of splitting workloads up into more isolated VMs and instead favors fewer much larger VMs that are doing a lot of things at the same time. Even on my own personal co-located ESXi server, I have 5 VMs on it, I could consolidate it to two and provide the similar end user services, but it’s much cleaner to do it in 5 for my own sanity.

All of this new licensing stuff also makes me think back to a project I was working on about a year ago, trying to find some way of doing DR in the cloud, the ROI for doing it in house vs. any cloud on the market(looked at about 5 different ones at the time) was never more than 3 months. In one case the up front costs for the cloud was 4 times the cost for doing it internally. The hardware needs were modest in my opinion, with the physical hardware not even requiring two full racks of equipment. The #1 cost driver was memory, #2 was CPU, storage was a distant third assuming the storage that the providers spec’d could meet the IOPS and throughput requirements, storage came in at about 10-15% of the total cost of the cloud solution.

Since most of my VMware deployments have been in performance sensitive situations (lots of Java) I run the systems with zero swapping, everything in memory has to stay in physical ram.

Cluster DRS

Filed under: Virtualization — Tags: , — Nate @ 12:05 am

Given the recent price hikes that VMware is imposing on it’s customers(because they aren’t making enough money obviously) , and looking at the list of new things in vSphere 5 and being, well underwhelmed (compared to vSphere 4), I brain stormed a bit and thought about what kind of things I’d like to see VMware add.

VMware seems to be getting more aggressive in going after service providers (their early attempts haven’t been successful, it seems they have less partners now than a year ago – btw I am a vCloud express end-user at the moment). An area that VMware has always struggled in is scalability in their clusters (granted such figures have not been released for vSphere 5 but I am not holding my breath for a 10-100x+ increase in scale)

Whether it’s the number of virtual machines in a cluster, the number of nodes, the scalability of the VMFS file system itself (assuming that’s what your using) etc.

For the most part of course, a cluster is like a management domain, which means it is, in a way a single point of failure. So it’s pretty common for people to build multiple clusters when they have a decent number of systems, if someone has 32 servers, it is unlikely they are going to build a single 32-node cluster.

A feature I would like to see is Cluster DRS, and Cluster HA. Say for example you have several clusters, some clusters are very memory heavy for loading a couple hundred VMs/host(typically 4-8 socket with several hundred gigs of ram), others are compute heavy with very low cpu consolidation ratios (probably dual socket with 128GB or less of memory). Each cluster by itself is a stand alone cluster, but there is loose logic that binds them together to allow the seamless transport of VMs between clusters either for either load balancing or fault tolerance. Combine and extend regular DRS to span clusters, on top of that you may need to do transparent storage vMotion (if required) as well along with the possibility of mapping storage on the target host (on the fly) in order to move the VM over (the forthcoming storage federation technologies could really help make hypervisor life simpler here I think).

Maybe a lot of this could be done using yet another management cluster of some kind, a sort of independent proxy of things (running on independent hardware and perhaps even dedicated storage). In the unlikely event of a catastrophic cluster failure, the management cluster would pick up on this and move the VMs to other clusters and re start them (provided there is sufficient resources of course!). In very large environments it is not be possible to map everything to everywhere, which would require multiple storage vMotions in order to get the VM from the source to a destination that the target host can access – if this can be done at the storage layer via the block level replication stuff first introduced in VAAI that could of course greatly speed up what otherwise might be a lengthy process.

Since it is unlikely anyone is going to be able to build a single cluster with shared storage that spans a great many systems(100s+) and have it be bulletproof enough to provide 99.999% uptime, this kind of capability would be a stop gap, providing the flexibility and availability of a single massive cluster, while at the same time reducing the complexity in having to try to build software that can actually pull the impossible (or what seems impossible today) off.

On the topic of automated cross cluster migrations, having global spare hardware would be nice too, much like most storage arrays have global hot spares, which can be assigned to any degraded RAID group on the system regardless of what shelf it may reside on. Global spare servers would be shared across clusters, and assigned on demand. A high end VM host is likely to cost upwards of $50,000+ in hardware these days, multiply by X number of clusters and well.. you get the idea.

While I’m here, I might as well say I’d like the ability to hot remove memory, Hyper-V has dynamic memory which seems to provide this functionality. I’m sure the guest OSs would need to be re-worked a bit too in order to support this, since in the physical world it’s not too common to need to yank live memory from a system. In the virtual world it can be very handy.

Oh and I won’t forget – give us an ability to manually control the memory balloon.

Another area that could use some improvement is the vMotion compatibility, there is EVC, but last I read you still couldn’t cross processor manufacturers when doing vMotion with EVC. KVM can apparently do it today.

July 12, 2011

VMware jacks up prices too

Filed under: Virtualization — Tags: , — Nate @ 4:34 pm

Not exactly hot on the heels of Red Hat’s 260% price increase, VMware has done something similar with the introduction of vSphere 5 which is due later this year.

The good: They seem to have eliminated the # of core/socket limit for each of the versions, and have raised the limit of vCPUs per guest to 8 from 4 on the low end, and to 32 from 8 on the high end.

The bad: They have tied licensing to the amount of memory on the server. Each CPU license is granted a set amount of memory it can address.

The ugly: The amount of memory addressable per CPU license is really low.

Example 1 – 4x[8-12] core CPUs with 512GB memory

  • vSphere 4 cost with Enterprise Plus w/o support (list pricing)  = ~$12,800
  • vSphere 5 cost with Enterprise Plus w/o support (list pricing)  = ~$38,445
  • vSphere 5 cost with Enterprise w/o support (list pricing)         = ~$46,000
  • vSphere 5 cost with Standard w/o support (list pricing)           = ~$21,890

So you pay almost double for the low end version of vSphere 5 vs the highest end version of vSphere 4.

Yes you read that right, vSphere 5 Enterprise costs more than Enterprise Plus in this example.

Example 2 – 8×10 core CPUs with 1024GB memory

  • vSphere 4 cost with Enterprise Plus w/o support (list pricing) = ~$25,600
  • vSphere 5 cost with Enterprise Plus w/o support (list pricing) = ~$76,890

It really is an unfortunate situation, while it is quite common to charge per CPU socket, or in some cases per CPU core, I have not heard of a licensing scheme that charged for the memory.

I have been saying that I would expect to be using VMware vSphere myself until the 2012 time frame at which point I hope KVM is mature enough to be a suitable replacement (I realize there are some folks out there using KVM now it’s just not mature enough for my own personal taste).

The good news, if you can call it that, is as far as I can tell you can still buy vSphere 4 licenses, and you can even convert vSphere 5 licenses to vSphere 4 (or 3). Hopefully VMware will keep the vSphere 4 license costs around for the life of (vSphere 4) product, which would take customers to roughly 2015.

I have not seen much info about what is new in vSphere 5, for the most part all I see are scalability enhancements for the ultra high end (e.g. 36Gbit/s network throughput, 1 million IOPS, supporting more vCPUs per VM – number of customers that need that I can probably count on 1 hand). With vSphere 4 there was many good technological improvements that made it compelling for pretty much any customer to upgrade (unless you were using RDM with SAN snapshots), I don’t see the same in vSphere 5 (at least at the core hypervisor level). My own personal favorites for vSphere 4 enhancements over 3 were – ESXi boot from SAN, Round Robin MPIO, and the significant improvements in the base hypervisor code itself.

I can’t think of a whole lot of things I would want to see in vSphere 5 that aren’t already in vSphere 4, my needs are somewhat limited though. Most of the features in vSphere 4 are nice to have though for my own needs are not requirements. For the most part I’d be happy on vSphere standard edition (with vMotion which was added to the licensed list for Standard edition about a year ago) the only reason I go for higher end versions is because of license limitations on hardware. The base hypervisor has to be solid as a rock though.

In my humble opinion, the memory limits should look more like

  • Standard = 48GB (Currently 24GB)
  • Enterprise = 96GB (Currently 32GB)
  • Enterprise Plus = 128GB (Currently 48GB)

It just seems wrong to have to load 22 CPU licenses of vSphere on a host with 8 CPUs and 1TB of memory.

I remember upgrading from ESX 3.5 to 4.0, it was so nice to see that it was a free upgrade for those with current support contracts.

I have been a very happy, loyal and satisfied user & customer of VMware’s products since 1999, put simply they have created some of the most robust software I have ever used (second perhaps to Oracle). Maybe I have just been lucky over the years but the number of real problems (e.g. caused downtime) I have had with their products has been tiny, I don’t think it’s enough to need more than one hand to count. I have never once had a ESX or GSX server crash for example. I see mentions of the PSOD that ESX belches out on occasion but I have yet to see it in person myself.

I’ve really been impressed by the quality and performance (even going back as far as my first e-commerce launch on VMware GSX 3.0 in 2004 we did more transactions the first day than we were expecting for the entire first month), so I’m happy to admit I have become loyal to them over the years(for good reason IMO). Pricing moves like this though are very painful, and it will be difficult to break that addiction.

This also probably means if you want to use the upcoming Opteron 6200 16-core cpus (also due in Q3) on vSphere you probably have to use vSphere 5, since 4 is restricted to 12-cores per socket (though would be interesting to see what would happen if you tried).

If I’m wrong about this math please let me know, I am going by what I read here.

Microsoft’s gonna have a field day with these changes.

And people say there’s no inflation going on out there..

sigh

January 31, 2011

Terremark snatched by Verizon

Filed under: General,Virtualization — Tags: , — Nate @ 9:34 pm

Sorry for my three readers out there for not posting recently I’ve been pretty busy! And to me there hasn’t been too much events in the tech world in the past month or so that have gotten me interested enough to write about them.

One recent event that did was Verizon’s acquisition of Terremark, a service I started using about a year ago.

I was talking with a friend of mine recently he was thinking about either throwing a 1U server in a local co-location or play around with one of the cloud service providers. Since I am doing both still (been too lazy to completely move out of the co-lo…) I gave him my own thoughts, and it sort of made me think about more about the cloud in general.

What do I expect from a cloud?

When I’m talking cloud I’m mainly referring to the IaaS or Infrastructure as a Service. Setting aside cost modelling and stuff for  a moment here I expect the IaaS to more or less just work. I don’t want to have to care about:

  • Power supply failure
  • Server failure
  • Disk drive failure
  • Disk controller failure
  • Scheduled maintenance (e.g. host server upgrades either software or hardware, or fixes etc)
  • Network failure
  • UPS failure
  • Generator failure
  • Dare I say it ? A fire in the data center?
  • And I absolutely want to be able to run what ever operating system I want, and manage it the same way I would manage it if it was sitting on a table in my room or office. That means boot from an ISO image and install like I would anything else.

Hosting it yourself

I’ve been running my own servers for my own personal use since the mid 90s. I like the level of control it gives me and the amount of flexibility I have with running my own stuff. Also gives me a playground on the internet where I can do things. After multiple power outages over the first part of the decade, one of which lasted 28 hours, and the acquisition of my DSL provider for the ~5th time, I decided to go co-lo. I already had a server and I put it in a local, Tier 2 or Tier 3 data center. I could not find a local Tier 4 data center that would lease me 1U of space. So I lacked:

  • Redundant Power
  • Redundant Cooling
  • Redundant Network
  • Redundant Servers (if my server chokes hard I’m looking at days to a week+ of downtime here)

For the most part I guess I had been lucky, the facility had one, maybe two outages since I moved in about three years ago. The bigger issue with my server was aging and the disks were failing, it was a pain to replace them and it wasn’t going to be cheap to replace the system with something modern and capable of running ESXi in a supported configuration(my estimates put the cost at a minimum of $4k). Add to that  the fact that I need such a tiny amount of server resources.

Doing it right

So I had heard of Terremark from my friends over at 3PAR, and you know I like 3PAR, and they use Vmware and I like Vmware. So I decided to go with them rather than the other providers out there, they had a decent user interface and I got up and going fairly quickly.

So I’ve been running it for almost a year, with pretty much no issues, I wish they had a bit more flexibility in the way they provision networking stuff but nothing is perfect (well unless you have the ability to do it yourself).

From a design perspective, Terremark has done it right, whether it’s providing an easy to use interface to provision systems, using advanced technology such as VMware, 3PAR, and Netscaler load balancers, and building their data centers to be even — fire proof.

Having the ability to do things like Vmotion, or Storage vMotion is just absolutely critical for a service provider, I can’t imagine anyone being able to run a cloud without such functionality at least with a diverse set of customers. Having things like 3PAR’s persistent cache is critical as well to keep performance up in the event of planned or unplanned downtime in the storage controllers.

I look forward to the day where the level of instrumentation and reporting in the hypervisors allow billing based on actual usage, rather than what is being provisioned up front.

Sample capabilities

In case your a less technical user I wanted to outline a few of the abilities the technology Terremark uses offers their customers –

Memory Chip Failure (or any server component failure or change)

Most modern servers have sensors on them and for the most part are able to accurately predict when a memory chip is behaving badly and to warn the operator of the machine to replace it. But unless your running on some very high end specialized equipment (which I assume Terremark is not because it would cost too much for their customers to bare), the operator needs to take the system off line in order to replace the bad hardware. So what do they do? They tell VMware to move all of the customer virtual machines off the affected server onto other servers, this is done without customer impact, the customer never knows this is going on. The operator can then take the machine off line and replace the faulty components and then reverse the process.

Same applies to if you need to:

  • Perform firmware or BIOS updates/changes
  • Perform Hypervisor updates/patches
  • Maybe your retiring an older type of server and moving to a more modern system

Disk failure

This one is pretty simple, a disk fails in the storage system and the vendor is dispatched to replace it, usually within four hours. But they may opt to wait a longer period of time for whatever reason, with 3PAR it doesn’t really matter, there are no dedicated hot spares so your really in no danger of losing redundancy, the system rebuilds quickly using a many:many RAID relationship, and is fully redundant once again in a matter of hours(vs days with older systems and whole-disk-based RAID).

Storage controller software upgrade

There are fairly routine software upgrades on modern storage systems, the software feature set seems to just grow and grow. So the ability to perform the upgrade without disrupting users for too long(maybe a few seconds) is really important with a diverse set of customers, because there will probably be no good time where all customers say ok I have have some downtime. So having high availability storage with the ability to maintain performance with a controller being off line by mirroring the cache elsewhere is a very useful feature to have.

Storage system upgrade (add capacity)

Being able to add capacity without disruption and dynamically re-distribute all existing user data across all new as well as current disk resources on-line to maximize performance is a boon for customers as well.

UPS failure (or power strip/PDU failure)

Unlike the small dinky UPS you may have in your house or office UPSs in data centers typically are powering up to several hundred machines, so if it fails then you may be in for some trouble. But with redundant power you have little to worry about, the other power supply takes over without interruption.

If a server power supply blows up it has the ability to take out the entire branch or even whole circuit that it’s connected to. But once again redundant power saves the day.

Uh-oh I screwed up the network configuration!

Well now you’ve done it, you hosed the network (or maybe for some reason your system just dropped off the network maybe flakey network driver or something) and you can’t connect to your system via SSH or RDP or whatever you were using. Fear not, establish a VPN to the Terremark servers and you can get console access to your system. If only the console worked from Firefox on Linux..can’t have everything I guess. Maybe they will introduce support for vSphere 4.1’s virtual serial concentrators soon.

It just works

There are some applications out there that don’t need the level of reliability that the infrastructure Terremark uses can provide and they prefer to distribute things over many machines or many data centers or something, that’s fine too, but most apps, almost all apps in fact make the same common assumption, perhaps you can call it the lazy assumption – they assume that it will just work. Which shouldn’t surprise many, because achieving that level of reliability at the application layer alone is an incredibly complex task to pull off. So instead you have multiple layers of reliability under the application handling a subset of availability, layers that have been evolving for years or decades even in some cases.

Terremark just works. I’m sure there are other cloud service providers out there that work too, I haven’t used them all by any stretch(nor am I seeking them for that matter).

Public clouds make sense, as I’ve talked about in the past for a subset of functionality, they have a very long ways to go in order to replace what you can build yourself in a private cloud (assuming anyone ever gets there). For my own use case, this solution works.

November 11, 2010

Extreme VMware

Filed under: Networking,Virtualization — Tags: , — Nate @ 7:29 pm

So I was browsing some of the headlines of the companies I follow during lunch and came across this article (seems available on many outlets), which I thought was cool.

I’ve known VMware has been a very big happy user of Extreme Networks gear for a good long time now though I wasn’t aware of anything that was public about it, at least until today. It really makes me feel good that despite VMware’s partnerships with EMC and NetApp that include Cisco networking gear, at the end of the day they chose not to run Cisco for their own business.

But going beyond even that it makes me feel good that politics didn’t win out here, obviously the people running the network have a preference, and they were either able to fight, or didn’t have to fight to get what they wanted. Given VMware is a big company and given their big relationship with Cisco I would kind of think that Cisco would try to muscle their way in. Many times they can succeed depending on the management at the client company, but fortunately for the likes of VMware they did not.

SYDNEY, November 12. Extreme Networks, Inc., (Nasdaq: EXTR) today announced that VMware, the global leader in virtualisation and cloud infrastructure, has deployed its innovative enterprise, data centre and Metro Ethernet networking solutions.

VMware’s network features over 50,000 Ethernet ports that deliver connectivity to its engineering lab and supports the IT infrastructure team for its converged voice implementation.

Extreme Networks met VMware’s demanding requirements for highly resilient and scalable network connectivity. Today, VMware’s thousands of employees across multiple campuses are served by Extreme Networks’ leading Ethernet switching solutions featuring 10 Gigabit Ethernet, Gigabit Ethernet and Fast Ethernet, all powered by the ExtremeXOS® modular operating system.

[..]

“We required a robust, feature rich and energy efficient network to handle our data, virtualised applications and converged voice, and we achieved this through a trusted vendor like Extreme Networks, as they help it to achieve maximum availability so that we can drive continuous development,” said Drew Kramer, senior director of technical operations and R&D for VMware. “Working with Extreme Networks, from its high performance products to its knowledgeable and dedicated staff, has resulted in a world class infrastructure.”

Nice to see technology win out for once instead of back room deals which often end up screwing the customer over in the long run.

Since I’m here I guess I should mention the release of the X460 series of switches which came out a week or two ago, intended to replace the now 4-year old X450 series(both “A” and “E”). Notable differences & improvements include:

  • Dual hot swap internal power supplies
  • User swappable fan tray
  • Long distance stacking over 10GbE – up to 40 kilometers
  • Clear-Flow now available when the switches are stacked (prior hardware switches could not be stacked to use Clear-Flow
  • Stacking module is now optional (X450 it was built in)
  • Standard license is Edge license (X450A was Advanced Edge) – still software upgradable all the way to Core license (BGP etc). My favorite protocol ESRP requires Advanced Edge and not Core licensing.
  • Hardware support for IPFIX, which they say is complimentary to sFlow
  • Lifetime hardware warranty with advanced hardware replacement (X450E had lifetime, X450A did not)
  • Layer 3 Virtual Switching (yay!) – I first used this functionality on the Black Diamond 10808 back in 2005, it’s really neat.

The X460 seems to be aimed at the mid to upper range of GbE switches, with the X480 being the high end offering.

November 4, 2010

Chicken and the egg

Filed under: Random Thought,Storage,Virtualization — Tags: , , , , , , — Nate @ 5:24 pm

Random thought time! –  came across an interesting headline on Chuck’s Blog – Attack of the Vblock Clones.

Now I’m the first to admit I didn’t read the whole thing but the basic gist he is saying if you want a fully tested integrated stack (of course you know I don’t like these stacks they restrict you too much, the point of open systems is you can connect many different types of systems together and have them work but anyways), then you should go with their VBlock because it’s there now, and tested, deployed etc. Others recently announced initiatives are responses to the VBlock and VCE, Arcadia(sp?) etc.

I’ve brought up 3cV before, something that 3PAR coined back almost 3 years ago now. Which is, in their words Validated Blueprint of 3PAR, HP, and VMware Products Can Halve Costs and Floor Space”.

And for those that don’t know what 3cV is, a brief recap –

The Elements of 3cV
3cV combines the following products from 3PAR, HP, and VMware to deliver the virtual data center:

  • 3PAR InServ Storage Server featuring Virtual Domains and thin technologies—The leading utility storage platform, the 3PAR InServ is a highly virtualized tiered-storage array built for utility computing. Organizations creating virtualized IT infrastructures for workload consolidation use the 3PAR InServ to reduce the cost of allocated storage capacity, storage administration, and the SAN infrastructure.
  • HP BladeSystem c-Class—The No. 1 blade infrastructure on the market for datacenters of all sizes, the HP BladeSystem c-Class minimizes energy and space requirements and increases administrative productivity through advantages in I/O virtualization, power and cooling, and manageability. (1)
  • VMware Infrastructure—Infrastructure virtualization suite for industry-standard servers. VMware Infrastructure delivers the production-proven efficiency, availability, and dynamic management needed to build the responsive data center.

Sounds to me that 3cV beat VBlock to the punch by quite a ways. It would have been interesting to see how Dell would of handled the 3cV solution had they managed to win the bidding war, given they don’t have anything that competes effectively with c-Class. But fortunately HP won out so 3cV can be just that much more official.

It’s not sold as a pre-packaged product I guess you could say, but I mean how hard is it to say I need this much CPU, this much ram, this much storage HP go get it for me. Really it’s not hard. The hard part is all the testing and certification. Even if 3cV never existed you can bet your ass that it would work regardless. It’s not that complicated, really. Even if Dell managed to buy 3PAR and kill off the 3cV program because they wouldn’t want to directly promote HP’s products, you could still buy the 3PAR from Dell and the blades from HP and have it work. But of course you know that.

The only thing missing from 3cV is I’d like a more powerful networking stack, or at least sFlow support. I’ll take Flex10 (or Flexfabric) over Cisco any day of the week but I’d still like more.

I don’t know why this thought didn’t pop into my head until I read that headline, but it gave me something to write about.

But whatever, that’s my random thought of the day/week.

October 21, 2010

Red Hat wants to end “IT Suckage”

Filed under: Datacenter,Virtualization — Tags: , — Nate @ 8:50 am

Read an interesting article over on The Register with a lot of comments by a Red Hat executive.

And I can’t help but disagree on a bunch of stuff the executive says. But it could be because the executive is looking at and talking with big bloated slow moving organizations that have a lot of incompetent people in their ranks (“Never got fired for buying X” mantra), instead of smaller more nimble more leading edge organizations willing, ready and able to take some additional “risk” for a much bigger return (such as running virtualized production systems, seems like a common concept to many but I know there’s a bunch of people out there that aren’t convinced that it will work, btw I ran my first VMware in production in 2004, and saved my company BIG BUCKS with the customer (that’s a long story, and an even longer weekend)).

OK so this executive says

After all, processor and storage capacity keep tracking along on their respective Moore’s and Kryder’s Laws, doubling every 18 months, and Gilder’s Law says that networking capacity should double every six months. Those efficiencies should lead to comparable economies. But they’re not.

I was just thinking this morning about the price and capacity of the latest systems(sorry keep going back to the BL685c G7 with 48 cores and 512GB of ram 🙂 ).

I remember back in 2004/2005 time frame the company I was at paying well over $100,000 for a 8-way Itanium system with 128GB of memory to run Oracle databases. The systems of today whether it is the aforementioned blade or countless others can run circles around such hardware now at a tiny fraction of the price. It wasn’t unreasonable just a few short years ago to pay more than $1M for a system that had 512GB of memory and 24-48 CPUs, and now you can get it for less than $50,000(in this case using HP web pricing). That big $1M system probably consumed at least 5-10kW of power and a full rack as well, vs now the same capacity can go for ~800W(100% load off the top of my head) and you can get at least 32 of them in a rack(barring power/cooling constraints).

Granted that big $1M system was far more redundant and available than the small blade or rack mount server, but at the time if you wanted so many CPU cores and memory in a single system you really had no choice but to go big, really big. And if I was paying $1M for a system I’d want it to be highly redundant anyways!

With networking, well 10GbE has gotten to be dirt cheap, just think back a few years ago if you wanted a switch with 48 x 10GbE ports you’d be looking at I’d say $300k+ and it’d take the better part of a rack. Now you can get such switches in a 1U form factor from some vendors(2U from others), for sub $40k?

With storage, well spinning rust hasn’t evolved all that much over the past decade for performance unfortunately but technologies like distributed RAID have managed to extract an enormous amount of untapped capacity out of the spindles that older architectures are simply unable to exploit. More recently the introduction of SSDs and the sub LUN automagic storage tiering technology that is emerging (I think it’s still a few years away from being really useful) you can really get a lot more bang out of your system. EMC‘s fast cache looks very cool too from a conceptual perspective at least I’ve never used it and don’t know anyone who has but I do wish 3PAR had it! Assuming I understand the technology right, with the key being the SSDs are used for both read and write caching. Verses something like the NetApp PAM card which is only a read cache. Neither Fast cache nor PAM is enough to make we want to use those platforms for my own stuff.

The exec goes on to say

Simply put, Whitehurst’s answer to his own question is that IT vendors suck, and that the old model of delivering products to customers is fundamentally broken.

I would tend to agree for the most part but there are those out there that really are awesome. I was lucky enough to find one such vendor, and a few such manufacturers. As one vendor I deal with says they work with the customer not with the manufacturer, they work to give the customer what is best for them. So many vendors I have dealt with over the years are really lazy when it comes down to it, they only know a few select solutions from a few big name organizations and give blank stares if you go outside their realm of comfort (random thought: I got the image of Speed Bump: The roadkill possum from a really old TV series called Liquid Television that I watched on MTV for a brief time in the 90s).

By the same token while most IT vendors suck, most IT managers suck too, for the same reason. Probably because most people suck that may be what it comes down to it at the end of the day. IT as you well know is still an emerging industry, still a baby really evolving very quickly, but has a ways to go. So like with anything the people out there that can best leverage IT are few and far between. Most of the rest are clueless — like my first CEO about 10-11 years ago was convinced he could replace me with a tech head from Fry’s Electronics (despite my 3 managers telling him he could not). About a year after I left the company he did in fact hire such a person — only problem was that individual never showed up for work (maybe he forgot).

Exec goes on to say..

“Functionality should be exploding and costs should be plummeting — and being a CIO, you should be a rock star and out on the golf course by 3 pm,” quipped Whitehurst to his Interop audience.

That is in fact what is happening — provided your choosing the right solutions, and have the right people to manage them, the possibilities are there, just most people don’t realize it or don’t have the capacity to evolve into what could be called the next generation of IT, they have been doing the same thing for so long, it’s hard to change.

Speaking of being a rock star and out on the golf course by 3pm, I recall two things I’ve heard in the past year or so-

The first one used the golf course analogy, from a local VMware consulting shop that has a bunch of smart folks working for them I thought this was a really funny strategy and can see it working quite well in many cases – the person took an industry average of say 2-3 days to provision a new physical system, and said in the virtual world — don’t tell your customers that you can provision that new system in ten minutes, tell them it will take you 2-3 days, spend the ten minutes doing what you need and spend the rest of the time on the golf course.

The second one was from a 3PAR user I believe. Who told one of their internal customers/co-workers something along the lines of “You know how I tell you it takes me a day to provision your 10TB of storage? Well I lied, it only takes me about a minute”.

For me, I’m really too honest I think, I tell people how long I think it will really take and at least on big projects am often too optimistic on time lines. Maybe I should take up Scotty’s strategy and take my time lines and multiply them by four to look like a miracle worker when it gets done early. It might help to work with a project manager as well, I haven’t had one for any IT projects in more than five years now. They know how to manage time (if you have a good one, especially one experienced with IT not just a generic PM).

Lastly the exec says

The key to unlocking the value of clouds is open standards for cloud interoperability, says Whitehurst, as well as standardization up and down the stack to simplify how applications are deployed. Red Hat’s research calculates that about two-thirds of a programmer’s time is spent worrying about how the program will be deployed rather than on the actual coding of the program.

Worrying about how the program will be deployed is a good thing, an absolutely good thing. Rewinding again to 2004 I remember a company meeting where one of the heads of the company stood up and said something along the lines of 2004 was the year of operations, we worked hard to improve how the product operates, and the next phase is going back to feature work for customers. I couldn’t believe my ears, that year was the worst for operations, filled with half implemented software solutions that actually made things worse instead of better, outages increased, stress increased, turnover increased.

The only thing I could do from an operations perspective and buy a crap load of hardware and partition the application to make it easier to manage. We ended up with tons of excess capacity but the development teams were obviouslly unable to make the design changes we needed to improve the operations of the application, but we at least had something that was more manageable, the deployment and troubleshooting teams were so happy when the new stuff was put into production, no longer did they have to try to parse gigabyte sized log files trying to find which errors belong to which transactions from which subsystem. Traffic for different subsystems was routed to different physical systems so you knew if there was an issue with one type of process you go to server farm X to look at it, problem resolution was significantly faster.

I remember having one conversation with a software architect in early 2005 about a particular subsystem that was very poorly implemented (or maybe even designed), it caused us massive headaches in operations, non stop problems really. His response was Well I invited you to a architecture meeting in January of 2004 to talk about this but you never showed up. I don’t remember the invite but if I saw it I know why I didn’t show up, it’s because I was buried in production outages 24/7 and had no time to think more than 24 hours ahead yet alone think about a software feature that was months away from deployment. Just didn’t have the capacity, was running on fumes for more than a year.

So yes, if you are a developer please do worry about how it is deployed, never stop worrying. Consult your operations team (assuming they are worth anything), and hopefully you can get a solid solution out the door. If you have a good experienced operations team then it’s very likely they know a lot more about running production than you do and can provide some good insight into what would provide the best performance and uptime from an operations perspective. They may be simple changes, or not.

One such example, I was working at a now defunct company who had a hard on for Ruby on Rails. They were developing app after app on this shiny new platform. They were seemingly trying to follow Services Oriented Architecture (SOA), something I learned about ironically at a Red Hat conference a few years ago (didn’t know there was a acronym for that sort of thing it seemed so obvious). I had a couple, really simple suggestions for them to take into account for how we would deploy these new apps. Their original intentions called for basically everything running under a single apache instance(across multiple systems), and for example if Service A wanted to talk to Service B then it would talk to that service on the same server. My suggestions which we went with involved two simple concepts:

  • Each application had it’s own apache instance, listening on it’s own port
  • Each application lived behind a load balancer virtual IP with associated health checking, with all application-to-application communication flowing through the load balancer

Towards the end we had upwards of I’d say 15 of these apps running on a small collection of servers.

The benefits are pretty obvious, but the developers weren’t versed in operations — which is totally fine they don’t need to be (though it can be great when they are, I’ve worked with a few such people though they are VERY RARE) that’s what operations people do and you should involve them in your development process.

As for cloud standards — folks are busy building those as we speak and type. VMware seems to be the furthest along from an infrastructure cloud perspective I believe, I wouldn’t expect them to lose their leadership position anytime soon they have an enormous amount of momentum behind them, and it takes a lot to counter that momentum.

About a year ago I was talking to some former co-workers who told me another funny story they were launching a new version of software to production, the software had been crashing their test environments daily for about a month. They had a go no-go meeting in which everyone involved with the product said NO GO. But management overrode them, and they deployed it anyways. The result? A roughly 14 hour production outage while they tried to roll the software back. I laughed and said, things really haven’t changed since I left have they?

So the solutions are there, the software companies and hardware companies have been evolving their stuff for years, the problem is the concepts can become fairly complex when talking about things like capacity utilization and stranded resources, getting the right people in place to be able to not only find such solutions but deploy and manage them as well can really go a long ways, but those people are rare at this point.

I haven’t been writing too much recently been really busy, Scott looks to be doing a good job so far though.

 

October 11, 2010

Qlogic answers my call for help

Filed under: Networking,Virtualization — Tags: , — Nate @ 8:53 am

THANK YOU QLOGIC. I have been a long time user of Qlogic stuff and like them a lot. If you have been reading this blog for a while you may of noticed earlier in the year I was criticizing the network switch industry (includes my favorite manufacturers as well) for going down the route of trying to “reclaim the network” by working on standards that would move the inter-VM switching traffic out of the host and back into the network switches. I really think the whole concept is really stupid, and a desperate attempt to hold onto what will be a dramatically declining ports market in the coming years. Look no further than my recent post on testing the limits of virtualization.

My answer to the dilemma ? Put a layer 2 hardware switching fabric into the server, less latency, faster performance.

And Qlogic has done just that. I will refrain from using colorful metaphors to describe my glee, but I certainly hope this is a trend going forward.

According to our friends at The Register, Qlogic has released new Converged Network Adapters (CNA) that includes an integrated layer 2 switch for virtual machines.

EMEA Marketing head for QLogic, Henrik Hansen, said: “Within the ASIC we have embedded a layer 2 Ethernet switch [and] can carve up the two physical ports into 4 NIC partitions or NPARs, which can each be assigned to a specific VM. There can be eight of them with dual-port product.”An Ethernet message from one VM to another in the same server goes to the QLogic ASIC and is switched back to the target VM. This is reminiscent of Emulex’ VNIC feature.

From the specs:

  • PCI Express Gen2 x8
  • Dual 10Gbps and quad 1Gbps ports on a single controller
  • Integrated 10GBase-KR and 10GBase-T PHYs
  • Concurrent TCP/IP, FCoE, and iSCSI protocol support with full hardware offload
  • Industry standard SR-IOV and QLogic’s switch-agnostic NIC Partitioning (NPAR)
  • Wake-on-LAN including Magic Packet recognition
  • Common drivers and API’s with existing QLogic NIC, FCoE, and iSCSI products

Side note: I love that they have 10GbaseT too!!

I think the ASIC functionality needs more work as it seems limited to supporting only a couple VMs rather than being a more generic switching fabric but we gotta start somewhere!

The higher end 8200 CNA looks like it has much of the same technology available in the HP FlexFabric (which I know at least part of is already based on Qlogic technology though might not be these specific ASICs I don’t know)

VMflex. With QLogic’s new VMflex technology, one Converged Network Adapter is viewed by the server operating system (OS) as a flexible mix (up to  four per physical port) of standalone NICs, FCoE adapters, and iSCSI adapters, with the ability to allocate guaranteed bandwidth to each virtual adapter.  This unique feature can be switch dependent or switch agnostic— it is not necessary to pair an 8200 Series adapter with any specific 10GbE switch model to enable partitioning.

I would love to see more technical information on the VMFlex and the layer 2 switching fabric, I tried poking around on Qlogic’s site but didn’t come up with anything too useful.

So I say again, thank you Qlogic, and I hope you have started a trend here. I firmly believe that offloading the switching functionality to an ASIC rather than performing it in software is critical, and when you have several hundred VMs running on a single server not wasting your uplink bandwidth to talk between them is just as critical. The functionality of the ASIC need not offer too much, for me I think the main things would be vlan tagging and sFlow, some folks may want QoS as well.

My other request, I don’t know if it is already possible or not is to be able to run a mix of jumbo frames and standard frame sizes on different virtual NICs riding on the same physical network adapter, without configuring everything for jumbo frames, because that causes compatibility issues (especially for anything using UDP!).

The networking industry has it backwards in my opinion, but I can certainly understand the problem they face.

October 8, 2010

I/O Virtualization for mortals

Filed under: Networking,Virtualization — Nate @ 10:24 pm

This product isn’t that new but haven’t seen many people talk about it, I first came across it a few weeks ago certainly looked very innovative.

I don’t know who started doing I/O virtualization first, maybe it was someone like Xsigo, or maybe it was HP with their VirtualConnect or maybe it was someone else, but the space has heated up in the past couple of years.

Neterion, a name that sounds familiar but I can’t quite place it… a company by the name of Exar may of bought them or something. But anyways there is this interesting virtualized NIC that they have – the X3120 V-NIC looks pretty cool –

Neterion’s family of 10 Gigabit Ethernet adapters offer a unique multi-channel device model. Depending upon the product, a total of between eight and seventeen fully independent, hardware-based transmit and receive paths are available; each path may be prioritized for true Quality-of-Service support.

I/O Virtualization Support

  • Special “multi-function PCI device” mode brings true IOV to any industry-standard server. In multi-function mode, up to 8 physical functions are available (more in ARI-capable systems). Each physical function appears to the system as an independent Ethernet card
  • Unique, hardware-based multi-channel architecture mitigates head-of-line blocking and allows direct data transfer between hardware channels and host-based Virtual Machines without hypervisor intervention (greatly reducing CPU workload)
  • VMware® NetQueue support
  • Dedicated per-VF statistics and interrupts
  • Support for function-level reset (FLR)
  • Fully integrated Layer 2 switching function

I removed some bullet points of things to shorten the entry a bit and those things I wasn’t exactly sure what they did anyways! Anyone know what ARI means above?

Never used the product, but it is very nice to see such a product in the market place, to get Virtual Connect “like” functionality (at least as far as the virtual NICs go, I know theres a lot of other advantages to VC) in your regular rack mount systems from any vendor and at least potentially connect to any 10GbE switch, as far as I can tell there’s no special requirements for a specific type of switch.

« Newer PostsOlder Posts »

Powered by WordPress