I’ve seen a couple different articles from our friends at The Register on the launch of the HP IaaS cloud as a public beta. There really isn’t a whole lot of information yet, but one thing seems unfortunately clear – HP has embraced the same backwards thinking as Amazon when it comes to provisioning. Going against the knowledge and experience we’ve all gained in the past decade around sharing resources and over subscription.
Yes – it seems they are going to have fixed instance sizes and no apparent support for resource pools. This is especially depressing from someone like HP who has technology like thin provisioning, and partnerships with the likes of all of the major hypervisor players.
Is the software technology at the hypervisor just not there yet to provide such a service? vSphere 5 for example supports 1600 resource pools per cluster. I don’t like the licensing model of 5, so I built my latest cluster on 4.1 – which supports 512 resource pools per cluster. Not a whole lot in either case but then again cluster sizes are fairly limited anyways.
There’s no doubt that gigabyte to gigabyte that DAS is cheaper than something like a 3PAR V800. But with fixed allocation sizes from the likes of Amazon – it’s not uncommon to have disk utilization rates hovering in low single digits. I’ve seen it at two different companies – and guess what – everyone else on the teams (all of whom have had more Amazon experience than me) was just as not surprised as I was.
So you take this cheap DAS and you apply a 4 or 5% utilization rate to it – and all of a sudden it’s not so cheap anymore is it ? Why is utilization so low ? Well in Amazon (since I haven’t use HP’s cloud), it’s primarily low because that DAS is not protected, if the server goes down or the VM dies the storage is gone. So people use other methods to protect their more important data. You can have the OS and log files and stuff on there no big deal if that goes away – but again – your talking about maybe 3-5GB of data (which is typical for me at least). Then the rest of the disk goes unused.
Go to the most inefficient storage company in the world and and even they will drool at the prospects of replacing storage that your only getting 5% utilization out of! Because really even the worst efficiency is maybe 20% on older systems w/o thin provisioning.
Even if the storage IS fully protected – the fixed allocation units are still way out of whack and they can’t be shared! I may need a decent amount of CPU horsepower and/or (more likely) memory to run a bloated application but I don’t need several hundred gigabytes of storage attached to each system when 20GB would be MORE than enough(my average OS+App installation comes in at under 3GB and that’s with a lot of stuff installed)! I’d rather take that several hundred gigabytes both in terms of raw space and IOPS and give them to database servers or something like that(in theory at least, the underlying storage in this case is poor so I wouldn’t want to use it for that anyways).
This is what 3PAR was built to solve – drive (way)utilization up, while simultaneously providing the high availability and software features of a modern storage platform. Others do the same too of course with various degrees of efficiency.
So that’s storage – next take CPU. The industry average pre-virtualization was in the sub 20% utilization range – my own personal experience says it’s in the sub 10% range for the most part. There was a quote from a government official a couple years back that talked about how their data centers are averaging about 7% utilization. I’ve done a few virtualization projects over the years and my experience shows me that even after systems have been virtualized the vmware hosts themselves are at low utilization from a CPU perspective.
Two projects in particular that I documented while I was at a company a few years ago while back – the most extreme perhaps being roughly 120VMs on 5 servers, four of them being HP DL585 G1s – which were released in 2005. They had 64GB of memory on them but they were old boxes. I calculated that the newer Opteron 6100 when it was released had literally 12 times the CPU power(according to SPEC numbers at least) of the Opteron 880s that we had at the time. Anyways, even with these really old servers the cluster averaged under 40% CPU – with peaks to maybe 50 or 60%. Memory usage was pretty constant at around 70-75%. Imagine translating that workload on those ancient servers onto something more modern and you’d likely see CPU usage rates drop to single digits while memory usage remains constant.
I have no doubt that the likes of HP and Amazon are building their cloud to specifically not oversubscribe – to assume that people will utilize all of the CPU allocated to them as well as memory and disk space. So they have fixed building blocks to deal with and they carve them up accordingly.
The major fault with the design of course is the vast majority of workloads do not fit in such building blocks and will never come close to utilizing all of the resources that are provided – thus wasting an enormous amount of resources in the environment. What’s Amazon’s solution to this ? Build your apps to better utilize what they provide. basically work around their limitations. Which, naturally most people don’t do so resources end up being wasted on a massive scale.
I’ve worked for really nothing but software development companies for almost 10 years now and I have never really seen even one company or group or developer ever EVER design/build for the hardware. I have been part of teams that have tried to benchmark applications and buy the right sized hardware but it really never works out in the end because a simple software change can throw all those benchmarks and testing out the window overnight(not to mention how traditionally difficult it is to replicate real traffic in a test environment – I’ve yet to see it done right myself for any even moderately complex application). The far easier solution to take is of course, resource pools, and variably allocated resources.
Similarly this model, along with the per-VM licensing model of so many different products out there go against the trend that has allowed us to have more VM sprawl I guess. Instead of running a single server with a half dozen different apps it’s become a good practice to split those apps up. This fixed allocation unit of the cloud discourages such behavior by dramatically increasing the cost of doing it. You still incur additional costs by doing it on your own gear – memory overhead for multiple copies of the OS (assuming that memory de-dupe doesn’t work -which for me on Linux it doesn’t), or disk overhead (assuming your array doesn’t de-dupe -which 3PAR doesn’t – but the overhead is so trivial here that it is a rounding error). But those incremental costs pale in comparison to massive increases in cost in the cloud, because again of those fixed allocation units.
I have seen no mention of it yet, but I hope HP has at least integrated the ability to do live migration of VMs between servers. The hypervisor they are using supports it of course, I haven’t seen any details from people using the service as to how it operates yet.
I can certainly see a need for cheap VMs on throwaway hardware. I see an even bigger need, for the more traditional customers(that make up the vast, vast majority of the market) to have this model of resource pools instead. If HP were to provide both services – and a unified management UI that really would be pretty nice to see.
The concept is not complicated, and is so obvious it dumbfounds me why more folks aren’t doing it (only thought is perhaps the technology these folks are using isn’t capable) – IaaS won’t be worth while to use in my opinion until we have that sort of system in place.
HP is obviously in a good position when it comes to providing 3PAR technology as a cloud since they own the thing their support costs would be a fraction of what their customers pay and they would be able to consume unlimited software for nothing. Software typically makes up at least half the cost of a 3PAR system(the SPC-1 results and costs of course only show the bare minimum software required). Their hardware costs would be significantly less as well since they would not need much(any?) margin on it.
I remember SAVVIS a few years ago wanting to charge me ~$200,000/year for 20TB usable on 10k RPM storage on a 3PAR array, when I could of bought 20TB usable on 15k RPM storage on a 3PAR array(+ hosting costs) for less than one year’s costs at SAVVIS. I heard similar stories from 3PAR folks where customers would go out to the cloud to get pricing thinking it might be cheaper than doing it in house but always came back being able to show massive cost savings by keeping things in house.
They are also in a good position as a large server manufacturer to get amazing discounts on all of their stuff and again of course don’t have to make as much margin for these purposes (I imagine at least). Of course it’s a double edged sword pissing off current and potential customers that may use your equipment to try to compete in that same space.
I have hope still, that given HP’s strong presence in the enterprise and in house technology and technology partners that they will, at some point offer an enterprise grade cloud, something where I can allocate a set amount of CPU, memory, maybe even give me access to a 3PAR array using their Virtual Domain software, and then provision whatever I want within those resources – and billing would be based on some sort of combination of a fixed price for base services and variable price based on actual utilization (bill for what you use rather than what you provision), with perhaps some minimum usage thresholds (because someone has to buy the infrastructure to run the thing). So say I want a resource pool with 1TB of ram and 500Ghz of CPU. Maybe I am forced to pay for 200GB of ram and 50Ghz of CPU as a baseline, then anything above that is measured and billed accordingly.
Don’t let me down HP.
[…] of cloud, I heard that HP brought their own cloud platform out of beta recently. I am not a fan of this cloud either, basically they tried to clone what Amazon is doing in their cloud, which infrastructure wise is a […]
Pingback by Top 10 outages of the year « TechOpsGuys.com — December 18, 2012 @ 11:02 am