TechOpsGuys.com Diggin' technology every day

November 6, 2009

Thin Provisioning strategy with VMware

Filed under: Storage,Virtualization — Tags: , , , — Nate @ 3:44 pm

Since the announcement of thin provisioning built into vSphere I have seen quite a few blog posts on how to take advantage of it but haven’t seen anything that matches my strategy which has served me well utilizing array-based thin provisioning technology. I think it’s pretty foolproof..

The man caveat is that I assume you have a decent amount of storage available on your system, that is your VMFS volumes aren’t the only thing residing on your storage. On my current storage array,written VMFS data accounts for maybe 2-3 % of my storage. On the storage array I had at my last company it was probably 10-15%. I don’t believe in dedicated storage arrays myself. I prefer nice shared storage systems that can sustain random and sequential I/O from any number of hosts and distributed that I/O across all of the resources for maximum efficiency.  So my current array has most of it’s space set aside for a NFS cluster, and then there is a couple dozen terabytes set aside for SQL servers and VMware. The main key is being able to share the same spindles across dozens or even hundreds of LUNs.

There has been a lot of debate over the recent years about how best to size your VMFS volumes. The most recent data I have seen suggests somewhere between 250GB and 500GB. There seems to be unanimous opinion out there not to do something crazy and use 2TB volumes. The exact size depends on your setup. How many VMs, how many hosts, how often you use snapshots, how often you do vMotion, as well as the amount of I/O that goes on. The less of all of those the larger the volume can potentially be.

My method is so simple. I chose 1TB as my volume sizes, thin provisioned of course.  I utilize the default lazy zero VMFS mode and do not explicitly turn on thin provisioning on any VMDK files. There’s no real point if you already have it in the array. So I create 1TB volumes, and I begin creating VMs on them. I try to stop when I get to around 500GB of allocated(but not written) space. That is VMware thinks it is using 500GB, but it may only be using 30GB. This way I know, the system will never use more than 500GB. Pretty simple. Of course I have enough space in reserve that if something crazy were to happen the volume could grow to 500GB and not cause any problems. Even with my current storage array operating in the neighborhood of 89% of total capacity, that still leaves me with several terabytes of space I can use in an emergency.

If I so desire I can go beyond the 500GB at any time without an issue. If I chose not to then I haven’t wasted any space because nothing is written to those blocks. My thin provisioning system is licensed based on written data, so if I have 10TB of thin provisioning on my system I can, if I want create 100TB of thin provisioned volumes, provided I don’t write more than 10TB to them. So you see there really is no loss in making a larger volume when the data is thin provisioned on the array. Why not make it 2TB or even bigger? Well really I can’t see a time when I would EVER want a 2TB VMFS volume which is why I picked 1TB.

I took the time in my early days working with thin provisioning to learn the growth trends of various applications and how best to utilize them to get maximum gain out of thin provisioning.  With VMs that means having a small dedicated disk for OS and swap, and any data resides on other VMDKs or preferably on a NAS or for databases on raw devices(for snapshot purposes). Given that core OSs don’t grow much there isn’t much space needed(I default to 8GB) for the OS, and I give the OS a 1GB swap partition.  For additional VMDKs or raw devices I always use LVM. I use it to assist me in automatically detecting what devices a particular volume are on, I use it for naming purposes, and I use it to forcefully contain growth. Some applications are not thin provisioning friendly but I’d like to be able to expand the volume on demand without an outage. Online LVM resize and file system resize allows this without touching the array. It really doesn’t take much work.

On my systems I don’t really do vMotion(not licensed), I very rarely use VMFS snapshots(few times a year), the I/O on my VMFS volumes is tiny despite having 300+ VMs running on them. So in theory I probably could get away with 1TB or even 2TB VMFS volume sizes, but why lock myself into that if I don’t have to? So I don’t.

I also use dedicated swap VMFS volumes so I can monitor the amount of I/O going on with swap from an array perspective. Currently I have 21 VMware hosts connected to our array totalling 168 CPU cores, and 795GB of memory. Working to retire our main production VMware hosts, many of which are several years old(re-purposed from other applications). Now that I’ve proven how well it can work on existing hardware and the low cost version the company is ready to gear up a bit more and commit more resources to a more formalized deployment utilizing the latest hardware and software technology. You won’t catch me using the enterprise plus or even the enterprise version of VMware though, cost/ benefit isn’t there.

November 3, 2009

The new Cisco/EMC/Vmware alliance – the vBlock

Filed under: Storage,Virtualization — Tags: , , , , , , , — Nate @ 6:04 pm

Details were released a short time ago thanks to The Register on the vBlock systems coming from the new alliance of Cisco and EMC, who dragged along Vmware(kicking and screaming I’m sure). The basic gist of it is to be able to order a vBlock and have it be a completely integrated set of infrastructure ready to go, servers and networking from Cisco, storage from EMC, and Hypervisor from VMware.

vBlock0 consists of rack mount servers from Cisco, and unknown EMC storage, price not determined yet

vBlock1 consists 16-32 blade servers from Cisco and EMC CX4-480 storage system. Price ranges from $1M – 2.8M

vBlock2 consists of 32-64 blade servers from Cisco and an EMC V-MAX. Starting price $6M.

Sort of like FCoE, sounds nice in concept but the details fall flat on their face.

First off is the lack of choice. That is Cisco’s blades are based entirely on the Xeon 5500s, which are, you guessed it limited to two sockets. And at least at the moment limited to four cores. I haven’t seen word yet on compatibility with the upcoming 8-core cpus if they are socket/chip set compatible with existing systems or not(if so, wonderful for them..). Myself I prefer more raw cores, and AMD is the one that has them today(Istanbul with 6 cores, Q1 2010 with 12 cores). But maybe not everyone wants that so it’s nice to have choice. In my view HP blades win out here for having the broadest selection of offerings from both Intel and AMD. Combine that with their dense memory capacity(16 or 18 DIMM slots on a half height blade), allows you up to 1TB of memory in a blade chassis in an afforadable confiugration using 4GB DIMMs. Yes Cisco has their memory extender technology but again IMO at least with a dual socket Xeon 5500 that it is linked to the CPU core:memory density is way outta whack. It may make more sense when we have 16, 24, or even 32 cores on a system using this technology. I’m sure there are niche applications that can take advantage of it on a dual socket/quad core configuration, but the current Xeon 5500 is really holding them back with this technology.

Networking, it’s all FCoE based, I’ve already written a blog entry on that, you can read about my thoughts on FCoE here.

Storage, you can see how even with the V-MAX EMC hasn’t been able to come up with a storage system that can start on the smaller end of the scale, something that is not insanely unaffordable to 90%+ of the organizations out there. So on the more affordable end they offer you a CX4. If you are an organization that is growing you may find yourself outliving this array pretty quickly. You can add another vBlock, or you can rip and replace it with a V-MAX which will scale much better, but of course the entry level pricing for such a system makes it unsuitable for almost everyone to try to start out with even on the low end.

I am biased towards 3PAR of course as both of the readers of the blog know, so do yourself a favor and check out their F and T series systems, if you really think you want to scale high go for a 2-node T800, the price isn’t that huge, the only difference between a T400 and a T800 is the backplane. They use “blocks” to some extent, blocks being controllers(in pairs, up to four pairs), disk chassis(40 disks per chassis, up to 8 per controller pair I think). Certainly you can’t go on forever, or can you? If you don’t imagine you will scale to really massive levels go for a T400 or even a F400.  In all cases you can start out with only two controllers the additional cost to give you the option of an online upgrade to four controllers is really trivial, and offers nice peace of mind. You can even go from a T400 to a T800 if you wanted, just need to switch out the back plane (downtime involved). The parts are the same! the OS is the same! How much does it cost? Not as much as you would expect. When 3PAR announced their first generation 8-node system 7 years ago, entry level price started at $100k. You also get nice things like their thin built in technology which will allow you to run those eager zeroed VMs for fault tolerance and not consume any disk space or I/O for the zeros. You can also get multi level synchronous/asynchronous replication for a fraction of the cost of others. I could go on all day but you get the idea. There are so many fiber ports on the 3PAR arrays that you don’t need a big SAN infrastructure just hook your blade enclosures directly to the array.

And as for networking hook your 10GbE Virtual Connect switches on your c Class enclosures to your existing infrastructure. I am hoping/expecting HP to support 10GbaseT soon, and drop the CX4 passive copper cabling. The Extreme Networks Summit X650 stands alone as the best 1U 10GbE (10GbaseT or SFP+) switch on the market. Whether it is line rate, or full layer 3, or high speed stacking, or lower power consuming 10GbaseT vs fiber optics,  or advanced layer 3 networking protocols to simplify management,  price and ease of use — nobody else comes close. If you want bigger check out the Black Diamond 8900 series.

Second you can see with their designs that after the first block or two the whole idea of a vBlock sort of falls apart. That is pretty quickly your likely to just be adding more blades(especially if you have a V-MAX), rather than adding more storage and more blades.

Third you get the sense that these aren’t really blocks at all. The first tier is composed of rack mount systems, the second tier is blade systems with CX4, the third tier is blade systems with V-MAX. Each tier has something unique which hardly makes it a solution you can build as a “block” as you might expect from something called a vBlock. Given the prices here I am honestly shocked that the first tier is using rack mount systems. Blade chassis do not cost much, I would of expected them to simply use a blade chassis with just one or two blades in it. Really shows that they didn’t spend much time thinking about this.

I suppose if you treated these as blocks in their strictest sense and said yes we won’t add more than 64 blades to a V-MAX, and add it like that you could get true blocks, but I can imagine the amount of waste doing something like that is astronomical.

I didn’t touch on Vmware at all, I think their solution is solid, and they have quite a bit of choices. I’m certain with this vBlock they will pimp the enterprise plus version of software, but I really don’t see a big advantage of that version with such a small number of physical systems(a good chunk of the reason to go to that is improved management with things like host profiles and distributed switches). As another blogger recently noted, Vmware has everything to lose out of this alliance, I’m sure they have been fighting hard to maintain their independence and openness, this reeks of the opposite, they will have to stay on their toes for a while when dealing with their other partners like HP, IBM, NetApp, and others..

August 4, 2009

Will it hold?

Filed under: Monitoring,Storage — Tags: , , — Nate @ 10:21 pm

I went through a pretty massive storage refresh earlier this year which cut our floorspace in half, power in half, disks in half etc. Also improved performance at the same time. It’s exceeded my expectations, more recently though I have gotten worried as far as how far will the cache+disks scale to before they run out of gas. I have plans to increase the disk count by 150% (from 200 to 300) at the end of the year, but will we last until then? My  past(admittedly limited) storage experience  says we should already be having lots of problems but we are not. The system’s architecture and large caches are absorbing the hit, the performance remains high and very responsive to the servers. How long will that hold up though?  There are thousands of metrics available to me but the one metric that is not available is cache utilization, I can get hit ratios on tons of things, but no info on how full the cache is at any particular period of time(for either NAS or SAN).

To illustrate my point, here is a graphic from my in-house monitoring showing sustained spindle response times over 60 milliseconds:

Physical Disk response time

Physical Disk response time

And yet on the front end, response times are typically 2 milliseconds:

Fiber channel response time to NAS cluster

Fiber channel response time to NAS cluster

There are spikes of course, there is a known batch job that kicks off tons of parallel writes which blows out the cache on occasion, a big gripe I have with the developers of the app and their inability to(so far) throttle their behavior. I do hold my breath on occasion when I personally witness the caches(if you add up both NAS+SAN caches it’s about 70GB of mirrored memory) getting blown out. But as you can see both on the read and especially write side the advanced controllers are absorbing a huge hit. And the trend over the past few months has been a pretty steep climb upwards as more things run on the system. My hope is things level off soon, that hasn’t happened yet.

The previous arrays I have used would not of been able to sustain this, by any stretch.

Will it hold?

« Newer Posts

Powered by WordPress