TechOpsGuys.com Diggin' technology every day

August 18, 2009

It’s not a bug, it’s a feature!

Filed under: Storage,Uncategorized,Virtualization — Tags: , , — Nate @ 5:01 pm

I must be among a tiny minority of people who have automated database snapshots moving between systems on a SAN.

Earlier this year I setup an automated snapshot process to snapshot a production  MySQL database and bring it over to QA. This runs every day, and runs fine as-is. There is another on-demand process to copy byte-for-byte the same production MySQL DB to another QA mysql server(typically run once every month or two, and runs fine too!).

I also setup a job to snapshot all of the production MySQL DBs(3 currently), and bring them to a dedicated “backup” VM which then backs up the data and compresses it onto our NFS cluster. This runs every day, and runs fine as-is.

ENTER VMWARE VSPHERE.

Apparently they introduced new “intelligence” in vSphere in the storage system that tries to be smarter about what storage devices are present. This totally breaks these automated processes. Because the data on the LUN is different after I remove the LUN, delete the snapshot, create a new one, and re-present the LUN to vSphere it says HEY THERE IS DIFFERENT DATA SO I’LL GIVE IT A UNIQUE UUID (Nevermind the fact that it is the SAME LUN). During that process the guest VM loses connectivity to the original storage(of course) and does not regain connectivity because VSPHERE thinks the LUN is different so doesn’t give the VM access to it. The only fix at that point is to power off the VM, delete all of the Raw device maps, re-create all of the raw device maps and then power on the VM again. @#)!#$ No you can’t gracefully halt the guest OS because there are missing LUNs and the guest will hang on shutdown.

So I filed a ticket with vmware, the support team worked on it for a couple of weeks, escalating it everywhere, but as far as anyone could tell it’s “doing what it’s supposed to do”. And they can’t imagine how this process works in ESX 3.5 except for the fact that ESX 3.5 was more “dumb” when it came to this sort of thing.

ITS RAW FOR A REASON, DON’T TRY TO BE SMART WHEN IT COMES TO A RAW DEVICE MAP, THAT’S WHY IT’S RAW.

http://www.vmware.com/pdf/esx25_rawdevicemapping.pdf

With ESX Server 2.5, VMware is encouraging the use of raw device mapping in the following
situations:
• When SAN snapshot or other layered applications are run in the virtual machine. Raw
device mapping better enables scalable backup offloading systems using the features
inherent to the SAN.

[..]

HELLO ! SAN USER HERE TRYING TO OFFLOAD BACKUPS!

Anyways there are a few workarounds for these processes going forward:
– Migrate these LUNs to use Software iSCSI instead of Fiber channel, there is a performance hit(not sure how much)
– Keep one/more ESX 3.5 systems around for this type of work
– Use physical servers for things that need automated snapshots

The VMWare support rep sounded about as frustrated with the situation as I was/am. He did appear to try his best, but this behavior by vSphere is just unacceptable.  After all it works flawlessly in ESX 3.5!

WAIT! This broken-ness extends to NFS as well!

I filed another support request on a kinda-sorta-similar issue a couple of weeks ago regarding NFS data stores. Our NFS cluster operates with multiple IP addresses. Many(all?) active-active NFS clusters have at least two IPs (one per controller). In vSphere it once again assigns a unique ID based on the IP address rather than the host name to identify the NFS system. As a result if I use the host name on multiple ESX servers there is a very high likelihood(pretty much guaranteed) that I will not be able to do a migration of a VM that is on NFS from one host to another, because vSphere identifies the volumes differently because they are accessing it via a different IP. And if I try to rename the volume to match what is on the other system it tells me there is already a volume named that(when there is not) so I cannot rename it. The only workaround is to hard code the IP to each host, which is not a good solution because you lose multi-node load balancing at that point. Fortunately I have a Fiber channel SAN as well and have migrated all of my VMs off of NFS onto Fiber Channel, so this particular issue doesn’t impact me. But I wanted to illustrate this same sort of behavior with UUIDs is not unique to SAN, it can easily affect NAS as well.

You may not be impacted by the NFS stuff if your NFS system is unable to serve out the same file system over multiple controller systems simultaneously. I believe most fall into this category of being limited to 1 file system per controller at any given point in time. Our NFS cluster does not have this limitation.

August 17, 2009

FCoE Hype

Filed under: Storage — Nate @ 6:41 pm

I feel like I’ve been bombarded by hype about FCoE (Fiber Channel over Ethernet) over the past five months or so, and wanted to rant a bit about it. Been to several conferences and they all seem to hammer on it.

First a little background on what FCoE is and this whole converged networking stuff that some companies are pushing.

The idea behind it is to combine Fiber Channel and traditional Ethernet networking into something that runs on a single cable. So you have two 10 Gigabit connections coming out of your server for traditional FC, as well as your networking. The HBA presents itself to the server as independent FC and Ethernet connectivity. From a 10,000 foot view it sounds like a really cool thing to have, but then you get into the details.

They re-worked the foundations of ethernet networking to be better suited for storage traffic, which is a good thing, but it simultaneously makes this new FCoE technology incompatible with all existing Ethernet switches. You don’t get a true “converged” network based on Ethernet, you can’t even use the same cabling as you can for 10GbE in many cases. You cannot “route” your storage(FC) traffic across a traditional 10GbE switch despite it running over “Ethernet”.

The way it’s being pitched for the most part is somewhat of an aggregation layer, you link your servers to a FCoE switch, and then that switch splits the traffic out, uplinking 10GbE to 10GbE upstream switches, and FC traffic to FC switches(or FC storage). So what are you left with?

  • You still need two seperate networks – one for your regular Ethernet traffic, the other for the FCoE traffic
  • You still need to do things like zone your SAN as the FCoE presents itself as Fiber Channel HBAs
  • At least right now you end up paying quite a premium for the FCoE technology, from numbers I’ve seen, mostly list pricing on both sides an FCoE solution can cost 2x more than a 10GbE+8Gb fiber channel solution(never mind that the split solution as an aggregate can get much more performance).
  • With more and more people deploying blades these days, your really not cutting much of the cable clutter with FCoE, as your cables are aggregated at the chassis level. I even saw one consultant who seemed to imply some people using cables to connect their blades to their blade chassis? He sounded very confusing. Reduce your cable clutter! Cut your cables in half! Going from four, or even six cables to two or something really isn’t much to get excited about.

What would I like to see? Let the FCoE folks keep their stuff, if it makes them happy I’m happy for them. What I’d like to see as far as this converged networking goes is more 10GbE iSCSI converged HBAs. I see that Chelsio has one for example, combines 10GbE iSCSI offload and a 10GbE NIC in one package. I have no experience with their products so don’t know how good it is/not. I’m not personally aware of any storage arrays that have 10GbE iSCSI connectivity to them, though I haven’t checked recently. But what I’d like to see as an alternative is more focus on standardized ethernet as a storage transport, rather than this incompatible stuff.

Ethernet switches are so incredibly fast these days, and cheap! Line rate non blocking 1U 10GbE switches are dirt cheap these days, and many of them can even do 10GbE over regular old Cat 5E. Though I’m sure Cat 6A would provide better performance and/or latency. But the point I’m driving towards is not having to care what I’m plugging into, have it just work.

Maybe I’m just mad because I got somewhat excited about the concept of FCoE and feel totally let down by the details.

What I’d really like to see is a HP VirtualConnect 10GbE “converged” iSCSI+NIC. That’d just be cool. Toss onto that the ability to run a mix of jumbo and non jumbo frames on the same NIC(different vlans of course). Switches can do it, NICs should be able to do it too! I absolutely want jumbo frames on any storage network, but I probably do not want jumbo frames on my regular network for compatibility reasons.

August 6, 2009

Spreading the Load

Filed under: Storage — @ 12:24 pm

I’m sure there are a number of articles out there on 3PAR’s Dynamic Optimization but I thought it would be worth adding one more “holy cow this is easy!” post. My company just added 8 more drives to our 3PAR E200 bringing the total spindle count from 24 to 32. In the past, using another vendor’s SAN, taking advantage of the space on these new drives meant carving out a new LUN. If you wanted to use the space on all 32 drives collectively (in a single LUN for example) it would mean copying all the data off, recreating your LUN(s) and copying the data back. Not with 3PAR. First of all, thanks to their “chunklet” technology, carving out LUNs is a thing of the past. You can create and delete multiple virtual LUNs (VLUNs) on the fly. I won’t got into the details of that here but instead want to look at their Dynamic Optimization feature.

With Dynamic Optimization, after adding those 8 new drives I can then rebalance my VLUNs across all 32 drives – taking advantage of the extra spindles and increasing IOPS (and space). Now comes the part about it being easy. It is essentially 3 commands for a single volume – obviously the total number of commands will vary based on your volumes and common provisioning groups (cpgs).

createcpg -t r5 -ha mag -ssz 9 RAID5_SLOWEST_NEW
The previous command creates a new CPG that is spread out across all of the disks. You can do a lot with CPG’s, but we use them in a pretty flat manner and just use them to define the RAID level and where the data resides on the platter (inside or outside). The “-t r5” flag defines the RAID type (RAID5 in this case). The “-ha mag” flag defines the level of redundancy for this CPG (in this case, at the magazine level which on the E200 equates to disk level). The “-ssz9” defines the set size for the RAID level (in this case 8+1 – obviously a slower RAID level but easy on the overhead). The “RAID5_SLOWEST_NEW” is the name I’m assigning to the CPG.

tunevv usr_cpg RAID5_SLOWEST_NEW -f MY_VIRTVOL1
The “tunevv” command is used for each virtual volume I want to migrate to the newly created CPG in the previous command. It tells the SAN to move the virtual volume MY_VIRTVOL1 to the CPG RAID5_SLOWEST_NEW.

Then, once all of your volumes on a particular CPG are moved to a new CPG, the final command is to delete the old CPG and regain your space.
removecpg RAID5_SLOWEST_OLD

If you start running low on space before you get to the removecpg command (when you’re moving multiple volumes, for example), you can always issue a compactcpg command that will shrink your “old” CPG and free up some space until you finish moving your volumes. Or, if you’re not moving all the volumes off the old CPG, then be sure to issue a compactcpg when you’re finished to reclaim that space.

The Dynamic Optimization can also be used to migrate a volume from one RAID level to another using commands similar to the ones above. At a previous company we moved a number of volumes from RAID1 to RAID5 because we needed the extra space that RAID5 gives. Also, due to the speed of the 3PAR SAN, we hardly noticed a performance hit! And, in this case, the entire DO operation was done by a 3PAR engineer from his Blackbarry while sitting at a restaurant in another state.

Oh, did I mention this is all done LIVE with zero downtime? In fact, I’m doing it on our production SAN right now in the middle of a weekday while the system is under load. There is a performance hit to the system in terms of disk I/O, but the system will throttle the CPG migration as needed to give priority to your applications/databases.

You can queue up multiple tunevv commands at the same time (I think 4 is the max) – each command kicks off a background task that you can check on with the showtask command.
showtask

Through this process I’ve created new CPG’s that are configured the same as my old CPG’s (in terms of RAID level and physical location on the platter) except that the new CPG’s are spread across all 32 disk and not just my original 24. Then I moved my VLUNs from the old CPG’s to the new CPG’s. And finally, I deleted the old CPG. Now all of my CPG’s and the VLUNs they contain are spread across all 32 disks thereby increasing IOPS and space available to the CPG’s.
createcpg -t r5 -ha mag -ssz 9 RAID5_SLOWEST_NEW
tunevv usr_cpg RAID5_SLOWEST_NEW -f MY_VIRTVOL1
# repeat as many times as need for each virtual volume in that CPG
removecpg RAID5_SLOWEST_OLD
showtask

August 5, 2009

FTP to your tape drive

Filed under: Storage — Nate @ 7:53 pm

Just got done with an evaluation of a new product on the market, a Cache-A Prime Cache tape drive. It is based off the same technology that the Quantum Superloader 3A. Quantum as a company hasn’t been doing too hot recently, I was told that they basically let go of the entire team responsible for developing the software for the Superloader 3A. Cache-A then went in and either bought or at least licensed the software to continue development at their company. The first product was released in late June 2009, the Prime Cache, which is a single LTO-4 tape drive hooked up to a small computer running Fedora 10. They have a fairly easy to use UI that is web based.

You can either FTP or use CIFS to interface with the device to upload files. It stages the files on a local internal disk which they call the VTAPE, once the file is uploaded to the share then the system automatically sends it to tape. Eventually it will support NFS as well. It does have the ability to mount remote NFS/CIFS shares and back them up directly though there are some limitations in the current software release. I was unable to get it to see any files on our main NAS cluster which runs Samba for CIFS, and was unable to mount NFS volumes it depends currently on another software package(forgot the name) which broadcasts the available NFS exports to the network for the device to see them, no ability yet to manually input a NFS server/mount point to go to.

I like the concept because being able to use FTP or even smbclient on the command line to be able to tie directly into a tape drive from backup scripts is real handy for me. Add to that pretty much any system on the network being able to push files to the tape without having to go through special backup software has it’s appeals as well. Most of our data that needs to be backed up is spread out all over the place, and I have scripts that gathers the paths and file names of the files that need backing up. Our MySQL database backups are also heavily scripted as well involving snapshots from the SAN etc. So being able to put a few lines of code in the script to pass the files along to the tape is nice.

The system is quite new so has some bugs, and some things aren’t implimented yet, like the ability to delete files directly from the tape or erase/format the tape without using the WebUI, that is coming(along with an API), though no ETA. The device retails for about $7k I believe, which is roughly 1/2 the cost of the SuperLoader 3A. Though this is just one tape drive, no autoloader yet. Though it is LTO-4 and the SuperLoader 3A is LTO-3(with no expectations of it ever being able to get to be LTO-4).

I’ll certainly be following this product/company pretty closely in the future myself as I really like the way they are going, this is certainly a very innovative product, other than the SuperLoader I haven’t seen any other product like it on the market.

August 4, 2009

Will it hold?

Filed under: Monitoring,Storage — Tags: , , — Nate @ 10:21 pm

I went through a pretty massive storage refresh earlier this year which cut our floorspace in half, power in half, disks in half etc. Also improved performance at the same time. It’s exceeded my expectations, more recently though I have gotten worried as far as how far will the cache+disks scale to before they run out of gas. I have plans to increase the disk count by 150% (from 200 to 300) at the end of the year, but will we last until then? My  past(admittedly limited) storage experience  says we should already be having lots of problems but we are not. The system’s architecture and large caches are absorbing the hit, the performance remains high and very responsive to the servers. How long will that hold up though?  There are thousands of metrics available to me but the one metric that is not available is cache utilization, I can get hit ratios on tons of things, but no info on how full the cache is at any particular period of time(for either NAS or SAN).

To illustrate my point, here is a graphic from my in-house monitoring showing sustained spindle response times over 60 milliseconds:

Physical Disk response time

Physical Disk response time

And yet on the front end, response times are typically 2 milliseconds:

Fiber channel response time to NAS cluster

Fiber channel response time to NAS cluster

There are spikes of course, there is a known batch job that kicks off tons of parallel writes which blows out the cache on occasion, a big gripe I have with the developers of the app and their inability to(so far) throttle their behavior. I do hold my breath on occasion when I personally witness the caches(if you add up both NAS+SAN caches it’s about 70GB of mirrored memory) getting blown out. But as you can see both on the read and especially write side the advanced controllers are absorbing a huge hit. And the trend over the past few months has been a pretty steep climb upwards as more things run on the system. My hope is things level off soon, that hasn’t happened yet.

The previous arrays I have used would not of been able to sustain this, by any stretch.

Will it hold?

« Newer Posts

Powered by WordPress