TechOpsGuys.com Diggin' technology every day

August 13, 2010

Do you really need RAID 6

Filed under: Storage — Tags: , , , , , — Nate @ 11:34 pm

I’ve touched on this topic before but I don’t think I’ve ever done a dedicated entry on the topic. I came across a blog post from Marc Farley, which got my mind thinking on the topic again. He talks about a leaked document from EMC trying to educate their sales force to fight 3PAR in the field. One of the issues raised is 3PAR’s lack of RAID 6 (nevermind the fact that this is no longer true, 2.3.1 introduced RAID 6(aka RAID DP) in early January 2010).

RAID 6 from 3PAR’s perspective for the most part was mostly just a check box, because there are those customers out there that have hard requirements, they disregard the underlying technology and won’t even entertain the prospect unless it mets some of their own criteria.

What 3PAR did in their early days was really pretty cool, the way they virtualize the disks in the system which in turn distributes the RAID across many many disks. On larger arrays you can have well over 100,000 RAID arrays on the system. This provides a few advantages:

  • Evenly distributes I/O across every available spindle
  • Parity is distributed across every available spindle – no dedicated parity disks
  • No dedicated hot spare spindles
  • Provides a many:many relationship for RAID rebuilds
    • Which gives the benefit of near zero impact to system performance while the RAID is rebuilt
    • Also increases rebuild performance by orders of magnitude (depending on # of disks)
  • Only data that has been written to the disk is rebuilt
  • Since there are no spare spindles, only spare “space” on each disk, in the event you suffer multiple disk failures before having the failed disks swapped(say you have 10 disks fail over a period of a month and for whatever reason you did not have the drives replaced right away) the system will automatically allocate more “spare” space as long as there is available space to write to on the system. Unlike traditional arrays where you may find yourself low or even out of hot spares after multiple disks fail which will make you much more nervous and anxious to replace those disks than if it were a 3PAR system(or similarly designed system)

So do you need RAID 6?

To my knowledge the first person to raise this question was Robin from Storage Mojo, whom a bit over three years ago wrote a blog post talking about how RAID 5 will have lost it’s usefulness in 2009. I have been following Robin for a few years (online anyways), he seems like a real smart guy I won’t try to dispute the math. And I can certainly see how traditional RAID arrays with large SATA disks running RAID 5 are in quite a pickle, especially if there is a large data:parity ratio.

In the same article he speculates on when RAID 6 will become as “useless” as RAID 5.

I think what it all really comes down to is a couple of things:

  • How fast can your storage system rebuild from a failed disk
    • For distributed RAID this is determined by the number of disks participating in the RAID arrays and the amount of load on the system, because when a disk fails one RAID array doesn’t go into degraded mode, potentially hundreds of them do, which then triggers all of the remaining disks to help in the rebuild.
    • For 3PAR systems at least this is determined by how much data has actually been written to the disk.
  • What is the likelihood that a 2nd disk will fail(in the case of RAID 5) or two more disks(RAID 6) fail during this time?

3PAR is not alone with the distributed RAID. As I have mentioned before, others that I know of that have similar technology are at least : Compellent, Xiotech and IBM XIV. I bet there are others as well.

From what I understand of Xiotech’s technology I don’t *think* that RAID arrays can span their ISE enclosures, I think they are limited to a single enclosure(by contrast I believe a LUN can span enclosures), so for example if there are 30 disks in the enclosure and a disk fails the maximum number of disks that can participate in the rebuild is 30. Though in reality I think the number is less given how they RAID based on disk heads, the number of individual RAID arrays is far fewer vs 3PAR’s chunklet-based RAID.

I’ve never managed to get in depth info on Compellent’s or IBM XIV’s design with regards to specifics around how RAID arrays are constructed. Though I haven’t tried any harder than looking at what is publically available on their web sites.

Distributed RAID really changes the game in my opinion as far as RAID 5’s effective life span (same goes for RAID 6 of course).

Robin posted a more recent entry several months ago about the effectiveness of RAID 6, and besides on of the responders being me, there was another person that replied with a story that made me both laugh and feel sorry for the guy, a horrific experience with RAID 6 on Solaris ZFS with Sun hardware –

Depending on your Recovery Time Objectives, RAID6 and other dual-parity schemes (e.g. ZFS RAIDZ2) are dead today. We know from hard experience.

Try 3 weeks to recover from a dual-drive failure on 8x 500GB ZFS RAIDZ2 array.

It goes like this:
– 2 drives fail
– Swap 2 drives (no hot spares on this array), start rebuild
– Rebuild-while-operating took over one week. How much longer, we don’t know because …
– 2 more drives failed 1 week into the rebuild.
– Start restore from several week old LTO-4 backup tapes. The tapes recorded during rebuild were all corrupted.
– One week later, tape restore is finished.
– Total downtime, including weekends and holidays – about 3 weeks (we’re not a 24xforever shop).

Shipped chassis and drives back to vendor – No Trouble Found!

Any system that takes longer than say 48 hours to rebuild you probably do want that extra level of protection in there, whether it is dual parity or maybe even triple parity(something I believe ZFS offers now?).

Add to that disk enclosure/chassis/cage(3PAR term) availability which means you can lose an entire shelf of disks without disruption, which means in their S/T class systems 40 disks can go down and your still ok(protection against a shelf failing is the default configuration and is handled automatically – this can be disabled upon request of the user since it does limit your RAID options based on the number of shelves you have).  So not only do you need to suffer a double disk failure but that 2nd disk has to fail:

  • In a DIFFERENT drive chassis than the original disk failure
  • Happens to be a disk that has portions of RAID set(s) that were also located on the original disk that failed

But if you can recover from a disk failure in say 4 hours even on a 2TB disk with RAID 5, do you really need RAID 6? I don’t know what the math might look like but would be willing to bet that a system that takes 3 days to rebuild a RAID 6 volume has about as much of a chance of suffering a triple disk failure as a system that takes 4 hours (or less) to rebuild a RAID 5 array suffering a double disk failure.

Think about the probability of the two above bullet points on how a 2nd drive must fail in order to cause data loss, combine that with the fast rebuild of distributed RAID, and cosnider whether or not you really need RAID 6. Do you want to take the I/O hit ? Sure it is an easy extra layer of protection, but you might be protecting yourself that is about as likely to happen as a natural disaster taking out your data center.

I mentioned to my 3PAR rep a couple of weeks ago about the theory of RAID 6 with “cage level availability” has the potential of being able to protect against two shelves of disks failing(so you can lose up to 80 disks on the big arrays) without impact. I don’t know if 3PAR went this far to engineer their RAID 6, I’ve never seen it mentioned so I suspect not, but I don’t think there is anything that would stop them from being able to offer this level of protection at least with RAID 6 6+2.

Myself I speculate that on a decently sized 3PAR system (say 200-300 disks) SATA disks probably have to get to 5-8TB in size before I think I would really think hard about RAID 6. That won’t stop their reps from officially reccomending RAID 6 with 2TB disks though.

I can certainly understand the population at large coming to the conclusion that RAID 5 is no longer useful, because probably 99.999% of the RAID arrays out there (stand alone arrays as well as arrays in servers) are not running on distributed RAID technology. So they don’t realize that another way to better protect your data is to make sure the degraded RAID arrays are rebuilt (much) faster, lowering the chance of additional disk failures occurring at the worst possible time.

It’s nice that they offer the option, let the end user decide whether or not to take advantage of it.

November 24, 2009

81,000 RAID arrays

Filed under: Storage,Virtualization — Tags: , , — Nate @ 2:56 pm

I keep forgetting to post about this, I find this number interesting myself. It is the number of mini RAID arrays on my storage system, which has 200 spindles, which comes out to about 400 RAID arrays per disk! Why so many? Well it allows for maximum distribution of storage space and I/O across the system as well as massively parallel RAID rebuilds as every disk in the system will participate when a disk fails, which leads to faster rebuild times and much better service times during rebuild.

While 3PAR talks a lot about their mini RAID arrays(composed of virtual 256MB disks) it turns out there really isn’t an easy way to query how many there are, I suppose because they expect it to be so abstracted that you should not care. But I like to poke around if you haven’t noticed already!

The little script to determine this number is:

#!/bin/bash

export ARRAYS_TOTAL=0
export ARRAY="mrt"
echo "showld -d" | ssh $ARRAY | grep cage | while read line;
do
        export RAWSIZE=`echo $line | awk '{print $7}'`;
        export LD_NAME=`echo $line | awk '{print $2}'`;
        export SETSIZE=`echo $line | awk '{print $10}'`;
        export ARRAYS=`echo "${RAWSIZE}/256/${SETSIZE}" | bc`;
        export ALL_ARRAYS=`echo "${ARRAYS_TOTAL}+${ARRAYS}" | bc `;
        export ARRAYS_TOTAL="$ALL_ARRAYS"; echo "NAME:${LD_NAME} Raw Size:${RAWSIZE}  Set Size:${SETSIZE} Micro arrays in LD:${ARRAYS}  Total Micro arrays so far:${ALL_ARRAYS}";
done

Hopefully my math is right..

Output looks like:

NAME:log2.0 Raw Size:40960  Set Size:2 Micro arrays in LD:80  Total Micro arrays so far:80
NAME:log3.0 Raw Size:40960  Set Size:2 Micro arrays in LD:80  Total Micro arrays so far:160
NAME:pdsld0.0 Raw Size:49152  Set Size:2 Micro arrays in LD:96  Total Micro arrays so far:256
[..]
NAME:tp-7-sd-0.242 Raw Size:19968  Set Size:6 Micro arrays in LD:13  Total Micro arrays so far:81351
NAME:tp-7-sd-0.243 Raw Size:19968  Set Size:6 Micro arrays in LD:13  Total Micro arrays so far:81364

Like the mini RAID arrays the logical disks (the command above is showld or show logical disks) are created/maintained/deleted automatically by the system, another layer of abstraction that you really never need to concern yourself with.

The exact number is 81,364 which is up from about 79,000 in June of this year. To me at least it’s a pretty amazing number when I think about it, 80,000+ little arrays working in parallel, how does the system keep track of it all?

3PAR isn’t unique in this technology though I think maybe they were first. I believe Compellent has something similar, and Xiotech constructs RAID arrays at the disk head level, which I find very interesting, I didn’t know it was possible to “target” areas of the disk as specifically as the head. I think of these three implimentations though the 3PAR one is the most advanced because it’s implimented in hardware(Compellent is software), and it’s much more granular(400 per disk in this example, Xiotech would have up to 8 per disk I think).

The disks are not full yet either, they are running at about ~87% of capacity, so maybe room for another 10,000 RAID arrays on them or something..

I learned pretty quick there’s a lot more to storage than just the number/type of disks..

(Filed under virtualization as well since this is a virtualized storage post)

August 4, 2009

Will it hold?

Filed under: Monitoring,Storage — Tags: , , — Nate @ 10:21 pm

I went through a pretty massive storage refresh earlier this year which cut our floorspace in half, power in half, disks in half etc. Also improved performance at the same time. It’s exceeded my expectations, more recently though I have gotten worried as far as how far will the cache+disks scale to before they run out of gas. I have plans to increase the disk count by 150% (from 200 to 300) at the end of the year, but will we last until then? My  past(admittedly limited) storage experience  says we should already be having lots of problems but we are not. The system’s architecture and large caches are absorbing the hit, the performance remains high and very responsive to the servers. How long will that hold up though?  There are thousands of metrics available to me but the one metric that is not available is cache utilization, I can get hit ratios on tons of things, but no info on how full the cache is at any particular period of time(for either NAS or SAN).

To illustrate my point, here is a graphic from my in-house monitoring showing sustained spindle response times over 60 milliseconds:

Physical Disk response time

Physical Disk response time

And yet on the front end, response times are typically 2 milliseconds:

Fiber channel response time to NAS cluster

Fiber channel response time to NAS cluster

There are spikes of course, there is a known batch job that kicks off tons of parallel writes which blows out the cache on occasion, a big gripe I have with the developers of the app and their inability to(so far) throttle their behavior. I do hold my breath on occasion when I personally witness the caches(if you add up both NAS+SAN caches it’s about 70GB of mirrored memory) getting blown out. But as you can see both on the read and especially write side the advanced controllers are absorbing a huge hit. And the trend over the past few months has been a pretty steep climb upwards as more things run on the system. My hope is things level off soon, that hasn’t happened yet.

The previous arrays I have used would not of been able to sustain this, by any stretch.

Will it hold?

Powered by WordPress