Storage « TechOpsGuys.com

August 21, 2012

When is a million a million?

Filed under: Storage — Tags: ssd — Nate @ 7:24 pm

I was doing my usual reading The Register, specifically this article and something popped into my mind and I wanted to write a brief note about it.

I came across the SPC-1 results for the Huawei OceanStor Dorado 5100 before I saw the article and didn’t think a whole lot about it.

I got curious when I read the news article though so I did some quick math – the Dorado 5100 is powered by 96 x 200GB SSDs and 96GB of cache in a dual controller active-active configuration. Putting out an impressive 600,000 IOPS with the lowest latency (by far) that I have seen anyways. Also they did have a somewhat reasonable unused storage ratio of 32.35% (I would of liked to have seen much better givenÂ the performance of the box but I’ll take what I can get).

But the numbers aren’t too surprising – I mean SSDs are really fast right. What got me curious though is the # of IOPS coming out of each SSDÂ to the front end, in this case it comes to 6,250 IOPS/SSD. Compared to some of the fastest disk-based systems this is about 25x faster per disk than spinning rust. There is no indication that I can see at least that tells what specific sort of SSD technology they are using(other than SLC). But 6,250 per disk seems like a far cry from the 10s of thousands of IOPS many SSDs claim to be able to do.

I’m not trying to say it’s bad or anything but I found the stat curious.

I went ahead and looked at another all-SSD solution the IBM V7000, this time 18 x 200GB SSDs are providing roughly 120,000 IOPS also with really good latency with 16GB of data cache between the pair of controllers.Â Once again the numbers come to roughly 6,600 IOPS per SSD. IBM ran at an even better unused storage ratio of just under 15%, hard to get much better than that.

Texas memory systems (recently acquired by IBM), posted results for their RamSan-630 about a year ago, with 20 x 640GB SSDs pushing out roughly 400,000 IOPS with pretty good latency. This time however the numbers change – around 20,000 IOPS per SSD here, as far as I can tell there is no RAM cache either. The TMS system came in at a 20% unused storage ratio.

While there are no official results, HP did announce not long ago that an ‘all SSD’ variant of the P10000(just realized it is kind of strange to have two sub models(V400 and V800 which were the original 3PAR models) of the larger P10000 model), which said would get the same 450,000 IOPS on 512 x SSDs. The difference here is pretty stark with each SSD theoretically putting out only 878 IOPS(so roughly 3.5x faster than spinning rust).

At least originally I know originally 3PAR chose a slower STEC Mach8IOPS SSD primarily due to cost (it was something like 60% cheaper). STEC’s own website shows the same SSD getting 10,000 IOPS (on a read test – whereas the disk they compared it to seemed to give around 250 IOPS).Â Still thoughÂ you can tap out the 8 controllers with almost 1/4th the number of disks supported with these SSDs. I don’t know whether or not the current generation of systems uses the same SSD or not.

I’ll be the first to admit an all-SSD P10000 doesn’t make a lot of sense to me, though it’s nice that customers have that option if that’s what they want (I never understood why all-SSD was not available before that didn’t make sense either). HP says it is 70% less expensive than an all-disk variant, though are not specific whether they are using 100GB SSDs (I assume they are) vs 200GB SSDs.

Both TMS and Huawei advertise their respective systems as being “1 million IOPS”, I suppose if you took one of each and striped them together that’s about what you’d get ! Sort of reminds me of a slide show presentation I got from Hitachi right before their AMS2000-series launched one of the slides showed the # of IOPS from cache (they did not have a number for IOPS from disk at the time), which didn’t seem like a terribly useful statistic.

So here you have individual SSDs providing anywhere from 900 to 20,000 IOPS per disk on the same test…

I’d really love to see SPC-1 results for the likes of Pure Storage, Nimble Storage, Nimbus Storage, and perhaps even someone like Tintri, just to see how they measure up on a common playing field, with a non trivial utilization rate. Especially with claims like this from Nimbus saying they can do 20M IOPS/rack, does that mean at 10% of usable capacity or greater than 50%? I really can’t imagine what sort of workload would need that kind of I/O but there’s probably one or two folks out there that can leverage it.

We now take you back to your regularly scheduled programming..

Comments (3)

June 25, 2012

Exanet – two years later – Dell Fluid File system

Filed under: Storage — Tags: dell, exanet — Nate @ 10:21 pm

I was an Exanet customer a few years ago up until they crashed. They had a pretty nice NFS cluster for scale out, well at least it worked well for us at the time and it was really easy to manage.

Dell bought them over two years ago and hired many of the developers and have been making the product better I guess over the past couple of years. Really I think they could of released a product – wait for it – a couple of years ago given that Exanet was simply a file system that ran on top of CentOS 4.x at the time. Dell was in talks with Exanet at the time they crashed to make Exanet compatible with an iSCSI back end (because really who else makes a NAS head unit that can use iSCSI as a back end disk). So even that part of the work was pretty much done.

It was about as compatible as you could get really. It would be fairly trivial to certify it against pretty much any back end storage. But Dell didn’t do that, they sat on it making it better(one would have to hope at least). I think at some point along the line perhaps even last year they released something in conjunction with Equallogic – I believe that was going to be their first target at least, but with so many different names for their storage products I’m honestly not sure if it has come out yet or not.

Anyways that’s not the point of this post.

Exanet clustering, as I’ve mentioned before was sort of like 3PAR for file storage. It treated files like 3PAR treats chunklets. It was highly distributed (but lacked data movement and re-striping abilities that 3PAR has had for ages).

Exanet File System Daemon - a software controller for files in the file system, typically one per CPU core, a file had a primary FSD and a secondary FSD. New files would be distributed evenly across all FSDs.

One of the areas where the product needed more work I thought was being able to scale up more. It was a 32-bit system – so inherited your typical 32-bit problems like memory performance going in the tank when you try to address large amounts of memory. When their Sun was about to go super nova they told me they had even tested up to 16-node clusters on their system, they could go higher there just wasn’t customer demand.

3PAR too was a 32-bit platform for the longest time, but those limitations were less of an issue for it because so much of the work was done in hardware – it even has physical separation of the memory used for the software vs the data cache. Unlike Exanet which did everything in software, and of course shared memory between the OS and data cache. Each FSD had it’s own data cache, something like up to 1.5GB per FSD.

Requests could be sent to any controller, any FSD, if that FSD was not the owner of the file it would send a request on a back end cluster interconnect and proxy the data for you, much like 3PAR does in it’s clustering.

I believed it was a great platform to just throw a bunch of CPU cores and gobs of memory at, it runs on a x86-64 PC platform (IBM Dual socket quad core was their platform of choice at the time). 8, 10 and 12 core CPUs were just around the corner, as were servers which could easily get to 256GB or even 512GB of memory. When your talking software licensing costs being in the tens of thousands of dollars – give me more cores and ram, the cost is minimal on such a commodity platform.

So you can probably understand my disappointment when I came across this a few minutes ago, which tries to hype up the upcoming Exanet platform.

Up to 8 nodes and 1PB of storage (Exanet could do this and more 4 years ago – though in this case it may be a Compellent limitation as they may not support more than two Compellent systems behind a Exanet cluster – docs are un clear) — Originally Exanet was marketed as a system that could scale to 500TB per 2-node pair. Unofficially they preferred you had less storage per pair (how much less was not made clear – at my peak I had around I want to say 140TB raw managed by a 2-node cluster? It didn’t seem to have any issues with that we were entirely spindle bound)
Automatic load balancing (this could be new – assuming it does what it implies – which the more I think about it I’d be it does not do what I think it should do and probably does the same load balancing Exanet did four years ago which was less load balancing and more round robin distribution)
Dual processor quad core with 24GB – Same controller configuration I got in 2008 (well the CPU cores are newer) — Exanet’s standard was 16GB at the time butÂ you could get a special order and do 24GB though there was some problem with 24GB at the time that we ran into during a system upgrade I forgot what it was.
Back end connectivity – 2 x 8Gbps FC ports (switch required) — my Exanet was 4Gbps I believe and was directly connected to my 3PAR T400, queue depths maxed out at 1500 on every port.
Async replication only – Exanet had block based async replication this in late 2009/early 2010. Prior to that they used a bastardized form of rsync (I never used either technology)
Backup power – one battery per controller. Exanet used old fashioned UPSs in their time, not sure if Dell integrated batteries into the new systems or what.
They dropped support for Apple File Protocol. That was one thing that Exanet prided themselves on at the time – they even hired one of the guys that wrote the AFP stack for Linux, they were the only NAS vendor (that I can recall) at the time that supported AFP.
They added support for NDMP – something BlueArc touted to us a lot at the time but we never used it, wasn’t a big deal. I’d rather have more data cache than NDMP.

I mean from what I can see I don’t really see much progress over the past two years. I really wanted to see things like

64-bit (the max memory being 24GB implies to me still a 32-bit OS+ file system code)
Large amounts of memory – at LEAST 64GB per controller – maybe make it fancy and make it it flash-backed? RAM IS CHEAP.
More cores! At least 16 cores per controller, though I’d be happier to see 64 per controller (4x Opteron 6276 @ 2.3Ghz per controller) – especially for something that hasn’t even been released yet. Maybe based on Dell R815 or R820
At least 16-node configuration (the number of blades you can fit in a Dell blade chassis(perhaps running Dell M620), not to mention this level of testing was pretty much complete two and a half years ago).
SSD Integration of some kind – meta data at least? There is quite a bit of meta data mapping all those files to FSDs and LUNs etc.
Clearer indication that the system supports dynamic re-striping as well as LUN evacuation (LUN evacuation especially something I wanted to leverage at the time – as the more LUNs you had the longer the system took to fail over. In my initial Exanet configuration the 3PAR topped out at 2TB LUNs, later they expanded this to 16TB but there was no way from the Exanet side to migrate to them, and Exanet being fully distributed worked best if the back end was balanced so it wasn’t a best practice to have a bunch of 2TB LUNs then start growing by adding 16TB LUNs you get the idea) – the more I look at this pdf the less confident I am in them having added this capability (that PDF also indicates using iSCSI as a back end storage protocol).
No clear indication that they support read-write snapshots yet (all indications point to no). For me at the time it wasn’t a big deal, snapshots were mostly used for recovering things that were accidentally deleted. They claim high performance with their redirect on write – though in my experience performance was not high. It was adequate with some tuning, they claimed unlimited snapshots at the time, but performance did degrade on our workloads with a lot of snapshots.
A low end version that can run in VMware – I know they can do it because I have an email here from 2 years ago that walks you through step by step instructions installing an Exanet cluster on top of VMware.
Thin provisioning friendly – Exanet wasn’t too thin provisioning friendly at the time Dell bought them – no indication from what I’ve seen says that has changed (especially with regards to reclaiming storage). The last version Exanet released was a bit more thin provisioning friendly but I never tested that feature before I left the company, by then the LUNs had grown to full size and there wasn’t any point in turning it on.

I can only react based on what I see on the site – Dell isn’t talking too much about this at the moment it seems, unless perhaps your a close partner and sign a NDA.

Perhaps at some point I can connect with someone who has in depth technical knowledge as to what Dell has done with this fluid file system over the past two years, because really all I see from this vantage point is they added NDMP.

I’m sure the code is more stable, easier to maintain perhaps, maybe they went away from the Outlook-style GUI, slapped some Dell logos on it, put it on Dell hardware.

It just feels like they could of launched this product more than two years ago minus the NDMP support (take about 1 hour to put in the Dell logos, and say another week to certify some Dell hardware configuration).

I wouldn’t imagine the SpecSFS performance numbers would of changed a whole lot as a result, maybe it would be 25-35% faster with the newer CPU cores (those SpecSFS results are almost four years old). Well performance could be boosted more by the back end storage. Exanet used to use the same cheap crap LSI crap that BlueArc used to use (perhaps still does in some installations on the low end). Exanet even went to the IBM OEM version of LSI and wow have I heard a lot of horror stories about that too(like entire arrays going off line for minutes at a time and IBM not being able to explain how/why then all of a sudden they come back as if nothing happened). But one thing Exanet did see time and time again, performance on their systems literally doubled when 3PAR storage was used (vs their LSI storage). So I suspect fancy Compellent tiered storage with SSDs and such would help quite a bit in improving front end performance on SpecSFS. But that was true when the original results were put out four years ago too.

What took so long? Exanet had promise, but at least so far it doesn’t seem Dell has been able to execute on that promise. Prove me wrong please because I do have a soft spot for Exanet still 🙂

Comments (12)

June 22, 2012

NetApp Cluster SPC-1

Filed under: Storage — Tags: netapp, spc-1 — Nate @ 7:19 pm

Sorry for the off topic posts recently – here is something a little more on topic.

I don’t write about NetApp much, mainly because I believe they have some pretty decent technology, they aren’t a Pillar or an Equallogic. Though sometimes I poke fun. BTW did you hear about that senior Oracle guy that got canned recently and the comments he made about Sun? Oh my, was that funny. I can only imagine what he thought of Pillar. Then there are the folks that are saying Oracle is heavily discounting software so they can sell hardware at list price thus propping up the revenues, net result is Oracle software folks hate Sun. Not a good situation to be in. I don’t know why Oracle couldn’t of just of been happy owning BEA Jrockit JVM and let Sun whither away.

Anyways…

NetApp tried to make some big news recently when they released their newest OS, Ontap 8.1.1. For such a minor version number change (8.1 -> 8.1.1) they sure did try to raise a big fuss about it. Shortly after 8.1 came out I came across some NetApp guy’s blog who was touting this release quite heavily. I was interested in some of the finer points and tried to ask some fair technical questions – I like to know the details. Despite me being a 3PAR person I tried, really hard to be polite and balanced, and the blogger was very thoughtful, informed and responsive and gave a great reply to my questions.

Anyways I’m still sort of un clear what is really new in 8.1.1 vs 8.1 – it sounds to me like it’s just some minor changes from a technical side and they slapped some new marketing on top of it. Well I think the new Hybrid aggregates are perhaps specifically new to 8.1.1 (Also I think some new form of Ontap that can run in a VM for small sites). Maybe 8.1 by itself didn’t make a big enough splash. Or maybe 8.1.1 is what 8.1 was supposed to be (I think I saw someone mention that perhaps 8.1 was a release candidate or something). The SpecSFS results posted by NetApp for their clusters are certainly pretty impressive from a raw performance standpoint. They illustrate excellent scalability up to 24 nodes.

But the whole story isn’t told in the SpecSFS results – partially because things like cost are not disclosed in the results, partially because it doesn’t illustrate the main weakness of the system in that it’s not a single file system, it’s not automatically balanced from either a load or a space perspective.

But I won’t harp on that much this post is about their recent SPC-1 results which I just stumbled upon. These are the first real SPC-1 results NetApp has posted in almost four years – you sort of have to wonder what took them so long. I mean they did release some SPC-1E results a while back but those are purely targeting energy measurements. For me at least, energy usage is probably not even in the top 5 things I look for when I want some new storage. The only time I really care about energy usage is if the installation is really, really small. I mean like the whole site being less than one rack. Energy efficiency is nice but there are a lot of things that are higher on my priority list.

This SPC-1 result from them is built using a 6-node cluster, 3TB of flash cache and 288 GB of data cache spread across the controllers, and only 432 disks – 144 x 450GB per pair of controllers protected with RAID DP. The cost given is $1.67M for the setup. They say it is list pricing – so not being a customer of theirs I’m not sure if it’s apples to apples compared to other setups – some folks show discounted pricing and some show list – I would think it would heavily benefit the tester to illustrate the typical price a customer would pay for the configuration.

250,039 IOPSÂ @ 3.35ms latencyÂ ($6.89 per SPC-1 IOP)
69.8TB Usable capacity ($23,947 per Usable TB)

Certainly a very respectable I/O number and really amazing latency – I think this is the first real SPC-1 result that is flash accelerated (as opposed to being entirely flash).

What got me thinking though was the utilization. I ragged on what could probably be considered a tier 3 or 4 storage company a while back for just inching by the SPC-1 minimum efficiency requirements. The maximum unused storage cannot exceed 45% and that company was at 44.77%.

Where’s NetApp with this ? Honestly higher than I thought especially considering RAID DP they are at 43.20% unused storage. I mean really – would it not make more sense to simply use RAID 10 and get the extra performance ? I understand that NetApp doesn’t support RAID 10 but it just seems a crying shame to have such low utilization of the spindles. I really would of expected the Flash cache to allow them to drive utilization up. But I suppose they decided to inch out more performance at the cost of usable capacity. I’d honestly be fascinated to see results when they drive unused storage ratio down to say 20%.

The flash cache certainly does a nice job at accelerating reads and letting the spindles run more writes as a result. Chuck over at EMC wrote an interesting post where he picked apart the latest NetApp release. What I found interesting from an outsider perspective is how so much of this new NetApp technology feels bolted on rather than integrated. They seem unable to adapt the core of their OS with this (now old) scale out Spinmaker stuff even after this many years have elapsed.Â From a high level perspective the new announcements really do sound pretty cool. But once I got to know more aobut what’s on the inside,Â I became less enthusiastic about them. There’s some really neat stuff there but at the same time some pretty dreadful shortcomings in the system still (see the NetApp blog posts above for info).

The plus side though is that at least parts of NetApp are becoming more up front with where they target their technology. Some of the posts I have seen recently both in comments on The Register as well as the NetApp blog above have been really excellent. These posts are honest in that they acknowledge they can’t be everything to everyone, they can’t be the best in all markets. There isn’t one storage design to rule them all. As EMC’s Chuck said – compromise. All storage systems have some degree of compromise in them, NetApp always seems to have had less compromise on the features and more compromise on the design. That honesty is nice to see coming from a company like them.

I met with a system engineer of theirs about a year ago now when I had a bunch of questions to ask and I was tired of getting pitched nothing but dedupe. This guy from NetApp came out and we had a great talk for what must’ve been 90 minutes. Not once was the word dedupe used and I learned a whole lot more about the innards of the platform. It was the first honest discussion I had had with a NetApp rep in all the years I had dealt with them off and on.

At the end of the day I still wasn’t interested in using the storage but felt that hey – if some day I really feel a need to combine the best storage hardware, with what many argue to say is the best storage software (management headaches aside e.g no true active-active automagic load balanced clustering), I can – just go buy a V-series and slap it in front of a 3PAR. I did it once before (really only because there was no other option at the time). I could do it again. I don’t plan to (at least not at the company I’m at now). But the option is there. Just as long as I don’t have to deal with the NetApp team in the Northwest and their dirty underhanded threatening tactics. I’m in the Bay area now so that shouldn’t be hard. The one surprising thing I heard from the reps here is they still can’t do evaluations. Which just seems strange to me. The guy told me if a deal hinged on an evaluation he wouldn’t know what to do.

3PAR of course has no such flash cache technology shipping today, something I’ve brought up with the head of HP storage before. I have been wanting them to release something like it (more specifically more like EMC’s Fast cache – EMC has made some really interesting investments in Flash over recent years – but like NetApp – at least for me the other compromises involved in using an EMC platform doesn’t make me want to use it over a 3PAR even though they have this flash technology) for some time now. I am going to be visiting 3PAR HQ soon and will learn a lot of cool things I’m sure that I won’t be able to talk about for some time to come.

Comments (2)

June 11, 2012

3PAR and NPIV

Filed under: Storage — Tags: 3par, virtualconnect — Nate @ 7:20 am

I was invited to a little preview of some of the storage things being announced at HP Discover last week, just couldn’t talk about it until the announcement. Since I was busy in Amsterdam all last week I really didn’t have a lot of time to think about blogging here.

But I’m back and am mostly adjusted to the time zone differences I hope. HP had at least two storage related announcements they made last Monday, one related to scaling of their StoreOnce dedupe setup and another related to 3PAR. The StoreOnce announcement seemed to be controversial, since I really have a minimal amount of exposure to that sort of product I won’t talk about it much, on the surface it sounded pretty impressive but if the EMC claims are true than it’s unfortunate.

Anyways onto the 3PAR announcement which while it had a ton of marketing around it, it basically comes down to three words:

3PAR Supports NPIV (finally)

NPIV in a nutshell the way I understand it is a way of virtualizing connections between points in a fibre channel network, most often in the past it seems to have been used to present storage directly to VM hosts, via FC switches. NPIV is also used by HP’s VirtualConnect technology on the FC side to connect the VC modules to a NPIV-aware FC switch (which is pretty much all of them these days?), and then the switch connected to the storage(duh). I assume that NPIV is required by Virtual Connect because the VC module isn’t really a switch it’s more of a funky bridge.

Because 3PAR did not support NPIV (for what reason I don’t know I kept asking them about it for years but never got a solid response as to why not or when they might support it) there was no way to directly connect a Virtual Connect module (either the new Flex Fabric or the older dedicated FC VC modules) to a 3PAR array, you had to have a switch as a middleman. Which just seemed like a waste. I mean here you have a T or now a V-class system with tons of ports, you have these big blade chassis with a bunch of servers in them, with the VC modules acting like a switch (acting as in aggregating points) and you can’t directly connect it to the 3PAR storage! It was an unfortunate situation. Even going back to the 3cV, which was a bundle of sorts of 3PAR, HP c-Class Blades and VMware (long before HP bought 3PAR of course), I would have thought getting NPIV support would of been a priority but it didn’t happen, until now (well last Monday I suppose).

So at scale you have up to 96 host fibre channel ports on a V400 or 192 FC ports on a V800 operating at 8Gbps. At a maximum you could get by with 48 blade enclosures (2 FC/VC modules each with a single connection) on a V400 or of course double that to 96 on a V800. Cut it in half if you want higher redundancy with dual paths on each FC/VC module. That’s one hell of a lot of systems directly connected to the array. Users may wish to stick to a single connection per VC module allowing the 2nd connection to be connected to something else, maybe another 3PAR array. You still have full redundancy with two modules and one path per module. 3PAR 4Gbps HBAs (note the V-class has 8Gbps) have queue depths of something like 1,536 (not sure what the 8Gbps HBAs have). If your leveraging full height blades you get 8 per chassis, absolute worst case scenario you could set a queue depth of 192/server (I use 128/server on my gear). You could probably pretty safely go quite a bit higher though more thought may have to be had in certain circumstances. I’ve found 128 has been more than enough for my own needs.

It’s cost effective today to easily get 4TB worth of memory per blade chassis, memory being the primary driver of VM density, so your talking anywhere from 96 – 384 TB of memory hooked up to a single 3PAR array. From a CPU perspective anywhere from 7,680 CPU cores all the way up to 36,684 CPU cores in front of a single storage system, a system that has been tested to run at over 450,000 SPC-1 IOPS. The numbers are just insane.

All we need now is a flat ethernet fabric to connect the Virtual Connect switches to, oh wait we have that too, though it’s not from HP. A single pair of Black Diamond X-Series switches could scale to the max here as well, supporting a full eight 10Gbit/second connections per blade chassis with 96 blade chassis directly connected – which, guess what – is the maximum number of 10GbE ports on a pair of FlexFabric Virtual Connect modules (assuming your using two ports for FC). Of course all of the bandwidth is non blocking. I don’t know what the state of interoperability is but Extreme touts their VEPA support in scaling up to 128,000 VMs in an X-series, and Virtual Connect appears to tout their own VEPA support as well. Given the lack of more traditional switching functionality in the VC modules it would probably be advantageous to leverage VEPA (whether or not this extends to the Hypervisor I don’t know – I suspect not based on what I last heard at least from VMware, I believe it is doable in KVM though) to route that inter-server traffic through the upstream switches in order to gain more insight into it and even control it. If you have upwards of 80Gbps of connectivity per chassis anyways it seems there’d be abundant bandwidth to do it. All HP needs to do now is follow the Dell and revise their VC modules to natively support 40GbE (the Dell product is a regular Blade Ethernet switch by contrast and is not yet shipping).

~~You’d have to cut at least one chassis out of that configuration(or reduce port counts) in order to have enough ports on the X-Series to uplink to other infrastructure~~. (When I did the original calculations I forgot there would be two switches not one, so there’s more than enough ports to support 96 blade chassis between a pair of X-8s going full bore with 8x10GbE/chassis and you could even use M-LAG to go active-active. if you prefer). I’m thinking load balancers, and some sort of scale-out NAS for file sharing, maybe the interwebs too.

Think about that, up to 30,000 cores, more than 300 TB of memory, sure you do have a bunch of bridges, but all of it connected by only two switches, and one storage array (perhaps two). Just insane.

One HP spokesperson mentioned that even a single V800 isn’t spec’d to support their maximum blade system configuration of 25,000 VMs. 25k VMs on a single array does seem quite high(that comes to an average of 18 SPC-1 IOPS/VM), but it really depends on what those VMs are doing. I don’t see how folks can go around tossing solutions about saying X number of VMs when workloads and applications can vary so widely.

So in short, the announcement was simple – 3PAR supports NPIV now – the benefits of that simple feature addition are pretty big though.

Comments (1)

April 20, 2012

Oracle not afraid to leverage Intel architecture

Filed under: Storage — Tags: oracle, zfs — Nate @ 11:28 am

I have bitched and griped in the past about how some storage companies waste their customer’s time, money and resources by not leveraging the Intel/Commodity CPU architecture that some of them tout so heavily.

Someone commented on here in response to my HP SPC-2 results pointing out that the new Oracle 7240 ZFS system has some new SPC-2 results that are very impressive, and today I stumble upon an article from our friends at The Register which talks about a similar 7240 system being tested in the SpecSFS benchmark with equally impressive results.

The main thing missing to me with the NFS results is the inability to provide them over a single file system(not just a global name space as NetApp tries to advertise but truly a single file system), oh, and of course the disclosure of costs with the test.

This 7240 system must be really new, when I went to investigate it recently the product detail pages on Oracle’s own site were returning 404s, but now they work.

I’ll come right out and say it – I’ve always been a bit leery of the ZFS offerings for a true high availability solution, I wrote a bit about this topic a while ago. Though that article focused mainly on people deploying ZFS on cheap crap hardware because they think they can make an equivalent enterprise offering by slapping some software on top of it.

I’m also a Nexenta customer for a very small installation (NAS only, back end is 3PAR). I know Nexenta and Oracle ZFS are worlds apart but at least I am getting some sort of ZFS exposure. ZFS has a lot of really nice concepts, it’s just a matter of how well it works in practice.

For example I was kind of shocked to learn that if a ZFS file system gets full you can’t delete files off of it. I saw one post of a person saying they couldn’t even mount the file system because it was full. Recently I noticed on one of my Nexenta volumes a process that kicks in when a volume gets 50% full. They create a quota’d file system on the volume of 100MB in size, so that when/if the file system is full you can somehow remove this data and get access to your file system again. Quite a hack.

I’ve seen another thread or two about existing Sun ZFS customers who have gotten very frustrated with the lack of support Oracle has given them since Oracle took the helm.

ANYWAYS, back to the topic of exploiting x86-64 architecture. Look at this –

ZFS Storage array base specifications

Clearly Oracle is embracing the processing and memory power that is available to them and I have to give them mad props for that – I wish other companies did the same, the customer would be so much better off.

They do it also by keeping the costs low (relative to the competition anyways), which is equally impressive. Oracle is a company of course that probably likes to drive margins more than most any other company out there, so it is nice to see them doing this.

My main question is – what of Pillar ? What kind of work is being done there? I haven’t noticed anything since Pillar went home to the Larry E mothership. Is it just dieing on the vine? Are these ZFS systems still not suitable for certain situations which Pillar is better at supporting?

Anyways, I can’t believe I’m writing about an Oracle solution twice in the same week but these are both nice things to see come out of one of the bigger players out there.

Comments Off

April 12, 2012

3PAR Zero detection and snapshots

Filed under: Storage — Tags: 3par, thinprovisioning — Nate @ 9:00 am

UPDATED(again)

I’ve been holding onto this post to get confirmation from HP support, now that I have it, here goes something’

UPDATE 1

A commenter Karl made a comment that made my brain think about this another way, and it turns out I was stupid in my original assessment as for the system storing the data for the snapshot. For some reason I was thinking the system was storing the zeros, but rather it was storing the data the zeros were replacing.

So I’m a dumbass for thinking that. homer moment *doh*

BUT the feature still appears broken in that what should happen is if there is in fact 200GB of data written to the snapshot that implies that zeros overwrote 200GB worth of non zero’d data – and that data should of been reclaimed from the source volume. In this case it was not, only a tiny fraction (1,536MB of logical storage or 12x128MB chunks of data). So at the very least the bulk of the data usage should of been moved from the source volume to the snapshot (snapshot space is allocated separate from the source volume so it’s easy to see which is using what). CURRENTLY the volume is showing 989GB of reserved space on the array with 120GB of written snapshot data and 140GB of file system data or around 260GB of total data which should come out to around 325GB of physical data in RAID 5 3+1, not 989thGB. But that space reclaiming technology is another feature thin copy reclamation. Which reclaims space from deleted snapshots.

So, sorry for being a dumbass for the original premise of the post, for some reason my brain got confused by the results of the tests, and it wasn’t until Karl’s comment that it made me think about it from the other angle.

I am talking to some more technical / non support people on Monday about this.

And thanks Karl 🙂

UPDATE 2

I got some good information from senior technical folks at 3PAR and it turns out the bulk of my problems are related to bugs in how one part reports raw space utilization (resulting in wildly inaccurate info), and a bug with regards to a space reclamation feature that was specifically disabled by a software patch on my array in order to fix another bug with space reclamation. So the fix for that is to update to a newer version of code which has that particular problem fixed for good(I hope?). I think I’d never get that kind of information out of the technical support team.

So in the end not much of a big issue after all, just confused by some bugs and functionality that was disabled and me being stupid.

END UPDATE

A lot of folks over the years have tried to claim I am a paid shill for 3PAR, or Extreme or whatever. All I can say is I’m not compensated by them for my posts in any direct way (maybe I get better discounts on occasion or something I’m not sure how that stuff factors in but in any case those benefits go to my companies rather than me).

I do knock them when I think they need to be knocked though. Here is something that made me want to knock 3PAR, well more than knock, more like kick in the butt, HARD and say W T F.

I was a very early adopter of the T-class of storage system, getting it in house just a couple months after it was released. It was the first system from them which had the thin built in – the thin reclamation and persistence technology integrated into the ASIC – only I couldn’t use it because the software didn’t exist at the time.

3PAR "Stay Thin" strategy - wasn't worth $100,000 in licensing for the extra 10% of additional capacity savings for my first big array.

That was kind of sad but it could of been worse – the competition that we were evaluating was Hitachi who had just released their AMS2000-series of products, literally about a month after the T-class was released. Hitachi had no thin provisioning support what-so-ever on the AMS2000 when it was launched. That came about seven months later. If you required thin provisioning at the time you had to buy a USP or (more common for this scenario due to costs at the time) a USP-V, which supported TP, and put the AMS2000 behind it. Hitachi refused to even give us a ballpark price as to the cost of TP on the AMS2000 whenever it was going to be released. I didn’t need an exact price, just tell me is it going to be $5,000, or $25,000 or maybe $100,000 or more ? Should be a fairly simple process, at least from a customer perspective especially given they already had such licensing in place on their bigger platform. In the end I took that bigger platform’s licensing costs(since they refused to give that to me too) and extrapolated what the cost might look like on the AMS line. I got the info from Storage Mojo‘s price list and basically took their price and cut it in half to take into account discounts and stuff. We ended up obviously not going for HDS so I don’t know what it would of really cost us in the end.

OK, steering the tangent closer to the topic again..bear with me.

Which got me wondering – given it is an ASIC – and not a FPGA they really have to be damn sure it works when they ship product otherwise it can be an expensive proposition to replace the ASICs if there is a design problem with the chip, after all the CPUs aren’t really in the data path of the stuff flowing through the system so it would be difficult to work around ASIC faults in software(if it was possible at all).

So I waited, and waited for the new thin stuff to come out, thinking since I had thin provisioning licensed already I would just install the new software and get the new features.

Then it was released – more than a year after I got the T400 but it came with a little surprise – additional licensing costs associated with the software – something nobody ever told me of (sounds like it was a last minute decision). If I recall right, for the system I had at the time if we wanted to fully license thin persistence it was going to be an extra $100,000 in software. We decided against it at the time, really wasn’t worth the price for what we’d reclaim. Later on 3PAR offered to give us the software for free if we bought another storage array for disaster recovery (which we were planning to) – but the disaster recovery project got canned so we never got it.

Another licensing feature of this new software was in order to get to the good stuff, the thin persistence you had to license another productÂ – Thin Conversion whether you wanted it or not (I did not – really you might only need Thin Conversion if your migrating from a non thin storage system).

Fast forward almost two years and I’m at another company with another 3PAR, there was a thin provisioning licensing snafu with our system so for the past few months(and for the next few) I’m operating on an evaluation license which basically has all the features unlocked – including the thin reclamation tech. I had noticed recently that some of my volumes are getting pretty big – per the request of the DBA we have I agreed to make these volumes quite large – 400GB each, what I normally do is create the physical volume at 1 or 2TB (in this case 2TB), then I create a logical volume that is more in line with what the application actually needs(which may be as low as say 40GB for the database), then grow it on line as the space requirements increase.

3PAR’s early marketing at least tried to communicate that you can do away with volume management altogether.Â While certainly technically possible, I don’t recommend that you take that approach. Another nice thing about volume management is being able to name the volumes with decent names, which is very helpful when working with moving snapshots between systems, especially with MPIO and multiple paths and multiple snapshots on one system, with LVM it’s simple as can be, without – I really don’t want to think about it. Only downside is you can’t easily mount a snapshot back to the originating system because the LVM UUID will conflict and changing that ID is not (or was not, been a couple years since I looked into it) too easy, blocking access to the volume. Not a big deal though the number of times I felt I wanted to do that was once.

This is a strategy I came up with going back almost six years to my original 3PAR box and has worked quite well over the years. Originally, resizing was an off line operation since the kernel that we had at the time (Red Hat Enterprise 4.x) did not support on line file system growth, it does (and has) for a while now, I think since maybe 4.4/4.5 and certainly ever since v5.

Once you have a grasp as to the growth pattern of your application it’s not difficult to plan for. Getting the growth plan in the first place could be complex though given the dedicate on write technology, you had to (borrowing a term from Apple here) think different. It obviously wasn’t enough to just watch how much disk space was being consumed on average, you had to take into account space being written vs being deleted and how effective the file system was at re-utilizing deleted blocks. In the case of MySQL – being as inefficient as it is, you had to also take into account space utilization required by things like ALTER TABLE statements, in which MySQL makes a copy of the entire table with your change then drops the original. Yeah, real thin friendly there.

Given this kind of strategy it is more difficult to gauge just exactly how much your saving with thin provisioning, I mean on my original 3PAR I was about 300-400% over subscribed(which at the time was considered extremely high – I can’t count the number of hours I spent achieving that), I think I recall at that conference I was at David Scott saying the average customer was 300% oversubscribed. On my current system I am 1300% over subscribed. Mainly because I got a bunch of databases and I make them all 2TB volumes, I can say with a good amount of certainty that they will probably never get to remotely 2TB in size but it doesn’t affect me otherwise so I give it what I can (all my boxes on this array are VMware ESX 4.1 which of course has a 2TB limit – the bulk of these volumes are raw device mapped to leverage SAN-based snapshots as well as, to a lesser extent individually manage and monitor space and i/o metrics).

At the time my experience was compounded by the fact that I was still very green when it came to storage (I’d like to think I am more blue now at least whatever that might mean). Never really having dabbled much in it prior, choosing instead to focus on networking and servers. All big topics, I couldn’t take them all on at once 🙂

So my point is – even though 3PAR has had this technology for a while now – I really have never tried it. In the past couple months I have run the Microsoft sdelete tool on the 3 windows VMs I do have to support my vCenter stuff(everything else is Linux) – but honestly I don’t think I bothered to look to see if any space was reclaimed or not.

Now back on topic

Anyways, I have this one volume that was consuming about 300GB of logical space on the array when it had maybe 140GB of space written to the file system (which is 400GB). Obviously a good candidate for space reclamation, right? I mean the marketing claims you can gain 10% more space, in this case I’m gaining a lot more than that!

So I decided – hey how bout I write a basic script that writes out a ton of zeros to the file system to reclaim this space (since I recently learned that the kernel code required to do fancier stuff like fstrim [updated that post with new information at the end since I originally wrote it] doesn’t exist on my systems). So I put a basic looping script in to write 100MB files filled with zeros from /dev/zero.

I watched it as it filled up the file system over time (I spaced out the writing as to not flood my front end storage connections), watching it reclaim very little space – at the end of writing roughly 200GB of data it reclaimed maybe 1-2GB from the original volume. I was quite puzzled to say the least. But that’s not the topic of this post now is it.

I was shocked, awed, flabbergasted by the fact that my operation actually CONSUMED an additional 200GB of space on the system (space filled with zeros). Why did it do this? Apparently because I created a snapshot of the volume earlier in the day and the changes were being kept track of thus consuming the space. Never mind the fact that the system is supposed to drop the zeros even if it doesn’t reclaim space – it doesn’t appear to do so when there is a snapshot(s) on the volume, so the effects were a double negative – didn’t reclaim any space from the original, and actually consumed a ton more space (more than 2x the original volume size) due to the snapshot.

Support claims minimal space was reclaimed by the system because I wrote files in 100MB blocks instead of 128MB blocks. I find it hard to believe out of 200GB of files I wrote that there was not more 128MB contiguous blocks of space of zeros. But I will try the test again with 128MB files on that specific volume after I can contact the people that are using the snapshot to delete the snapshot and re-create it to reclaim that 200GB of space. Hell I might as well not even use the snapshot and create a full physical copy of the volume.

Honestly I’m sort of at a loss for words as to how stupid this is. I have loved 3PAR through thick and thin for a long time (and I’ve had some big thicks over the years that I haven’t written about here anyways..), but this one I felt compelled to. A feature so heavily marketed, so heavily touted on the platform is rendered completely ineffective when a basic function like snapshots is in use. Of course the documentation has nothing on this, I was looking through all the docs I had on the technology when I was running this test on Thursday and basically what it said was enable zero detection on the volume (disabled by default) and watch it work.

I’ve heard a lot of similar types of things (feature heavily touted but doesn’t work under load or doesn’t work period) on things like NetApp, EMC etc. This is a rare one for 3PAR in my experience at least. My favorite off the top of my head was NetApp’s testing of an EMC CX-3 performance with snapshots enabled. That was quite a shocker to me when I first saw it. Roughly a 65% performance drop over the same system without snapshots.

Maybe it is a limitation of the ASIC itself – going back to my speculation about design issues and not being able to work around them in software. Maybe this limitation is not present in the V-class which is the next generation ASIC. Or maybe it is, I don’t know.

HP Support says this behavior is as designed. Well I’m sure more than one person out there would agree it is a stupid design if so. I can’t help but think it is a design flaw, not an intentional one – or a design aspect they did not have time to address in order to get the T-series of arrays out in a timely manor(I read elsewhere that the ASIC took much longer than they thought to design, which I think started in 2006 – and was at least partially responsible for them not having PCI express support when the ASIC finally came out). I sent them an email asking if this design was fixed in the V-Class, will update if they respond. I know plenty of 3PAR folks (current and former) read this too so they may be able to comment (anonymously or not..).

As for why more space was not reclaimed in the volume, I ran another test on Friday on another volume without any snapshots which should of reclaimed a couple hundred gigs but according to the command line it reclaimed nothing, support points me to logs saying 24GB was reclaimed, but that is not reflected in the command line output showing the raw volume size on the system. Still working with them on that one. My other question to them is why 24GB ? I wrote zeros to the end of the file system, there wasÂ 0 bytes left. I have some more advanced logging things to do for my next test.

While I’m here I might as well point out some of the other 3PAR software or features I have not used, let’s see

Adaptive optimization (sub LUN tiering – licensed separately)
Full LUN-based automated tiering (which I believe is included with Dynamic optimization) – all of my 3PAR arrays to-date have had only a single tier of storage from a spindle performance perspective though had different tiers from RAID level perspectives
Remote Copy – for the situations I have been in I have not seen a lot of value in array-based replication. Instead I use application based. The one exception is if I had a lot of little files to replicate, using block based replication is much more efficient and scalable. Array-based replication really needs application level integration, and I’d rather have real time replication from the likes of Oracle(not that I’ve used it in years, though I do miss it, really not a fan of MySQL) or MySQL then having to co-ordinate snapshots with the application to maintain consistency (and in the case of MySQL there really is no graceful way to take snapshots, again, unlike Oracle – I’ve been struggling recently with a race condition somewhere in an App or in MySQL itself which pretty much guarantees MySQL slaves will abort with error code 1032 after a simple restart of MySQL – this error has been shown to occur upwards of 15 minutes AFTER the slave has gotten back in sync with the master – really frustrating when trying to deal with snapshots and getting those kinds of issues from MySQL). I have built my systems, for the most part so they can be easily rebuilt so I really don’t have to protect all of my VMs by replicating their data, I just have to protect/replicate the data I need in order to reconstruct the VM(s) in the event I need to.
Recovery manager for Oracle (I licensed it once on my first system but never ended up using it due to limitations in it not being able to work with raw device maps on vmware – I’m not sure if they have fixed that by now)
Recovery manager for all other products (SQL server, Exchange, and VMware)
Virtual domains (useful for service providers I think mainly)
Virtual lock (used to lock a volume from having data deleted or the volume deleted for a defined period of time if I recall right)
Peer motion

3PAR Software/features I have used (to varying degrees)

Thin Provisioning (for the most part pretty awesome but obviously not unique in the industry anymore)
Dynamic Optimization (oh how I love thee) – the functionality this provides I think for the most part is still fairly unique, pretty much all of it being made possible by the sub disk chunklet-based RAID design of the system. Being able to move data around in the array between RAID levels, between tiers, between regions of the physical spindles themselves (inner vs outer tracks), really without any limit as to how you move it (e.g. no limitations like aggregates in the NetApp world), all without noticeable performance impact is quite amazing (as I wrote a while back I ran this process on my T400 once for four SOLID MONTHS 24×7 and nobody noticed).
System Tuner (also damn cool – though never licensed it only used it in eval licenses) – this looks for hot chunklets and moves them around automatically. Most customers don’t need this since the system balances itself so well out of the box. If I recall right, this product was created in response to a (big) customer’s request mainly to show that it could be done, I am told very few license it since it’s not needed. In the situations where I used it it ended up not having any positive(or negative) effect on the situation I was trying to resolve at the time.
Virtual Copy (snapshots – both snapshots and full volume copies) – written tons of scripts to use this stuff mainly with MySQL and Oracle.
MPIO Software for MS windows – worked fine – really not much to it, just a driver. Though there was some licensing fee 3PAR had to pay for MS for the software or development efforts they leveraged to build it – otherwise the drivers could of been free.
Host Explorer (pretty neat utility that sends data back through the SCSI connection from the server to the array including info like OS version, MPIO version, driver versions etc – doesn’t work on vSphere hosts because VMware hasn’t implemented support for those SCSI commands or something)
System Reporter – Collects a lot of data, though from a presentation perspective I much prefer my own cacti graphs
vCenter Plugin for the array – really minimal set of functionality compared to the competition – a real weak point for the platform. Unfortunately it hasn’t changed much in the almost two years since it was released – hoping it gets more attention in the future, or even in the present. As-is, I consider it basically useless and don’t use it. I haven’t taken advantage of the feature on my own system since I installed the software to verify that it’s functional.
Persistent Cache – an awesome feature in 4+ node systems that allows re-mirroring of cache to another node in the system in the event of planned or unplanned downtime on one or more nodes in the cluster (while I had this feature enabled – it was free with the upgrade to 2.3.1 on systems with 4 or more nodes I never actually had a situation where I was able to take advantage of it before I left the company with that system).
Autonomic Groups – group volumes and systems together and make managing mappings of volumes to clusters of servers very easy. The GUI form of this is terrible and they are working to fix it. I literally practically wiped out my storage system when I first tried this feature using the GUI. It was scary the damage I did in the short time I had this(even more so given the number of years I’ve used the platform for). Fortunately the array that I was using was brand new and had really no data on it (literally). Since then – CLI for me, safer and much more clear as to what is going on. My friends over at 3PAR got a lot of folks involved over there to drive a priority plan to fix this functionality which they admit is lacking. What I did wipe out were my ESX boot volumes, so I had to re-create the volumes and re-install ESX. Another time I wiped out all of my fibre channel host mappings and had to re-establish those too. Obviously on a production system this would of resulted in massive data loss and massive downtime. Fortunately, again it was still at least 2 months from being a production system and had a trivial amount of data. When autonomic groups first came out I was on my T400 with a ton of existing volumes, migrating to use existing volumes to groups likely would of been disruptive so I only used groups for new resources, so I didn’t get much exposure to the feature at the time.

That turned out to be A LOT longer than I expected.

This is probably the most negative thing I’ve said about 3PAR here. This information should be known though. I don’t know how other platforms behave – maybe it’s the same. But I can say in the nearly three years I have been aware of this technology this particular limitation has never come up in conversations with friends and contacts at 3PAR. Either they don’t know about it either or it’s just one of those things they don’t want to admit to.

It may turn out that using SCSI UNMAP to reclaim space, rather than writing zeros is much more effective thus rendering the additional costs of thin licensing worth while. But not many things support that yet. As mentioned earlier, VMware specifically recommends disabling support for UNMAP in ESX 5.0 and has disabled it in subsequent releases because of performance issues.

Another thing that I found interesting, is that on the CLI itself, 3PAR specifically reccomends keeping Zero detection disabled unless your doing data migration because under heavy load it can cause issues –

Note: there can be some performance implication under extreme busy systems so it is recommended for this policy to be turned on onlyÂ during Fat to Thin and re-thinning process and be turned off during normal operation.

Which to some degree defeats the purpose? Some 3PAR folks have told me that information is out of date and only related to legacy systems. Which didn’t really make sense since there are no legacy systems that support zero detection as it is hard wired into the ASIC. 3PAR goes around telling folks that zero detection on other platforms is no good because of the load it introduces but then says that their system behaves in a similar way. Now to give them credit I suspect it is still quite likely a 3PAR box can absorb that hit much better than any other storage platform out there, but it’s not as if your dealing with a line rate operation, there clearly seems to be a limit as to what the ASIC can process. I would like to know what an extremely busy system looks like – how much I/O as a percentage of controller and/or disk capacity?

Bottom line – at this point I’m even more glad I didn’t license the more advanced thinning technologies on my bigger T400 way back when.

I suppose I need to go back to reclaiming space the old fashioned way – data migration.

4,000+ words woohoo!

Comments (6)

March 23, 2012

Hitachi trounces XIV in SPC-2 Costs

Filed under: Storage — Tags: spc-2, vsp, xiv — Nate @ 9:37 am

This really sort of surprised me. I came across a HP storage blog post which mentioned some new SPC-2 results for the P9500 aka Hitachi VSP, naturally I expected the system to cost quite a bit, and offer good performance but I was really not expecting these results.

A few months ago I wrote about what seemed like pretty impressive numbers from IBM XIV (albeit at a high cost), I didn’t realize how high of a cost until these latest results came out.

Not that any of my workloads are SPC-2 related (which is primarily throughput). I mean if I have a data warehouse I’d probably run HP Vertica (which slashes I/O requirements due to it’s design), negating the need for such a high performing system, if I was streaming media I would probably be running some sort of NAS – maybe Isilon or DDN, BlueArc – I don’t know. I’m pretty sure I would not be using one of these kinds of arrays though.

Anyways, the raw numbers came down to this:

IBM XIV

7.4GB/sec throughput
$152.34 per MB/sec of throughput (42MB/sec per disk)
~$7,528 per usable TB (~150TB Usable)
Total system cost – $1.1M for 180 x 2TB SATA disks and 360GB cache

HP P9500 aka Hitachi VSP

13.1GB/sec throughput
$88.34 per MB/sec of throughput (26MB/sec per disk)
~$9,218 per usable TB (~126TB Usable)
Total system cost – $1.1M for 512 x 300GB 10k SAS disks and 512GB cache

The numbers are just startling to me, I never really expected the cost of the XIV to be so high in comparison to something like the P9500. In my original post I suspected that any SPC-1 numbers coming out of XIV(based on the SPC-2 configuration cost anyways) would put the XIV as the most expensive array on the market(per IOP), which is unfortunate given it’s limited scalability to 180 disks, 7200RPM-only and RAID 10 only. I wonder what, if anything(other than margin) keeps the price so high on XIV.

I’m sure a good source for getting the cost lower on the P9500 side was the choice to use RAID 5 instead of RAID 10. The previous Hitachi results, released in 2008 for the previous generation USP platform was mirroring. And of course XIV only supports mirroring.

It seems clear to me that the VSP is the winner here, I suspect the XIV probably includes more software out of the box, while the VSP is likely not an all-inclusive system.

IBM gets some slack cut to them since they were doing a SPC-2/E energy efficiency test, though not too much since if your spending $1M on a storage system the cost of energy isn’t going to be all that important(at least given average U.S. energy rates). I’m sure the P9500 with it’s 2.5″ drives are pretty energy efficient on their own anyways.

Where XIV really fell short was on the last test for Video on Demand, for some reason the performance tanked, less than 50% of the other tests( a full 10 Gigabytes/second less than VSP). I’m not sure what the weightings are for each of the tests but if IBM was lucky and the VOD test wasn’t there it would of helped them a lot.

The XIV as tested is maxed out, so any expansion would require an additional XIV. The P9500 is nowhere close to maxed out (though throughput could be maxed out, I don’t know).

Comments (9)

March 20, 2012

Storage Reclamation under Linux

Filed under: Storage — Nate @ 1:32 pm

UPDATED – Oh the good ‘ol days, I remember when I first started using thin provisioning in late 2006, learning the ropes, learning ins and outs.. I can’t count how many times I did data migrations between volumes to reclaim space at the time.

Today times have changed, a few years ago many companies started introducing thin reclamation technologies which provides a means for the computer to communicate to the storage which blocks are not being used and can be reclaimed by the shared storage system for use in other volumes.

Initially software support for this was almost non existent, short of a Microsoft tool called sdelete, which was never designed with storage reclamation in mind (I assume ..) it would be the tool of choice for early reclamation systems since it had the ability to zero out deleted space in a volume. Of course it only worked on Windows boxes.

Later came support from Symantec in their Veritas products, and VMware in their VAAI technologies (though in VMware’s case it seems they suggest you disable support because storage arrays can’t keep up which drives latency up – wonder if that includes 3PAR ?). Then Oracle announced their own reclamation system which the built in co-operation with 3PAR. Myself at least, I have not seen many other announcements for such technology.

A few days ago I came across this post from Compellent which talks about the support of the SCSI UNMAP command in Red Hat Enterprise 6, apparently integrated into the ext4 file system (or somewhere else in the layer so that it is transparent, no special tool needed to reclaim space). That sounded really cool, my company is an Ubuntu shop at the moment, and as far as I can tell there is no such support in Ubuntu at this time (at least not with 10.04 LTS).

One of my co-workers pointed me to another tool called fstrim, which seems to do something similar as sdelete.

Fstrim is used on a mounted filesystem to discard (or “trim”) blocks which are not in use by the filesystem. This is useful for solid-state drives (SSDs) and thinly-provi-sioned storage.

I have not yet tried it. YAT (Yet Another Tool) I came across is zerofree, but this tool is not really too useful since it needs the file system to be in read only mode or totally unmounted.

fstrim is part of the util-linux package in newer Debian-based distributions (Ubuntu 10.04 excluded), so it should be supported by your distro if your on a current release. It is also part of the latest versions of Red Hat Enterprise.

Any other thin reclamation tools out there (for any platform), or any other thin reclamation support built into other software stacks ?

UPDATE – Upon further investigation I believe I determined that this new Linux functionality was introduced into the kernel as something called FITRIM which Ubuntu says was part of 2.6.36 kernel. I see an XFS page that mentionsÂ real time reclamation (what RHEL 6 does) is apparently part of the 3.0 kernel, so I assume RHEL back ported those changes.

Comments (6)

March 17, 2012

Who uses Legacy storage?

Filed under: Random Thought,Storage — Tags: 3par — Nate @ 3:34 pm

Still really busy these days haven’t had time to post much but I was just reading someone’s LinkedIn profile who works at a storage company and it got me thinking.

Who uses legacy storage? It seems almost everyone these days tries to benchmark their storage system against legacy storage.Â Short of something like maybe direct attached storage which has no functionality, legacy storage has been dead for a long time now. What should the new benchmark be? How can you go about (trying to) measuring it?Â I’m not sure what the answer is.

When is thin, thin?

One thing that has been in my mind a lot on this topic recently is how 3PAR constantly harps on about their efficient allocation at 16kB blocks. I think I’ve tried to explain this in the past but I wanted to write about it again. I wrote a comment on it in a HP blog recently I don’t think they published the comment though (haven’t checked for a few weeks maybe they did). But they try to say they are more efficient (by dozens or hundreds of times) than other platforms because of this 16kB allocation thing-a-ma-bob.

I’ve never seen this as an advantage to their platform. Whether you allocate in 16kB chunks or perhaps 42MB chunks in the case of Hitachi, it’s still a tiny amount of data in any case and really is a rounding error. If you have 100 volumes and they all have 42MB of slack hanging off the back of them, that’s 4.2GB of data, it’s nothing.

What 3PAR doesn’t tell you is this 16kB allocation unit is what a volume draws from a storage pool (Common Provisioning Group in 3PAR terms – which is basically a storage template or policy which defines things like RAID type, disk type, placement of data, protection level etc). They don’t tell you up front how much these storage pools provision storage on, which is in-part based on the number of controllers in the system.

If your volumes max out a CPG’s allocated space and it needs more, it won’t grab 16kB, it will grab (usually at least) 32GB, this is adjustable. This is – I believe in part how 3PAR addresses minimizing impact of thin provisioning with large amounts of I/O, because it allocates these pools with larger chunks of data up front. They even suggest that if you have a really large amount of growth that you increase the allocation unit even higher.

Growth Increments for CPGs on 3PAR

I bet you haven’t heard HP/3PAR say their system grows in 128GB increments recently 🙂

It is important to note, or to remember, that a CPG can be home to hundreds of volumes, so it’s up to the user, if they only have one drive type for example maybe they only want 1 CPG.Â But I think as they use the system they will likely go down a similar path that I have and have more.

If you only have one or two CPGs on the system it’s probably not a big deal, though the space does add up. Still I think for the most part even this level of allocation can be a rounding error. Unless you have a large number of CPGs.

Myself, on my 3PAR arrays I use CPGs not just for determining data characteristics of the volumes but also for organizational purposes / space management. So I can look at one number and see all of the volumes dedicated to development purposes are X in size, or set an aggregate growth warning on a collection of volumes. I think CPGs work very well for this purpose. The flip side is you can end up wasting a lot more space. Recently on my new 3PAR system I went through and manually set the allocation level of a few of my CPGs from 32GB down to 8GB because I know the growth of those CPGs will be minimal. At the time I had maybe 400-450GB of slack space in the CPGs, not as thin as they may want you to think (I have around 10 CPGs on this array). So I changed the allocation unit and compacted the CPGs which reclaimed a bunch of space.

Again, in the grand scheme of things that’s not that much data.

For me 3PAR has always been more about higher utilizations which are made possible by the chunklet design and the true wide striping, the true active-active clustered controllers, one of the only(perhaps one of if not the first?) storage designs in the industry that goes beyond two controllers, and the ASIC acceleration which is at the heart of the performance and scalability. Then there is the ease of use and stuff, but I won’t talk about that anymore I’ve already covered it many times. One of my favorite aspects of the platform is the fact that they use the same design on everything from the low end to the high end, the only difference really is scale. It’s also part of the reason why their entry level pricing can be quite a bit higher than entry level pricing from others since there is the extra sauce in there that the competition isn’t willing or able to put on their low end box(s).

Sacrificing for data availability

I was talking to Compellent recently learning about some of their stuff for a project over in Europe and they told me their best practice (not a requirement) is to have 1 hot spare of each drive type (I think drive type meaning SAS or SATA, I don’t think drive size matters but am not sure) per drive chassis/cage/shelf.

They, like many other array makers don’t seem to support the use of low parity RAID (like RAID 50 3+1, or 4+1), they (like others) lean towards higher data:parity ratios I think in part because they have dedicated parity disks(they either had a hard time explaining to me how data is distributed or I had a hard time understanding, or both..), and dedicating 25% of your spindles to parity is very excessive, but in the 3PAR world dedicating 25% of your capacityÂ to parity is not excessive(when compared to RAID 10 where there is a 50% overhead anyways).

There are no dedicated parity, or dedicated spares on a 3PAR system so you do not lose any I/O capacity, in fact you gain it.

The benefits to a RAID 50 3+1 configuration are a couple fold – you get pretty close to RAID 10 performance, and you can most likely (depending on the # of shelves) suffer a shelf failure w/o data loss or downtime(downtime may vary depending on your I/O requirements and I/O capacity after those disks are gone).

It’s a best practice (again, not a requirement) in the 3PAR world to provide this level of availability (losing an entire shelf), not because you lose shelves often but just because it’s so easy to configure and is self managing. With a 4, or 8-shelf configuration I do like RAID 50 3+1. In an 8-shelf configuration maybe I have some data volumes that don’t need as much performance so I could go with a 7+1 configuration and still retain shelf-level availability.

Or, with CPGs you could have some volumes retain shelf-level availability and other volumes not have it, up to you. I prefer to keep all volumes with shelf level availability. The added space you get with a higher data:parity ratio really has diminishing returns.

Here’s a graphic from 3PAR which illustrates the dimishing returns(at least on their platform, I think the application they used to measure was Oracle DB):

The impact of RAID on I/O and capacity

3PAR can take this to an even higher extreme on their lower end F-class series which uses daisy chaining in order to get to full capacity (max chain length is 2 shelves). There is a availability level called port level availability which I always knew was there but never really learned what it truly was until last week.

Port level availability applies only to systems that have daisy chained chassis and protects the system from the failure of an entire chain. So two drive shelves basically. Like the other forms of availability this is fully automated, though if you want to go out of your way to take advantage of it you need to use a RAID level that is compliant with your setup to leverage port level availability otherwise the system will automatically default to a lower level of availability (or will prevent you from creating the policy in the first place because it is not possible on your configuration).

Port level availability does not apply to the S/T/V series of systems as there is no daisy chaining done on those boxes (unless you have a ~10 year old S-series system which they did support chaining – up to 2,560 drives on that first generation S800 – back in the days of 9-18GB disks).

Comments (4)

February 3, 2012

IBM shows it still has some horses left

Filed under: Storage — Tags: ibm, spc-1, svc — Nate @ 11:15 am

I noticed a few days ago that IBM posted some new SPC-1 results based on their SVC system, this time using different back end storage- their Storwize product (something I had not heard of before but I don’t pay too close of attention to what IBM does they have so many things it’s hard to keep track).

The performance results are certainly very impressive, coming in at over 520,000 IOPSÂ at a price of $6.92 per IOP. This is the sort of results I was kind of expecting from the Hitachi VSP a while back. IBM tested with 1,920 drives the same number as the 3PAR V800. They bested the 3PAR performance by a good 70,000 IOPS with half the latency on the same number of disks and less data cache.

The capacity numbers were, and still are, sort of difficult to interpret they seem to give conflicting information. IBM is using ~138TB of disk space to protect ~99TB of disk space. While 3PAR is using ~263TB of disk space to protect ~263TB of disk space. Both results say there is 30TB+ of “unused storage” in that protection scheme.

Bottom line is the IBM box is presented with roughly 280TB of storage, and of that, 100TB is usable, or about 35%. That brings their cost per usable TB number to $36,881/TB vs the 3PAR V800 which is roughly $12,872. The V800 I/O cost was $6.59, which IBM comes real close to.

IBM has apparently gone the same route as HDS in the only 3.5″ drives they support on their Storwize systems are 3TB SATA disks. They hamper their own cost structure by not supporting larger 3.5″ 15k RPM SAS disks, which just doesn’t make sense to me. There are 300GB 15k SAS drives out and Storwize doesn’t support those either(yet at least).

It took about five pages of scripting to configure the system from the looks of the full disclosure report.

Certainly looks like a halfway decent system. I mean if you compare it to the VSP for example it has the same array virtualization abilities with the SVC, it is sporting almost double the amount of disk drives, almost double the raw performance, configuration at least appears to be less complicated. It uses those power efficient 2.5″ disks just like the VSP. It also costs quite a bit less than the VSP both on per-IOP and per-TB basis. It also appears to have mainframe support for those that need that. From the looks of Seagate 15k RPM disks at least the 2.5″ drives have an average of 15% less latency for random reads and writes than their 3.5″ counterparts. I thought the difference might be bigger than that given how much less distance the disk heads have to travel.

If I was in the market for such a big system, these results wouldn’t lead me away from 3PAR, at least based on the pricing disclosed of each system (and level of complexity to configure). I was interviewing a candidate a few weeks ago and this guy had a strong storage background. Having worked for Symantec I think for a while he was doing some sort of consulting at various companies for storage. I asked him how he provisioned storage, what his strategies were. His response was quite surprising. He said usually the vendors come out, deploy their systems and provision everything up front, all he does is carve out LUNs and present them to users. He had never been involved in the architecture planning or deployment of a storage system. He acted as if what he was doing was the standard practice (maybe it is at large companies I’ve never worked at such an organization), and that it was perfectly normal.

But it certainly seems like a good system when put up against at least the VSP, and probably the V-MAX too.

I’ve always been interested in the SVC by itself, certainly seems like a cool concept, I’ve never used one of course but having the ability to cluster at that intermediate level(in this case a 8-node cluster which may be the max I’m not sure) and then scale out storage behind it. Clearly they’ve shown with this you can pump one hell of a lot of I/O through the thing. They also seem to have SSD tiering support built into it which is nice as well.

Hopefully HP can come up with something similar at some point, as much as they talk smack about the likes of SVC today.