TechOpsGuys.com Diggin' technology every day

25Jun/12Off

Exanet – two years later – Dell Fluid File system

TechOps Guy: Nate

I was an Exanet customer a few years ago up until they crashed. They had a pretty nice NFS cluster for scale out, well at least it worked well for us at the time and it was really easy to manage.

Dell bought them over two years ago and hired many of the developers and have been making the product better I guess over the past couple of years. Really I think they could of released a product - wait for it - a couple of years ago given that Exanet was simply a file system that ran on top of CentOS 4.x at the time. Dell was in talks with Exanet at the time they crashed to make Exanet compatible with an iSCSI back end (because really who else makes a NAS head unit that can use iSCSI as a back end disk). So even that part of the work was pretty much done.

It was about as compatible as you could get really. It would be fairly trivial to certify it against pretty much any back end storage. But Dell didn't do that, they sat on it making it better(one would have to hope at least). I think at some point along the line perhaps even last year they released something in conjunction with Equallogic - I believe that was going to be their first target at least, but with so many different names for their storage products I'm honestly not sure if it has come out yet or not.

Anyways that's not the point of this post.

Exanet clustering, as I've mentioned before was sort of like 3PAR for file storage. It treated files like 3PAR treats chunklets. It was highly distributed (but lacked data movement and re-striping abilities that 3PAR has had for ages).

Exanet File System Daemon - a software controller for files in the file system, typically one per CPU core, a file had a primary FSD and a secondary FSD. New files would be distributed evenly across all FSDs.

One of the areas where the product needed more work I thought was being able to scale up more. It was a 32-bit system - so inherited your typical 32-bit problems like memory performance going in the tank when you try to address large amounts of memory. When their Sun was about to go super nova they told me they had even tested up to 16-node clusters on their system, they could go higher there just wasn't customer demand.

3PAR too was a 32-bit platform for the longest time, but those limitations were less of an issue for it because so much of the work was done in hardware - it even has physical separation of the memory used for the software vs the data cache. Unlike Exanet which did everything in software, and of course shared memory between the OS and data cache. Each FSD had it's own data cache, something like up to 1.5GB per FSD.

Requests could be sent to any controller, any FSD, if that FSD was not the owner of the file it would send a request on a back end cluster interconnect and proxy the data for you, much like 3PAR does in it's clustering.

I believed it was a great platform to just throw a bunch of CPU cores and gobs of memory at, it runs on a x86-64 PC platform (IBM Dual socket quad core was their platform of choice at the time). 8, 10 and 12 core CPUs were just around the corner, as were servers which could easily get to 256GB or even 512GB of memory. When your talking software licensing costs being in the tens of thousands of dollars - give me more cores and ram, the cost is minimal on such a commodity platform.

So you can probably understand my disappointment when I came across this a few minutes ago, which tries to hype up the upcoming Exanet platform.

  • Up to 8 nodes and 1PB of storage (Exanet could do this and more 4 years ago - though in this case it may be a Compellent limitation as they may not support more than two Compellent systems behind a Exanet cluster - docs are un clear) -- Originally Exanet was marketed as a system that could scale to 500TB per 2-node pair. Unofficially they preferred you had less storage per pair (how much less was not made clear - at my peak I had around I want to say 140TB raw managed by a 2-node cluster? It didn't seem to have any issues with that we were entirely spindle bound)
  • Automatic load balancing (this could be new - assuming it does what it implies - which the more I think about it I'd be it does not do what I think it should do and probably does the same load balancing Exanet did four years ago which was less load balancing and more round robin distribution)
  • Dual processor quad core with 24GB - Same controller configuration I got in 2008 (well the CPU cores are newer) -- Exanet's standard was 16GB at the time but  you could get a special order and do 24GB though there was some problem with 24GB at the time that we ran into during a system upgrade I forgot what it was.
  • Back end connectivity - 2 x 8Gbps FC ports (switch required) -- my Exanet was 4Gbps I believe and was directly connected to my 3PAR T400, queue depths maxed out at 1500 on every port.
  • Async replication only - Exanet had block based async replication this in late 2009/early 2010. Prior to that they used a bastardized form of rsync (I never used either technology)
  • Backup power - one battery per controller. Exanet used old fashioned UPSs in their time, not sure if Dell integrated batteries into the new systems or what.
  • They dropped support for Apple File Protocol. That was one thing that Exanet prided themselves on at the time - they even hired one of the guys that wrote the AFP stack for Linux, they were the only NAS vendor (that I can recall) at the time that supported AFP.
  • They added support for NDMP - something BlueArc touted to us a lot at the time but we never used it, wasn't a big deal. I'd rather have more data cache than NDMP.

I mean from what I can see I don't really see much progress over the past two years. I really wanted to see things like

  • 64-bit (the max memory being 24GB implies to me still a 32-bit OS+ file system code)
  • Large amounts of memory - at LEAST 64GB per controller - maybe make it fancy and make it it flash-backed? RAM IS CHEAP.
  • More cores! At least 16 cores per controller, though I'd be happier to see 64 per controller (4x Opteron 6276 @ 2.3Ghz per controller) - especially for something that hasn't even been released yet. Maybe based on Dell R815 or R820
  • At least 16-node configuration (the number of blades you can fit in a Dell blade chassis(perhaps running Dell M620), not to mention this level of testing was pretty much complete two and a half years ago).
  • SSD Integration of some kind - meta data at least? There is quite a bit of meta data mapping all those files to FSDs and LUNs etc.
  • Clearer indication that the system supports dynamic re-striping as well as LUN evacuation (LUN evacuation especially something I wanted to leverage at the time - as the more LUNs you had the longer the system took to fail over. In my initial Exanet configuration the 3PAR topped out at 2TB LUNs, later they expanded this to 16TB but there was no way from the Exanet side to migrate to them, and Exanet being fully distributed worked best if the back end was balanced so it wasn't a best practice to have a bunch of 2TB LUNs then start growing by adding 16TB LUNs you get the idea) - the more I look at this pdf the less confident I am in them having added this capability (that PDF also indicates using iSCSI as a back end storage protocol).
  • No clear indication that they support read-write snapshots yet (all indications point to no). For me at the time it wasn't a big deal, snapshots were mostly used for recovering things that were accidentally deleted. They claim high performance with their redirect on write - though in my experience performance was not high. It was adequate with some tuning, they claimed unlimited snapshots at the time, but performance did degrade on our workloads with a lot of snapshots.
  • A low end version that can run in VMware - I know they can do it because I have an email here from 2 years ago that walks you through step by step instructions installing an Exanet cluster on top of VMware.
  • Thin provisioning friendly - Exanet wasn't too thin provisioning friendly at the time Dell bought them - no indication from what I've seen says that has changed (especially with regards to reclaiming storage). The last version Exanet released was a bit more thin provisioning friendly but I never tested that feature before I left the company, by then the LUNs had grown to full size and there wasn't any point in turning it on.

I can only react based on what I see on the site - Dell isn't talking too much about this at the moment it seems, unless perhaps your a close partner and sign a NDA.

Perhaps at some point I can connect with someone who has in depth technical knowledge as to what Dell has done with this fluid file system over the past two years, because really all I see from this vantage point is they added NDMP.

I'm sure the code is more stable, easier to maintain perhaps, maybe they went away from the Outlook-style GUI, slapped some Dell logos on it, put it on Dell hardware.

It just feels like they could of launched this product more than two years ago minus the NDMP support (take about 1 hour to put in the Dell logos, and say another week to certify some Dell hardware configuration).

I wouldn't imagine the SpecSFS performance numbers would of changed a whole lot as a result, maybe it would be 25-35% faster with the newer CPU cores (those SpecSFS results are almost four years old). Well performance could be boosted more by the back end storage. Exanet used to use the same cheap crap LSI crap that BlueArc used to use (perhaps still does in some installations on the low end). Exanet even went to the IBM OEM version of LSI and wow have I heard a lot of horror stories about that too(like entire arrays going off line for minutes at a time and IBM not being able to explain how/why then all of a sudden they come back as if nothing happened). But one thing Exanet did see time and time again, performance on their systems literally doubled when 3PAR storage was used (vs their LSI storage). So I suspect fancy Compellent tiered storage with SSDs and such would help quite a bit in improving front end performance on SpecSFS. But that was true when the original results were put out four years ago too.

What took so long? Exanet had promise, but at least so far it doesn't seem Dell has been able to execute on that promise. Prove me wrong please because I do have a soft spot for Exanet still :)

Tagged as: , Comments Off
8Sep/11Off

HDS aborbs Bluearc

TechOps Guy: Nate

It seems HDS has finally decided to buy out BlueArc after what was either two or three failed attempts at an IPO.

BlueArc, along with my buddies over at 3PAR is among the few storage companies that puts real silicon to work in their system for the best possible performance. Their architecture is quite impressive and the performance (that is for their mid range system) shows.

I have only been exposed to their older stuff (5-6 year old technology) directly, not their newer technology. But even their older stuff was very fast and efficient, very reliable and had quite a few nifty features as well. I think they were among the first to do storage tiering (for them at the file level).

[ warning - a wild tangent gets thrown in here somewhere ]

While their NAS technology was solid(IMO), their disk technology was not. They relied on LSI storage, and the quality of the storage was very poor over all. First off whoever setup the system we had set it up with everything running RAID 5 12+1, then there was the long RAID rebuild times, the constant moving hot spots because of the number of different tiers of storage we had, the fact that the 3 Titan head units were not clustered so we had to take hard downtime for software upgrades(not BlueArc's fault other than perhaps making it too expensive to be able to do clustered heads when the company bought the stuff - long before I was there). Every time we engaged with BlueArc 95% of our complaints were about the disk. For the longest time they tried to insist that "disk doesn't matter". That you could put any storage system behind the BlueArc and it would be the same.

After the 3rd or 4th engagement BlueArc significantly changed their tune (not sure what prompted it), but now acknowledged the weakness of the low tier storage and was promoting the use of HDS AMS storage (USP was well, waaaaaaaay out of our price range) since they were a partner of HDS back then as well. The HDS proposal fell far short of the design I had with 3PAR and at the time Exanet was their partner of choice.

If I could of chosen I would of used BlueArc for NAS and 3PAR for disk. 3PAR was open to the prospect of course, BlueArc claimed they had contacted 3PAR to start working with them but 3PAR said that never happened. Later BlueArc acknowledged they were not going to try to work with 3PAR (or any other storage company other than LSI or HDS - I think 3PAR was one digit too long for them to handle).

Given the BlueArc system lacked the ability to provide us with any truly useful disk performance statistics, it was tough coming up with a configuration that I thought would work as a replacement. There was a large number of factors involved, and any one of them had a fairly wide margin of error. You could say I pulled a number out of my ass, but I did do more calculations than that I have about a dozen pages of documentation I wrote at the time on the project, but really at the end of the day it was a stab in the dark as far as initial configuration.

BlueArc as a company, at the time didn't really have their support stuff all figured out yet. The first sign was when we had scheduled downtime for a software upgrade that was intended to take 2-3 hours ended up taking 10-11 hours because there was a problem and BlueArc lacked the proper escalation procedures to resolve it quick enough. Their CEO sent us a letter later saying that they fixed that process in the company. The second sign was when I went to them and asked them to confirm the drive type/size of all of our disks so I could do some math for the replacement system. They did a new audit(had to be on site to do it for some reason), and turns out we had about 80 more spindles than they thought we had(we bought everything through them). I don't know how you lose track of that amount of disks for support but somehow it fell through the cracks. Another issue we had was we paid BlueArc to relocate the system to another facility(again before I was at the company), and whomever moved it didn't do a good job, they accidentally plugged both power supplies of a single shelf into the same PDU. Fortunately it was a non production system. A PSU blew at one point that took out the PDU, which then took out that shelf which then took out the file system the shelf was on.

Even after all of that my main problem with their solution was the disks. LSI was not up to snuff and the proposal from HDS wasn't going to cut it. I told my management that there is no doubt that HDS could come up with a solution that would work -- it's just what they have proposed will not(they didn't even have thin provisioning at the time. 3PAR was telling me HDS was pairing USP-Vs along with AMSs in order to try to compete in the meantime. They did not propose that to us). A combination of poor performing SATA on RAID-6 no less for bulk storage and higher performing 15k RPM disks for higher tier/less storage. HDS/BlueArc felt it was equivalent to what I had specified through 3PAR and Exanet, not understanding the architectural advantages the 3PAR system had over the proposed HDS design(going into specifics will take too long you probably know them by now anyways if your here). Not to mention what seemed like sheer incompetence among the HDS team that was supporting us, it seemed nothing I asked them they could answer without engaging someone from Japan and even then I rarely got a comprehensible answer.

So in the end we ended up replacing a 4-rack BlueArc system with what could of been a single rack 3PAR + a few rack units for the Exanet but we had to put the 3PAR in two racks due to weight constraints in the data center. We went from 500+ disks (mix of SATA-I and 10k RPM FC) to 200 disks of SATA-II (RAID 5 5+1). With the change we got the advantage of being able to run fibre channel (which we ran to all of our VM boxes as well as primary databases), iSCSI (which we used here and there 3PAR's iSCSI support has never been as good as I would of liked to have seen it though for anything serious I'd rather use FC anyways and that's what 3PAR's customers did which led to some neglect on the iSCSI front).

Half the floor space, half the power usage, roughly the same amount of usable storage, about the same amount of raw storage. We put the 3PAR/Exanet system through it's paces with our most I/O intensive workload at the time and it absolutely screamed. I mean it exceeded everyone's expectations(mine included). But that was only the begining.

This is a story I like to tell on 3PAR reference calls when I do them which is becoming more and more rare these days. In the early days of our 3PAR/Exanet deployment the Exanet engineer tried to assure me that they were thin provisioning friendly, he had personally used 3PAR+Exanet in the past and it worked fine. So with time constraints and stuff I provisioned a file system on the Exanet box not thinking too much about the 3PAR end of things. It's thin provisioning friendly right? RIGHT?

Well not so much, before you knew it the system was in production and we started dumping large amounts of data on it, and deleting large amounts of data on it, I found out in a few weeks the Exanet box was preferring to allocate new space rather than reclaim deleted space. I did some calculations and the result was not good. If we let the system continue at this rate we were going to exceed the amount of capacity on the 3PAR box if the Exanet file system was allowed to grow to it's full potential. Not good.. Compound that with the fact that we were at the maximum addressable capacity of a 2-node 3PAR box, if I had to add even 1 more disk to the system(not that adding 1 disk is possible in that system due to the way disks are added, minimum is 4), I would of had to put in 2 more controllers. Which as you might expect is not exactly cheap. So I was looking at what could of been either a very costly downtime to do data migration or a very costly upgrade to correct my mistake.

Dynamic optimization to the rescue. This really saved my ass. I mean really, it did. When I built the system I used RAID 5 3+1 for performance (for 3PAR that is roughly ~8% slower than RAID 10, and 3PAR fast RAID 5 is probably on par with many other vendors RAID 10 due to the architecture).

So I ran some more calculations and determined if I could get to RAID 5 5+1 I would have enough space to survive. So I began the process, converting roughly a half dozen LUNs at a time. 24 hours a day, 7 days a week. It took longer than I expected, the 3PAR was routinely getting hammered from daily activities from all sides. It took about 5 months in the end to convert all of the volumes. Throughout the process nobody noticed a thing. The array was converting volumes for 24 hours a day for 5 months straight and nobody noticed (except me who was baby sitting it hoping I could beat the window). If I recall right I probably had 3-4 weeks of buffer space, if my conversions was going to take an extra month I would of exceeded the capacity of the system. So, I got lucky I suppose, but also bought the system knowing I could make such course corrections online without impacting applications for just that kind of event -- I just didn't expect the event to be so soon and on such a large scale.

One of the questions I had for HDS at the time we were looking at them was could they do the same online RAID conversions. The answer? Absolutely they can. But the fine print was (assuming it still is) you needed blank disks to do the migration to. Since 3PAR's RAID is performed at the sub disk level no blank disks are required, only blank "chunklets" as they call them. Basically you just need enough empty space on the array to mirror the LUN/Volume to the new RAID level and then you break the mirror and eliminate the source (this is all handled transparently with a single command and some patience depending on system load and volume of data in flight).

As time went on we loaded the system with ever more processes and connected ever more systems to it as we got off the old BlueArc(s). I kept seeing the IOPS (and disk response times) on the 3PAR going up..and up.. I really thought it was going to choke, I mean we were pushing the disks hard, with sustained disk response times in the 40-50ms range at times(with rare spikes to well over 100ms). I just kept hoping for the day when we would be done and the increase would level off, and it did for the most part eventually. I built my own custom monitoring system for the array for performance trending, since I didn't like the query based tool they provided as much as what I could generate myself(despite the massive amount of time it took to configure my tool).

I did not know a 7200RPM SATA disk could do 127 IOPS of random I/O.

We had this one process that would dump up to say 50GB of data from upwards of 40-50 systems simultaneously as fast as they could go. Needless to say when this happened it blew out all of the caches across the board and brought things to a grinding halt for some time(typically 30-60 seconds). I would see the behavior on the NAS system and login to my monitoring tool and just see it hang while it tried to query the database(which was on the storage). I would cringe, and wait for the system to catch up. We tried to get them to re-design the application so it was more thoughtful of the storage but they weren't able to. Well they did re-design it one time (for the worse). I tried to convince them to put it on fusion IO on local storage in the servers but they would have no part of it. Ironically not long after I left the company they went out and bought some Fusion IO for another project. I guess as long as the idea was not mine it was a good one.. The storage system was entirely a back office thing, no real time end user transactions ever touched it, which meant we could live with the higher latency by pushing the SATA drives 30-50% beyond engineering specifications.

At the end of the first full year of operation we finally got budget to add capacity to the system, we had shrunk the overall theoretical I/O capacity by probably 2/3rds vs the previous array, and had absorbed almost what seemed like a 200% growth on top of that during the first year and the system held up. I probably wouldn't of believed it if I didn't see it(and live it) personally. I hammered 3PAR as often as I could to increase the addressable capacity of their systems which was limited by the operating system architecture. Doesn't take a rocket scientist to see that their systems had 4GB of control cache(per controller) which is a common limit to 32-bit software. But the software enhancement never came while I was there at least, it is there in some respect in the new V-class, though as mentioned the V-class seems to have had an arbitrary raw capacity limit placed on it that does not align with the amount of control cache it can have (up to 32GB per controller). With 64-bit software and more control cache I could of doubled or tripled the capacity of the system without adding controllers.

Adding the two extra controllers did give us one thing I wanted - Persistent cache, that's just an awesome technology to have and you simply can't do that kind of thing on a 2-controller system. Also gave us more ports than I knew what to do with.

What happened to the BlueArc? Well after about 10 months of trying to find someone to sell it to - or give it to -- we ended up paying someone to haul it away. When HDS/BlueArc was negotiating with us on their solution they tried to harp on how we could leverage our existing disk from BlueArc in the new solution as another tier. I didn't have to say it my boss did which made me sort of giggle - he said the operational costs of running the old BlueArc disk (support was really high, + power and co-lo space) was more than the disks were worth, BlueArc/HDS didn't have any real response to that. Other than perhaps to nod their heads acknowledging that we're smart enough to realize that fact.

I still would like to use BlueArc again, I think it's a fine platform, I just want to use my own storage on it :)

This ended up being a lot longer than I expected! Hope you didn't fall asleep. Just got right to 2600 words.. there.

12Dec/10Off

Dell and Exanet: MIA

TechOps Guy: Nate

The thoughts around Dell buying Compellent made me think back to Dell's acquistiion of the IP and some engineering employees of Exanet, as The Register put it, a crashed NAS company.

I was a customer and user of Exanet gear for more than a year, and at least in my experience it was a solid product, very easy to use, decent performance and scalable. The back end architecture to some extent mirrored the 3PAR hardware-based architecture but in software, really a good design in my opinion.

Basic Exanet Architecture

Their standard server at the time they went under was a IBM x3650, dual proc quad core Intel Xeon 5500-based platform with 24GB of memory.

Each server had multiple software processes called fsds or File system daemons, that ran, they ran one fsd per core. Each fsd was responsible for a portion of the file system (x number of files), they load balanced it quite well I never had to manually re-balance or anything. Each fsd was allocated its own memory space used for itself as well as cache, if I recall right the default was around 1.6GB per fsd.

Each NAS head unit had back end connectivity to all of the other NAS units in the cluster(minimum 2, maximum tested at the time they went under was 16). A request for a file could come in on any node, any link. If the file wasn't home to that node it would transparently forward the request to the right node/fsd to service the request on the back end. Much like how 3PAR's backplane forwards requests between controllers.

Standard for back end network was 10Gbps on their last models.

As far as data protection, the use of "commodity" servers did have one downside, they had to use UPS systems as their battery backup to ensure enough time for the nodes to shut down cleanly in the event of a power failure. This could present problems at some data centers as operating a UPS in your own rack can be complicated from a co-location point of view(think EPO etc). Another similar design that Exanet had compared to 3PAR is their use of internal disks to flush cache to, which is something I suppose Exanet was forced into doing, other storage manufacturers use battery backed cache in order to survive power outages of some duration. But both Exanet and 3PAR dump their cache to an internal disk so that the power outage can last for a day, a week, or even a month and it won't matter, data itnegrity is not compromised.

32-bit platform

The only thing that held it back was they didn't have enough time or resources to make the system fully 64-bit before they went under, that would of unlocked a whole lot of additional performance they could of gotten. Being locked into a 32-bit OS really limited what they could do on a single node, and as processors became ever more powerful they really had to make the jump to 64-bit.

Exanet was entirely based on "commodity" hardware, not only were they using x86 CPUs but their NAS controllers were IBM 2U rackmount servers running CentOS 4.4 or 4.5 if I recall right.

To me, as previous posts have implied, if your going to base your stuff on x86 CPUs, go all out, it's cheap anyways. I would of loved to have seen a 32-48 core Exanet NAS controller with 512GB-1TB of memory on it.

Back to Dell

Dell originally went into talks with Exanet a while back because Exanet was willing to certify Equallogic storage as a back end provider of disk to an Exanet cluster, using iSCSI inbetween the Exanet cluster and the Equallogic storage. Since nobody else in the indusry seemed willing to have their NAS solution talk to a back end iSCSI system. As far as I know the basic qualifications for this solution was completed in 2009, quite a ways before they ran out of cash.

Why did Exanet go under? I believe primarily because the market they were playing in was too small with too few players in it, not enough deals to go around, so whomever had the most resources to outlast the rest would come out on top, in this case I believe it was Isilon, even though they too were taken out by EMC from the looks of their growth it didn't seem like they were in a fine position to continue to operate independently. With Ibrix and Polyserve going to HP, Onstor going to LSI, and I'm still convinced BlueArc will go to HDS at some point(they are once again filing for IPO but word on the street is they aren't in very good shape), I suspect after they fail to IPO and go under. They have a very nice NAS platform, but HDS has their hands tied in supporting 3rd party storage other than HDS product, BlueArc OEM's LSI storage like so many others.

About a year ago SGI OEM'd one of BlueArc's products though recently I have looked around the SGI site and see no mention of it. Either they have abandoned it (more likely) or are just really quiet. Since I know SGI is also a big LSI shop I wonder if they are making the switch to Onstor. One industry insider I know suspects LSI is working on integrating the Onstor technology directly into their storage systems rather than having an independent head unit, which makes sense if they can make it work.

But really my question is why hasn't Dell announced anything related to the Exanet technology? They could of, quite possibly within a week or two had a system running and certified on Dell PowerEdge equipment and selling to both existing Exanet customers as well as new ones. The technology worked fine, it was really easy to setup and use, and it's not as if Dell has another solution in house that competes with it. AND since it was an entirely software based solution there was really no costs involved in manufacturing. Exanet had more than one PB-sized deal in the works at the time they went under, that's a lot of good will Dell just threw away. But hey, what do you expect, it's Dell. Thankfully they didn't get their dirty paws on 3PAR.

When I looked at how a NetApp system was managed compared to the Exanet my only response was You're kidding, right?

Time will tell if anything ever comes of the technology.

I really wanted 3PAR to buy them of course, they were very close partners with 3PAR and both pitched each other's products at every opportunitiy. Exanet would go out of their way to push 3PAR storage whenever possible because they knew how much trouble the LSI storage could be, and they were happy to get double the performance per spindle off 3PAR vs LSI. But I never did get an adequate answer out of 3PAR as to why they did not pursue Exanet, they were in the early running but pulled out for whatever reason, the price tag of less then $15M was a steal.

Now that 3PAR is with HP we'll see what they can do with Ibrix, I knew of more than one customer that migrated off of things like Ibrix and Onstor to Exanet, HP has been pretty silent about Ibrix since they bought them as far as I know. I have no idea how much R&D they have pumped into it over the years or what their plans might be.

Tagged as: , Comments Off