TechOpsGuys.com

September 22, 2010

The Cloud: Grenade fishing in a barrel

Filed under: Datacenter — Tags: cloud — Nate @ 10:03 pm

I can’t help but laugh. I mean I’ve been involved in several initiatives surrounding the cloud. So many people out there think the cloud is efficient and cost effective. Whoever came up with the whole concept deserves to have their own island (or country) by now.

Because, really, competing against the cloud is like grenade fishing in a barrel. Shooting fish in a barrel isn’t easy enough, really it’s not!

Chuck from EMC earlier in the year talked to the folks at Pfizer around their use of the Amazon cloud, and the real story behind it. Interesting read, really shows the value you can get from the cloud if you use it right.

R+Dâ€™s use of HPC resources is unimaginably bursty and diverse, where on any given day one of 1000 different applications will be run. Periodically enormous projects (of very short duration!) come up very quickly, driven by new science or insights, which sometimes are required to make key financial orÂ strategic decisions with vast amounts of money at stake for the business.

As a result, there’s no real ability to forecast or plan in any sort of traditional IT sense.Â The HPC team has to be able to respond in a matter of days to huge requests for on-demand resources — far outside the normal peaks and valleys you’d find in most traditional IT settings.

But those use cases at the moment really are few and far between. Contrasted by use cases of having your own cloud (of sorts) lots more use there. It would not surprise me if over time Pfizer continues to expand it’s internal HPC stuff as it gets more of a grasp as far as what the average utilization rate is and host more and more stuff internally vs going to Amazon. It’s just that in the early days of this they don’t have enough data to predict how much they need. They may never get completely out of the cloud I’m just saying that the high watermark(for lack of a better term) can be monitored so that there is less significant “bursting” to the cloud.

Now if Pfizer is unable to ever really get a grip on forecasting their HPC requirements well then they might just keep using the cloud, but I suspect at the end of the day they’ will get better forecasting. They obviously have the talent internally to do this very tricky balance of cloud and internal HPC. The cloud people would have you believe it’s a simple thing to do, it’s really not. Especially for off the shelf applications. If you have seen the numbers I have seen, you’d shake your head too. Sort of the response I had when I did come across a real good use case for the cloud earlier this year.

I could see paying a lot more for premium cloud services if I got more, but I don’t get more, in fact I get less, a LOT less, than doing it myself. Now for my own personal “server” that is in the Terremark cloud I can live with it, not a big deal myÂ needs are tiny(though now that I think about it they couldn’t even give me a 2nd NAT address for a 2nd VM for SMTP purposes, I had to create a 2nd account to put my 2nd VM in it to get my 2nd NAT address, costs for me are the same regardless but it is a bit more complicated than it should be, and opening a 2nd account in their system caused all sorts of problems with their back end which seemed to get confused by having two accounts with the same name, had to engage support on more than one occasion to get all the issues fixed). But for real work stuff, no way.

Still so many sheep out there still buy the hype – hook, line and sinker.

Which can make jobs for people like me harder, I’ve heard the story time and time again from several different people in my position, PHB’s are so sold on the cloud concept they can’t comprehend why it’s so much more expensive then doing it yourself, so they want you to justify it six ways from Sunday (if that’s the right phrase). They know there’s something wrong with your math but they don’t know what it is so they want you to try to prove yourself wrong when your not. At the end of the day it works out though, just takes some time to break that glass ceiling (again it sounds like the right term but it might not be)

Then there’s the argument the cloud people make, I was involved in one deal earlier in the year, usual situation, and the cloud providers said “well do you really have the staff to manage all of this?” I said “IT IS A RACK AND A HALF OF EQUIPMENT, HOW MANY PEOPLE DO I NEED, REALLY?“ They were just as oblivious to that as the PHB’s were to the cloud costs.

While I’m thinking of wikipedia anyone else experience massive slowdowns with their DNS infrastructure? It takes FOREVER to resolve their domains for me. All other domains resolve really fast. I run my own DNS, maybe there is something wrong with it I’m not sure, haven’t investigated.

Comments Off

September 21, 2010

Online Schema Changes for MySQL

Filed under: General — Tags: hadoop, mysql — Nate @ 9:08 pm

Looks like Facebook released a pretty cool tool that apparently provides the ability to perform MySQL schema changes online, something most real databases take for granted.

Another thing noted by our friends at The Register, was how extensively Facebook leverages MySQL. I was working on a project revolving around Apache Hadoop and someone that was involved with it was under the incorrect assumption that Facebook stores most of it’s data on Hadoop.

At Facebook, MySQL is the primary repository for user data, with InnoDB the accompanying storage engine.
[..]
All Callaghan will say is that the company runs “X thousands” of MySQL servers. “X” is such a large number, the company needed a way of making index changes on live machines.

I wouldn’t be surprised if they probably had a comparable number of MySQL servers to servers running Hadoop. After all Yahoo! is the biggest Hadoop user and at my last count had “only” about 25,000 servers running the software.

It certainly is unfortunate to see so many people out there see some sort of solution and think they can get it to solve all of their problems.

Hadoop is a good example, lots of poor assumptions are made around Hadoop. It’s designed to do one thing really well, and it does that fairly well. But when you think you can adapt it into a more general purpose storage system it starts falling apart. Which is completely understandable, it wasn’t designed for that purpose. Many people don’t understand that simple concept though.

Another poor use of Hadoop is trying to shoehorn a real time application on top of it, it just doesn’t work. Yet there are people out there (I’ve talked to some of them in person) who have devoted significant developer resources to try to attack that angle. Spend thirty minutes of time researching the topic and you can realize pretty quickly that it is a wasted effort. Google couldn’t even do it!

Speaking of Hadoop, and Oracle for that matter it seems Oracle announced a Hadoop-style system yesterday at Open World, only Oracle’s version seems to be orders of magnitutde faster (and more orders of magnitude expensive given the amount of flash it is using).

Using the skinnier and faster SAS disks, Oracle says that the Exadata X2-8 appliance can deliver up to 25GB/sec of raw disk bandwidth on uncompressed data and 50GB/sec across the flash drives. The disks deliver 50,000 I/O operations per second (IOPs), while the flash delivers 1 million IOPs. The machine has 100TB of raw disk capacity per rack and up to 28TB of uncompressed user data. The rack can load data at a rate of 5TB per hour. Using the fatter disks, the aggregate disk bandwidth drops to 14GB/sec, but the capacity goes up to 336TB and the user data space grows to 100TB.

The system is backed by an Infiniband-based network, I didn’t notice specifics but assume 40Gbps per system.

Quite impressive indeed. Like Hadoop, this Exadata system is optimized for throughput, it can do IOPS pretty well too but it’s clear that throughput is the goal. By contrast a more traditional SAN gets single digit gigabytes per second even on the ultra high end for data transfers at least on the industry standard SPC-2 benchmark.

IBM DS8700 rated at around 7.2 Gigabytes/second with 256 drives and 256GB cache costing a cool $2 million
Hitachi USP-V rated at around 8.7 Gigabytes/second with 265 drives and 128GB cache costing a cool $1.6 million

Now it’s not really apples to apples comparison of course, but it can give some frame of reference.

It seems to scale really well according to Oracle –

Ellison is taking heart from the Exadata V2 data warehousing and online transaction processing appliance, which he said now has a $1.5bn pipeline for fiscal 2011. He also bragged that at Softbank, Teradata’s largest customer in Japan, Oracle won a deal to replace 60 racks of Teradata gear with three racks of Exadata gear, which he said provided better performance and which had revenues that were split half-and-half on the hardware/software divide.

From 60 to 3? Hard to ignore those sorts of numbers!

Oh and speaking of Facebook, and Hadoop, and Oracle, as part of my research into the topic of Hadoop I came across this, I don’t know how up to date it is but thought it was neat. Oracle DB is one product I do miss using, the company is filled with scumbags to be sure, I had to educate their own sales people on their licensing the last time I dealt with them. But it is a nice product, works really well, and IMO at least it’s pretty easy to use especially with enterprise manager (cursed by DBAs from coast to coast I know!). Of course makes MySQL look like it’s a text file based key-value pair database by comparison.

Anyways onto the picture!

Oh my god! Facebook is not only using Hadoop, but they are using MySQL, normal NAS storage, and even Oracle RAC! Who’da thunk it?

Find a tool or a solution that does everything well? The more generic the approach, the more difficult it is to pull it off, which is why so many solutions like that typically cost a significant amount of money, because there is significant value in what the product provides. If perhaps the largest open source platform in the world (Linux) has not been able to do it (how many big time open source advocates do you see running OS X and how many run OS X on their servers), who can?

That’s what I thought.

(posted from my Debian Lenny workstation with updates from my Ubuntu Lucid Lynx laptop)

Comments Off

September 17, 2010

No more Cranky Geeks?

Filed under: News,Random Thought — Tags: random — Nate @ 7:46 am

What!! I just noticed that it seems the only online video feed I watch, Cranky Geeks seems to be coming to an end? That sucks! I didn’t stumble upon the series until about one and a half years ago on my Tivo. Been an big fan ever since. I rarely learned anything from the shows but I did like observing the conversations, it’s not quite to the technical depth that I get into but it’s a far cry from the typical “tech tv” videos/shows that don’t seem to go beyond things like over clocking and what motherboard and video card to use for the latest games.

I know I’m a hell of a lot more cranky than anyone I ever saw on the show but they did bitch about some things. There seems to have been quite a few video blogs, for a lack of a better word, that have bitten the dust in recent months, I guess the economy is taking it’s toll.

[Begin Another Tangent –]

I believe that we are entering the second phase of the great depression (how long until we are solidly in the second phase I’m not sure, won’t know until we’re there), the phase where states realize their budget shortfalls are too big for short term budget gimmicks and make drastic cuts and tax hikes which further damages the economy. I don’t blame anyone in particular for our situation it’s a situation that has been festering for more than thirty years, it’s like trying to stop an avalanche with I don’t know a snow plow?

This is what happens when you give people every incentive possible to pull demand forward, you run out of gimmicks to pull demand forward and are faced with a very large chasm that will only be healed with time, just look at Japan.

I have seen lots of folks say that this is not as bad as the real Great Depression, but they aren’t taking into account the massive amount of social safety nets that have been deployed over the past 40-50+ years, I just saw a news report last night that said the rate of poverty among children is the same as it was in the 1960s. And to think the cost of living in the U.S. is so high that living in poverty here in many countries if you got paid that you’d be in the upper middle class.

Not sustainable, and as time goes on more and more people are realizing this, unfortunately too late for many they will be left behind, permanently.

My suggestion? Read the infrastructure report card. Yes I know infrastructure spending is not a short term stimulus, we need to take advantage of lower prices for wages, and materials, and rebuild the country, it will take years, maybe even a couple of decades but we need it. Long term problems call for long term solutions.

[End Another Tangent –]

I hope it doesn’t go but it looks like it’s essentially gone, and I just added the link to the blog roll a few days ago!

Noticed this from John in the comments –

The two companies couldnâ€™t come to any agreement. This is a problem when you personally do not own the show. The fact is the show is not what advertising agencies want. They want two minute shows with a 15 second pre-roll ad at the beginning. They see no market for anything with a long format unless it is on network TV.

The irony is that the demographics for the show should be at $100/per k levels if they understood anything at all.

Itâ€™s amazing that we managed to get 4 1/2 years out of the show.

RIP

Sigh

RIP Cranky Geeks, I shall miss you greatly.

Comments (1)

September 16, 2010

How High?

Filed under: Random Thought — Tags: 3par, random — Nate @ 6:35 pm

I got this little applet on my Ubuntu desktop that tracks a few stocks of companies I am interested in(I don’t invest in anything). And thought it was pretty crazy how close to the offer price the 3PAR stock price got today, I mean as high as 32.98, everyone of course knows the final price will be $33, to think folks are trading the stock with only $0.02 of margin to me is pretty insane.

Looks a fair sight better than the only public company I have ever worked for, surprised they are still around even!

I never bought any options, good thing I guess because from the day I was hired the stock never did anything but go down, I think my options were in the ~$4.50 range (this was 2000-2002)

Just dug this up, I remember being so proud my company is on TV! Not quite as weird as watching the freeinternet.com commercials back when I worked there. A company that spent $7 million a month on bandwidth it didn’t know it had and wasn’t utilizing. Of course by the time they found out it was too late.

My company at the top of the list! I miss Tom Costello, he was a good NASDAQ floor guy. Screen shot is from March 2002. Also crazy that the DOW is only 68 points higher today than it was eight years ago.

Comments Off

Fusion IO now with VMware support

Filed under: Storage,Virtualization — Tags: c-class, flash, fusionio, hp, ssd, vmware — Nate @ 8:58 am

About damn time! I read earlier in the year on their forums that they were planning on ESX support for their next release of code, originally expected sometime in March/April or something. But that time came and went and saw no new updates.

I saw that Fusion IO put on a pretty impressive VDI demonstration at VMworld, so I figured they must have VMware support now, and of course they do.

I would be very interested to see how performance could be boosted and VM density incerased by leveraging local Fusion IO storage for swap in ESX.Â I know of a few 3PAR customers that say they get double the VM density per host vs other storage because of the better I/O they get from 3PAR, though of course Fusion IO is quite a bit snappier.

With VMware’s ability to set swap file locations on a per-host basis, it’s pretty easy to configure, in order to take advantage of it though you’d have to disable memory ballooning in the guests I think in order to force the host to swap. I don’t think I would go so far as to try to put individual swap partitions on the local fusion IO for the guests to swap to directly, at least not when I’m using a shared storage system.

I just checked again, and as far as I can tell, still, from a blade perspective at least, still the only player offering Fusion IO modues for their blades is the HP c Class in the form of their IO Accelerator. With up to two expansion slots on the half width, and three on the full width blades, there’s plenty of room for the 80, 160 GB SLC models or the 320GB MLC model. And if you were really crazy I guess you could use the “standard” Fusion IO cards with the blades by using the PCI Express expansion module, though that seems more geared towards video cards as upcomming VDI technologies leverage hardware GPU acceleration.

HP’s Fusion IO-based I/O Accelerator

FusionIO claims to be able to write 5TB per day for 24 years, even if you cut that to 2TB per day for 5 years, it’s quite an amazing claim.

From what I have seen (can’t speak with personal experience just yet), the biggest advantage Fusion IO has over more traditional SSDs is write performance, of course to get optimal write performance on the system you do need to sacrifice space.

Unlike drive form factor devices, the ioDrive can be tuned to achieve a higher steady-state write performance than what it is shipped with from the factory.

Comments (2)

September 15, 2010

Time to drop a tier?

Filed under: Networking — Tags: esrp, extremenetworks — Nate @ 8:30 am

Came across an interesting slide show, The Ultimate guide to the flat data center network. at Network World. From page 7:

All of the major switch vendors have come out with approaches that flatten the network down to two tiers, and in some cases one tier. The two-tier network eliminates the aggregation layer and creates a switch fabric based on a new protocol dubbed TRILL for Transparent Interconnection of Lots of Links. Perlman is a member of the IETF working group developing TRILL.

For myself, I have been designing two tier networks for about 6 years now with my favorite protocol ESRP. I won’t go into too much detail this time around, click the link for an in-depth article but here is a diagram I modified from Extreme to show what my deployments have looked like:

Sample ESRP Mesh network

ESRP is very simple to manage, scalable, mature, and with a mesh design like the above, the only place it needs to run is on the core. The edge switches can be any model, any vendor, managed, and even unmanaged switches will work without trouble. Fail over is sub second, not quite the 25-50ms that EAPS provides for voice grade, not that I have had any way to accurately measure it but I would say it’s reasoanble to expect a ~500ms fail over in an all-Extreme network(where the switches communicate via EDP), or ~750-1000ms for switches that are not Extreme.

Why ESRP? Well because as far as I have seen since I started using it, there is no other protocol on the market that can do what it can do (at all, let alone as easily as it can do it).

Looking at TRILL briefly, it is unclear to me if it provides layer 3 fault tolerance or if you still must use a 2nd protocol like VRRP, ESRP or HSRP(ugh!) to do it.

The indication I get is that it is a layer 2 only protocol, if that is the case, seems very short sighted to design a fancy new protocol like that and not integrate at least optional layer 3 support, we’ve been running layer 3 for more than a decade on switches.

In case you didn’t know, or didn’t click the link yet, ESRP by default runs in both Layer 2 and Layer 3, though optionally can be configured to run in only one layer if your prefer.

Comments (5)

September 12, 2010

Crazy Seagate Statistics

Filed under: Storage — Tags: seagate — Nate @ 1:09 pm

Been a while since I clicked on their blog but I just did and as the most current entry says, those are pretty eye popping.

A driveâ€™s recording head hovers above the disks at a height of 100 atoms, 100 times thinner than a piece of paper
Seagate clean rooms are 100 times cleaner than a hospital operating room
Seagate can analyze over 1.5 Million drives at a time
Seagate builds 6 hard drives, hybrid drives, and solid state drives every second
Every single drive travels through over 1000 manufacturing steps

[Begin First Tangent –]

If your using a Seagate SATA disk, do yourself a favor and don’t let the temperature of the drive drop below 20 degrees celcius 🙂

I read an interesting article recently on the various revenue numbers of the big drive manufacturers, and the numbers were surprising to me.

Hitach GST had revenues of $4.8bn in 2009.
[..]
Seagate’s fiscal 2010 revenue of $11.4bn
[..]
Western Digital’s latest annual revenue of $9.8bn

I really had no idea Western Digital was so big! After all since they do not (not sure if they ever did) participate in the SCSI / Fibre Channel / SAS arena that leaves them out of the enterprise space for the most part (I never really saw their Raptor line of drives get adopted, too bad!). Of course “Enterprise SATA” has taken off quite a bit in recent years but I would think that would still pale in comparison to Enterprise SAS/SCSI/FC. But maybe not I don’t know, haven’t looked into the details.

I thought Hitachi was a lot bigger especially since Hitachi bought the disk division from IBM way back when. I used to be a die hard fan of IBM disks, up until the 75GXP fiasco. I’m still weary of them even now. I still have a CDROM filled with “confidential” information with regards to the class action suit that I played a brief part in (the judge kicked me out because he wanted to consolidate the people in the suite to folks in California), very nteresting stuff, not that I remember much of it, I haven’t looked at it in years.

The 75GXP was the only drive where I’ve ever suffered a “double disk failure” before I could get a replacement in. Only happened once. My company had 3 “backup” servers, one at each office site. Each one had I think it was 5 x 100GB disks, or was it another size, this was back in 2001. RAID5, connected to a 3Ware 7000-series controller. One Friday afternoon one of the disks in my local office failed, so I called to get an RMA, about 2 hours later, another disk failed in a remote office, so I called to get that one RMA’d too.Â The next day the bad disk for my local server arrived, but it was essentially DOA from what I recall. So the system kept running in degraded mode( come on how many people’s servers in 2001 had hot spares, that’s what I thought). There was nobody in the office for the other server in degraded mode so the drive was set to arrive on Monday to be replaced. On Sunday that same weekend a 2nd disk in the remote server failed, killing the RAID array of course. In the end, that particular case wasn’t a big deal, it was a backup server after all, everything on the disk was duplicated at least once to another site. But it was still a pain. If memory serves I had a good 15-20 75GXP disks fail over the period of a year or so(both home+work), all of them were what I would consider low duty cycle, hardly being stressed that much. In all cases the data lost wasn’t a big deal, it was more of a big deal to be re-installing the systems, that took more time than anything else. Especially the Solaris systems..

[End First Tangent –]
[Begin Second Tangent — ]

One thing that brings back fond childhood memories related to Seagate is where they are based out of – Scotts Valley, California. Myself I wouldn’t consider it in Silicon Valley itself but it is about as close as you can get. I spent a lot of time in Soctts Valley as a kid, I grew up in Boulder Creek, California (up until I was about 12 anyways) which is about 10 miles from Scotts Valley. I considered it(probably still is) the first “big town” to home, where it had things like movie theaters, and arcades. I didn’t find out Seagate was based there until a few years ago, but for some reason makes me proud(?), for such a big giant to be located in such a tiny town so close to what I consider home.

[End Second Tangent –]

Comments (2)

Google waves goodbye to Mapreduce

Filed under: News — Tags: google, hadoop — Nate @ 9:05 am

From the group of people that brought the Map Reduce algorithm to a much broader audience (despite the concepts being decades old), Google has now outgrown it and is moving on according to our friends at The Register.

The main reason behind it is map reduce was hindering their ability to provide near real time updates to their index. So they migrated their Search infrastructure to a Bigtable distributed database. They also optimized the next generation Google file system for this database, making it inappropriate for more general uses.

MapReduce is a sequence of batch operations, and generally, Lipkovits explains, you can’t start your next phase of operations until you finish the first. It suffers from “stragglers,” he says. If you want to build a system that’s based on series of map-reduces, there’s a certain probability that something will go wrong, and this gets larger as you increase the number of operations. “You can’t do anything that takes a relatively short amount of time,” Lipkovitz says, “so we got rid of it.”

I have to wonder how much this new distributed database-based index was responsible for Google to be able to absorb upwards of a 7 fold increase in search traffic due to the Google Instant feature being launched.

I had an interview at a company a couple of months ago that was trying to use Hadoop + Map ReduceÂ for near real-time operations (the product had not launched yet), and thought that wasn’t a very good use of the technology. It’s a batch processing system. Google of course realized this and ditched it when it could no longer scale to the levels of performance that they needed (despite having an estimated 1.8 million servers at their disposal).

As more things get closer to real time I can’t help but wonder about all those other companies out there that have hopped on the Hadoop/Map Reduce bandwagon, when they will realize this and try once again to follow the food crumbs that Google is dropping.

I just hope for those organizations, that they don’t compete with Google in any way, because they will be at a severe disadvantage from a few angles:

Google has a near infinite amount of developer resources internally and as one Yahoo! person said “[Google] clearly demonstrated that it still has the organizational courage to challenge its own preconceptions,”
Google has a near infinite hardware capacity and economies of scale. What one company may pay $3-5,000 for, Google probably pays less than $1,000. They are the largest single organization that uses computers in the world. They are known for getting special CPUs,. and everyone at cloud scale operates with specialized motherboard designs. They build their own switches and routers (maybe). Though last I heard they are still a massive user of Citrix Netscaler load balancers.
Google of course operates it’s own high end, power efficient data centers which means they get many more servers per kW than you can get in a typical data center. I wrote earlier in the year about a new container that supports 45 kilowatts per rack, more than ten times your average data center.
Google is the world’s third largest internet carrier and due to peering agreements pays almost nothing for bandwidth.

Google will be releasing more information about their new system soon, I can already see the army of minions out there gearing up to try to duplicate the work and try to remain competitive. ha ha! I wouldn’t want to be them, that’s all I can say 🙂

[Google’s] Lipokovitz stresses that he is “not claiming that the rest of the world is behind us.”

Got to admire the modesty!

Comments (3)

September 9, 2010

Availability vs Reliability with ZFS

Filed under: Storage — Tags: zfs — Nate @ 7:37 pm

ZFS doesn’t come up with me all that often, but with the recent news of the settlement of the suites I wanted to talk a bit about this.

It all started about two years ago, some folks at the company I was at were proposing using cheap servers and ZFS to address our ‘next generation’ storage needs, at the time we had a bunch of tier 2 storage behind some really high end NAS head units(not configured in any sort of fault tolerant manor).

Anyways in doing some research I came across a fascinating email thread, the most interesting post was this one, and I’ll put it here because really I couldn’t of said it better myself –

I think there’s a misunderstanding concerning underlying concepts. I’ll try to explain my thoughts, please excuse me in case this becomes a bit lengthy. Oh, and I am not a Sun employee or ZFS fan, I’m just a customer who loves and hates ZFS at the same time

You know, ZFS is designed for high *reliability*. This means that ZFS tries to keep your data as safe as possible. This includes faulty hardware, missing hardware (like in your testing scenario) and, to a certain degree, even human mistakes.

But there are limits. For instance, ZFS does not make a backup unnecessary. If there’s a fire and your drives melt, then ZFS can’t do anything. Or if the hardware is lying about the drive geometry. ZFS is part of the operating environment and, as a consequence, relies on the hardware.

so ZFS can’t make unreliable hardware reliable. All it can do is trying to protect the data you saved on it. But it cannot guarantee this to you if the hardware becomes its enemy.

A real world example: I have a 32 core Opteron server here, with 4 FibreChannel Controllers and 4 JBODs with a total of [64] FC drives connected to it, running a RAID 10 using ZFS mirrors. Sounds a lot like high end hardware compared to your NFS server, right? But … I have exactly the same symptom. If one drive fails, an entire JBOD with all 16 included drives hangs, and all zpool access freezes. The reason for this is the miserable JBOD hardware. There’s only one FC loop inside of it, the drives are connected serially to each other, and if one drive dies, the drives behind it go downhill, too. ZFS immediately starts caring about the data, the zpool command hangs (but I still have traffic on the other half of the ZFS mirror!), and it does the right thing by doing so: whatever happens, my data must not be damaged.

A “bad” filesystem like Linux ext2 or ext3 with LVM would just continue, even if the Volume Manager noticed the missing drive or not. That’s what you experienced. But you run in the real danger of having to use fsck at some point. Or, in my case, fsck’ing 5 TB of data on 64 drives. That’s not much fun and results in a lot more downtime than replacing the faulty drive.

What can you expect from ZFS in your case? You can expect it to detect that a drive is missing and to make sure, that your _data integrity_ isn’t compromised. By any means necessary. This may even require to make a system completely unresponsive until a timeout has passed.

But what you described is not a case of reliability. You want something completely different. You expect it to deliver *availability*.

And availability is something ZFS doesn’t promise. It simply can’t deliver this. You have the impression that NTFS and various other Filesystems do so, but that’s an illusion. The next reboot followed by a fsck run will show you why. Availability requires full reliability of every included component of your server as a minimum, and you can’t expect ZFS or any other filesystem to deliver this with cheap IDE hardware.

Usually people want to save money when buying hardware, and ZFS is a good choice to deliver the *reliability* then. But the conceptual stalemate between reliability and availability of such cheap hardware still exists – the hardware is cheap, the file system and services may be reliable, but as soon as you want *availability*, it’s getting expensive again, because you have to buy every hardware component at least twice.

So, you have the choice:

a) If you want *availability*, stay with your old solution. But you have no guarantee that your data is always intact. You’ll always be able to stream your video, but you have no guarantee that the client will receive a stream without drop outs forever.

b) If you want *data integrity*, ZFS is your best friend. But you may have slight availability issues when it comes to hardware defects. You may reduce the percentage of pain during a disaster by spending more money, e.g. by making the SATA controllers redundant and creating a mirror (than controller 1 will hang, but controller 2 will continue working), but you must not forget that your PCI bridges, fans, power supplies, etc. remain single points of failures why can take the entire service down like your pulling of the non-hotpluggable drive did.

c) If you want both, you should buy a second server and create a NFS cluster.

Hope I could help you a bit,

Ralf

The only thing somewhat lacking from the post is that creating a NFS cluster comes across as not being a very complex thing to do either. Tightly coupling anything really is pretty complicated especially it needs to be stateful (for lack of a better word), in this case the data must be in sync(then there’s the IP-level tolerance, optional MAC takeover, handling fail over of NFS clients to the backup system, failing back, performing online upgrades etc). Just look at the guide Red Hat wrote for building a HA NFS cluster with GFS. Just look at the diagram on page 21 if you don’t want to read the whole thing! Hell I’ll put the digram here because we need more color, note that Red Hat forgot network and fiber switch fault tolerance –

That. Is. A. Lot. Of. Moving. Parts. I was actually considering deploying this at a previous company(not the one that brought up the ZFS discussion), budgets were slashed and I left shortly before the company (and economy) really nose dived.

Also take note in the above example, that only covers the NFS portion of the cluster, they do not talk about how the back end storage is protected. GFS is a shared file system, so the assumption is you are operating on a SAN of some sort. In my case I was planning to use our 3PAR E200 at the time.

Unlike say providing fault tolerance for a network device(setting aside stateful firewalls in this example), since the TCP stack in general is a very forgiving system, storage on the other hand makes so many assumptions about stuff “just working” that you know as well as I do, when storage breaks, usually everything above it breaks hard too, and in crazy complicated ways (I just love to see that “D” in the linux process list after a storage event). Stateful firewall replication is fairly simple by contrast.

Also I suspect that all of the fancy data integrity protection bits are all for naught when running ZFS with things like RAID controllers or higher end storage arrays because of the added abstraction layer(s) that ZFS has no control over, which is probably why so many folks prefer to run RAID in ZFS itself and use “raw” disks.

I think ZFS has some great concepts in it, I’ve never used it because it’s usability on Linux has been very limited (and haven’t had a need for ZFS that was big enough to justify deploying a Solaris system), but certainly give mad props to the evil geniuses who created it.

Comments (3)

ZFS Free and clear.. or is it?

Filed under: News,Random Thought,Storage — Tags: random, zfs — Nate @ 7:03 pm

So, Sun and Oracle kissed and made up recently over the lawsuits they had against each other, from our best friends at The Register –

Whatever the reasons for the mutual agreement to dismiss the lawsuits, ZFS technology product users and end-users can feel relieved that a distracting lawsuit has been cleared away.

Since the terms of the settlement or whatever you want to call it have not been disclosed and there has been no apparent further comment from either side, I certainly wouldn’t jump to the conclusion that other ZFS users are in the clear. I view it as if your running ZFS on Solaris your fine, if your using OpenSolaris your probably fine too. But if your using it on BSD, or even Linux (or whatever other platforms folks have tried to port ZFS to over the years), anything that isn’t directly controlled by Oracle, I wouldn’t be wiping the sweat from my brow just yet.

As is typical with such cases the settlement (at least from what I can see) is specifically between the two companies, there have been no statements or promises from either side from a broader technology standpoint.

I don’t know what OS folks like Coraid, and Compellent use on their ZFS devices, but I recall when investigating NAS options for home use I was checking out Thecus, a model like the N770+ and among the features was a ZFS option. The default file system was ext3, and supported XFS as well. While I am not certain, I was pretty convinced the system was running Linux in order to be supporting XFS and ext3, and not running OpenSolaris. I ended up not going with Thecus because as far as I could tell they were using software RAID. Instead I bought a new workstation(previous computer was many years old), and put a 3Ware 9650SE RAID controller(with a battery backup unit and 256MB of write back cache) along with four 2TB disks(RAID 1+0).

Now as and end user I can see not really being concerned, it is unlikely Netapp or Oracle will go after end users using ZFS on Linux or BSD or whatever, but if your building a product based on it(with the intension of selling/licensing it), and you aren’t using an ‘official’ version, I would stay on my toes. If your product doesn’t compete against any of NetApp’s product lines then you may skirt by without attracting attention. And as long as your not too successful Oracle probably won’t come kicking down your door.

Unless of course further details are released and the air is cleared more about ZFS as a technology in general.

Interestingly enough I was reading a discussion on Slashdot I think, around the time Oracle bought Sun and folks became worried about the future of ZFS in theÂ open source world. And some were suggesting as far as Linux was concerned btrfs, which is the Linux community’s response to ZFS. Something I didn’t know at the time was that apparently btrfs is also heavily supported by Oracle(or at least it was, I don’t track progress on that project).

Yes I know btrfs is GPL, but as you know I’m sure a file system is a complicated beast to get right. And if Oracle’s involvement in the project is significant and they choose instead to for whatever reason drop support and move resources to ZFS, well that could leave a pretty big gap that will be hard to fill. Just because the code is there doesn’t mean it’s going to magically code itself. I’m sure others contribute, I don’t know what the ratio of support is from Oracle vs outsiders. I recall reading at one point for OpenOffice something like 75-85% of the development was done directly by Sun Engineers. Just something to keep in mind.

I miss reiserfs. I really did like reiserfs v3 way back when. And v4 certainly looked promising (never tried it).

Reminds me of the classic argument that so many make for using open source stuff (not that I don’t like open source, I use it all the time). That is if there is a bug in the program you can go in and fix it yourself. My own experience at many companies is the opposite, they encounter a bug and they go through the usual community channels to try to get a fix. I would say it’s a safe assumption to say in excess of 98% of users of open source code have no ability to comprehend or fix the source they are working with. And that comes from my own experience of working for, really nothing but software companies over the past 10 years. And before anyone asks, I believe it’s equally improbable that a company would hire a contractor to fix a bug in an open source product. I’m sure it does happen, but pretty rare given the number of users out there.

Comments Off

« Newer Posts — Older Posts »

TechOpsGuys.com Diggin' technology every day

September 22, 2010

September 21, 2010

September 17, 2010

September 16, 2010

September 15, 2010

September 12, 2010

September 9, 2010