TechOpsGuys.com Diggin' technology every day

March 23, 2012

Hitachi trounces XIV in SPC-2 Costs

Filed under: Storage — Tags: , , — Nate @ 9:37 am

This really sort of surprised me. I came across a HP storage blog post which mentioned some new SPC-2 results for the P9500 aka Hitachi VSP, naturally I expected the system to cost quite a bit, and offer good performance but I was really not expecting these results.

A few months ago I wrote about what seemed like pretty impressive numbers from IBM XIV (albeit at a high cost), I didn’t realize how high of a cost until these latest results came out.

Not that any of my workloads are SPC-2 related (which is primarily throughput). I mean if I have a data warehouse I’d probably run HP Vertica (which slashes I/O requirements due to it’s design), negating the need for such a high performing system, if I was streaming media I would probably be running some sort of NAS – maybe Isilon or DDN, BlueArc – I don’t know. I’m pretty sure I would not be using one of these kinds of arrays though.

Anyways, the raw numbers came down to this:

IBM XIV

  • 7.4GB/sec throughput
  • $152.34 per MB/sec of throughput (42MB/sec per disk)
  • ~$7,528 per usable TB (~150TB Usable)
  • Total system cost – $1.1M for 180 x 2TB SATA disks and 360GB cache

HP P9500 aka Hitachi VSP

  • 13.1GB/sec throughput
  • $88.34 per MB/sec of throughput (26MB/sec per disk)
  • ~$9,218 per usable TB (~126TB Usable)
  • Total system cost – $1.1M for 512 x 300GB 10k SAS disks and 512GB cache

The numbers are just startling to me, I never really expected the cost of the XIV to be so high in comparison to something like the P9500. In my original post I suspected that any SPC-1 numbers coming out of XIV(based on the SPC-2 configuration cost anyways) would put the XIV as the most expensive array on the market(per IOP), which is unfortunate given it’s limited scalability to 180 disks, 7200RPM-only and RAID 10 only. I wonder what, if anything(other than margin) keeps the price so high on XIV.

I’m sure a good source for getting the cost lower on the P9500 side was the choice to use RAID 5 instead of RAID 10. The previous Hitachi results, released in 2008 for the previous generation USP platform was mirroring. And of course XIV only supports mirroring.

It seems clear to me that the VSP is the winner here, I suspect the XIV probably includes more software out of the box, while the VSP is likely not an all-inclusive system.

IBM gets some slack cut to them since they were doing a SPC-2/E energy efficiency test, though not too much since if your spending $1M on a storage system the cost of energy isn’t going to be all that important(at least given average U.S. energy rates). I’m sure the P9500 with it’s 2.5″ drives are pretty energy efficient on their own anyways.

Where XIV really fell short was on the last test for Video on Demand, for some reason the performance tanked, less than 50% of the other tests( a full 10 Gigabytes/second less than VSP). I’m not sure what the weightings are for each of the tests but if IBM was lucky and the VOD test wasn’t there it would of helped them a lot.

The XIV as tested is maxed out, so any expansion would require an additional XIV. The P9500 is nowhere close to maxed out (though throughput could be maxed out, I don’t know).

March 20, 2012

Storage Reclamation under Linux

Filed under: Storage — Nate @ 1:32 pm

UPDATED – Oh the good ‘ol days, I remember when I first started using thin provisioning in late 2006, learning the ropes, learning ins and outs.. I can’t count how many times I did data migrations between volumes to reclaim space at the time.

Today times have changed, a few years ago many companies started introducing thin reclamation technologies which provides a means for the computer to communicate to the storage which blocks are not being used and can be reclaimed by the shared storage system for use in other volumes.

Initially software support for this was almost non existent, short of a Microsoft tool called sdelete, which was never designed with storage reclamation in mind (I assume ..) it would be the tool of choice for early reclamation systems since it had the ability to zero out deleted space in a volume. Of course it only worked on Windows boxes.

Later came support from Symantec in their Veritas products, and VMware in their VAAI technologies (though in VMware’s case it seems they suggest you disable support because storage arrays can’t keep up which drives latency up – wonder if that includes 3PAR ?). Then Oracle announced their own reclamation system which the built in co-operation with 3PAR. Myself at least, I have not seen many other announcements for such technology.

A few days ago I came across this post from Compellent which talks about the support of the SCSI UNMAP command in Red Hat Enterprise 6, apparently integrated into the ext4 file system (or somewhere else in the layer so that it is transparent, no special tool needed to reclaim space). That sounded really cool, my company is an Ubuntu shop at the moment, and as far as I can tell there is no such support in Ubuntu at this time (at least not with 10.04 LTS).

One of my co-workers pointed me to another tool called fstrim, which seems to do something similar as sdelete.

Fstrim is used on a mounted filesystem to discard (or “trim”) blocks which are not in use by the filesystem. This is useful for solid-state drives (SSDs) and thinly-provi-sioned storage.

I have not yet tried it. YAT (Yet Another Tool) I came across is zerofree, but this tool is not really too useful since it needs the file system to be in read only mode or totally unmounted.

fstrim is part of the util-linux package in newer Debian-based distributions (Ubuntu 10.04 excluded), so it should be supported by your distro if your on a current release. It is also part of the latest versions of Red Hat Enterprise.

Any other thin reclamation tools out there (for any platform), or any other thin reclamation support built into other software stacks ?

UPDATE – Upon further investigation I believe I determined that this new Linux functionality was introduced into the kernel as something called FITRIM which Ubuntu says was part of 2.6.36 kernel. I see an XFS page that mentions  real time reclamation (what RHEL 6 does) is apparently part of the 3.0 kernel, so I assume RHEL back ported those changes.

March 19, 2012

Apple and their dividends

Filed under: Random Thought — Nate @ 4:20 pm

I watch quite a bit of CNBC despite never having invested a dime in the markets. I found this news of Apple doing a stock buyback and offering a dividend curious.

People have been clamoring for a dividend from Apple for a while now, wanting Apple to return some of that near $100B cash stock pile to investors.  I’m no fan of Apple(to put it mildly), never really have been, but for some reason I always thought Apple should NOT offer a dividend nor do a stock buyback.

The reason? Just look at the trajectory of the stock, investors are getting rewarded by a large amount already, from about $225 two years ago to near $600 now. Keep the cash, sooner or later their whole iOS ecosystem will start to fizzle and they will go on a decline again,  when that is I don’t know but the cash could be used to sustain them for quite some time to come. As long as the stock keeps going up at a reasonable amount year over year (I hardly think what has happened to their stock is reasonable but that’s why I’m not an investor), with some room for corrections here and there, keep the cash.

Same goes for stock buybacks, these seem to be tools for companies that really don’t have any other way to boost their stock price, a sort of last resort, they are all out of ideas. Apple doesn’t seem to be near that situation.

Even with the dividend I saw people complaining this morning that it was too low.

Apple could do something crazy with their cash stockpile too – perhaps buy HP (market cap of $48B). Or Dell ($30B market cap), if for nothing else then for diversification.  Not that I’d be happy with an Apple acquisition of HP.. Apple wouldn’t have to do anything with the companies just let them run like they are now. For some reason I thought both HP and Dell’s market caps were much higher(I suppose they were, just a matter of time frame).

What I think Apple should do with their stock? is a massive split (10:1 ?). Something apparently they are considering. Not for any other reason than to allow the little guy to get in on the action easier, having a $600/share price is kind of excessive, not Berkshire Hathaway excessive ($122k/share) but still excessive.

With the Nasdaq hitting new 10+ year highs in recent days/weeks, I would be curious to see the charts of such indexes that have Apple in them, how they would look without the ~50 fold increase in Apple’s stock over the past 10 years). I seem to recall seeing/hearing that Dow won’t put Apple in their DJI index because it would skew the numbers too much due to how massive it is. One article here says that if Apple had been put in the DJI in 2009 instead of Cisco the Dow would be past 15,000 now.

CNBC reported that Steve Jobs was against dividends.

10GbaseT making a comeback ?

Filed under: Networking — Tags: — Nate @ 12:20 pm

Say it’s true… I’ve been a fan of 10GbaseT for a while now. Though it hasn’t really caught on in the industry, off the top of my head I can only think of Arista and Extreme who have embraced the standard from a switching perspective, with everyone else going with SFP+, or XFP or something else. Both Arista and Extreme obviously have SFP+ products as well, maybe XFP too, though I haven’t looked into why someone would use XFP over SFP+ or vise versa.

From what I know, the biggest thing that has held back adoption of 10GbaseT has been power usage. Also I think other industry organizations had given up waiting for 10GbaseT to materialize. Also cost was somewhat of a factor too, I recall at least with Extreme their 24-port 10GbaseT switch was about $10k more than their SFP+ switch (without any SFP+ adapters or cables), so it was priced similarly to an optical switch that was fairly fully populated with modules, making entry level pricing if you only needed say 10 ports initially quite a bit higher.

But I have read two different things (and heard a third)  recently, which I’m sure are related which hopefully points to a turning point in 10GbaseT adoption.

The first was a banner on Arista’s website.

The second  is this blog post talking about a new 10GbaseT chip from Intel.

Then the third thing I probably can’t talk about, so I won’t 🙂

I would love to have 10GbaseT over the passive copper cabling that most folks use now, that stuff is a pain to work with. While there are at least two switching companies that have 10GbaseT (I recall a Dell blade switch that had 10GbaseT support too), the number of NICs out there that support it is just about as sparse.

Not only that but I do like to color code my cables, and while CAT6 cables are obviously easy to get in many colors, it’s less common and harder to get those passive 10GbE cables in multiple colors, seems most everyone just has black.

Also, cable lengths are quite a bit more precise with CAT6 than with passive copper. For example from Extreme at least (I know I could go 3rd party if I wanted), their short cables are 1 meter and 3 meters. There’s a massive amount of distance between those two. CAT6 can be easily customized to any length and pre-made cables(I don’t make my own), can be fairly easily found to be in 1 foot (or even half a foot) increments.

SFP Passive copper 10GbE cable

I wonder if there are (or will there be) 10GbaseT SFP+ GBICs (so existing switches could support 10GbaseT without wholesale replacement) ? I know there are 1GbE SFP+ GBICs.

 

March 17, 2012

Who uses Legacy storage?

Filed under: Random Thought,Storage — Tags: — Nate @ 3:34 pm

Still really busy these days haven’t had time to post much but I was just reading someone’s LinkedIn profile who works at a storage company and it got me thinking.

Who uses legacy storage? It seems almost everyone these days tries to benchmark their storage system against legacy storage.  Short of something like maybe direct attached storage which has no functionality, legacy storage has been dead for a long time now. What should the new benchmark be? How can you go about (trying to) measuring it?  I’m not sure what the answer is.

When is thin, thin?

One thing that has been in my mind a lot on this topic recently is how 3PAR constantly harps on about their efficient allocation at 16kB blocks. I think I’ve tried to explain this in the past but I wanted to write about it again. I wrote a comment on it in a HP blog recently I don’t think they published the comment though (haven’t checked for a few weeks maybe they did). But they try to say they are more efficient (by dozens or hundreds of times) than other platforms because of this 16kB allocation thing-a-ma-bob.

I’ve never seen this as an advantage to their platform. Whether you allocate in 16kB chunks or perhaps 42MB chunks in the case of Hitachi, it’s still a tiny amount of data in any case and really is a rounding error. If you have 100 volumes and they all have 42MB of slack hanging off the back of them, that’s 4.2GB of data, it’s nothing.

What 3PAR doesn’t tell you is this 16kB allocation unit is what a volume draws from a storage pool (Common Provisioning Group in 3PAR terms – which is basically a storage template or policy which defines things like RAID type, disk type, placement of data, protection level etc). They don’t tell you up front how much these storage pools provision storage on, which is in-part based on the number of controllers in the system.

If your volumes max out a CPG’s allocated space and it needs more, it won’t grab 16kB, it will grab (usually at least) 32GB, this is adjustable. This is – I believe in part how 3PAR addresses minimizing impact of thin provisioning with large amounts of I/O, because it allocates these pools with larger chunks of data up front. They even suggest that if you have a really large amount of growth that you increase the allocation unit even higher.

Growth Increments for CPGs on 3PAR

I bet you haven’t heard HP/3PAR say their system grows in 128GB increments recently 🙂

It is important to note, or to remember, that a CPG can be home to hundreds of volumes, so it’s up to the user, if they only have one drive type for example maybe they only want 1 CPG.  But I think as they use the system they will likely go down a similar path that I have and have more.

If you only have one or two CPGs on the system it’s probably not a big deal, though the space does add up. Still I think for the most part even this level of allocation can be a rounding error. Unless you have a large number of CPGs.

Myself, on my 3PAR arrays I use CPGs not just for determining data characteristics of the volumes but also for organizational purposes / space management. So I can look at one number and see all of the volumes dedicated to development purposes are X in size, or set an aggregate growth warning on a collection of volumes. I think CPGs work very well for this purpose. The flip side is you can end up wasting a lot more space. Recently on my new 3PAR system I went through and manually set the allocation level of a few of my CPGs from 32GB down to 8GB because I know the growth of those CPGs will be minimal. At the time I had maybe 400-450GB of slack space in the CPGs, not as thin as they may want you to think (I have around 10 CPGs on this array). So I changed the allocation unit and compacted the CPGs which reclaimed a bunch of space.

Again, in the grand scheme of things that’s not that much data.

For me 3PAR has always been more about higher utilizations which are made possible by the chunklet design and the true wide striping, the true active-active clustered controllers, one of the only(perhaps one of if not the first?) storage designs in the industry that goes beyond two controllers, and the ASIC acceleration which is at the heart of the performance and scalability. Then there is the ease of use and stuff, but I won’t talk about that anymore I’ve already covered it many times. One of my favorite aspects of the platform is the fact that they use the same design on everything from the low end to the high end, the only difference really is scale. It’s also part of the reason why their entry level pricing can be quite a bit higher than entry level pricing from others since there is the extra sauce in there that the competition isn’t willing or able to put on their low end box(s).

Sacrificing for data availability

I was talking to Compellent recently learning about some of their stuff for a project over in Europe and they told me their best practice (not a requirement) is to have 1 hot spare of each drive type (I think drive type meaning SAS or SATA, I don’t think drive size matters but am not sure) per drive chassis/cage/shelf.

They, like many other array makers don’t seem to support the use of low parity RAID (like RAID 50 3+1, or 4+1), they (like others) lean towards higher data:parity ratios I think in part because they have dedicated parity disks(they either had a hard time explaining to me how data is distributed or I had a hard time understanding, or both..), and dedicating 25% of your spindles to parity is very excessive, but in the 3PAR world dedicating 25% of your capacity  to parity is not excessive(when compared to RAID 10 where there is a 50% overhead anyways).

There are no dedicated parity, or dedicated spares on a 3PAR system so you do not lose any I/O capacity, in fact you gain it.

The benefits to a RAID 50 3+1 configuration are a couple fold – you get pretty close to RAID 10 performance, and you can most likely (depending on the # of shelves) suffer a shelf failure w/o data loss or downtime(downtime may vary depending on your I/O requirements and I/O capacity after those disks are gone).

It’s a best practice (again, not a requirement) in the 3PAR world to provide this level of availability (losing an entire shelf), not because you lose shelves often but just because it’s so easy to configure and is self managing. With a 4, or 8-shelf configuration I do like RAID 50 3+1. In an 8-shelf configuration maybe I have some data volumes that don’t need as much performance so I could go with a 7+1 configuration and still retain shelf-level availability.

Or, with CPGs you could have some volumes retain shelf-level availability and other volumes not have it, up to you. I prefer to keep all volumes with shelf level availability. The added space you get with a higher data:parity ratio really has diminishing returns.

Here’s a graphic from 3PAR which illustrates the dimishing returns(at least on their platform, I think the application they used to measure was Oracle DB):

The impact of RAID on I/O and capacity

3PAR can take this to an even higher extreme on their lower end F-class series which uses daisy chaining in order to get to full capacity (max chain length is 2 shelves). There is a availability level called port level availability which I always knew was there but never really learned what it truly was until last week.

Port level availability applies only to systems that have daisy chained chassis and protects the system from the failure of an entire chain. So two drive shelves basically. Like the other forms of availability this is fully automated, though if you want to go out of your way to take advantage of it you need to use a RAID level that is compliant with your setup to leverage port level availability otherwise the system will automatically default to a lower level of availability (or will prevent you from creating the policy in the first place because it is not possible on your configuration).

Port level availability does not apply to the S/T/V series of systems as there is no daisy chaining done on those boxes (unless you have a ~10 year old S-series system which they did support chaining – up to 2,560 drives on that first generation S800 – back in the days of 9-18GB disks).

February 17, 2012

Random Irony – MySQL wanting .NET

Filed under: Random Thought — Tags: — Nate @ 10:22 pm

So I am here late at night waiting for Citrix to call me back about a issue on one of my Netscalers so I went to install an eval of the 3PAR System Reporter — last time I used it it was windows-only, even though it was based off of Apache, MySQL and some CGI. Out of instinct I installed a windows VM to install this on and since it is an eval I am just going to put MySQL on the same VM because well our array is small and there just isn’t enough data to tax it. Later I found out I can use Linux now! But for Linux it needs 32-bit, so for this eval I’ll stick to windows since I’m almost done setting it up and I don’t have 32-bit Linux handy at the moment.

I’ve never been a big fan of system reporter, it seems very much like the tools from EMC anyways, and I’ve heard similar about the tools from HDS. So I guess it is typical, which is probably why I am not fond of it. But installing it is faster and easier than getting my home grown stuff online. So in the meantime I’ll use it till I get some spare time to install my custom stuff (which uses perl scripts, rrdtool and cacti to display the data – along with a lot of time to create the graphs).

So I was installing some things and went to install MySQL, much to  my surprise, it wanted .NET. Oracle owns MySQL, they own Java, MySQL is of course open source, and I’m sure it’s got more deployments on Linux and stuff than Windows so it puzzles me after all that, that the MySQL folks go out of their way to use .NET for whatever they are using it for.

Thought it was funny anyways

Where's my NET?

Little update – got a cryptic error while trying to install system reporter.. so I guess here comes another support case for HP!

February 16, 2012

30 billion packets per second

Filed under: Networking — Tags: — Nate @ 11:28 pm

Still really busy getting ready to move out of the cloud. Ran into a networking issue today noticed my 10GbE NICs are going up and down pretty often. Contacted HP and apparently these Qlogic 10GbE NICs have overheating issues and the thing that has fixed it for other customers is to set the bios to increased cooling, basically ramp up those RPMs (as well as a firmware update, which is already applied) Our systems may have more issues given that each server has two of these NICs right on top of each other. So maybe we need to split them apart, waiting to see what support says. There haven’t been any noticeable issues with the link failures since everything is redundant across both NICs (both NIC ports on the same card seem to go down at about the same time).

Anyways, you know I’ve been talking about it a lot here, and it’s finally arrived (well a few days ago), the Black Diamond X-Series (self proclaimed world’s largest cloud switch). Officially announced almost a year ago, their marketing folks certainly liked to grab the headlines early. They did the same thing with their first 24-port 10GbE stackable switch.

The numbers on this thing are just staggering, I mean 30 billion packets per second, a choice of either switching fabric based on 2.5Tbps (meaning up to 10Tbps in the switch), or 5.1Tbps (with up to 20Tbps in the switch). They are offering both 48x10GbE line cards as well 12 and 24x40GbE line cards (all line rate, duh). You already know it has up to 192x40GbE or 768x10GbE(via breakout cables) in a 14U footprint – half the size of the competition which was already claiming to be a fraction of the size of their other competition.

5Tbps Switching fabric module (max 4 per switch)

 

Power rated at 5.5 watts per 10GbE port or 22W per 40GbE port.

They are still pushing their Direct Attach approach, support for which still hasn’t exactly caught on in the industry. I didn’t really want Direct Attach but I can see the value in it. They apparently have a solution for KVM ready to go, though their partner in crime that supported VMware abandoned the 10GbE market a while ago (it required special NIC hardware/drivers for VMware). I’m not aware of anyone that has replaced that functionality from another manufacturer.

I found it kind of curious they rev’d their Operating system 3 major versions for this product. (from version 12 to 15). They did something similar when they re-wrote their OS about a decade ago (jumped from version 7 to 10).

Anyways not much else to write about, I do plan to write more on the cloud soon just been so busy recently – this is just a quick post / random thought.

So the clock starts – how long until someone else comes out with a midplane-less chassis design ?

February 3, 2012

Don’t push code on a Friday damnit

Filed under: General,Random Thought — Tags: — Nate @ 9:35 pm

I hate it when people want to push code on a Friday. Here it is, Friday, I was working to wind up a few last tasks before going home when my phone went off saying part of my company’s site was not working right.

After some investigation with a developer we discovered it was an issue with Facebook (ugh, how I hate thee) and their code was breaking because Facebook was broken (they were not aware this would happen I am sure they will fix it going forward).

So while they are working to work around the issue I ran a search for it, and it seems to be a more wide spread problem caused by a Facebook software deployment done today, Friday at nearly 7PM!

F F S

My monitor first detected the failure at 6:49 PM so they weren’t even done deploying by the time it failed. It was intermittant for a few minutes then went hard down at around 7:11PM.

@$#$ facebook. Thanks for screwing me, and who knows how many others by deploying code on a Friday night.

Of course the developers on our end deserve some of the heat as well. But I am nit picking about code deployments on a Friday, not code bugs..

IBM shows it still has some horses left

Filed under: Storage — Tags: , , — Nate @ 11:15 am

I noticed a few days ago that IBM posted some new SPC-1 results based on their SVC system, this time using different back end storage- their Storwize product (something I had not heard of before but I don’t pay too close of attention to what IBM does they have so many things it’s hard to keep track).

The performance results are certainly very impressive, coming in at over 520,000 IOPS  at a price of $6.92 per IOP. This is the sort of results I was kind of expecting from the Hitachi VSP a while back. IBM tested with 1,920 drives the same number as the 3PAR V800. They bested the 3PAR performance by a good 70,000 IOPS with half the latency on the same number of disks and less data cache.

The capacity numbers were, and still are, sort of difficult to interpret they seem to give conflicting information. IBM is using ~138TB of disk space to protect ~99TB of disk space. While 3PAR is using ~263TB of disk space to protect ~263TB of disk space. Both results say there is 30TB+ of “unused storage” in that protection scheme.

Bottom line is the IBM box is presented with roughly 280TB of storage, and of that, 100TB is usable, or about 35%. That brings their cost per usable TB number to $36,881/TB vs the 3PAR V800 which is roughly $12,872. The V800 I/O cost was $6.59, which IBM comes real close to.

IBM has apparently gone the same route as HDS in the only 3.5″ drives they support on their Storwize systems are 3TB SATA disks. They hamper their own cost structure by not supporting larger 3.5″ 15k RPM SAS disks, which just doesn’t make sense to me. There are 300GB 15k SAS drives out and Storwize doesn’t support those either(yet at least).

It took about five pages of scripting to configure the system from the looks of the full disclosure report.

Certainly looks like a halfway decent system. I mean if you compare it to the VSP for example it has the same array virtualization abilities with the SVC, it is sporting almost double the amount of disk drives, almost double the raw performance, configuration at least appears to be less complicated. It uses those power efficient 2.5″ disks just like the VSP. It also costs quite a bit less than the VSP both on per-IOP and per-TB basis. It also appears to have mainframe support for those that need that. From the looks of Seagate 15k RPM disks at least the 2.5″ drives have an average of 15% less latency for random reads and writes than their 3.5″ counterparts. I thought the difference might be bigger than that given how much less distance the disk heads have to travel.

If I was in the market for such a big system, these results wouldn’t lead me away from 3PAR, at least based on the pricing disclosed of each system (and level of complexity to configure). I was interviewing a candidate a few weeks ago and this guy had a strong storage background. Having worked for Symantec I think for a while he was doing some sort of consulting at various companies for storage. I asked him how he provisioned storage, what his strategies were. His response was quite surprising. He said usually the vendors come out, deploy their systems and provision everything up front, all he does is carve out LUNs and present them to users. He had never been involved in the architecture planning or deployment of a storage system. He acted as if what he was doing was the standard practice (maybe it is at large companies I’ve never worked at such an organization), and that it was perfectly normal.

But it certainly seems like a good system when put up against at least the VSP, and probably the V-MAX too.

I’ve always been interested in the SVC by itself, certainly seems like a cool concept, I’ve never used one of course but having the ability to cluster at that intermediate level(in this case a 8-node cluster which may be the max I’m not sure) and then scale out storage behind it. Clearly they’ve shown with this you can pump one hell of a lot of I/O through the thing. They also seem to have SSD tiering support built into it which is nice as well.

Hopefully HP can come up with something similar at some point, as much as they talk smack about the likes of SVC today.

Making the easy stuff hard, the hard stuff possible

Filed under: linux — Tags: — Nate @ 5:16 am

First off, sorry for being away for so long, I’ve been really, really busy preparing a new data center deployment to migrate my company out of the cloud into. The last time I did anything remotely resembling this was in 2007, though this time there are some extra layers involved that I didn’t have back then. It is certainly an interesting experience though configuring the software and infrastructure from the absolute ground up, having nothing to base it off of (other than past experience obviously!). I mean we have our stuff in a public cloud now but there are so many things that are different from an infrastructure perspective that little of it transfers over.

I wanted to write about a sort of topic that I haven’t really written about before. It’s about a systems management tool named Chef from a Seattle-based company named Opscode. It’s supposed to be a next generation tool that is supposed to make your life easier, more advanced than older tools like Puppet and Cfengine.

I’ll start off by saying I have a very strong background in Cfengine, having had used it since late 2004, at three different companies. My techniques and approaches evolved significantly over the years, and my last deployment was quite good in my opinion considering I had to adapt an existing Cfengine deployment made by folks who didn’t know what they were doing into something that worked well, and doing so in a 4 nines environment. That was not easy, as you know one wrong command or config in one of these tools can wreck havoc as I know first hand. I grew to like Cfengine a lot, and there was really nothing that I needed it to do that it couldn’t do for me. I knew it’s limitations well and it was simple to use.

I was introduced to Chef in the summer of 2010 when I went to the headquarters of Opscode and met their senior staff including one of the co-founders I believe. They gave us their powerpoint presentation on what Chef was, how it worked, what it could do, why it exists.

It certainly came across as a very impressive tool, being able to do tons of things that Cfengine could not do, had a lot of concepts that sounded like they could be useful. At the same time however it looked incredibly complicated.

I raised my concerns with their senior staff on that very first day and we had about a 15 minute discussion on it. I’m not a programmer, nor do I ever intend to be. I have a very big line that I refuse to cross from scripting tools in perl & bash to help make my life easier to full on code. A developer at my company constantly jokes that I say I am not a programmer yet I come up with complicated regexes and scripts to do things they don’t understand how to come up with on their own.

They tried to re-assure me that learning Chef is no different than learning the syntax of an Apache configuration file, or DNS or something like that. I didn’t really buy it, but was still willing to give the tool a shot since it sounded like a nice level of systems management that you could achieve with it. I still joke with my co-workers and current boss( who was my boss at the time too) on this very topic, they all remember that conversation to this day.

Chef is written in Ruby, and is very Ruby-centric. I guess you could say I am very biased against Ruby given my past experience supporting Ruby (on Rails) applications.

So here I am, almost 18 months later and things haven’t changed much. My dislike of Ruby continues, and is perhaps even stronger now having used Chef.

My first chef implementation about a year ago was fraught with frustration at almost every turn. I could(and still can) see the promise in the tools it provides the user with but it’s just so difficult to work with especially coming from a Cfengine background(and lack of programming experience) that for my first iteration I dumbed it down a whole bunch, making the logic very Cfengine like, at least as much as I could. I didn’t use any data bags, any attributes, no templates, nothing like that. I had (and still have) a very hard time finding usable examples for many things in Chef. They have a big repository of sample cookbooks – but to me for the most part those are not usable, because while examples they don’t go into details as to specifically, literally what each line of code does. Chef apparently uses this for it’s template language, I looked at it a couple of times – and really I could not make heads or tails of it.

I like to tell people that Chef makes the easy things hard, and the hard things possible. It seems very clear to me that they attacked the hard things in system management first before addressing the easy things. I remember seeing something in their documentation around the concept of the holy grail in the single instance copy, which fits along those lines well. The idea is you have one small bit of code that can be adapted to (m)any environments and situations, using the templates to pull attributes and values from data bags or other sources to make something on the fly.

The concept is novel for sure, coming from a Cfengine background I am very used to duplicating config stanzas, for different environments, making static config files, one for each environment or something like that. I’ve been doing it so long it’s second nature.

Where the opscode folks and I seem to part ways is our priorities. Their priority is to turn the system management into code and automate it to the point where it scales to a million systems. Mine is less ambitious, I want it to be easy to manage and it can scale to a few thousand systems at the most, since going beyond that gets so cookie cutter that it’s not fun anymore. I can certainly see the value of such an approach when dealing with massive environments that are changing all the time. Most companies though this situation doesn’t exist – most companies things are fairly static, you get a new system here and there, you get a new environment maybe once a quarter at the most. Maybe some big project comes along that increases your system count by a large amount for some special purpose.

I have absolutely no problem in maintaining separate config files for each environment and having different config stanzas in the config management tool to push those files out. Not only is this approach simpler (in my view) it gives much more, insight – perhaps is a good word into what is actually happening. I mean if you have a template filled with things that are pulling values dynamically from a half dozen or more different sources you really have no idea what that file really looks like until it lands on the server in question. I like to be able to open the file and look at the settings rather than hunt down the various flags and values that can come from these various sources chef provides.

I’m not building new environments every day, the level of change in general is quite small (as it has been over the past decade at companies I have worked at), I don’t need the level of dynamic abilities that Chef provides because it doesn’t help me that much.

I came up with a new saying a few months ago after dealing with Chef. If it’s not friends with sed, awk and grep then it’s not friends with me. Chef, being very developer-centric uses a lot of JSON to store and manage it’s various configurations. JSON is very much not friendly to sed, awk and grep, and so it frustrates me greatly whenever I have to deal with it.

Because we are moving into a self managed data center environment we needed a way to provision systems. My background is Red Hat/CentOS, Kickstart and Cfengine. We have Ubuntu, <nothing>, and Chef. I came up with a system that for now uses VMware templates (my first ever use of VMware templates) and some custom scripting to integrate with Chef and do other provisioning tasks. It works, it’s not as nice as Kickstart but it works. So speaking of this, and JSON there is a bootstrap process Chef needs to do in order to get itself registered and stuff with the Chef service. This involves creating a bit of JSON that Chef can read. The standard way of Chef bootstrap is a sort of push approach, where there is a management agent that waits for a system to be provisioned, then ssh’s to the system and runs a bunch of stuff. I wanted a pull approach, where the system is provisioned and boots up and configures itself. So I came up with this little bash snippet to construct this JSON file

echo -n "Making first-boot.json ..."
echo -n "{ \"run_list\": [ ">/etc/chef/first-boot.json;
export ROLES=`grep ROLE /root/00-50-* |head -n 1 | sed s'/.*=//'g | sed s'/,/ /'g` &&
for ROLE in $ROLES; do echo -n \"role[${ROLE}]\",;done | sed s'/\,$//'g >>/etc/chef/first-boot.json;
echo -n " ] }" >>/etc/chef/first-boot.json

That /root/00-50-* file is a configuration file named after the MAC address of the VM. This is based on my older kickstart stuff which has been extended to support Chef. It stores things like IP address, Host name, default gateway, for the network, then Chef environment, Chef Role(s), and Chef Organization. It’s a simple text file format, that looks like VARIABLE=value, one VARIABLE per line.

My point with pasting that code is the ugly length I have to go through to simulate valid JSON output using my own regular tool set. Remember I am NOT a programmer!

The scripting works fine(at least so far, built a dozen or so different roles and systems), but it shouldn’t be that complicated.

For those of you more experienced with Vmware templates I noticed there is the ability to customize a template so that Vmware can set the IP address, host name etc of the guest OS. When I saw this I spent a good two hours trying to get it to work, but no matter what I tried Vmware said my configuration was not supported and it would not let me customize. I have read conflicting reports as to whether or not it is possible on Ubuntu. I am running ESX 4.1 with vCenter 5.0. I think if I was running vCenter 4.x it would work fine, but Ubuntu and other “non tier 1” operating system support for template customization is no longer supported in the 5.0 products. Often times when I see “not supported” especially when something used to work, it means that it might work but don’t ask us for help if it blows up. Maybe coincidence or not but as I said no matter what I did, the customization boxes were greyed out and I could not get vCenter 5.0 to work with Ubuntu.

At the end of the day it doesn’t matter though, I had, what was to me at least a good provisioning process I could adapt from my Kickstart days, a process that works well on both physical as well as virtual machines. Something that leverages the MAC address or the serial number(in the case of physical machines) for unique identification.

With regards on how I used to do things with Cfengine, it was simpler than Chef. Cfengine operates more on trust than Chef. Chef uses public/private keys to authenticate systems, and these keys have to be in the right place in order for a system to get registered. This is good for untrusted networks, like public clouds(ugh). Cfengine works more on trust, where you can (or at least I did) assign network ranges where the IPs are trusted, and a new system could just register itself without any special configuration. The keys would be generated automatically and exchanged between cfengine client and server. I had my cfengine configuration, for the most part dynamic based on the host name of the server. Most of my major Cfengine classes ran a simple grep on the file name that had the host name in it, if the host name matched a particular pattern it was automatically included in the right classes. With Chef life is different, I can’t do that. I have to specifically define which role(s) or recipes a system has up front. Because the system will only download cookbooks that it is specifically configured for using. This isn’t a big deal but is an extra step that I’m not used to having to do.

Sample CFengine class defitition:

ENV_CORPDMZ     = ( ReturnsZero(/bin/egrep -q "^HOSTNAME=corpdmz" /etc/sysconfig/network) )

With Cfengine, prior to implementing the hostname-based approach, adding a new server with Cfengine involved manually editing the master cfengine configuration so that it was aware of the new system that was about to come online. I still had to edit this file on occasion, if there were special configs needed for a server, but for the most part, for like systems, web servers and the like I did not.

Which sort of brings me to the next topic – recruiting talent that can use Chef.  I’ve been managing server systems for about 17 years now, wow has it really been that long.  It’s clear to me after 18 months of chef I lack the knowledge to be able to effectively use the tool (though it hasn’t stopped me from using it at this point), but knowing that, and working with people at my previous company with Chef and seeing the tool present them with a similar level of frustration (if not more), I can see Chef being a real sticking point finding talent that is capable of managing it. My company is actively recruiting senior systems people(well one person) and the candidates that I have spoken with so far, along with candidates I have spoken to in the past, I honestly can think of perhaps one or two people over the years that I know that could handle Chef, and one of them is a full time programmer now (when I met him he was hired to be on my operations team back in 2003).

Well short of the co-worker I have now who does quite a wonderful job in deploying and managing Chef, who wrote the vast majority of Chef stuff at my current company. It’s really well done, but even now that a lot of the hard work was done by him, in a very chef-like way I constantly struggle to add new stuff in, or to change existing things because it’s so dynamic. I see a value for something – where is it coming from? is it from the node? environment ? data bag? attribute? something else?

So I see Chef somewhat like I see Hadoop as far as what skill sets are needed and who can provide them. One of my previous companies was working on migrating towards Hadoop and a big complaint I heard from them about Hadoop (and I have heard it from others since) is finding talent that knows the product. With the likes of Yahoo, Google, and other big companies with very deep pockets and big data aspirations they can afford to pay out the wazoo for Hadoop talent, something small companies just can’t compete with. The number of people qualified to do Hadoop right vs the number of people that can do SQL, well it’s obvious, right.

I see the same with Chef. It’s a powerful tool but it’s just not there yet with regards to usability, I can see it being a very useful tool for the likes of those same kinds of companies who manage very huge fleets of systems and have a very dynamic environment. One such place is HP, whom someone I know is going to work for HP Cloud, because he knows Chef. I assume he is probably pretty good at Chef by now, though the caveat with him is he has a strong Ruby programming background. So it’s no real surprise that he could pick Chef up.

I filed several feature requests and bug reports on the Chef support site about a year ago when I was first interacting with it, though I don’t think much made it through. One thing I’d really like is a good way to do in-line editing of text files. At least at the time the Chef mantra was “find another way to do it”, which a friend of mine says is the same thing Puppet people say. So how do I go about adding an entry to /etc/hosts?

Another thing I’d like to be able to do is bulk file copies from the cookbook and preserve ownership and permissions from the source files(e.g. having a directory tree with various owners/groups/permissions and copying it all at once), I don’t think that is possible still. At the time the Opscode people suggested I use rsync for that.

Another thing I’d like is to be able to host cookbooks internally while using the external service for other things. This is mainly for security purposes I feel more at ease when my core data stays within the confines of my network, on systems under my direct control.

Another thing I’d like to see which I have mentioned to Opscode in one way or another as well is a more abstracted configuration language. I think I called it idiot mode or something. The Ruby syntax they use, while I’m sure it’s great for ruby people really sucks for people like me. I’m fine with a reduced subset of functionality that may be provided by idiot mode, because it’s likely that I won’t use that functionality to begin with(at least not initially). Make the learning curve to actually using the tool less steep.

At one point Opscode was interested in talking to me about a full time position being an advocate for their platform. I just couldn’t go through with it, I just can’t get excited about the platform after all the frustration it has given me. I certainly see the promise and will continue trying,  but I think some fundamental things need to be done to the system in order to make it more usable.

So, in the end, I see Chef as a very powerful tool, a very useful tool for those with the skills that can handle the power it gives you. If I were deploying a new environment today I would certainly NOT use Chef, I would use Cfengine. I don’t want to discourage people from using Chef, it is a good tool, just realize the much higher level of investment you need in order to properly leverage it and try to weigh that against the benefits. For me, the hard things that are made possible by Chef really involve a trivial amount of time. I dare say I have spent FAR more time trying to work with Chef on these hard things (understanding the concepts, code etc) than just flat out doing it by hand the old fashioned way.

You might want to ask – why haven’t I tried Puppet? My answer would be – to-date I haven’t had a reason to. I’ve had a few brief discussions with people who use Puppet over the years(including those who have used Cfengine as well) and asked them why should I use Puppet over Cfengine. For the most part the response was there’s nothing really revolutionary in Puppet so if your happy with Cfengine then stick to it. There are a few things Puppet apparently does better (What they are I don’t remember), but in my talks with people there wasn’t anything — anything that made me want to jump on Puppet. There was things that sounded nice (like Chef has), but not enough return to justify the investment in time to make a migration when, as I mentioned earlier Cfengine does pretty much everything I need it to do.

With Cfengine I could probably train a systems person up on the basics in literally an afternoon. My Cfengine configurations were not complicated. With Chef, well here I am at 18 months and still lost.

3,400 words, I think that’s a record for me for a published blog post. Should get back to sleep now, started writing this at about 3:30AM.

« Newer PostsOlder Posts »

Powered by WordPress