TechOpsGuys.com Diggin' technology every day

July 24, 2013

Opscode is learning

Filed under: Random Thought — Tags: — Nate @ 10:11 am

A few months ago I wrote a long rant on how Opscode has a lot to learn about running operations.

Problems included:

  • status.opscode.com site returning broken HTTP Location headers which broke standards compliant browsers (I had reported this issue to them through multiple channels and support tickets for a good 7 months)
  • Taking scheduled downtime in the middle of a business day

It APPEARS someone read that post, because recently the status site was fixed. It now redirects to opscode.tumblr.com and since that time I have seen no issues with the site.

Also I see they have a scheduled downtime for some of their databases and they are scheduling it for 9PM Pacific time (Opscode is HQ’d in Pacific time), instead of say one in the afternoon. Obviously people in far time zones may not like that as much, but it makes sense to their U.S. customers(which I’d imagine is the bulk of their customer base but I don’t know).

They’ve also gone through some effort to post analysis on outages/performance issues recently as well which is nice.

I have two remaining requests, in case Opscode is reading:

  • Schedule downtime further in advance, the most recent announcement provides about 48 hours of notification, I think it’d be better to provide one week notice. Take a look at what other service providers do for planned outages, my experience says 48 hours is not sufficient notice for scheduled downtime. If it’s an emergency, then obviously a shorter window is acceptable just say it’s an emergency and try to explain why it’s an emergency.
  • Provide actual dates and times for the posts on the status site. Now it just says things like “17 hours ago” or “5 days ago”.
  • Be consistent on the time zone used. Some posts use UTC, others(scheduled events) refer to Pacific time. I don’t care which myself (well honestly I prefer Pacific since I am in that zone, but I can understand using UTC too).
  • Provide pro-active notification of any customer impacting maintenance.  Maybe all of their other customers follow them on twitter, I don’t know. I don’t use twitter myself. So having an email notification option (perhaps opt in by default) to customer addresses registered with the platform for such things would be good to consider.

Now as for Chef, there’s tons of things that could be improved with Chef to make it easier to use.. My latest issue is whenever I pull up the JSON to edit an environment, or a node or whatever the order is not consistent. My co-worker says the data is not ordered, and it has never been consistent for him, for me the issues just started a few weeks/month or two ago. It’s quite annoying. For example if I want to change the role of a node, I would knife node edit <hostname>, then skip to the end of the file, and change the role.  Now sometimes the role is at the top of the file, other times it is at the bottom (it’s never shown up as in the middle).

Pick a way to display the information and display it consistently! How hard is that to do.. It’s not as if I can pipe the JSON to the sort command and have it sort for me. I’ve never liked JSON for that reason — my saying is If it’s not friendly with grep and sed it’s not a friend to me. Or something like that ..  JSON seems to be almost exactly the opposite of what Linux admins want to deal with, it’s almost as bad as binary data, I just hate it. If I don’t have to deal with it (e.g. it’s used in applications and I never see it) – fine go nuts. Same goes for XML. I used to support a big application whose developers were gung ho for XML config files, we literally had several hundred. It would take WEEKS (literally) of configuration auditing(line by line) prior to deployment – and even then there was still problems. It was a massive headache. Using JSON just brings me back to those days.

The syntax is so delicate as well, one extra comma, or missing quote or anything the whole thing blows up(it wouldn’t be so bad if the tool ran a simple syntax check and pointed out what the error was and returned you to the editor to fix it telling you what line it was on, but in this case it just bombs and you lose all your changes — Opscode folks – look at visudo – it does this better..)

The only thing worse(off the top of my head) than the syntax for the chef recipes itself, is the JSON.. or maybe that should be vise versa..

Opscode and Chef are improving I guess is the point, maybe in the next decade or so it will become a product that I would consider usable to mere mortals.

April 9, 2013

Opscode Chef folks still have a lot to learn

Filed under: Random Thought — Tags: — Nate @ 8:01 pm

The theme for this post is: BASIC STUFF. This is not rocket science.

A while back I wrote a post (wow has it really been over a year since that post!) about Chef and my experience with it for what was at the time the past two years, I think I chose a good title for it –

Making the easy stuff hard, and the hard stuff possible

Which still sums up my thoughts today. This post was inspired by something I just read on the Opscode Chef status site.

While I’m on the subject of that damn status site I’ll tell you what – I filed a support ticket with them back in AUGUST 2012 – yes people that is EIGHT MONTHS ago, to report to them that their status site doesn’t #@$@ work.  Well at least most of the time it doesn’t #@$@! work. You see a lot of times the site returns an invalid Location: header which is relative instead of absolute, and standards based browsers(e.g. Firefox), don’t like that so I get a pretty error message that says the site is down, basically. I can usually get it to load after forcing a refresh 5-25 times.

This is not the kind of message you want to serve from your "status" site

I first came across this site when Opscode was in the midst of a fairly major outage. So naturally I feel it’s important that the web site that hosts your status page work properly. So I filed the ticket, after going back and forth with support, I determined the reason for the browser errors and they said they’d look into it. There wasn’t a lot they claimed they could do because the site was hosted with another provider (Tumbler or something??).

That’s no excuse.

So time passes, and nothing gets done. I mentioned a while back I met some of the senior opscode staff a few years ago, so I directly reached out to the Chief Operating Officer of Opscode (who is a very technical guy himself) to plead with him FIX THE DAMN SITE. If Tumbler is not working then host it elsewhere, it is trivial to setup that sort of site, I mean just look at the content on the site! I was polite in my email to him. He responded and thanked me.

So more time passes, and nothing happens. So in early January I filed another support ticket outlining the reason behind their web site errors and asked that they fix their site. This time I got no reply.

More time passes. I was bored tonight so I decided to hit the site again, guess what? Yeah, they haven’t done squat.

How incompetent are these people? Sorry maybe it is not incompetence but laziness.  If you can’t be bothered to properly operate the site take the site down.

So anyway I was on their site and noticed this post from last week

Chef 0.9.x Client EOL

Since we stopped supporting Chef 0.9.x June 11, 2012 we decided it is a good time to stop all API support for Chef 0.9.x completely.

Starting tomorrow the api.opscode.com service will no longer support requests from Chef 0.9.x clients.

ref: http://www.opscode.com/blog/2012/05/10/chef-0-9-eol/

I mean it doesn’t take a rocket scientist to read that and not think immediately how absurd that is. It’s one thing to say you are going to stop supporting something that is fine. But to say OH WE DECIDED TO STOP SUPPORT, TODAY IS YOUR LAST DAY.

So I go to the page they reference above and it says

On or after June 11th, we’ll deploy a change to Hosted Chef that will disable all access to Hosted Chef for 0.9 clients, so you will want to make sure you’ve upgraded before then.

Last I checked, it is nowhere near June 11th. (now that I think of it maybe they meant last year, they don’t say for sure).  In any case there was extremely poor notification on this – and how much work does it take to maintain servers running chef 0.9 ? So you can stop development on it, no new patches. Big deal.

This has absolutely no impact on anything I do because we have been on Chef 0.10 forever. But the fact they would even consider doing something like this just shows how poorly run things are over there.

How can they expect customers to take them seriously by doing stuff like this? This is BASIC STUFF. REAL BASIC.

Something else that caught my eye recently as I was doing some stuff in Chef, was their APIs seemed to be down completely. So I hopped on over to the status site after forcing a refresh a dozen or more times to get it to load and saw

Hosted Chef Maintenance Underway

The following systems are unavailable while Hosted Chef is migrated from MySQL to PostgreSQL.

– The Hosted Chef Platform including the API and Management Console

– Opscode Support Ticketing System

– Chef Community Site

Apparently they had announced it on the site one or more days prior(can’t tell for sure now since both posts say posted 1 week ago). But they took the APIs down at 2:00 PM Pacific time! (they are based in Seattle so that’s local time for them). Who in their right mind takes their stuff down in the middle of the afternoon intentionally for a data migration? BASIC STUFF PEOPLE. And their method of notification was poor as well, nobody at my company(we are a paying customer) had any idea it was happening. Fortunately it had only a minor impact on us. I just got lucky when I happened to try to use their API at the exact moment they took it down.

Believe me there are plenty of times when one of our developers comes up to me and says OH #@$ WE NEED THIS CONFIGURATION SETTING IN PRODUCTION NOW! As you might imagine most of that is in Chef, so we rely on that functioning for us at all times. Unscheduled down time is one thing, but this is not excusable. At the very least you could migrate customers in smaller batches(with downtime for any given customer measured in seconds – maybe the really big customers take longer but they can work with those individually to schedule a good time). If they didn’t build the product to do that they should go back to the drawing board.

My co-worker was recently playing around with a slightly newer build of Chef 0.10.x that he thinks we should upgrade to (ours is fairly out of date – primarily because we had some major issues on a newer build at the time). He ran into a bunch of problems including Opscode changing some major things around within a minor release breaking a bunch of stuff. Just more signs of how cavalier they are, typical modern “web 2.0” developer types, that don’t know anything about stability.

Maybe I was lucky I don’t know. But I basically ran the same version of CFengine v2 for nearly 7 years without any breakage (hell I can’t remember encountering a single issue I considered a bug!), across three different companies. I want my configuration system to be stable, fast and simple to manage. Chef is none of those, the more I use it the more I dislike it. I still believe it is a good product and has it’s niche, but it’s got a looooooooong way to go to win over people like me.

As a CFengine employee put it in my last post, Chef views things as configuration as code, and CFengine views them as configuration as documentation. I’m far in the documentation camp. I believe in proper naming conventions whether it is servers, or load balancer addresses, or storage volumes, mount points on servers etc. Also I believe strongly in a good descriptive domain name (have always used the airport codes like most other folks). None of this randomly generated crap(here’s looking at you Amazon). If you are deploying 10,000 servers that are doing the same thing you can still number them in some sort of sane manor. I’ve always been good at documentation, it does take work, and I find more often than not most people are overwhelmed by what I write (you may get the idea with what I have written here) so they often don’t read it — but it is there and I can direct them to it. I take lots of screen shots and do a lot of step by step guides.

On a side note, this configuration as documentation is a big reason why I do not look forward to IPv6.

Chef folks will say go read the code!  That can be a pretty dangerous thing to say, really, it is. I mean just yesterday or was it the day before, I was trying to figure out how a template on a server was getting a particular value. Was it coming from the cookbook attributes? from the role? from the environment? I looked everywhere and I could not find the values that were being populated — and the values I specified were being ignored. So I passed this task to my co-worker who I have to acknowledge has been a master in Chef, he has written most of what we have, and while I can manage to tweak stuff here and there, the difficult stuff I give him because if I don’t my fist will go through the desk or perhaps the monitor (desk is closer), after a couple hours working with Chef.  A tool is not supposed to make you get so frustrated.

So I ask him to look into it, and quickly I find HIM FIGHTING CHEF! OH MY THE IRONY. He was digging up and down and trying to set things but Chef was undoing them and he was cursing and everything. I loved it. It’s what I go through all the time.  After some time he eventually found the issue, the values were being set in another cookbook and they conflicted.

So he worked on it for a bit, and decided to hard code the values for a time while he looked into a better solution. So he deployed this better solution and it had more problems. The most recent thing is for some reason Chef was no longer able to successfully complete a run on certain types of servers(other types were fine though). He’s working on fixing it.

I know he can do it, he’s a really smart guy I just wanted to write about that story – I’m not the only one that has these problems.

Sure I’d love to replace Chef  with something else. But it’s not a priority I want to try to shove in my boss’ face (who likes the concept of Chef). I have other fish to fry, and as long as I have this guy doing the dirty work well it’s not as much of a pain for me.

Tracking down conflicting things in CFengine was really simple for me – probably because I wasn’t trying anything too over the top with configuration. Opscode guys liked to say, oh wouldn’t it be great if you could have one configuration stanza that could adapt to ANY SITUATION.

I SAY NO. —-  IT! IS! NOT! GREAT!

It might be nice in some situations but in many others it just gives me a headache. I like to be able to look at a config and say THAT IS GOING TO SERVER X, EXACTLY HOW IT SITS NOW. Sure I have to duplicate configs and files for different environments and such but really at the end of the day – at all of the companies I have worked at — IT’S NOT A BIG DEAL. In the grand scheme of things. If your configuration is so complex that you need all of this maybe you should step back and consider if you are doing something wrong – does it really need to be that complex? Why?

Oh and don’t get me started on that #$@ damn ruby syntax in the Chef configuration files. Oh you need quotes around a string that is nothing more than a word? You puke with a cryptic stack trace if you don’t have that? Oh you puke with a cryptic stack trace unless these two configuration settings are on their own lines? Come on, this is stupid. I go back to this post on Ruby, how I am reminded of it almost every time I use Chef. I had to support Ruby+Rails apps back from 2006-2008 and it was a bad experience. Left a bad taste in my mouth for Ruby. Chef just keeps on piling on the crap. I’ll fully admit I am very jaded against Ruby (and Chef for that matter). I think for good reason. How’s that saying go? Burn me once shame on you, burn me 500 times shame on me?

With the background that some of these folks have at Opscode it’s absolutely stunning to me the number of times they have shot themselves in the feet over the past few years, on such BASIC THINGS.  Maybe that’s how things are done at the likes of Amazon I don’t know, never worked there(knew many that did and do though, general consensus is stay away).

In my neck of the woods people take more care in what they do.

I’ll end this again by mentioning I could train someone on CFEngine in an afternoon, Chef – here I am 2 and a half years later and still struggling.

(In case your wondering YES I run Ubuntu 10.04 LTS on my laptop and desktop (guess what – it is about to go EOL too) – I have no plans to change, because it’s stable, and it does the job for me. I run Debian STABLE on my servers because – IT’S STABLE. No testing, no unstable, no experimental. Tried and true. The new UI stuff in the newer Ubuntu is just scary for me, I have no interest in trying it.)

Ok that’s enough for this rant I guess.  Thanks for listening.

May 29, 2012

More Chef pain

Filed under: General — Tags: — Nate @ 10:30 am

I wrote a while back about growing pains with Chef, which is the newish hyped up system management tool. I’ve been having a couple other frustrations with it in the past few months and needed a place to gripe.

The first issue started a couple of months ago where some systems were for some reason restarting Splunk every single time chef ran. It may of been going on longer than that but that’s when I first noticed it. After a couple hours of troubleshooting I tracked it down to chef seemingly randomizing the attributes for the configuration resulting in writing a new configuration (that was the same configuration, just in a different order) every time and triggering a restart. I think it was isolated primarily to the newer version(s) of chef (maybe specific to 0.10.10). My co-worker who knows more chef than I (and the more I use chef the more I really want cfengine – disclaimer I’ve only used cfengine v2 to-date), says after spending some time troubleshooting himself that the only chef solution might be to somehow set the order of the attributes in a static fashion (probably some ruby thing that lets you do that? I don’t know). In any case he hasn’t spent time on doing that and it’s over my head so these boxes just sit there restarting splunk once or twice an hour. They make up a small portion of the systems, the vast majority are not affected by this behavior.

So this morning I am alerted to a failure in some infrastructure that still lives in EC2 (oh how I hate thee), turns out the disk is going bad and I need to build a new system to replace it. So I do, and chef spits out one of it’s usual helpful error messages

 [Tue, 29 May 2012 16:35:36 +0000] ERROR: link[/var/log/myapp] (/var/cache/chef/cookbooks/web/recipes/default.rb:50:in `from_file') had an error:
 link[/var/log/myapp] (web::default line 50) had an error: TypeError: can't convert nil into String
 /usr/lib/ruby/vendor_ruby/chef/file_access_control/unix.rb:106:in `stat'
 /usr/lib/ruby/vendor_ruby/chef/file_access_control/unix.rb:106:in `stat'
 /usr/lib/ruby/vendor_ruby/chef/file_access_control/unix.rb:61:in `set_owner'
 /usr/lib/ruby/vendor_ruby/chef/file_access_control/unix.rb:30:in `set_all'
 /usr/lib/ruby/vendor_ruby/chef/mixin/enforce_ownership_and_permissions.rb:33:in `enforce_ownership_and_permissions'
 /usr/lib/ruby/vendor_ruby/chef/provider/link.rb:96:in `action_create'
 /usr/lib/ruby/vendor_ruby/chef/resource.rb:454:in `send'
 /usr/lib/ruby/vendor_ruby/chef/resource.rb:454:in `run_action'
 /usr/lib/ruby/vendor_ruby/chef/runner.rb:49:in `run_action'
 /usr/lib/ruby/vendor_ruby/chef/runner.rb:85:in `converge'
 /usr/lib/ruby/vendor_ruby/chef/runner.rb:85:in `each'
 /usr/lib/ruby/vendor_ruby/chef/runner.rb:85:in `converge'
 /usr/lib/ruby/vendor_ruby/chef/resource_collection.rb:94
 /usr/lib/ruby/vendor_ruby/chef/resource_collection/stepable_iterator.rb:116:in `call'
 /usr/lib/ruby/vendor_ruby/chef/resource_collection/stepable_iterator.rb:116:in `call_iterator_block'
 /usr/lib/ruby/vendor_ruby/chef/resource_collection/stepable_iterator.rb:85:in `step'
 /usr/lib/ruby/vendor_ruby/chef/resource_collection/stepable_iterator.rb:104:in `iterate'
 /usr/lib/ruby/vendor_ruby/chef/resource_collection/stepable_iterator.rb:55:in `each_with_index'
 /usr/lib/ruby/vendor_ruby/chef/resource_collection.rb:92:in `execute_each_resource'
 /usr/lib/ruby/vendor_ruby/chef/runner.rb:80:in `converge'
 /usr/lib/ruby/vendor_ruby/chef/client.rb:330:in `converge'
 /usr/lib/ruby/vendor_ruby/chef/client.rb:163:in `run'
 /usr/lib/ruby/vendor_ruby/chef/application/client.rb:254:in `run_application'
 /usr/lib/ruby/vendor_ruby/chef/application/client.rb:241:in `loop'
 /usr/lib/ruby/vendor_ruby/chef/application/client.rb:241:in `run_application'
 /usr/lib/ruby/vendor_ruby/chef/application.rb:70:in `run'
 /usr/bin/chef-client:25

So I went to look at this file, on line 50, looked perfectly reasonable, there hasn’t been any changes to this file in a long time and has worked up until now. What a TypeError is I don’t know(it’s been explained to me before but I forgot what it was 30 seconds after it was explained), I’m not a developer(hey, fancy that). I have seen it tons of times before though, it was usually a syntax problem (tracking down the right syntax has been a bane for me in Chef, it’s so cryptic, just like the stack trace above).

So I went to the Chef website to verify the syntax, and yep, at least according to those docs it was right. So, WTF?

I decided to delete the user and group config values, and ran chef again, and it worked! Well until the next TypeError, rinse and repeat about four more times and I finally got chef to complete. Now for all I know my modifications to make the recipes work on this chef will break on the others. Fortunately I was able to figure this syntax error out, usually I just bang my head on my desk for two hours until it’s covered in blood and then wait for my co worker to come figure it out(he’s in a vastly different time zone from me).

So what’s next? I get an alert for the number of apache processes on this host, and that brings back another memory with regards to Chef attributes. I haven’t specifically looked into this issue again but am quite certain I know what the issue is – just no idea how to fix it. The issue the last time this came up was that Chef could not decide on what type of EC2 (ugh) instance this system is, there are different thresholds for different sizes. Naturally one would expect chef to check to see what size, it’s not as if Amazon has the ability to dynamically change sizes on you right? But for some reason again chef thinks it is size A on one run and size B on another run. Makes no sense. Thus the alerts when it gets incorrectly set to the wrong size. Again – this only seems to impact the newest version(s) of Chef.

I’m sure it’s something we’re doing wrong, or if it was VMware it would be something Chef was doing wrong before and is doing right now, what we’re doing hasn’t changed and now all of a sudden is broken. I believe another part of the issue is the legacy EC2 bootstrap process pulls in the latest chef during build, vs our new stuff(non EC2) maintains a static version, less surprises.

Annoyed to have to come back from a nice short holiday to have to immediately deal with two things I hate to deal with – Chef and EC2.

This coming trip to Amsterdam will provide the infrastructure to move the vast majority of the remaining EC2 stuff out of EC2, so am excited about that portion of the trip at least. Getting off of chef is another project I don’t feel like tackling now since I’m in the minority as to my feelings for it. I just try to minimize my work in it for my own sanity, there’s lots of other things I do instead.

On an unrelated note, for some reason during a recent apt-get upgrade my Debian system pulled in what feels like a significantly newer version of WordPress, though I think the version number only changed a little(I don’t recall what the original version number was). I did a major Debian 5.0->6.0 Upgrade a couple of months ago, but this version came in after, has a bunch of UI changes. I’m not sure if it breaks anything, I think I need to re-test how the site renders in IE9 as I manually patched a file after getting a report that it didn’t work right, and the most recent update may of overwritten that fix.

February 3, 2012

Making the easy stuff hard, the hard stuff possible

Filed under: linux — Tags: — Nate @ 5:16 am

First off, sorry for being away for so long, I’ve been really, really busy preparing a new data center deployment to migrate my company out of the cloud into. The last time I did anything remotely resembling this was in 2007, though this time there are some extra layers involved that I didn’t have back then. It is certainly an interesting experience though configuring the software and infrastructure from the absolute ground up, having nothing to base it off of (other than past experience obviously!). I mean we have our stuff in a public cloud now but there are so many things that are different from an infrastructure perspective that little of it transfers over.

I wanted to write about a sort of topic that I haven’t really written about before. It’s about a systems management tool named Chef from a Seattle-based company named Opscode. It’s supposed to be a next generation tool that is supposed to make your life easier, more advanced than older tools like Puppet and Cfengine.

I’ll start off by saying I have a very strong background in Cfengine, having had used it since late 2004, at three different companies. My techniques and approaches evolved significantly over the years, and my last deployment was quite good in my opinion considering I had to adapt an existing Cfengine deployment made by folks who didn’t know what they were doing into something that worked well, and doing so in a 4 nines environment. That was not easy, as you know one wrong command or config in one of these tools can wreck havoc as I know first hand. I grew to like Cfengine a lot, and there was really nothing that I needed it to do that it couldn’t do for me. I knew it’s limitations well and it was simple to use.

I was introduced to Chef in the summer of 2010 when I went to the headquarters of Opscode and met their senior staff including one of the co-founders I believe. They gave us their powerpoint presentation on what Chef was, how it worked, what it could do, why it exists.

It certainly came across as a very impressive tool, being able to do tons of things that Cfengine could not do, had a lot of concepts that sounded like they could be useful. At the same time however it looked incredibly complicated.

I raised my concerns with their senior staff on that very first day and we had about a 15 minute discussion on it. I’m not a programmer, nor do I ever intend to be. I have a very big line that I refuse to cross from scripting tools in perl & bash to help make my life easier to full on code. A developer at my company constantly jokes that I say I am not a programmer yet I come up with complicated regexes and scripts to do things they don’t understand how to come up with on their own.

They tried to re-assure me that learning Chef is no different than learning the syntax of an Apache configuration file, or DNS or something like that. I didn’t really buy it, but was still willing to give the tool a shot since it sounded like a nice level of systems management that you could achieve with it. I still joke with my co-workers and current boss( who was my boss at the time too) on this very topic, they all remember that conversation to this day.

Chef is written in Ruby, and is very Ruby-centric. I guess you could say I am very biased against Ruby given my past experience supporting Ruby (on Rails) applications.

So here I am, almost 18 months later and things haven’t changed much. My dislike of Ruby continues, and is perhaps even stronger now having used Chef.

My first chef implementation about a year ago was fraught with frustration at almost every turn. I could(and still can) see the promise in the tools it provides the user with but it’s just so difficult to work with especially coming from a Cfengine background(and lack of programming experience) that for my first iteration I dumbed it down a whole bunch, making the logic very Cfengine like, at least as much as I could. I didn’t use any data bags, any attributes, no templates, nothing like that. I had (and still have) a very hard time finding usable examples for many things in Chef. They have a big repository of sample cookbooks – but to me for the most part those are not usable, because while examples they don’t go into details as to specifically, literally what each line of code does. Chef apparently uses this for it’s template language, I looked at it a couple of times – and really I could not make heads or tails of it.

I like to tell people that Chef makes the easy things hard, and the hard things possible. It seems very clear to me that they attacked the hard things in system management first before addressing the easy things. I remember seeing something in their documentation around the concept of the holy grail in the single instance copy, which fits along those lines well. The idea is you have one small bit of code that can be adapted to (m)any environments and situations, using the templates to pull attributes and values from data bags or other sources to make something on the fly.

The concept is novel for sure, coming from a Cfengine background I am very used to duplicating config stanzas, for different environments, making static config files, one for each environment or something like that. I’ve been doing it so long it’s second nature.

Where the opscode folks and I seem to part ways is our priorities. Their priority is to turn the system management into code and automate it to the point where it scales to a million systems. Mine is less ambitious, I want it to be easy to manage and it can scale to a few thousand systems at the most, since going beyond that gets so cookie cutter that it’s not fun anymore. I can certainly see the value of such an approach when dealing with massive environments that are changing all the time. Most companies though this situation doesn’t exist – most companies things are fairly static, you get a new system here and there, you get a new environment maybe once a quarter at the most. Maybe some big project comes along that increases your system count by a large amount for some special purpose.

I have absolutely no problem in maintaining separate config files for each environment and having different config stanzas in the config management tool to push those files out. Not only is this approach simpler (in my view) it gives much more, insight – perhaps is a good word into what is actually happening. I mean if you have a template filled with things that are pulling values dynamically from a half dozen or more different sources you really have no idea what that file really looks like until it lands on the server in question. I like to be able to open the file and look at the settings rather than hunt down the various flags and values that can come from these various sources chef provides.

I’m not building new environments every day, the level of change in general is quite small (as it has been over the past decade at companies I have worked at), I don’t need the level of dynamic abilities that Chef provides because it doesn’t help me that much.

I came up with a new saying a few months ago after dealing with Chef. If it’s not friends with sed, awk and grep then it’s not friends with me. Chef, being very developer-centric uses a lot of JSON to store and manage it’s various configurations. JSON is very much not friendly to sed, awk and grep, and so it frustrates me greatly whenever I have to deal with it.

Because we are moving into a self managed data center environment we needed a way to provision systems. My background is Red Hat/CentOS, Kickstart and Cfengine. We have Ubuntu, <nothing>, and Chef. I came up with a system that for now uses VMware templates (my first ever use of VMware templates) and some custom scripting to integrate with Chef and do other provisioning tasks. It works, it’s not as nice as Kickstart but it works. So speaking of this, and JSON there is a bootstrap process Chef needs to do in order to get itself registered and stuff with the Chef service. This involves creating a bit of JSON that Chef can read. The standard way of Chef bootstrap is a sort of push approach, where there is a management agent that waits for a system to be provisioned, then ssh’s to the system and runs a bunch of stuff. I wanted a pull approach, where the system is provisioned and boots up and configures itself. So I came up with this little bash snippet to construct this JSON file

echo -n "Making first-boot.json ..."
echo -n "{ \"run_list\": [ ">/etc/chef/first-boot.json;
export ROLES=`grep ROLE /root/00-50-* |head -n 1 | sed s'/.*=//'g | sed s'/,/ /'g` &&
for ROLE in $ROLES; do echo -n \"role[${ROLE}]\",;done | sed s'/\,$//'g >>/etc/chef/first-boot.json;
echo -n " ] }" >>/etc/chef/first-boot.json

That /root/00-50-* file is a configuration file named after the MAC address of the VM. This is based on my older kickstart stuff which has been extended to support Chef. It stores things like IP address, Host name, default gateway, for the network, then Chef environment, Chef Role(s), and Chef Organization. It’s a simple text file format, that looks like VARIABLE=value, one VARIABLE per line.

My point with pasting that code is the ugly length I have to go through to simulate valid JSON output using my own regular tool set. Remember I am NOT a programmer!

The scripting works fine(at least so far, built a dozen or so different roles and systems), but it shouldn’t be that complicated.

For those of you more experienced with Vmware templates I noticed there is the ability to customize a template so that Vmware can set the IP address, host name etc of the guest OS. When I saw this I spent a good two hours trying to get it to work, but no matter what I tried Vmware said my configuration was not supported and it would not let me customize. I have read conflicting reports as to whether or not it is possible on Ubuntu. I am running ESX 4.1 with vCenter 5.0. I think if I was running vCenter 4.x it would work fine, but Ubuntu and other “non tier 1” operating system support for template customization is no longer supported in the 5.0 products. Often times when I see “not supported” especially when something used to work, it means that it might work but don’t ask us for help if it blows up. Maybe coincidence or not but as I said no matter what I did, the customization boxes were greyed out and I could not get vCenter 5.0 to work with Ubuntu.

At the end of the day it doesn’t matter though, I had, what was to me at least a good provisioning process I could adapt from my Kickstart days, a process that works well on both physical as well as virtual machines. Something that leverages the MAC address or the serial number(in the case of physical machines) for unique identification.

With regards on how I used to do things with Cfengine, it was simpler than Chef. Cfengine operates more on trust than Chef. Chef uses public/private keys to authenticate systems, and these keys have to be in the right place in order for a system to get registered. This is good for untrusted networks, like public clouds(ugh). Cfengine works more on trust, where you can (or at least I did) assign network ranges where the IPs are trusted, and a new system could just register itself without any special configuration. The keys would be generated automatically and exchanged between cfengine client and server. I had my cfengine configuration, for the most part dynamic based on the host name of the server. Most of my major Cfengine classes ran a simple grep on the file name that had the host name in it, if the host name matched a particular pattern it was automatically included in the right classes. With Chef life is different, I can’t do that. I have to specifically define which role(s) or recipes a system has up front. Because the system will only download cookbooks that it is specifically configured for using. This isn’t a big deal but is an extra step that I’m not used to having to do.

Sample CFengine class defitition:

ENV_CORPDMZ     = ( ReturnsZero(/bin/egrep -q "^HOSTNAME=corpdmz" /etc/sysconfig/network) )

With Cfengine, prior to implementing the hostname-based approach, adding a new server with Cfengine involved manually editing the master cfengine configuration so that it was aware of the new system that was about to come online. I still had to edit this file on occasion, if there were special configs needed for a server, but for the most part, for like systems, web servers and the like I did not.

Which sort of brings me to the next topic – recruiting talent that can use Chef.  I’ve been managing server systems for about 17 years now, wow has it really been that long.  It’s clear to me after 18 months of chef I lack the knowledge to be able to effectively use the tool (though it hasn’t stopped me from using it at this point), but knowing that, and working with people at my previous company with Chef and seeing the tool present them with a similar level of frustration (if not more), I can see Chef being a real sticking point finding talent that is capable of managing it. My company is actively recruiting senior systems people(well one person) and the candidates that I have spoken with so far, along with candidates I have spoken to in the past, I honestly can think of perhaps one or two people over the years that I know that could handle Chef, and one of them is a full time programmer now (when I met him he was hired to be on my operations team back in 2003).

Well short of the co-worker I have now who does quite a wonderful job in deploying and managing Chef, who wrote the vast majority of Chef stuff at my current company. It’s really well done, but even now that a lot of the hard work was done by him, in a very chef-like way I constantly struggle to add new stuff in, or to change existing things because it’s so dynamic. I see a value for something – where is it coming from? is it from the node? environment ? data bag? attribute? something else?

So I see Chef somewhat like I see Hadoop as far as what skill sets are needed and who can provide them. One of my previous companies was working on migrating towards Hadoop and a big complaint I heard from them about Hadoop (and I have heard it from others since) is finding talent that knows the product. With the likes of Yahoo, Google, and other big companies with very deep pockets and big data aspirations they can afford to pay out the wazoo for Hadoop talent, something small companies just can’t compete with. The number of people qualified to do Hadoop right vs the number of people that can do SQL, well it’s obvious, right.

I see the same with Chef. It’s a powerful tool but it’s just not there yet with regards to usability, I can see it being a very useful tool for the likes of those same kinds of companies who manage very huge fleets of systems and have a very dynamic environment. One such place is HP, whom someone I know is going to work for HP Cloud, because he knows Chef. I assume he is probably pretty good at Chef by now, though the caveat with him is he has a strong Ruby programming background. So it’s no real surprise that he could pick Chef up.

I filed several feature requests and bug reports on the Chef support site about a year ago when I was first interacting with it, though I don’t think much made it through. One thing I’d really like is a good way to do in-line editing of text files. At least at the time the Chef mantra was “find another way to do it”, which a friend of mine says is the same thing Puppet people say. So how do I go about adding an entry to /etc/hosts?

Another thing I’d like to be able to do is bulk file copies from the cookbook and preserve ownership and permissions from the source files(e.g. having a directory tree with various owners/groups/permissions and copying it all at once), I don’t think that is possible still. At the time the Opscode people suggested I use rsync for that.

Another thing I’d like is to be able to host cookbooks internally while using the external service for other things. This is mainly for security purposes I feel more at ease when my core data stays within the confines of my network, on systems under my direct control.

Another thing I’d like to see which I have mentioned to Opscode in one way or another as well is a more abstracted configuration language. I think I called it idiot mode or something. The Ruby syntax they use, while I’m sure it’s great for ruby people really sucks for people like me. I’m fine with a reduced subset of functionality that may be provided by idiot mode, because it’s likely that I won’t use that functionality to begin with(at least not initially). Make the learning curve to actually using the tool less steep.

At one point Opscode was interested in talking to me about a full time position being an advocate for their platform. I just couldn’t go through with it, I just can’t get excited about the platform after all the frustration it has given me. I certainly see the promise and will continue trying,  but I think some fundamental things need to be done to the system in order to make it more usable.

So, in the end, I see Chef as a very powerful tool, a very useful tool for those with the skills that can handle the power it gives you. If I were deploying a new environment today I would certainly NOT use Chef, I would use Cfengine. I don’t want to discourage people from using Chef, it is a good tool, just realize the much higher level of investment you need in order to properly leverage it and try to weigh that against the benefits. For me, the hard things that are made possible by Chef really involve a trivial amount of time. I dare say I have spent FAR more time trying to work with Chef on these hard things (understanding the concepts, code etc) than just flat out doing it by hand the old fashioned way.

You might want to ask – why haven’t I tried Puppet? My answer would be – to-date I haven’t had a reason to. I’ve had a few brief discussions with people who use Puppet over the years(including those who have used Cfengine as well) and asked them why should I use Puppet over Cfengine. For the most part the response was there’s nothing really revolutionary in Puppet so if your happy with Cfengine then stick to it. There are a few things Puppet apparently does better (What they are I don’t remember), but in my talks with people there wasn’t anything — anything that made me want to jump on Puppet. There was things that sounded nice (like Chef has), but not enough return to justify the investment in time to make a migration when, as I mentioned earlier Cfengine does pretty much everything I need it to do.

With Cfengine I could probably train a systems person up on the basics in literally an afternoon. My Cfengine configurations were not complicated. With Chef, well here I am at 18 months and still lost.

3,400 words, I think that’s a record for me for a published blog post. Should get back to sleep now, started writing this at about 3:30AM.

Powered by WordPress