I wrote a while back about growing pains with Chef, which is the newish hyped up system management tool. I've been having a couple other frustrations with it in the past few months and needed a place to gripe.
The first issue started a couple of months ago where some systems were for some reason restarting Splunk every single time chef ran. It may of been going on longer than that but that's when I first noticed it. After a couple hours of troubleshooting I tracked it down to chef seemingly randomizing the attributes for the configuration resulting in writing a new configuration (that was the same configuration, just in a different order) every time and triggering a restart. I think it was isolated primarily to the newer version(s) of chef (maybe specific to 0.10.10). My co-worker who knows more chef than I (and the more I use chef the more I really want cfengine - disclaimer I've only used cfengine v2 to-date), says after spending some time troubleshooting himself that the only chef solution might be to somehow set the order of the attributes in a static fashion (probably some ruby thing that lets you do that? I don't know). In any case he hasn't spent time on doing that and it's over my head so these boxes just sit there restarting splunk once or twice an hour. They make up a small portion of the systems, the vast majority are not affected by this behavior.
So this morning I am alerted to a failure in some infrastructure that still lives in EC2 (oh how I hate thee), turns out the disk is going bad and I need to build a new system to replace it. So I do, and chef spits out one of it's usual helpful error messages
[Tue, 29 May 2012 16:35:36 +0000] ERROR: link[/var/log/myapp] (/var/cache/chef/cookbooks/web/recipes/default.rb:50:in `from_file') had an error: link[/var/log/myapp] (web::default line 50) had an error: TypeError: can't convert nil into String /usr/lib/ruby/vendor_ruby/chef/file_access_control/unix.rb:106:in `stat' /usr/lib/ruby/vendor_ruby/chef/file_access_control/unix.rb:106:in `stat' /usr/lib/ruby/vendor_ruby/chef/file_access_control/unix.rb:61:in `set_owner' /usr/lib/ruby/vendor_ruby/chef/file_access_control/unix.rb:30:in `set_all' /usr/lib/ruby/vendor_ruby/chef/mixin/enforce_ownership_and_permissions.rb:33:in `enforce_ownership_and_permissions' /usr/lib/ruby/vendor_ruby/chef/provider/link.rb:96:in `action_create' /usr/lib/ruby/vendor_ruby/chef/resource.rb:454:in `send' /usr/lib/ruby/vendor_ruby/chef/resource.rb:454:in `run_action' /usr/lib/ruby/vendor_ruby/chef/runner.rb:49:in `run_action' /usr/lib/ruby/vendor_ruby/chef/runner.rb:85:in `converge' /usr/lib/ruby/vendor_ruby/chef/runner.rb:85:in `each' /usr/lib/ruby/vendor_ruby/chef/runner.rb:85:in `converge' /usr/lib/ruby/vendor_ruby/chef/resource_collection.rb:94 /usr/lib/ruby/vendor_ruby/chef/resource_collection/stepable_iterator.rb:116:in `call' /usr/lib/ruby/vendor_ruby/chef/resource_collection/stepable_iterator.rb:116:in `call_iterator_block' /usr/lib/ruby/vendor_ruby/chef/resource_collection/stepable_iterator.rb:85:in `step' /usr/lib/ruby/vendor_ruby/chef/resource_collection/stepable_iterator.rb:104:in `iterate' /usr/lib/ruby/vendor_ruby/chef/resource_collection/stepable_iterator.rb:55:in `each_with_index' /usr/lib/ruby/vendor_ruby/chef/resource_collection.rb:92:in `execute_each_resource' /usr/lib/ruby/vendor_ruby/chef/runner.rb:80:in `converge' /usr/lib/ruby/vendor_ruby/chef/client.rb:330:in `converge' /usr/lib/ruby/vendor_ruby/chef/client.rb:163:in `run' /usr/lib/ruby/vendor_ruby/chef/application/client.rb:254:in `run_application' /usr/lib/ruby/vendor_ruby/chef/application/client.rb:241:in `loop' /usr/lib/ruby/vendor_ruby/chef/application/client.rb:241:in `run_application' /usr/lib/ruby/vendor_ruby/chef/application.rb:70:in `run' /usr/bin/chef-client:25
So I went to look at this file, on line 50, looked perfectly reasonable, there hasn't been any changes to this file in a long time and has worked up until now. What a TypeError is I don't know(it's been explained to me before but I forgot what it was 30 seconds after it was explained), I'm not a developer(hey, fancy that). I have seen it tons of times before though, it was usually a syntax problem (tracking down the right syntax has been a bane for me in Chef, it's so cryptic, just like the stack trace above).
So I went to the Chef website to verify the syntax, and yep, at least according to those docs it was right. So, WTF?
I decided to delete the user and group config values, and ran chef again, and it worked! Well until the next TypeError, rinse and repeat about four more times and I finally got chef to complete. Now for all I know my modifications to make the recipes work on this chef will break on the others. Fortunately I was able to figure this syntax error out, usually I just bang my head on my desk for two hours until it's covered in blood and then wait for my co worker to come figure it out(he's in a vastly different time zone from me).
So what's next? I get an alert for the number of apache processes on this host, and that brings back another memory with regards to Chef attributes. I haven't specifically looked into this issue again but am quite certain I know what the issue is - just no idea how to fix it. The issue the last time this came up was that Chef could not decide on what type of EC2 (ugh) instance this system is, there are different thresholds for different sizes. Naturally one would expect chef to check to see what size, it's not as if Amazon has the ability to dynamically change sizes on you right? But for some reason again chef thinks it is size A on one run and size B on another run. Makes no sense. Thus the alerts when it gets incorrectly set to the wrong size. Again - this only seems to impact the newest version(s) of Chef.
I'm sure it's something we're doing wrong, or if it was VMware it would be something Chef was doing wrong before and is doing right now, what we're doing hasn't changed and now all of a sudden is broken. I believe another part of the issue is the legacy EC2 bootstrap process pulls in the latest chef during build, vs our new stuff(non EC2) maintains a static version, less surprises.
Annoyed to have to come back from a nice short holiday to have to immediately deal with two things I hate to deal with - Chef and EC2.
This coming trip to Amsterdam will provide the infrastructure to move the vast majority of the remaining EC2 stuff out of EC2, so am excited about that portion of the trip at least. Getting off of chef is another project I don't feel like tackling now since I'm in the minority as to my feelings for it. I just try to minimize my work in it for my own sanity, there's lots of other things I do instead.
On an unrelated note, for some reason during a recent apt-get upgrade my Debian system pulled in what feels like a significantly newer version of WordPress, though I think the version number only changed a little(I don't recall what the original version number was). I did a major Debian 5.0->6.0 Upgrade a couple of months ago, but this version came in after, has a bunch of UI changes. I'm not sure if it breaks anything, I think I need to re-test how the site renders in IE9 as I manually patched a file after getting a report that it didn't work right, and the most recent update may of overwritten that fix.
[DANGER: NON TECHNICAL CONTENT BELOW]
Well next Saturday at least. The project kept getting delayed but I guess it's finally here. Going to Amsterdam for a week to build out a little co-location for the company. Whenever I brought up the subject over the past 7-8 months or so people almost universally responded with jealousy, wanting to go themselves. I suppose this may be a shock but I really don't want to go. I'm not fond of long trips especially on airplanes (don't mind long road trips though, at least I can stop whenever and stretch, or take a scenic route and take pictures).
The more I looked at Amsterdam the less interested I was in going there - or Europe in general. It seems like a very old, quaint, cultured place. Really the polar opposite of what I'm interested in. I traveled a lot around Asia specifically growing up(lived in Asia for nearly six years), and had a quick trip down to Australia. So I feel that I can confidently say that I have traveled a bunch and really don't feel like traveling more, I've seen a lot - a lot of things that I am not interested in and am not (yet) aware of other things that may interest me. I also don't like big crowded cities either. I swear I've spent more time in Seattle than San Fransisco since I moved here almost a year ago (not that I enjoy Seattle - but I have a few specific destinations to go to there, where I really have none in SFO at this point).
People of course obviously bring up how certain things you can do in the Netherlands legally that you can't do here legally at least (like that stops anybody from doing it here). So it's really not a big deal to me. I actually looked quite a bit at the red light district, and didn't see anything that got me excited.
The one thing I did sign up for when I was booking the trip on Orbitz (my friend's site hipmunk.com directed me to Orbitz), was a canal pizza cruise, which seems up my alley - if only they served the only alcoholic drink I order. I really can't stand the taste or smell even of things like wine or beer (or coffee or tea while I'm at it - I'm sure there's something wrong with me in that dept...). I've looked around for other things but so much of it seems culture related and my attention span for that stuff lasts about 60 seconds. I also don't know how much free time I may have there, maybe little. My two trips to Atlanta for the build out there I had a COMBINED maybe 7 hours between the two (total time there about 12 days), and many 15+ hour work days, with more than one dinner at a gas station well past midnight.
I would like to find a place like this over there, since it's so close to Germany. I asked one guy that is over near Amsterdam (runs a VMware blog), though he didn't have a whole lot of suggestions. I have a good friend from when I was a kid that lives in Denmark, which I thought was close - not close enough though (~700km away), haven't seen that guy in maybe 17-18 years.
I emailed a two of my very well traveled friends who know me well, know what I like, and neither of them had any ideas either.
In the remote chance that any of the 9 readers I have knows of a good place(s) to visit or thing(s) to do in Amsterdam let me know.
So I suspect short of that little canal cruise I'll spend all of my time at the Holiday Inn and at the data center. Save my cash for my next trip - I intend to spend a week in Seattle towards the end of next month. My company has a satellite office there so I plan to work during the day and have fun at night. Much celebration will ensue there. I'm obviously very excited to go back to my favorite places, neither of which I have not found replacements for in the Bay Area.
I've seen a couple different articles from our friends at The Register on the launch of the HP IaaS cloud as a public beta. There really isn't a whole lot of information yet, but one thing seems unfortunately clear - HP has embraced the same backwards thinking as Amazon when it comes to provisioning. Going against the knowledge and experience we've all gained in the past decade around sharing resources and over subscription.
Yes - it seems they are going to have fixed instance sizes and no apparent support for resource pools. This is especially depressing from someone like HP who has technology like thin provisioning, and partnerships with the likes of all of the major hypervisor players.
Is the software technology at the hypervisor just not there yet to provide such a service? vSphere 5 for example supports 1600 resource pools per cluster. I don't like the licensing model of 5, so I built my latest cluster on 4.1 - which supports 512 resource pools per cluster. Not a whole lot in either case but then again cluster sizes are fairly limited anyways.
There's no doubt that gigabyte to gigabyte that DAS is cheaper than something like a 3PAR V800. But with fixed allocation sizes from the likes of Amazon - it's not uncommon to have disk utilization rates hovering in low single digits. I've seen it at two different companies - and guess what - everyone else on the teams (all of whom have had more Amazon experience than me) was just as not surprised as I was.
So you take this cheap DAS and you apply a 4 or 5% utilization rate to it - and all of a sudden it's not so cheap anymore is it ? Why is utilization so low ? Well in Amazon (since I haven't use HP's cloud), it's primarily low because that DAS is not protected, if the server goes down or the VM dies the storage is gone. So people use other methods to protect their more important data. You can have the OS and log files and stuff on there no big deal if that goes away - but again - your talking about maybe 3-5GB of data (which is typical for me at least). Then the rest of the disk goes unused.
Go to the most inefficient storage company in the world and and even they will drool at the prospects of replacing storage that your only getting 5% utilization out of! Because really even the worst efficiency is maybe 20% on older systems w/o thin provisioning.
Even if the storage IS fully protected - the fixed allocation units are still way out of whack and they can't be shared! I may need a decent amount of CPU horsepower and/or (more likely) memory to run a bloated application but I don't need several hundred gigabytes of storage attached to each system when 20GB would be MORE than enough(my average OS+App installation comes in at under 3GB and that's with a lot of stuff installed)! I'd rather take that several hundred gigabytes both in terms of raw space and IOPS and give them to database servers or something like that(in theory at least, the underlying storage in this case is poor so I wouldn't want to use it for that anyways).
This is what 3PAR was built to solve - drive (way)utilization up, while simultaneously providing the high availability and software features of a modern storage platform. Others do the same too of course with various degrees of efficiency.
So that's storage - next take CPU. The industry average pre-virtualization was in the sub 20% utilization range - my own personal experience says it's in the sub 10% range for the most part. There was a quote from a government official a couple years back that talked about how their data centers are averaging about 7% utilization. I've done a few virtualization projects over the years and my experience shows me that even after systems have been virtualized the vmware hosts themselves are at low utilization from a CPU perspective.
Two projects in particular that I documented while I was at a company a few years ago while back - the most extreme perhaps being roughly 120VMs on 5 servers, four of them being HP DL585 G1s - which were released in 2005. They had 64GB of memory on them but they were old boxes. I calculated that the newer Opteron 6100 when it was released had literally 12 times the CPU power(according to SPEC numbers at least) of the Opteron 880s that we had at the time. Anyways, even with these really old servers the cluster averaged under 40% CPU - with peaks to maybe 50 or 60%. Memory usage was pretty constant at around 70-75%. Imagine translating that workload on those ancient servers onto something more modern and you'd likely see CPU usage rates drop to single digits while memory usage remains constant.
I have no doubt that the likes of HP and Amazon are building their cloud to specifically not oversubscribe - to assume that people will utilize all of the CPU allocated to them as well as memory and disk space. So they have fixed building blocks to deal with and they carve them up accordingly.
The major fault with the design of course is the vast majority of workloads do not fit in such building blocks and will never come close to utilizing all of the resources that are provided - thus wasting an enormous amount of resources in the environment. What's Amazon's solution to this ? Build your apps to better utilize what they provide. basically work around their limitations. Which, naturally most people don't do so resources end up being wasted on a massive scale.
I've worked for really nothing but software development companies for almost 10 years now and I have never really seen even one company or group or developer ever EVER design/build for the hardware. I have been part of teams that have tried to benchmark applications and buy the right sized hardware but it really never works out in the end because a simple software change can throw all those benchmarks and testing out the window overnight(not to mention how traditionally difficult it is to replicate real traffic in a test environment - I've yet to see it done right myself for any even moderately complex application). The far easier solution to take is of course, resource pools, and variably allocated resources.
Similarly this model, along with the per-VM licensing model of so many different products out there go against the trend that has allowed us to have more VM sprawl I guess. Instead of running a single server with a half dozen different apps it's become a good practice to split those apps up. This fixed allocation unit of the cloud discourages such behavior by dramatically increasing the cost of doing it. You still incur additional costs by doing it on your own gear - memory overhead for multiple copies of the OS (assuming that memory de-dupe doesn't work -which for me on Linux it doesn't), or disk overhead (assuming your array doesn't de-dupe -which 3PAR doesn't - but the overhead is so trivial here that it is a rounding error). But those incremental costs pale in comparison to massive increases in cost in the cloud, because again of those fixed allocation units.
I have seen no mention of it yet, but I hope HP has at least integrated the ability to do live migration of VMs between servers. The hypervisor they are using supports it of course, I haven't seen any details from people using the service as to how it operates yet.
I can certainly see a need for cheap VMs on throwaway hardware. I see an even bigger need, for the more traditional customers(that make up the vast, vast majority of the market) to have this model of resource pools instead. If HP were to provide both services - and a unified management UI that really would be pretty nice to see.
The concept is not complicated, and is so obvious it dumbfounds me why more folks aren't doing it (only thought is perhaps the technology these folks are using isn't capable) - IaaS won't be worth while to use in my opinion until we have that sort of system in place.
HP is obviously in a good position when it comes to providing 3PAR technology as a cloud since they own the thing their support costs would be a fraction of what their customers pay and they would be able to consume unlimited software for nothing. Software typically makes up at least half the cost of a 3PAR system(the SPC-1 results and costs of course only show the bare minimum software required). Their hardware costs would be significantly less as well since they would not need much(any?) margin on it.
I remember SAVVIS a few years ago wanting to charge me ~$200,000/year for 20TB usable on 10k RPM storage on a 3PAR array, when I could of bought 20TB usable on 15k RPM storage on a 3PAR array(+ hosting costs) for less than one year's costs at SAVVIS. I heard similar stories from 3PAR folks where customers would go out to the cloud to get pricing thinking it might be cheaper than doing it in house but always came back being able to show massive cost savings by keeping things in house.
They are also in a good position as a large server manufacturer to get amazing discounts on all of their stuff and again of course don't have to make as much margin for these purposes (I imagine at least). Of course it's a double edged sword pissing off current and potential customers that may use your equipment to try to compete in that same space.
I have hope still, that given HP's strong presence in the enterprise and in house technology and technology partners that they will, at some point offer an enterprise grade cloud, something where I can allocate a set amount of CPU, memory, maybe even give me access to a 3PAR array using their Virtual Domain software, and then provision whatever I want within those resources - and billing would be based on some sort of combination of a fixed price for base services and variable price based on actual utilization (bill for what you use rather than what you provision), with perhaps some minimum usage thresholds (because someone has to buy the infrastructure to run the thing). So say I want a resource pool with 1TB of ram and 500Ghz of CPU. Maybe I am forced to pay for 200GB of ram and 50Ghz of CPU as a baseline, then anything above that is measured and billed accordingly.
Don't let me down HP.
I wrote a couple of times about the return of 10GbaseT, a standard that tried to come out a few years ago but for various reasons didn't quite make it. I just noticed that two new 10GbaseT switching products were officially announced a few days ago at Interop Las Vegas. They are, of course from Extreme, and they are, of course not shipping yet (and knowing Extreme's recent history with product announcements it may be a while before they do actually ship - though they say for the 1U switch by end of year).
The new products are
- 48 port 10Gbase-T module for the Black Diamond X-series - for up to 384 x 10GbaseT ports in a 14U chassis - note this is of course half the density you can achieve using the 40GbE modules and break out cables, there's only so many plugs you can put in 14U!
- Summit X670V-48t (I assume that's what it'll be called) - a 48-port 10GbaseT switch with 40GbE uplinks (similar to the Arista 7100 - the only 48-port 10GbaseT switch I'm personally aware of - just with faster uplinks and I'm sure there will be stacking support for those that like to stack)
From this article it's claimed a list price of about $25k for the 1U switch which is a good price - about the same price as the existing 24-port X650 10GbaseT product. Also in line with the current generation X670V-48x which is a 48-port SFP+ product, so little to no premium for the copper which is nice to see! (note there is a lower cost X670 (non "V") that does not have 40GbE ability available for about half the cost of the "V" model)
Much of the hype seems to be around the new Intel 10Gbase-T controller that is coming out with the latest CPUs from them.
With the Intel Ethernet Controller X540, Intel is delivering on its commitment to drive down the costs of 10GbE. We’ve ditched two-chip 10GBASE-T designs of the past in favor of integrating the media access controller (MAC) and physical layer (PHY) controller into a single chip. The result is a dual-port 10GBASE-T controller that’s not only cost-effective, but also energy-efficient and small enough to be included on mainstream server motherboards. Several server OEMs are already lined up to offer Intel Ethernet Controller X540-based LOM connections for their Intel Xeon processor E5-2600 product family-based servers.
With Broadcom also having recently announced (and shipping too perhaps?) their own next generation 10GbaseT chips, built for LOM (among other things), which apparently can push power utilization down to under 2W per port, using a 10 meter mode (perhaps?), 10m is plenty long enough for most connections of course! Given that Broadcom also has a quad port version of this chipset, could they be the ones powering the newest boxes from Oracle ?
Will Broadcom be able to keep their strong hold on the LOM market (really can't remember the last time I came across Intel NICs on motherboards outside of maybe Supermicro or something)?
So the question remains - when will the rest of the network industry jump on board - after having been burned somewhat in the past by the first iterations of 10GbaseT.