The whole SDN thing has bugged me for a long time now. The amount of hype behind it has driven me mad. I have asked folks to explain to me what SDN is, and I have never really gotten a good answer. I have a decent background in networking but it's never been my full time responsibility (nor do I wish it to be).
I am happy to report that it seems I am not crazy. Yesterday I came across an article on slashdot from the inventor of Openflow, the same guy that sold his little networking startup Nicira for a cool $1.2B (and people thought HP paid too much for 3PAR).
Reading that made me feel better
On top of that our friends over at El Reg recently described SDN as an industry "hype gasm". That too was refreshing to see. Finally more people are starting to cut through the hype.
I've always felt that the whole SDN thing that is going on is entirely too narrow in vision - seemingly to be focused almost entirely on switching & routing. Most of the interesting stuff happens higher up in the advanced layer 7 load balancing where you have more insight as to what is actually traversing the wire from an application perspective.
I have no doubt that the concepts behind SDN will be/are very useful for massive scale service providers and such (though somehow they managed without it as it is trying to be defined now anyway). I don't see it as very useful for most of the rest of organizations, unlike say virtualized storage.
I cringed badly when I first saw the term software defined storage last year, it just makes me shudder as to the amount of hype people might try to pump into that. HP seems to be using this term more and more often. I believe others are too, though I can't bring myself to google the term.
Maybe I'm not alone in this world after all. I have ranted and raved about how terrible Amazon's cloud is for years now. I used it at two different companies for around two years(been almost a year since I last used it at this point) and it was, by far the worst experience in my professional career. I could go on for an entire afternoon listing all the problems and lack of abilities, features etc that I had experienced -- not to mention the costs.
A week ago I quoted Piston’s CTO saying that there was a “giant explosion” of companies moving off of Amazon Web Services (AWS). At the time, I noted that he had good reason to say that, since he started a company that builds software used by companies to build private clouds.
Enterprises that are moving to private clouds tend to be those that had developers start using the cloud without permission.
Other businesses are “trying to get out,” he said. AWS has made its compute services very sticky by making them difficult for users to remove workloads like databases to run them elsewhere.
Myself I know of several folks who have stuff in Amazon, and for the most part the complaints are similar to mine. Very few that I have heard of are satisfied. The people that seem to be satisfied (in my experience) are those that don't see the full picture, or don't care. They may be satisfied because they don't want to worry about infrastructure no matter the cost, or they want to be able to point the finger at an external service provider when stuff breaks (Amazon's support is widely regarded as worthless).
"We have thousands of AWS customers, and we have not found anyone who is happy with their tech support," says Laderman.
I was at a company paying six figures a month in fees and they refused to give us any worthwhile support. Any company in the enterprise space would of been more than happy to permanently station an employee on site to make sure the customer is happy for those kind of payments. Literally everyone who used the Amazon stuff in the company hated it, and the company wanted Amazon to come help show us the way -- and they said no.
I am absolutely convinced (as I've seen it first and second hand) in many cases the investors in the startups have conflicts of interest and want their startups to use Amazon because the investors benefit from them growing as well. Amazon then uses this marketing stuff to pimp to other customers. This of course happens all over the place with other companies, but there are a lot of folks that are invested in Amazon relatively speaking compared to most other companies.
There's no need for me to go into specifics as to why Amazon sucks here - for those you can see some of the past posts. This is just a quickie.
Anyway, that's it for now.. I saw the article and it made me smile.
HP made a little bit of headlines recently when they officially unveiled their first set of ultra dense micro servers, under the product name Moonshot. Originally speculated as likely being an ARM-platform, it seems HP has surprised many in making this first round of products Intel Atom based.
They are calling it the world's first software defined server. Ugh. I can't tell you how sick I feel whenever I hear the term software defined <insert anything here>.
In any case I think AMD might take issue with that, with their SeaMicro unit which they acquired a while back. I was talking with them as far back as 2009 I believe and they had their high density 10U virtualized Intel Atom-based platform(I have never used Seamicro though knew a couple folks that worked there). Complete with integrated switching, load balancing and virtualized storage(the latter two HP is lacking).
Unlike legacy servers, in which a disk is unalterably bound to a CPU, the SeaMicro storage architecture is far more flexible, allowing for much more efficient disk use. Any disk can mount any CPU; in fact, SeaMicro allows disks to be carved into slices called virtual disks. A virtual disk can be as large as a physical disk or it can be a slice of a physical disk. A single physical disk can be partitioned into multiple virtual disks, and each virtual disk can be allocated to a different CPU. Conversely, a single virtual disk can be shared across multiple CPUs in read-only mode, providing a large shared data cache. Sharing of a virtual disk enables users to store or update common data, such as operating systems, application software, and data cache, once for an entire system
Really the technology that SeaMicro has puts the Moonshot Atom systems to shame. SeaMicro has the advantage that this is their 2nd or 3rd (or perhaps more) generation product. Moonshot is on it's first gen.
Moonshot provides 45 hot pluggable single socket dual core Atom processors, each with 8GB of memory and a single local disk in a 4.5U package.
SeaMicro provides up to 256 sockets of dual core Atom processors, each with 4GB of memory and virtualized storage. Or you can opt for up to 64 sockets of either quad core Intel Xeon or eight core AMD Opteron, with up to 64GB/system (32GB max for Xeon). All of this in a 10U package.
Let's expand a bit more - Moonshot can get 450 servers(900 cores) and 3.6TB of memory in a 47U rack. SeaMicro can get 1,024 servers (2,048 cores) and 4TB of memory in a 47U rack. If that is not enough memory you could switch to Xeon or Opteron with similar power profile, at the high end 2,048 Opteron(AMD uses a custom Opteron 4300 chip in the Seamicro system - a chip not available for any other purpose) cores with 16TB of memory. Or maybe you mix/match .. There is also fewer systems to manage - HP having 10 systems, and Sea Micro having 4 per rack. I harped on HP's SL-series a while back for similar reasons.
Seamicro also has dedicated external storage which I believe extends upon the virtualization layer within the chassis but am not certain.
All in all it appears Seamicro has been years ahead of Moonshot before Moonshot ever hit the market. Maybe HP should of scrapped Moonshot and taken out Seamicro when they had the chance.
At the end of the day I don't see anything to get excited about with Moonshot - unless perhaps it's really cheap (relative to Seamicro anyway). The micro server concept is somewhat risky in my opinion. I mean if you really got your workload nailed down to something specific and you can fit it into one of these designs then great. Obviously the flexibility of such micro servers is very limited. Seamicro of course wins here too, given that an 8 core Opteron with 64GB of memory is quite flexible compared to the tiny Atom with tiny memory.
I have seen time and time again people get excited about this and say oh how they can get so many more servers per watt vs the higher end chips. Most of the time they forget to realize how few workloads are CPU bound, and simply slapping a hypervisor on top of a system with a boatload of memory can get you significantly more servers per watt than a micro server could hope to achieve. HOWEVER, if your workload can effectively exploit the micro servers, drive utilization up etc, then it can be a real good solution -- in my experience those sorts of workloads are the exception rather than the rule, I'll put it that way.
It seems that HP is still evaluating whether or not to deploy ARM processors in Moonshot - in the end I think they will - but won't have a lot of success - the market is too niche. You really have to go to fairly extreme lengths to have a true need for something specialized like ARM. The complexities in software compatibility are not trivial.
I think HP will not have an easy time competing in this space. The hyper scale folks like Rackspace, Facebook, Google, Microsoft etc all seem to be doing their own thing, and are unlikely to purchase much from HP. At the same time there of course is Seamicro, amongst other competitors (Dell DCS etc) who are making similar systems. I really don't see anything that makes Moonshot stand out, at least not at this point. Maybe I am missing something.
I don't know what is going on but for some reason this blog has been getting a lot more SPAM comments recently. I mean normally Akismet takes care of everything and MAYBE one gets through a MONTH, eleven have gotten through today alone (update: now 14)
I haven't been keeping track, but that little counter on the right side is up to almost 75,300 now -- the last time I recall noticing it I thought it was below 30,000..
The Akismet plugin says it is operational and the API key I am using is valid,and all servers are reachable.
I wonder what is going on, maybe today is just my lucky day.
The theme for this post is: BASIC STUFF. This is not rocket science.
A while back I wrote a post (wow has it really been over a year since that post!) about Chef and my experience with it for what was at the time the past two years, I think I chose a good title for it -
Making the easy stuff hard, and the hard stuff possible
Which still sums up my thoughts today. This post was inspired by something I just read on the Opscode Chef status site.
While I'm on the subject of that damn status site I'll tell you what - I filed a support ticket with them back in AUGUST 2012 - yes people that is EIGHT MONTHS ago, to report to them that their status site doesn't #@$@ work. Well at least most of the time it doesn't #@$@! work. You see a lot of times the site returns an invalid Location: header which is relative instead of absolute, and standards based browsers(e.g. Firefox), don't like that so I get a pretty error message that says the site is down, basically. I can usually get it to load after forcing a refresh 5-25 times.
I first came across this site when Opscode was in the midst of a fairly major outage. So naturally I feel it's important that the web site that hosts your status page work properly. So I filed the ticket, after going back and forth with support, I determined the reason for the browser errors and they said they'd look into it. There wasn't a lot they claimed they could do because the site was hosted with another provider (Tumbler or something??).
That's no excuse.
So time passes, and nothing gets done. I mentioned a while back I met some of the senior opscode staff a few years ago, so I directly reached out to the Chief Operating Officer of Opscode (who is a very technical guy himself) to plead with him FIX THE DAMN SITE. If Tumbler is not working then host it elsewhere, it is trivial to setup that sort of site, I mean just look at the content on the site! I was polite in my email to him. He responded and thanked me.
So more time passes, and nothing happens. So in early January I filed another support ticket outlining the reason behind their web site errors and asked that they fix their site. This time I got no reply.
More time passes. I was bored tonight so I decided to hit the site again, guess what? Yeah, they haven't done squat.
How incompetent are these people? Sorry maybe it is not incompetence but laziness. If you can't be bothered to properly operate the site take the site down.
So anyway I was on their site and noticed this post from last week
Chef 0.9.x Client EOL
Since we stopped supporting Chef 0.9.x June 11, 2012 we decided it is a good time to stop all API support for Chef 0.9.x completely.
Starting tomorrow the api.opscode.com service will no longer support requests from Chef 0.9.x clients.
I mean it doesn't take a rocket scientist to read that and not think immediately how absurd that is. It's one thing to say you are going to stop supporting something that is fine. But to say OH WE DECIDED TO STOP SUPPORT, TODAY IS YOUR LAST DAY.
So I go to the page they reference above and it says
On or after June 11th, we’ll deploy a change to Hosted Chef that will disable all access to Hosted Chef for 0.9 clients, so you will want to make sure you’ve upgraded before then.
Last I checked, it is nowhere near June 11th. (now that I think of it maybe they meant last year, they don't say for sure). In any case there was extremely poor notification on this - and how much work does it take to maintain servers running chef 0.9 ? So you can stop development on it, no new patches. Big deal.
This has absolutely no impact on anything I do because we have been on Chef 0.10 forever. But the fact they would even consider doing something like this just shows how poorly run things are over there.
How can they expect customers to take them seriously by doing stuff like this? This is BASIC STUFF. REAL BASIC.
Something else that caught my eye recently as I was doing some stuff in Chef, was their APIs seemed to be down completely. So I hopped on over to the status site after forcing a refresh a dozen or more times to get it to load and saw
Hosted Chef Maintenance Underway
The following systems are unavailable while Hosted Chef is migrated from MySQL to PostgreSQL.
- The Hosted Chef Platform including the API and Management Console
- Opscode Support Ticketing System
- Chef Community Site
Apparently they had announced it on the site one or more days prior(can't tell for sure now since both posts say posted 1 week ago). But they took the APIs down at 2:00 PM Pacific time! (they are based in Seattle so that's local time for them). Who in their right mind takes their stuff down in the middle of the afternoon intentionally for a data migration? BASIC STUFF PEOPLE. And their method of notification was poor as well, nobody at my company(we are a paying customer) had any idea it was happening. Fortunately it had only a minor impact on us. I just got lucky when I happened to try to use their API at the exact moment they took it down.
Believe me there are plenty of times when one of our developers comes up to me and says OH #@$ WE NEED THIS CONFIGURATION SETTING IN PRODUCTION NOW! As you might imagine most of that is in Chef, so we rely on that functioning for us at all times. Unscheduled down time is one thing, but this is not excusable. At the very least you could migrate customers in smaller batches(with downtime for any given customer measured in seconds - maybe the really big customers take longer but they can work with those individually to schedule a good time). If they didn't build the product to do that they should go back to the drawing board.
My co-worker was recently playing around with a slightly newer build of Chef 0.10.x that he thinks we should upgrade to (ours is fairly out of date - primarily because we had some major issues on a newer build at the time). He ran into a bunch of problems including Opscode changing some major things around within a minor release breaking a bunch of stuff. Just more signs of how cavalier they are, typical modern "web 2.0" developer types, that don't know anything about stability.
Maybe I was lucky I don't know. But I basically ran the same version of CFengine v2 for nearly 7 years without any breakage (hell I can't remember encountering a single issue I considered a bug!), across three different companies. I want my configuration system to be stable, fast and simple to manage. Chef is none of those, the more I use it the more I dislike it. I still believe it is a good product and has it's niche, but it's got a looooooooong way to go to win over people like me.
As a CFengine employee put it in my last post, Chef views things as configuration as code, and CFengine views them as configuration as documentation. I'm far in the documentation camp. I believe in proper naming conventions whether it is servers, or load balancer addresses, or storage volumes, mount points on servers etc. Also I believe strongly in a good descriptive domain name (have always used the airport codes like most other folks). None of this randomly generated crap(here's looking at you Amazon). If you are deploying 10,000 servers that are doing the same thing you can still number them in some sort of sane manor. I've always been good at documentation, it does take work, and I find more often than not most people are overwhelmed by what I write (you may get the idea with what I have written here) so they often don't read it -- but it is there and I can direct them to it. I take lots of screen shots and do a lot of step by step guides.
On a side note, this configuration as documentation is a big reason why I do not look forward to IPv6.
Chef folks will say go read the code! That can be a pretty dangerous thing to say, really, it is. I mean just yesterday or was it the day before, I was trying to figure out how a template on a server was getting a particular value. Was it coming from the cookbook attributes? from the role? from the environment? I looked everywhere and I could not find the values that were being populated -- and the values I specified were being ignored. So I passed this task to my co-worker who I have to acknowledge has been a master in Chef, he has written most of what we have, and while I can manage to tweak stuff here and there, the difficult stuff I give him because if I don't my fist will go through the desk or perhaps the monitor (desk is closer), after a couple hours working with Chef. A tool is not supposed to make you get so frustrated.
So I ask him to look into it, and quickly I find HIM FIGHTING CHEF! OH MY THE IRONY. He was digging up and down and trying to set things but Chef was undoing them and he was cursing and everything. I loved it. It's what I go through all the time. After some time he eventually found the issue, the values were being set in another cookbook and they conflicted.
So he worked on it for a bit, and decided to hard code the values for a time while he looked into a better solution. So he deployed this better solution and it had more problems. The most recent thing is for some reason Chef was no longer able to successfully complete a run on certain types of servers(other types were fine though). He's working on fixing it.
I know he can do it, he's a really smart guy I just wanted to write about that story - I'm not the only one that has these problems.
Sure I'd love to replace Chef with something else. But it's not a priority I want to try to shove in my boss' face (who likes the concept of Chef). I have other fish to fry, and as long as I have this guy doing the dirty work well it's not as much of a pain for me.
Tracking down conflicting things in CFengine was really simple for me - probably because I wasn't trying anything too over the top with configuration. Opscode guys liked to say, oh wouldn't it be great if you could have one configuration stanza that could adapt to ANY SITUATION.
I SAY NO. ---- IT! IS! NOT! GREAT!
It might be nice in some situations but in many others it just gives me a headache. I like to be able to look at a config and say THAT IS GOING TO SERVER X, EXACTLY HOW IT SITS NOW. Sure I have to duplicate configs and files for different environments and such but really at the end of the day - at all of the companies I have worked at -- IT'S NOT A BIG DEAL. In the grand scheme of things. If your configuration is so complex that you need all of this maybe you should step back and consider if you are doing something wrong - does it really need to be that complex? Why?
Oh and don't get me started on that #$@ damn ruby syntax in the Chef configuration files. Oh you need quotes around a string that is nothing more than a word? You puke with a cryptic stack trace if you don't have that? Oh you puke with a cryptic stack trace unless these two configuration settings are on their own lines? Come on, this is stupid. I go back to this post on Ruby, how I am reminded of it almost every time I use Chef. I had to support Ruby+Rails apps back from 2006-2008 and it was a bad experience. Left a bad taste in my mouth for Ruby. Chef just keeps on piling on the crap. I'll fully admit I am very jaded against Ruby (and Chef for that matter). I think for good reason. How's that saying go? Burn me once shame on you, burn me 500 times shame on me?
With the background that some of these folks have at Opscode it's absolutely stunning to me the number of times they have shot themselves in the feet over the past few years, on such BASIC THINGS. Maybe that's how things are done at the likes of Amazon I don't know, never worked there(knew many that did and do though, general consensus is stay away).
In my neck of the woods people take more care in what they do.
I'll end this again by mentioning I could train someone on CFEngine in an afternoon, Chef - here I am 2 and a half years later and still struggling.
(In case your wondering YES I run Ubuntu 10.04 LTS on my laptop and desktop (guess what - it is about to go EOL too) - I have no plans to change, because it's stable, and it does the job for me. I run Debian STABLE on my servers because - IT'S STABLE. No testing, no unstable, no experimental. Tried and true. The new UI stuff in the newer Ubuntu is just scary for me, I have no interest in trying it.)
Ok that's enough for this rant I guess. Thanks for listening.
Just a quick note -- I am in the midst of upgrading this server from 32-bit Debian to 64-bit. I really didn't think I needed 64-bit, but as time as gone on the processes on this system seem to have outgrown the 32-bit kernel. I recently doubled the memory size on the host server to 16GB, so there's plenty of ram to go around for the moment.
If you see anything around here that appears more broken than normal let me know, thanks.
El Reg a few days ago posted some commentary from the CTO of Rackspace
Major adoption of public cloud computing services by large companies won't happen until the current crop of IT workers are replaced by kiddies who grew up with Facebook, Instagram, and other cloud-centric services
Which is true to some extent -- though I still feel the bigger issues with the public cloud are cost and features. If a public cloud company can offer comparable capabilities vs operating in house at a comparable (or less - given the cloud company should have bigger economies of scale) cost then I can see cloud really taking off.
As I've harped on again and again - one of the key cost things would be billing based on utilization, not based on what is provisioned (you could have a minimum commit rate as is often negotiated in deals for internet bandwidth). But if you provision a 2 CPU VM with 6GB of memory and 90% of the time it sits at 1% cpu usage and 1GB of memory then you must not be charged the same if you were consuming 100% cpu and 95% memory.
Some folks think it is a good idea to host non production stuff in a cloud and host production in house -- to me non production is where even more of the value comes from. Most of the time the non production environments(at least the companies I have worked at in the past decade) operate at VERY low utilization rates 99.9% of the time. So they can be over subscribed even more. At my organization for example we started out with basically two or three non production environments, now we are up to 10, the costs to support the extra 7-8 were minimal(relative to hosting them in a public cloud). For the databases I setup a snapshot system for these environments so not only can we refresh the data w/minimal downtime to the environments(about 2 minutes/ea vs/ full day/ea) but each environment typically consumes less than 10% of the disk space that would normally be consumed had the environment had a full independent copy of the data.
Another thing is give the customers the benefit of things like thin provisioning, data compression, deduplication. Some work loads behave better than others, present this utilization data to the customer and include it in the billing. Myself I like to provision multi TB volumes for almost everything, and I use LVM to restrict their growth. So if the time comes and some volume needs to get bigger I just lvextend the volume and resize the file system(both are online operations), don't have to touch the hypervsior, the storage, or anything. If some application may need a massive amount of storage (have not had one that did yet that used storage through the hypervisor) -- as in many many TB -- then I could allocate many volumes at once to the system, and grow them the same way over time. Perhaps a VM would have 2 to 10TB of space provisioned to it but may only use a few hundred gigs for the next year or so -- nothing is wasted, because the excess is not used. There's no harm in doing that. Though I have not seen or heard of a cloud company that offers something like this. I think a large chunk of the reason is the technology doesn't exist yet to do it for hundreds or thousands of small/medium customers.
Most important of all - the cloud needs to be a true utility - 99.99% uptime demonstrated over a long period of time. No requirements for "built to fail", all failures should be transparent to the end user. Bonus points if you have the ability to have VMware-style fault tolerance (though something that can support multi CPUs) with millisecond fail over w/o data loss. It will take a long time for the IaaS of the world to get there, but perhaps SaaS can be there already. PaaS I'm not sure, I've never dealt with that though. All of the major IaaS companies have had large scale outages and/or degraded performance.
The one area where public cloud does well - is the ability to get something from nothing up and going quickly, or perhaps up and going in a part of the country or world which you don't have a facility. Though the advantage there isn't all that great. Even at my company back when we were hosted at Amazon on the east coast. The time came to bring up a site for our UK customers and we decided to host it on the east coast because the time frame to adapt everything(Chef etc) to work properly in another Amazon region was too tight to pull off. So we never used that region. Eventually we provisioned real equipment which I deployed in Amsterdam last summer to replace the last of our Amazon stuff.
Another article on a similar topic, this time from ComputerWorld, which noted the shift from in house data centers to service providers, though it seems more focused on literally in house data centers (vs "in house" with a co-location provider). They cite lack of available talent to manage these resources. These employees would rather work for a larger organization with more career opportunities than a small shop.
I'm sort of the opposite -- I would not like to work for a large company of any kind. Much prefer small companies, with small teams. The average team size I have worked in since 2006 has been 3 people. The amount of work required to maintain our own infrastructure is actually quite a bit less than managing cloud stuff.
I guess I am the exception rather than the rule again here. I had my annual review recently and in it I wrote there was no career advancement for me at the current company, I had higher growth expectations of the company I am at -- but I am not complaining. I'll admit that the stuff I am doing now is not as exciting as it has been in the past. I'm fairly certain we could not hire someone else in the team because they would get bored and leave quickly. Me -- at least for now -- I don't mind being bored. It is a good change of pace after my run in the trenches along the front lines of technology from 2003-2011. I could do this for another year I imagine (if not longer).
As I watch the two previous companies I worked for wither and die slow deaths (and the one before them died years ago -- so basically all the jobs I had from 2006-2011 were at companies that are dead or dying) it's a good reminder to me to be thankful for where I am at. Still a small growing company with a good culture, good people, and everything runs really really well (sometimes so well it sort of scares me for some reason).
Another good reminder is I had lunch with a couple of friends while up in Seattle -- they work for a company that has been on it's death bed for years now. I asked them what is keeping the company going and they said hope (also never knew why they stuck around for as long as they have). Or something like that. Not long after I left the company laid off a bunch of folks (they were not included in the layoff). The company is dieing every bit as much as the other two companies I worked for. I guess the main difference is I decided to jump ship long ago while they stuck it out for reasons unknown.
Time to close techopsguys?
I apologize again for not posting nearly as much as I used to -- there just continues to be a dearth of topics and news that I find interesting in recent months. I am not sure if it is me that is changing or if things have really gotten boring since the new year. I have contemplated closing the blog entirely, just to lower people's expectations(or eliminate them) about seeing new stuff here.I've poured myself out all over this site the past few years and it's just become really hard to find things to write about now. I don't want the site to turn into a blog that is updated a couple times a year.
So I will likely close it in the coming months unless the situation changes. It has been a good run, from an idea from my former co-workers that I thought I'd be a minor contributor on to a full fledged site where I wrote nearly 400 articles, and a few hundred thousand words. Wow that is a lot.. My former co-workers bailed on the site years ago citing lack of time. Time is certainly something I have what I have more is lack of things to write about.
I've had an offer to become an independent contributor to our friends over at El Reg - something that sounded cool at first, though now that I've thought about it I am not going to do it. I don't feel comfortable about the level of quality I could bring (granted not all of their stuff is high quality either but I tend to hold myself to high standards). Being a personal blog I can compromise more on quality, lean into more of my own personal biases, I have less responsibility in general.
I have seen them take on a couple other bloggers such as myself in recent months and have noticed the quality of their work is not good. In some cases it is sort of depressing(why would you write about that?????????) That sort of stuff belongs on a personal blog not on a news site.
I'll have to settle for the articles where they mentioned my name in them, those I am still sort of proud of for some reason