(You can see part two of my thoughts on containers here.)
I'll probably regret this post for a little while at least. Because I happened to wake up at about 3AM this morning and found myself not really falling back asleep quickly and I was thinking about this article I read last night on Docker containers and a couple skype chats I had with people regarding it.
Like many folks, I've noticed a drastic increase in the hype around containers, specifically Docker stuff over the past year or so. Containers are nothing new, there have been implementations of them for a decade on Solaris anyway. I think Google has been using them for about a decade, and a lot of the initial container infrastructure in Linux (cgroups etc) came from Google. Red Hat has been pushing containers through their PaaS system Openshift I want to say for at least five years now since I first started hearing about it.
My personal experience with containers is limited - I have deployed a half dozen containers in production for a very specific purpose a little over a year ago using LXC on Ubuntu. No Docker here, I briefly looked into it at the time and saw it provided no value to what I do so I went with plain LXC. The containers have worked well since deployment and have completely served their purpose. There are some serious limitations to how containers work(in the Linux kernel) which today prevent them from being used in a more general sense, but I'll get to that in another post perhaps (this one ended up being longer than I thought). Perhaps since last year's deployment some of those issues have been addressed I am not sure. I don't run bleeding edge stuff.
I'll give a bit of my own personal experience here first so you get an idea where I am coming from. I have been working in technical operations for organizations (five of them at this point) running more or less SaaS services (though internally I've never heard that term tossed about at any company, we do run a hosted application for our customers) for about 12 years now. I manage servers, storage, networking, security to some degree, virtualization, monitoring etc(the skills and responsibilities have grown over time of course). I know I am really good at what I do(tried to become less modest over recent couple of years). The results speak for themselves though.
My own personal experience again here - I can literally count on one hand the number of developers I have worked with over the years that stand out as what I might call "operationally excellent". Fully aware of how their code or application will work in production and builds things with that in mind, or knows how to engage with operations in really productive way to get questions answered on to how best to design or build things. I have worked with dozens of developers(probably over 100 at this point), some of them try to do this, others don't even bother for some reason or another. The ones I can count on one hand though, truly outstanding, a rare breed.
Onto the article. It was a good read, my only real question is does this represent what a typical Docker proponent thinks of when they think of how great Docker is or how it's the future etc. Or is there a better argument. Hopefully this represents what a typical Docker person thinks so I'm not wasting my time here.
So, to address point by point, try to keep it simple
Up until now we’ve been deploying machines (the ops part of DevOps) separately from applications (the dev part). And we’ve even had two different teams administering these parts of the application stack. Which is ludicrous because the application relies on the machine and the OS as well as the code, and thinking of them separately makes no sense. Containers unify the OS and the app within the developer’s toolkit.
This goes back to experience. It is quite ludicrous to expect the developers to understand how to best operate infrastructure components, even components to operate their own app (such as MySQL) in a production environment. I'm sure there are ones out there that can effectively do it(I don't claim to be a MySQL expert even myself having worked with it for 15 years now I'll happily hand that responsibility to a DBA as do most developers I have worked with), but I would wager that number is less than 1 in 50.
Operating things in a development environment is one thing, go at it, have a VM or a container or whatever that has all of your services. Operating correctly in production is a totally different animal. In a good production environment (hopefully in at least one test environment as well) you have significantly more resources to throw at your application to get more out of it. Things that are just cost prohibitive or even impossible to deploy at a tiny scale in a development environment(when I say development environment I imply that it runs on a developer laptop or something). Even things like connectivity to external dependencies likely don't exist in a development environment. For all but the most basic of applications production will always be significantly different in many ways. That's just how it is. You can build production so it's really close to other environments or even exactly the same but then you are compromising on so much functionality, performance, scalability that you've really just shot yourself in the foot and you should hope you don't get to any kind of thing resembling scale (not "web scale" mind you) because it's just going to fall apart.
Up until now, we’ve been running our service-oriented architectures on AWS and Heroku and other IaaSes and PaaSes that lack any real tools for managing service-oriented architectures. Kubernetes and Swarm manage and orchestrate these services
First off I'm happy to admit I've never heard of Kubernets and Swarm, I have heard of Heroku but no idea what it does. I have used AWS in the past (for about 2 years - worst experience of my professional career, I admit I do have PTSD when it comes to Amazon cloud).
I have worked with service-oriented architectures for the past 12 years. My very first introduction to SaaS was an EXTREMELY complicated Java platform that ran primarily on Weblogic+Oracle DB on the back end, with Apache+Tomcat on the front end. Filled with Enterprise Java Beans(EJB), and just dozens of services. Their policy was very tight, NOTHING updates the DB directly without going through a service. No "manual" fixes or anything via SQL(only company I've worked at with that kind of policy). Must write an app or something to interface with a service to fix issues. They stuck to it from what I recall while I was there anyway, I admire them for that.
At one point I took my knowledge of the app stack, and proposed an very new architecture for operational deployment, it was much more expensive, because this was before wide spread use of VM technology or containers in general. We split the tomcat tiers up for our larger customers into isolated pools(well over 200 new physical servers! that ran at under 5% cpu in general!). The code on all systems was the same but we used the load balancer to route traffic for various paths to different sets of servers. To some extent this was for scaling but the bigger problem this "solved" was something more simple operationally (but was not addressed to-date in the app) - logging. This app generated gobs of logging from tons of different subsystems (all of it going to centralized structure on each system) making it very difficult to see what log events belonged to what subsystem. It could take hours to trace transactions through the system. Something as simple as better logging, which developers ignored forever, we were able to address by splitting things out. The project started out small scale but ballooned quickly as other people piled in. Approvals came fast and hard for everything. My manager said "aim for the sky because they will reject some things". I aimed for the sky and got everything I asked for(several million $ worth). I believe eventually they moved to a VM model a few years after I left. We tried to get the developers to fix the code, it never happened, so we did what we had to do to make things more manageable. I recall most everyone's gleeful reaction the first time they started using the new systems. I know it's hard to believe, you had to be there to see what a mess it was.
Though the app itself was pretty terrible. I remember two members of my team quit within a month and both said something along the lines of "we've been at the company 6-9 months and still don't understand how the application works" (and their job was in part supporting production issues like mine was, I was as close to an expert in the operation of that application as one could get, it wasn't easy). The data flows of this application were a nightmare, it was my first experience working in SaaS, so as far as I knew it was "normal". But if I were exposed to that today I would run away screaming. So. many. outages. (bad code, and incredibly over designed) I remember one developer saying "why weren't you in our planning meeting last year when we were building this stuff?" I said back something like "I was too busy pulling 90 hour weeks just keeping the application running, I had no time for anything else". I freely admit these days I burned out hard core at that company, took me more than three years to recover. I don't regret it, it was a good experience, I learned a lot, I had some fun. But it cost me a lot as well. I would not do it again at this point in my career, but if I had the ability to talk to my 2003 self I would tell me to do it.
My first exposure to "micro services" was roughly 2007 at another SaaS company, these services specifically were built with Ruby on Rails of all things. There were a few different ways to approach deploying it. By this time I had started using VMware ESX (my first production deployment of VMware was GSX 3.0 in 2004 in a limited scope production deployment at the previous company I referred to).
Perhaps the most common way would of just been to have an apache instance and the various applications inside of it, keep it simple. Another approach might of been to leverage VMware in a larger scope and build VMs for each micro service (each one had a different code base in subversion, not like it was a bunch of services in a single code base). I took a different approach though, an approach I thought was better(at the time anyway, I still think it was a good choice). I decided to deploy each service on it's own apache instance(each listening on a different port) on a single OS image (CentOS or Fedora at the time) running on physical hardware. We had a few "web" servers each running probably 15 apache instances with custom configurations managed by CFengine. The "micro services" talked to each other through a F5 BigIP load balancer. We had other services on these boxes as well, the company had a mod_perl based application stack, and another Tomcat-based application, these all ran on the same set of servers.
A common theme for this for me is, twelve years of working with services oriented architectures, and eight years of working with "micro services" and I've never needed special sauce to manage them.
Up until now, we have used entire operating systems to deploy our applications, with all of the security footprint that they entail, rather than the absolute minimal thing which we could deploy. Containers allow you to expose a very minimal application, with only the ports you need, which can even be as small as a single static binary.
This point seems kind of pointless to me. Operating systems are there to provide the foundation of the application. I like the approach of trying to keep things common. That is the same operating system across as many of the components as possible - keeping in mind there are far more systems involved than just the servers that "run the application". While saying minimal exposure is a nice thing to have, at the end of the day it really doesn't matter(it doesn't noticeably or in most cases measurably impact operation of the application, but it does improve manageability).
Up until now, we have been fiddling with machines after they went live, either using “configuration management” tools or by redeploying an application to the same machine multiple times. Since containers are scaled up and down by orchestration frameworks, only immutable images are started, and running machines are never reused, removing potential points of failure.
I'll preface this by saying I have never worked for an organization that regularly or even semi regularly scaled up and scaled down their infrastructure(even while working in Amazon cloud). Not only have they never really done it, but they've never really needed to. I'm sure there are some workloads that can benefit from this, but I'm also sure the number is very small. For most you define a decent amount of headroom for your application to burst into and you let it go, and increase it as required(if required) as time goes on with good monitoring tools.
I'll also say that since I led the technical effort behind moving my current organization out of Amazon cloud in late 2011(that is what I was hired to do, I was not going to work for another company that used them on a regular basis. Per earlier point we actually intended to auto scale up and down but at the end of the day it didn't happen), we have not had to rebuild a VM, ever. Well over three years now with never having had to rebuild a VM (well there is one exception where we retired one of our many test environments at one point only to have a need for it again a few months later, NOTHING like our experience with public cloud though). So yeah, the lifetimes of our systems are measured in years, not in hours, days, or weeks. Reliability is about as good as you can get in my opinion(again record speaks for itself). We've had just two sudden hardware failures causing VMs to fail in the past 3 and a half years. In both cases VMware High availability automatically restarted the affected VMs on other hosts within a minute, and HP's automatic server recovery rebooted the hosts in question (in both cases had to get system boards replaced).
Some people when thinking of public cloud say "oh but how can we operate this better than Amazon, or Microsoft etc". I'm happy to admit now that I KNOW I can operate things better than Amazon, Microsoft, Google etc. I've demonstrated it for the past decade, and will continue to do so. Maybe I am unique, I am not sure (I don't go out of my way to socialize with other people like me). There is a big caveat to that statement that again I'm happy to admit to. The "model' of many of the public cloud players is radically different from my model. The assumptions they make are not assumptions I make (and vise versa). Their model in order to operate well you really have to design your app(s) to handle it right. My model you don't. I freely admit my model would not be good for "web scale", just like their model is not good for the scale any company I have worked at for the past 12 years. Different approaches to solve similar issues.
Up until now, we have been using languages and frameworks that are largely designed for single applications on a single machine. The equivalent of Rails’ routes for service-oriented architectures hasn’t really existed before. Now Kubernetes and Compose allow you to specify topologies that cross services.
I'll mostly have to skip this one as it seems very code related, I don't see how languages and frameworks have any bearing on underlying infrastructure.
Up until now, we’ve been deploying heavy-weight virtualized servers in sizes that AWS provides. We couldn’t say “I want 0.1 of a CPU and 200MB of RAM”. We’ve been wasting both virtualization overhead as well as using more resources than our applications need. Containers can be deployed with much smaller requirements, and do a better job of sharing.
I haven't used AWS in years, in part because I believe they have a broken model, something I have addressed many times in the past. I got upset with HP when they launched their public cloud and launched with a similar model. I believe I understand why they do things this way (because doing it "right" at "large scale" is really complicated).
So this point is kind of moot. I mean people have been able to share CPU resources across VMs for well over a decade at this point(something that isn't possible in all major public cloud providers). I also share memory to an extent(this is handled transparently with the hypervisor). There is certainly overhead associated with VM, and with a "full" operating system image, but that is really the price you pay for the flexibility, and manageability that those systems offer. It's a price I'm willing to pay in a heartbeat, because I know how to run systems well.
Up until now, we’ve been deploying applications and services using multi-user operating systems. Unix was built to have dozens of users running on it simultaneously, sharing binaries and databases and filesystems and services. This is a complete mismatch for what we do when we build web services. Again, containers can hold just simple binaries instead of entire OSes, which results in a lot less to think about in your application or service.
(side note, check the container host operating system, yeah that one that is running all of the native processes in the same kernel - yes a multi user operating system running dozens of services on it simultaneously, a container is just a veil..)
This is another somewhat moot point. Having a lot less to think about, certainly from an OS perspective to me makes things more complicated. If your systems are so customized that each one is different that makes life more difficult. For me I can count on a common set of services and functionality being available on EVERY system. And yes, I even run local postfix services on EVERY system(oh, sorry that is some overhead). To be clear postfix is there as a relay for mail which is then forwarded to a farm of load balanced utility services which then forward onto external SMTP services. This is so I can do something as simple as "cat some file| mail my.email.address" and have it work.
Now we do limit what services run, e.g. Chef(current case) or CFengine(prior companies) only runs our core services, extra things that I never use are turned off. Some services are rare or special. Like FTP for example. I do have a couple of uses cases for running a FTP server still, and in those cases FTP services only run on the systems that need it. And obviously from an application standpoint not all apps run on all app servers. But this kind of stuff is pretty obvious.
At the end of the day having these extra services provides convenience not only to us, but to the developers as well. Take postfix as an example. Developers came to me one day saying they were changing how they send email, instead of interfacing with some external provider via web API their new provider they will interface with SMTP. So where do they direct mail to? My answer was simple - in all cases, in all environments send mail to localhost, we'll handle it from there. Sure you can put custom logic in your application or custom configurations for each environment if you want to send directly to our utility servers, but we sure as hell don't want you to try to send mail directly from the web servers to the external parties, that's just bad practice (for anything resembling a serious application assuming many servers supporting it and not just a single system running all services). The developers can easily track progress of the mail as it arrives on the locahost MTA, and is then immediately routed to the appropriate utility server farm (different farms for different environments due to network partitioning to prevent QA systems for example from talking to production, also each network zone(3 major zones) has a unique outbound NAT address, which is handy in the event of needing IP filters or something).
So again, worrying about these extra services is worrying about nothing in my opinion. My own experience says that the bulk of the problems with applications are code based, sometimes design, sometimes language, sometimes just bugs. Don't be delusional and think that by deploying containers that will somehow magically make the code better and the application scale and be highly available. It's addressing the wrong problem, it's a distraction.
Don't get me wrong though, I do believe containers do have valid use cases, which I may cover in another post this one is already pretty long. I do use some containers myself (not Docker). I do see value in providing an "integrated" experience (download one file and get everything - even though that has been a feature in virtualized environments for probably close to a decade with OVF as well). That is not a value to me, because as an experienced professional it is rare that something works properly "out of the box", at least as far as applications go. Just look for example at how Linux distributions package applications, many have their own approach on where to put files, how to manage things etc. That's the simplest example I can give. But I totally get it for constrained environments it's nice to be able to get up to speed quickly with a container. There are trade offs certainly once you get to "walking" (thinking baby step type stuff here).
There is a lot of value that operating systems and hypervisors and core services provide. There is overhead associated with them certainly. In true hyper scale setups this overhead is probably not warranted (Google type scale). I will be happy to argue till I am somewhat blue in the face that 99.99% of organizations will never be at that scale, and trying to plan for that from the outset is a mistake(and I'd wager almost all that try, fail because they over design or under design), because there is not one size that fits all. You build to some reasonable level of scale, then like anything new requirements likely come in, and you re evaluate, re-write, re-factor or whatever.
It's 5AM now, I need to hit the gym.