TechOpsGuys.com Diggin' technology every day

October 23, 2012

Should System admins know how to code?

Filed under: linux — Tags: — Nate @ 11:57 am

Just read the source article, and the discussion on slashdot was far more interesting.

It’s been somewhat of a delicate topic for myself, having been a system admin of sorts for about sixteen years now, primarily on the Linux platform.

For me, more than anything else, you have to define what code is. Long ago I drew a line in the sand that I have no interest in being a software developer, I do plenty of scripting in Perl & Bash, primarily for monitoring purposes and to aid in some of the more basic areas of running systems.

Since this blog covers 3PAR I suppose I should start there – I’ve written scripts to do snapshots and integrate them with MySQL (still in use today) and Oracle (haven’t used this side of things since 2008).  This is a couple thousand lines of script (I don’t like to use the word code because to me it implies some sort of formal application). I’d wager 99% of that is to support the Linux end of things and 1% to support 3PAR. One company I was at I left, and turned these scripts over to people who were going to try to take on my responsibility. The folks had minimal scripting experience and their eyes glazed over pretty quick while I walked them through the process. They feared the 1,000 line script. Even though for the most part the system was very reliable and not difficult to recover from failures from, even if you had no scripting experience. In this case to manage snapshots with MySQL (integrated with a storage platform) – I’m not aware of any out of the box tool that can handle this. So you sort of have no choice but to glue your own together. With Oracle, and MSSQL tools are common, maybe even DB2 – but MySQL is left out in the cold.

I wrote my own perl-based tool to login to 3PAR arrays and get their metrics and populate RRD files (I use cacti to present that data – since it has a nice UI, but cacti could not collect data like I can so that stuff is run outside of cacti). Another thousand lines of script here.

Perhaps one of the coolest things I think I wrote was a file distribution system a few years ago to replace a product we used in house that was called R1 Repliweb. Though it looks like they got acquired by somebody else. Repliweb is a fancy file distribution system that primarily ran on Windows, but the company I was at was using the Linux agents to pass files around. I suppose I could write a full ~1200 word post about that project alone(if your interested in hearing that let me know), but basically I replaced it with an architecture of load balancers, VMs, a custom version of SSH, rsync, with some help from CFengine and about 200 lines of script which not only dramatically improved scalability but also reliability went literally to 100%. Never had a single failure (the system was self healing – though I did have to turn off rsync’s auto resume feature because it didn’t work for this project) while I was there (the system was in place about 12-16 months when I left).

So back to the point – to code or not to code. I say not to code (again back to what code means – in my context it means programming – if your directly using APIs then your programming, if your using tools to talk to APIs then your scripting) – for the most part at least. Don’t make things too complicated. I’ve worked with a lot of system admins over the years and the number that can script well, or code is very small. I don’t see that number increasing. Network engineers are even worse – I’ve never seen a network engineer do anything other than completely manually. I think storage is similar.

If you start coding your infrastructure you start making it even more difficult to bring new people on board, to maintain this stuff, and run it moving forward. If you happen to be in an environment that is experiencing explosive growth and your adding dozens or hundreds of servers constantly then yes this can make a lot of sense. But most companies aren’t like that and never will be.

It’s hard enough to hire people these days, if you go about raising the bar to even higher levels your never going to find anyone. I think to the Hadoop end of the market – those folks are always struggling to hire because the skill is so specialized, and there are so few people out there that can do it. Most companies can’t compete with the likes of Microsoft, Yahoo and other big orgs with their compensation and benefits packages.

You will, no doubt spend more on things like software, hardware for things that some fancy DevOps god could do in 10 lines of ruby while they sleep. Good luck finding and retaining such a person though, and if you feel you need redundancy so someone can take a real vacation, yeah that’s gonna be tough. There is a lot more risk, in my opinion in having a lot of code running things if you lack the resources to properly maintain it.  This is a problem even at scale as well. I’ve heard on several occasions – the big Amazon themselves, customized CFengine v1 way back when with so much extra stuff. Then v2 (and since v3)  came around with all sorts of new things, and guess what – Amazon couldn’t upgrade because they had customized it too much. I’ve heard similar things about other technologies Amazon has adopted. They are stuck because they customized it too much and can’t upgrade.

I’ve talked to a ton of system admin candidates over the past year and the number that I feel comfortable being able to take over the “code” on our end I think is fair to say is zero. Granted not even I can handle the excellent code written by my co-worker. I like to tell people I can do simple stuff in 10 minutes on CFengine and it will take me four hours to do things the chef way on chef, my eyes will bleed and my blood will boil in the process.

The method I’d use on CFengine you could say “sucks” compared to Chef, but it works, and is far easier to manage. I can bring  almost anyone up to speed on the system in a matter of hours, vs chef takes a strong Ruby background to use (myself I am going on nearly two and a half years with Chef and I haven’t made much progress other than I feel I can speak with authority on how complex it is).

Sure it can be nice to have APIs for everything, fancy automation everywhere – but you need to pick your battles.  When your dealing with a cloud organization like Amazon you almost have to code – to deal with all of their faults and failures and just overall stupid broken designs and everything that goes along with it. Learning to code makes the experience most likely from absolutely infuriating (where I stand) to almost manageable (costs and architecture aside here).

When your dealing with your own stuff, where you don’t have to worry about IPs changing at random because some host has died, or because you can change your CPU or memory configuration with a few mouse clicks and not have to re-build your system from scratch, the amount of code you need shrinks dramatically, lowering the barriers to entry.

After having worked in the Amazon cloud for more than two years both myself and my co-workers(who have much more experience in it than me) believe that it actually takes more effort and expertise to properly operate something in there vs doing it on your own. It’s the total opposite of how cloud is viewed by management.

Obviously it is easier said than done, just look at the sheer number of companies that go down every time Amazon has an outage or their service is degraded. Most recent one was yesterday. It’s easy for some to blame the customer for not doing the right thing,  at the end of the day though most companies would rather work on the next feature to attract customers and let something else handle fault tolerance. Only the most massive companies have resources to devote to true “web scale” operation. Shoe horning such concepts onto small and medium businesses is just stupid, and the wrong set of priorities.

Someone made a comment recently that made me laugh (not at them, but more at the situation). They said they performed some task to make my life easier in the event we need to rebuild a server (a common occurrence in EC2). I couldn’t help but laugh because we hadn’t rebuilt a single server since we left EC2 (coming up on one year in a few months here).

I think it’s great that equipment manufacturers are making their devices more open, more programmatic. Adding APIs, and other things to make automation easier. I think it’s primarily great because then someone else can come up with the glue that can tie it all together.

I don’t believe system admins should have to interact with such interfaces directly.

At the same time I don’t expect developers to understand operations in depth. Hopefully they have enough experience to be able to handle basic concepts like load balancing(e.g. store session data in some central place, preferably not a traditional SQL database). The whole world often changes from running an application in a development environment to running it in production. The developers take their experience to write the best code that they can, and the systems folks manage the infrastructure (whether it is cloud based or home grown) and operate it in the best way possible.  Whether that means separating out configuration files so people can’t easily see passwords, to inserting load balancers in between tiers, splitting out how application code is deployed,  to something as simple as log rotation scripts.

If you were to look at my scripts you may laugh(depending on your skill level) – I try to keep them clean but they are certainly not up to programmer standards, no I’ve never “use strict” on Perl for example. My scripting is simple so to do things sometimes takes me many more lines than someone more experienced in the trade to do. This has it’s benefits though – it makes it easier for more people to be able to follow the logic should they need to, and it still gets the job done.

The original article seemed to focus more on scripts, while the discussion on slashdot at some points really got into programming with one person saying they wrote apache modules ?!

As one person in the discussion thread on slashdot pointed out, heavy automation can hurt just as much as help. One mistake in the wrong place and you could take the systems down far faster than you can recover them. This has happened to me on more than one occasion of course.  One time in particular I was looking at a CFEngine configuration file, saw some logic that appeared to be obsolete, and removed a single character (a ! which told CFEngine don’t apply that configuration to that class), then CFengine went and wiped out my apache configurations. When I made the change I was very sure that what I was doing was right, but in the end it wasn’t. That happened seven years ago but I still remember it like it was yesterday.

System administrators should not have to program – scripting certainly is handy and I believe that is important(not critical – it’s not at the top of my list  for skills when hiring), just keep an eye out for complexity and supportability when your doing that stuff.

6 Comments

  1. “I’ve never seen a network engineer do anything other than completely manually.”
    OI!

    Most worth while NE’s script well. The top guys that I’ve worked with do anyway, especially in lab environments.

    Many NE’s ‘used’ to code in a past life as NE’s are typically compute science or mechanical engineers who realized that they hate coding or that they thought building an airplane for the military is actually not very fun.

    Comment by Will Hogan — October 25, 2012 @ 10:54 am

  2. yeah ‘worth while’ being the key there of course I suppose 🙂

    Even the ones that I’ve worked with that could/did not script it did not impair them all that much, not as much as other skills gaps that they had anyways. The last formal NE I worked with had an issue with their Cisco firewalls for about a year, it took me literally five seconds to come up with the solution (he had the thing configured to have states last for a week or longer, and the state table was filling up). Don’t know why he didn’t think of that – nor did Cisco support apparently. He had been a network engineer for about many years at that point and now works at a fairly large company doing similar things (haven’t talked to him in years, have no need to talk with him again).

    That NE even went as far as to ask again even after I told him the problem “Why did the backup firewall fail too?” — DUH! The states are replicated!

    Another NE I worked with struggled hard when we moved from L2 switching to L3 switching, I was new to L3 at the time myself never having specialized in networking, but I was seemingly constantly having to remind him of the routing involved with our many VLANs and
    tiers.

    I’ve heard similar stories about many other NEs as well, the majority of them seem to lack some of the most basic skills. Similarly with system admins I think at least based on the candidates I have spoken to over the past year. There are exceptions of course, but it just reinforces the point to myself to not to make things too complicated, or so advanced that
    you can’t find anyone qualified enough to manage it. That totally includes things like writing significant amounts of custom code for things.

    thanks for the comment!!

    Comment by Nate — October 25, 2012 @ 11:12 am

  3. Do you think the question could be better rephrased as “Should a sysadmin be conversant with theory as much as practise?”. Living in a heterogeneous shop, UNIX and Linux guys are still less of a worry, but if a Windows admin has not at least read and comprehended Mark Russinovich, I’d get worried, very worried. IT may even be more acute in a Windows environment, since powershell is basically talking straight to DCOM, and not only loony powerful, but whilst I think in many ways it is harder to hose a system under Windows, there’s a tremendous amount of small things you can mess with that fewer operators understand well. Raymond Chen’s The Old New Thing blog is exemplar extraordinary of how nuanced the (often historical) structure of Windows code can be. Also, with security, inheriting a VMS approach, literally every call to every DLL can be locked down, or not, or spoken to in unfortunate ways, much of which undocumented. I can break any analogy I make about this, but if you are working in a public IP space, no matter how removed by other layers or security, I totally want to know whoever touches a Win box has far deeper knowledge than product specialities or “just” robust network theory. In no way am I pumping the Microsoft server environment, but I continually find useful faculties it possesses that are barely even touted by Microsoft themselves which have been very cool. In a perverse way it appeals to the hacker in me, that there is much to be discovered. In comparison, I think much of what a UNIX system is, is clearly exposed, well known, pretty thoroughly discussed, because UNIX simply, uh, started out to try to be simple. Many things I spotted because they are basically VMS code snatched by Dave Cutler. File versioinign naming with append [1], [2] .. etc is a basic function in VMS, and started showing through the cracks in recent years. My private theory is Cutler literally did lift the VMS source printouts, and all Microsoft ever did was release already existing capabilities, not so much by writing them, but pointing out where they were and providing a Win32 API. I obviously cannot prove that theory, and it may sound conspiracy stuff, but there’s too much coincidence, and Cutler delivered NT suspiciously quickly. What the operative aspect in this is, is the number of ways you can call the OS layers above the NT kernel, and to I suspect nobody’s surprise, this often leads to horror stories. Equally, I believe Msft had problems officially releasing feature sets that effectively did exist, because of vast and often unscrutinised legacy code, dating from the days they relied on way too many external ISVs to provide simple functionality, even as far as decent file system journalling (Veritas). When I go back to a UNIX box, apart from quirks, I feel a sense of just dealing with a glorified IO and scheduling system. That’s sometimes a breath of fresh air, sometimes a cold breeze.

    Maybe we could adjust the question once more: “Should you teach your admins programming?”. I think it pays off. I don;t mean the failure that is a CS degree biased around Java, but getting at least as far as one can go with the classics such as SICP and TAOCP, and good foundation math.

    My apologies for the Wincentric talk, but it’s been on my mind since Nate wrote this entry and I read it. Server 2012 introduced some 2000 “commandlets” or powershell functions, I am still simply gawking at. But it added another thing, that the entire management UI now drives the command line, and so whatever you just did, you can get the script for. That’s awesome if you know your system well, but I am preparing for some major ouches once that has the network effect amongst less able admins.

    My answer to “Should system admins know how to code?” is simply, from a personal view: Yes, and if not teach them or get them learning and encourage it. It may not be every case you might want to get admins to move into a dev team, nor does every shop have a dev team. But I think it highly productive to get a two way street going on there. Many times more so between DB admins and whoeever writes to the database. I think that pain is all too easy to understand, sadly. Having a admin who is skilled enough to see what apps are doing to the IO can be wonderful. If you write code, get that man talking to who writes the code!

    I guess personally I just think that defined roles are useful only for sorting out entry level salary rates, and what questions you might best ask when hiring. After that, if you have the basics right, I mean a tight system that is not having to be continually fixed, then by my view it’s a pool of talent and people. That does not go down well at all, in some contexts. The man who is my business partner is a old school database guy. He understands optimisations something rotten. But he learned those on tiny hardware, and almost everything really is different now. I wish I could threaten him with a month forced sysadmin work! It really does happen, in so many fields, where you get truly specialised talent, that you must make them face off and talk about how to balance their needs, but what frustrates me in IT, is that people see the need far less naturally, for such collaboration. Engineers in other fields just don’t have such problems to pull in the other people and their perspectives, in my opinion. Can you imagine such fine distinctions being argued over when building a ‘scraper? That would be negligent and dangerous. So if devs can’t learn from the admins, I reckon someone is outsourcing the foundations to a imaginary perfect contractor on the planet Zog. To me, that’s not accepable.

    I may simply be lucky (or unlucky, in terms of business prowess) to have a small enough shop I can shove people together in a smallish space, but considering the sheer compute you can do on modest budgets, with very few people, I think it as critical as if we were FB scale. My story of trying to be Jack Of All Trades might be fun to tell sometime via the vignettes of arguments I have had when my tiny outfit even tries to hire someone who has the skill, but has lived in a defined and compartmentalised role a while. But I have yet to find anything better as a generality, that to regularly expose one side of the shop to one another, deliberately bringing in one side into the other’s meetings as “observers” at regular times when it is not a attempt to be “cross discipline” but a real meeting. Smart people spot things, tend to ask, and if not I go ask of one “side” what they thought or understood of the other.

    The only other thing I could suggest is truly blindingly obvious: build good libraries, and duplicate them wherever it’s a bit long a walk to the usual shelves. If a good admin is one who has nothing to do, well, I tell ya, I’ve got some reading to suggest!

    You could be also evil, and try one I got away with just once: provide the admins with beer and pizza – this during a work day you let them off duty – on condition they totally let rip scribbling down all their most hated buisances of any time. Deal is I retype those, and put my name to them. (I’m immune, I founded and am a majority owner) . .then, because hackers like to get into the zone, but often don’t get there because they are distracted and so then are usually in a frustrated unproductive limbo, then when I know I’m not pissing them off by hurting their work, I hand “my” rant to them as a review they need to read over. If the moans and rants are any good, and you pick your moment, it sure gets some comment. Or is blanked entirely, but I think this does make it sink home that the systems we all rely on are not to be taken for granted. If asked why I “wrote” such a thing, I suggest that we’d all be pissed if the electric utility failed on us, and make rude accountant or lawyer jokes . . now, that was a bit jokey, and not really evil, but it has happened that programmer dude wrote epic detailed comment on sysadmin moaning, explaining the reasons why he was stuck having to load a system a certain way. That was pure gold. The more you find problems, I guess, the more you can solve.

    Bottom line: teach sysadins how to code, and seriously, please please teach “higher users” how to administrate.

    Sometime, when I am a bit more practised at this trick, I must add how I use “false bureaucracy” to effect changes of attitude. I am not seasoned in this, at all, because we have been too small and tight a outfit for too long. But when I grew up, you had to “hack” bureaucracy to get TI to send you opamp and other component specs sheets (14 volumes beautifully bound in TI yellow),or at least you had to if you were still at grade school . . and so I sussed that game, but then loathed any form of bureaucracy the minute I had to go earn a living. Now, just in case my company ever grows, and actually needs such layers of formality, I figured I might try to reverse it. Say we need to become more formal for audits because we obviously intend to expand. I am experimenting with almost joke layers of frustration, where the thing that causes the procedural glitch or rejection or screw up in the application is nothing about the process, but something that makes the applicant address whatever has been bugging me about them or the work they are doing lately. I might get okay enough at this, if I can spare more time. But I’m dead set privately on seeing if my little joke plan can implement good things and still be a joke.

    Comment by John (other John) — November 17, 2012 @ 7:11 pm

  4. Theory vs practice – yeah for sure as you move up in the technical ranks both are really important. So many (maybe all?) of these solutions out there whether it’s hardware, software, open source software , networking or whatever so much of it is built to do one thing or another, but how well it can accomplish that is often a very different story. One really good sales guys I know for example was recently billed as “someone who is great at selling stuff that doesn’t work”. I laughed when I heard that, but it made sense to me given what he has historically sold.

    Totally agree that windows admins are more typically operators rather than administrators or engineers, though of course there are some pretty amazing windows admins/engineers out there I’m sure (I haven’t worked with any personally).

    I’ve never been a fan of Powershell myself, since it is object based and I am used to string based scripting in Linux. I can understand the reason for Power shell to be object based but to me at least it makes things more complicated than they need to be (ironic for something like Windows which in general is easier to manage than Linux for a newbie).

    totally agree about the internals of Unix being much simpler than Windows. Though Microsoft has obviously been better at their UI. I encountered the most frustrating thing ever on Windows recently, a file on the file system that was not visible nor readable by 32-bit applications.

    http://www.west-wind.com/weblog/posts/2008/Aug/09/Editing-Applicationhostconfig-on-64-bit-Win2008

    It just blows my mind that is even possible.

    If admins want to learn how to do development work that’s fine. Though I believe they should first conquer other areas such as networking, storage and all of the concepts and stuff on how they interact before going into a totally separate realm like programming. I know for most that is enough to occupy a career. Most of the admins I know are very weak in networking, and storage.

    With so many things “converging” especially with hypervisors and stuff it’s important to know how these systems work, and so many just don’t.

    I believe it’s far more useful to first know how the servers are interacting with the other components in the environment than knowing how the internals of the applications interact with other things.

    I’m totally with you on the small shop vs large shop. I’ve never worked at a place where I didn’t have access to all layers of the infrastructure. I’d go mad if I was at a silo’d shop, where you have to co-ordinate with other teams on their areas of expertise when troubleshooting something. As-is I can find the issues so quickly! As I’m sure you’ve experienced as well.

    thanks for the comment(again) !!

    Comment by Nate — November 20, 2012 @ 9:26 am

  5. Just for a thought, anyone who knew how a NT box worked, never needed PowerShell. Ever since C-sharp came about, I can’t think of much else you needed. Before that, you did have to delve deeper.

    But essentially, a windows box is a living and breathing piece of code. You may not care for the toilet breath that can cause people to faint at a hundred yards, but just about anything and everything is accessible, modifiable, and manipulable at levels UNIX systems would rather you do not touch. Or in fact never thought to implement. There’s more than a hint of MULTICS in the object security system. Which is what has made it so exploitable, because few who write apps recognise the depth of the security implementation.

    Neither do I agree much with the object based nature of PowerShell, but that comes from the n-way security of every part of NT, that you can, if you choose, lock down every last DLL so it may only speak to what you allow, only address what you would like, and it was built, ground up, as a component system, each component regulated by lower kernel calls. The mess that it became was, in my opinion, because everyone was too lazy to care for all this potential sophistication. That’s not how normal people like to write applications. Compare with LISP, were you can write functional code as simply as you wish, and invoke the CLOS any time you really want to mess with structures, or macros if you dare. So much of the recent moves in underlying windows systems mirrors parts of how LISP operates, it became interesting, a few years back.

    You are so right, that finding anyone who genuinely has appreciation of how windows works, is a nightmare, both in management and ops, and makes HR departments evaporate in little white clouds. I think about $200K is entry level for what interests my company now.

    But I personally believe there is enough of value in there, that if one can find the right people – or just those who are interested – you can do very good things with a windows system. Did you note, e.g. that servers now reboot cold without a domain controller? Try that with anything else. Absurdly, the emphasis on managing operators and not real admins, let alone programmers, has led to a explosion of useful features.

    What is lacking, is any kind of boot camp, to take LINUX admins into this world, and make it interesting for them. I increasingly see it as a division between who wishes to operate a system, and who wants to control it at the deepest level. Yes, I see that operating means getting things done. Same ay programmers “hack” to build a tool. But also I see how to revitalize my company, I want a level shop floor, all hands on deck. And as low level as we can get. I think that is going to be rather expensive! But the complexity of what we aim to achieve demands that. Or, rather, we see enough to bet on windows now, but are truly scared because we also see the pitfalls.

    Are there in fact people still there, under 40 plus years, with sufficient experience? I mean, who grew up with this. Just as the airlines are suffering inability to recruit pilots now, because the generation who flew for the USAF are almost gone. Right there, is a reason not to fly budget. I don’t, because of plain fear.

    I don’t think this is about sysadmins versus programmers at all. If you want to learn a system, learn it. All of it. Then you can start to play. Make it more than having cool kit in one’s datacenter, and genuflecting to their usually inane ideas of usability!

    all my best ~ j

    Comment by John (other John) — December 6, 2012 @ 11:14 pm

  6. Taking a linux admin (at least a highly seasoned one) and putting them in a windows world is quite a culture shock. To do simple things it’s fine, but to do larger scale stuff it’s a totally different way of thinking (or at least it seems to be for me). You can to some degree shoehorn some Linux stuff onto Windows with Cygwin (which I’ve used for years though haven’t seriously used it on a Windows server since 2002). But of course the concepts are very different, most everything is hidden behind binaries whether it is things like services, or the dreaded registry.

    Myself at least have never come across a person who was adept at both Linux and Windows to the extent that I have come across adept Linux folks (and I know there are adept Windows folks but I may only know one or two myself).

    I do see this cloud stuff having another impact on the economy though – making it harder to find highly skilled folks to run infrastructures. Which, in it’s own way drives even more companies to use a service provider cloud type model, even though it often costs massive amounts more than doing it internally. If they can’t find (or can’t retain) competent people then they need cloud. Though more often than not in the case of Amazon anyway they don’t realize that they also need highly skilled people to operate that as well (I argue even more highly skilled because the system is so broken, difficult to use, and built to fail).

    Comment by Nate — December 18, 2012 @ 11:20 am

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress