Jun/100
40 Million IOPS in two racks
TechOps Guy: Nate
Fusion IO does it again, another astonishing level of performance in such an efficient design, from the case study:
LLNL used Fusion’s ioMemory technology to create the world’s highest performance storage array. Using Fusion’s ioSANs and ioDrive Duos, the cluster achieves an unprecedented 40,800,000 IOPS and 320GB/s aggregate bandwidth.
Incredibly, Fusion’s ioMemory allowed LLNL to accomplish this feat in just two racks of appliances– something that would take a comparable hard disk-based solution over 43 racks. In fact, it would take over 100 of the SPC-1 benchmark’s leading all-flash vendor systems combined to match the performance, at a cost of over $300 million.
40 Million IOPS @ ~250 IOPS per 15K RPM disk your talking 160,000 disk drives.
Not all flash is created equal of course, many people don’t understand that. They just see ooh this one is cheap, this one is not, not having any clue (shocker).
It’s just flat out irresponsible to ignore such a industry changing technology, especially for workloads that deal with small (sub TB) amounts of data.
Jun/100
Investing in IT vs spending in IT
TechOps Guy: Nate
My good friend Chuck over at EMC (ok we’ve never met but he seems like a nice guy, we could be friends) wrote an interesting article about Investing in IT vs Spending on IT. I thought it was a really good read, I hadn’t thought of things in that way, but it made me realize I am one who wants to Invest in IT infrastructure, even if it means paying more up front, the value add of some solutions are just difficult to put numbers on.
Take storage for example, since Chuck is a storage guy. There’s a lot more to storage than cost per TB, cost per IOP, cost per usable TB, and even more than cost of power+cooling for the solution. The smaller things really do add up over time, but how do you put numbers on them? Something as simple as granular monitoring, when I went through a storage refresh a while back the currently established vendor really had no way of measuring the performance of their own system to develop a plan for a suitable technology refresh. It wasn’t a small system either it was a big fairly expensive (for the time) one.
Would you of expected to replace one storage system with another that had less than half the number of disks, and roughly 75% less raw IOPS (on paper)? Would you of expected the new system to not only outperform and out scale the old but continue to eat a significant amount of growth over the following year before needing an upgrade? If your a normal person I would expect you to not expect that. But that’s what happened.
In my experience, my approach is to establish a track record at an organization, this may take a few months, or may take a year(may be much longer if it’s a big company). Once you have established X number of successful projects, a higher degree of trust is put in you to have more lateral control and influence on how things work. Less hand holding, less minute justifications are required to get your point across, and you can get more things done.
Maybe that thinking is too logical, I don’t know. It’s how I think though, myself I put more faith in people the more I see how good they are at their jobs, I trust them more, if they turn out to provide good solutions or even good angles of thought I believe I can rely more on them to do that line of work than to work over their shoulder double checking everything. I think it’s how you can scale. Of course not all employees measure up, I would say especially in larger organizations most do not(government is especially bad I hear).
No one person can run it all, as much as they’d like. I’ve tried, and well the results while not horrible weren’t as good as having more people doing the work. I learned the hard way to delegate more work, whether it’s to co-workers, or to contractors, or even to vendors. People take vendors for granted, there is a lot of experience and knowledge they can bring to the table, not all vendor teams are created equal of course.
If you just want to spend on IT, don’t hire someone like me, I don’t want to work for you. If you want to invest in IT, to give your organization more leadership in new technologies that can improve efficiencies and lower costs, then you may want someone like me. Which is why I gravitate towards smaller higher technological organizations. They usually don’t have the economies of scale to do things as well as the big guns out there, so it’s up to people like me to develop innovative solutions to compete differently. If you read the blog you’ll see I don’t subscribe to any one vendor stack. I like many different products from many different vendors depending on what the requirements are.
From a vendor perspective (since it’s been 5 years since I worked with a contractor of sorts) I do like to have a good relationship with the vendor, they can be a valuable source of information. Vendors either love me or hate me, it really depends on their products, as folks that have worked with me can attest. It also depends on how technical the vendor can get with me. I like to go deep into the technical realm. And I believe I do challenge the System Engineers at my vendors with tough questions. Those that don’t measure up don’t last long. I have high expectations of myself, and I have high expectations of those around me, frequently too high. I don’t like to play political games where you try to screw them over because you know they’ll screw you back the first chance you get. Having a good relationship is one of those things it’s hard to put a number on. To me it’s worth a decent amount.
Jake, another person on this blog(hi Jake!) is similar, though he’s a lot more loyal than me, which again can be a good thing as well. Changing technology paths every 15 minutes is not a good idea, having a dozen different server vendors in your racks because different ones provided 5-10% better pricing at that particular time of day is not a good idea either.
Speaking of Jake, I remember when I first started at my previous company and they were doing negotiations with Oracle on licensing. They were out of compliance on licensing(they paid for Oracle SE One but were using Oracle EE) and were facing hefty fines. I tried to propose an alternative solution (going to Oracle Standard Edition which is significantly different from SE One), which would of saved significant amounts of money with really no loss in functionality(for our apps at the time). I was a new(literally a few weeks) employee and Jake dismissed my opinion, which I could understand at the time I was new and had no track record, nobody knew if I knew what I was talking about. It was OK though, so they paid their fines, and licensed some new Oracle stuff as part of the settlement.
The next year rolled around and Oracle came back again to do an audit, and once again found massive numbers of violations and the company was once again facing large amounts of penalties to get back in compliance. Apparently the previous process wasn’t as transparent as they expected, either the Oracle rep was misleading the company or was generally incompetent, I don’t know since I wasn’t involved in those talks.
Once again I strongly urged the company to migrate to Standard Edition to slash licensing costs, this time they listened to me. It took a few weeks to get all of the environments migrated over, including a full weekend of my time migrating production doing all sorts of unsupported things to get it done(value adds for you) to minimize downtime (while you can go from Oracle SE to EE without downtime typically you can’t do it the other way around). Went the extra mile to establish a standby DB server with log replication and consistent database backups(because you can’t run RMAN against a standby DB at least you couldn’t on 10GR2 at the time), all of it worked great, and we (as expected) slashed our Oracle licensing fees.
Of course I didn’t have to do that, I could of sat by and watched them pay up in fees again(several hundred thousand dollars in total). But I did do it, I did go to them and say I’m willing to work my ass off for several weeks to do this to save you money. Many people I’ve come across I don’t think have the dedication to volunteer for such an effort, they’ll of course do it if asked, but frequently won’t push hard so they can work more. What did I get out of it? I suppose more than anything a sense of accomplishment and pride. I certainly didn’t get any monetary rewards from the company. I didn’t get to re-allocate that portion of the budget towards things we were in very desperate need for.
The only frustrating part of the whole situation was when we licensed Oracle EE originally the optimal CPU configuration at the time was the fastest dual core CPUs you could get. So we ordered a HP DL 380G5 I think it was with dual proc dual core CPUs. Given the system was marked as compatible with 4 core systems I figured it would be an easy switch when or if we went to Standard edition (which charges per socket not per core, a fact I had to correct Oracle’s own reps on more than one occasion). But when the time came it turned out that we had to replace the motherboard on the HP system because the particular part number we had was not compatible with quad core. It took lots of support calls and HP reps insisting that our system was compatible before someone dug further into the details and found out it was not. But we got the board and CPUs replaced and still of course came out way ahead.
When I come up with solutions it’s not half assed. You may have a problem and ask me and I may have an immediate solution for your problem, but it’s not because I just read about it on slashdot that morning. My solutions are heavily thought out over a period of months or years (usually years), and it’s not obvious to people that I work with (or for, often enough) how much thought actually went into a particular solution regardless of the amount of time that elapsed since you posed the question to me. I love technology and I am always on the hunt for what I consider best of breed in whatever industry that the product is in. I’m not afraid to get my hands dirty, I’m not afraid to stand by my decisions in the event I make a mistake, and I really like to operate in an environment of trust.
Would it surprise you that I led an effort to launch an e-commerce site on top of VMware GSX back in early 2004 so my company’s customer would be satisfied? How many of you were running production facing VMware servers back then? Were they doing credit card transactions? I only did it because the company’s software failed to install properly during a system upgrade, and in order to keep the customer happy we decided to build them their own stand alone cluster, went from 0 to fully functional and tested in about 96 hours, most of that time was NOT sleeping.
And before you ask, NO I am not one of those people who is going to go suggest an open source solution for every problem on the planet just because it’s free. I use open source where I believe it adds value regardless of the cost, and use commercial, closed platforms (whether it’s VMware or even Oracle) where I believe they can add value. Don’t equate creative solutions with using free software across the board. That’s just as stupid as using a closed source ecosystem for all of your IT infrastructure.You won’t catch me trying to replace your active directory server with a Samba+LDAP system. You could catch me trying to do that – 10 years ago -. I’m long passed all that.
I can only speak for myself, but let me do my job and you won’t be disappointed. I’m not afraid to say I am one of those people who can do some pretty amazing things given the right resources, if your on linkedin you can check the recommendations on my profile for some examples.
So, round about, thanks Chuck that was a good read. Getting all of this written down really makes me feel a bit better too.
May/100
That’s not a knife…
TechOps Guy: Nate
There’s been a lot of talk (no thanks to Cisco/EMC) about infrastructure blocks recently. Myself I never (and still don’t) like the concept. I think it makes sense in the SMB world where you have very limited IT staff and they need a canned, integrated solution. Companies like HP and IBM have been selling these sorts of mini stacks for years. As for Microsoft I think they have a “Small business” version of their server platform which includes a bunch of things integrated together as well.
I think the concept falls apart at scale though, I’m a strong believer in best of breed technologies, and what is best of breed really depends on the requirements of the organization. I have my own favorites of course for the industries I’ve been working with/in for the past several years but I know they don’t apply to everyone.
I was reading up yesterday on some new containerized data centers that SGI released in their Ice Cube series. The numbers are just staggering.
In their most dense configuration, in 320 square feet of space consuming approximately 1 megawatt of power you can have either:
- More then 45,000 CPU cores
- More than 29 Petabytes of storage
In both cases you can get roughly 45kW per rack, while today most legacy data centers top out at between 2-5kW per rack.
Stop and think about that for a minute, think about the space, think about the density. 320 square feet is smaller than even a studio apartment,, though in Japan it may be big enough to house a family of 10-12 (I hear space is tight over there).
How’s that for an infrastructure block? And yes you can stack one on top of another
ICE Cube utilizes an ISO standard commercially available 9.5′ x 8′ x 40′ container. SGI intentionally designed the offering such that the roof of the container is clear of obstruction and fully capable of utilizing its stacking container feature. Because of this, SGI is positioned to supply a compelling density multiplier for future expansion of the data center. If installed in a location without overhead height restriction the 9.5′ x 8′ x 40′ containers in our primary product offering can be stacked up to three-high, thus allowing customers to double or triple the per square foot density of the facility over the already industry-leading density of a single ICE Cube.
All of this made me think of a particular scene from a ’80s movie.
Really makes these other blocks some vendors are talking about sound like toys by comparison doesn’t it.
Nov/090
Thin Provisioning strategy with VMware
TechOps Guy: Nate
Since the announcement of thin provisioning built into vSphere I have seen quite a few blog posts on how to take advantage of it but haven’t seen anything that matches my strategy which has served me well utilizing array-based thin provisioning technology. I think it’s pretty foolproof..
The man caveat is that I assume you have a decent amount of storage available on your system, that is your VMFS volumes aren’t the only thing residing on your storage. On my current storage array,written VMFS data accounts for maybe 2-3 % of my storage. On the storage array I had at my last company it was probably 10-15%. I don’t believe in dedicated storage arrays myself. I prefer nice shared storage systems that can sustain random and sequential I/O from any number of hosts and distributed that I/O across all of the resources for maximum efficiency. So my current array has most of it’s space set aside for a NFS cluster, and then there is a couple dozen terabytes set aside for SQL servers and VMware. The main key is being able to share the same spindles across dozens or even hundreds of LUNs.
There has been a lot of debate over the recent years about how best to size your VMFS volumes. The most recent data I have seen suggests somewhere between 250GB and 500GB. There seems to be unanimous opinion out there not to do something crazy and use 2TB volumes. The exact size depends on your setup. How many VMs, how many hosts, how often you use snapshots, how often you do vMotion, as well as the amount of I/O that goes on. The less of all of those the larger the volume can potentially be.
My method is so simple. I chose 1TB as my volume sizes, thin provisioned of course. I utilize the default lazy zero VMFS mode and do not explicitly turn on thin provisioning on any VMDK files. There’s no real point if you already have it in the array. So I create 1TB volumes, and I begin creating VMs on them. I try to stop when I get to around 500GB of allocated(but not written) space. That is VMware thinks it is using 500GB, but it may only be using 30GB. This way I know, the system will never use more than 500GB. Pretty simple. Of course I have enough space in reserve that if something crazy were to happen the volume could grow to 500GB and not cause any problems. Even with my current storage array operating in the neighborhood of 89% of total capacity, that still leaves me with several terabytes of space I can use in an emergency.
If I so desire I can go beyond the 500GB at any time without an issue. If I chose not to then I haven’t wasted any space because nothing is written to those blocks. My thin provisioning system is licensed based on written data, so if I have 10TB of thin provisioning on my system I can, if I want create 100TB of thin provisioned volumes, provided I don’t write more than 10TB to them. So you see there really is no loss in making a larger volume when the data is thin provisioned on the array. Why not make it 2TB or even bigger? Well really I can’t see a time when I would EVER want a 2TB VMFS volume which is why I picked 1TB.
I took the time in my early days working with thin provisioning to learn the growth trends of various applications and how best to utilize them to get maximum gain out of thin provisioning. With VMs that means having a small dedicated disk for OS and swap, and any data resides on other VMDKs or preferably on a NAS or for databases on raw devices(for snapshot purposes). Given that core OSs don’t grow much there isn’t much space needed(I default to 8GB) for the OS, and I give the OS a 1GB swap partition. For additional VMDKs or raw devices I always use LVM. I use it to assist me in automatically detecting what devices a particular volume are on, I use it for naming purposes, and I use it to forcefully contain growth. Some applications are not thin provisioning friendly but I’d like to be able to expand the volume on demand without an outage. Online LVM resize and file system resize allows this without touching the array. It really doesn’t take much work.
On my systems I don’t really do vMotion(not licensed), I very rarely use VMFS snapshots(few times a year), the I/O on my VMFS volumes is tiny despite having 300+ VMs running on them. So in theory I probably could get away with 1TB or even 2TB VMFS volume sizes, but why lock myself into that if I don’t have to? So I don’t.
I also use dedicated swap VMFS volumes so I can monitor the amount of I/O going on with swap from an array perspective. Currently I have 21 VMware hosts connected to our array totalling 168 CPU cores, and 795GB of memory. Working to retire our main production VMware hosts, many of which are several years old(re-purposed from other applications). Now that I’ve proven how well it can work on existing hardware and the low cost version the company is ready to gear up a bit more and commit more resources to a more formalized deployment utilizing the latest hardware and software technology. You won’t catch me using the enterprise plus or even the enterprise version of VMware though, cost/ benefit isn’t there.
Aug/090
Are My Emails Getting Through? The Need to Monitor Email Deliverability Part II
TechOps Guy: Dave
I wanted to follow up on Jason’s post about determining if your e-mails are getting through with what we actually implemented. In order to find out whether the big guys (hotmail,gmail,AOL or earthilink) have accepted our (opt-in) e-mail message I created the following Nagios check script
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 | #!/usr/bin/ruby require '/usr/local/nagios/libexec/pop_ssl' require 'net/imap' require 'date' require 'date/format' # require 'imap' require '/usr/lib64/ruby/gems/1.8/gems/hotmailer-1.0.1/lib/hotmailer.rb' require 'rubygems' # require 'hotmailer' require 'getoptlong' # require 'rdoc/usage' class PlainAuthenticator def process(data) return "/0#{@user}/0#{@password}" end private def initialize(user, password) @user = user @password = password end end Net::IMAP.add_authenticator('PLAIN', PlainAuthenticator) opts = GetoptLong.new( [ '--help', '-h', GetoptLong::NO_ARGUMENT ], [ '--mailhost', '-m', GetoptLong::REQUIRED_ARGUMENT ], [ '--username', '-u', GetoptLong::REQUIRED_ARGUMENT ], [ '--password', '-p', GetoptLong::REQUIRED_ARGUMENT ], [ '--port', '-o', GetoptLong::REQUIRED_ARGUMENT ], [ '--search', '-s', GetoptLong::REQUIRED_ARGUMENT ], [ '--age', '-a', GetoptLong::REQUIRED_ARGUMENT ], [ '--transport', '-t', GetoptLong::REQUIRED_ARGUMENT ] ) mailhost = nil username = nil password = nil searchString = nil transport="POP3" port = nil age = 1 opts.each do |opt, arg| case opt when '--help' # RDoc::usage when '--mailhost' mailhost = arg when '--username' username = arg when '--password' password = arg when '--search' searchString = arg when '--port' port = arg when '--transport' transport = arg end end i = 0 #Figure the out how far back we will search in a mailbox day = Date.today-age imapFormatedDate = day.strftime(fmt="%d-%b-%Y") if transport == "securePOP3" Net::POP3.enable_ssl(OpenSSL::SSL::VERIFY_NONE) end if ((transport == "POP3") or (transport == "securePOP3")) pop = Net::POP3.new(mailhost, port) pop.start( username,password) if pop.mails.empty? else pop.each_mail do |m| date = nil m.header.each do |h| if h =~ /Date/ date = Date.parse(h) end end if date >= day m.pop.each do |f| # puts "#{f}" if f =~ /#{searchString}/ i += 1 end end end end end pop.finish end if transport == "IMAP" imap = Net::IMAP.new(mailhost) imap.login(username, password) messages = imap.status("inbox", ["MESSAGES"]) if messages["MESSAGES"] >= 1 imap.select('INBOX') imap.search(["SINCE", imapFormatedDate]).each do |message_id| msg = imap.fetch(message_id, "(UID RFC822.SIZE ENVELOPE BODY[TEXT])")[0] body = msg.attr["BODY[TEXT]"] # puts "#{body}" envelope = imap.fetch(message_id, "ENVELOPE")[0].attr["ENVELOPE"] if (envelope.subject =~ /#{searchString}/) or (body =~ /#{searchString}/) # puts "#{envelope.from[0].name}: \t#{envelope.subject}" i += 1 end imap.store(message_id, "+FLAGS", [:Deleted]) end imap.logout end end if transport == "hotmail" hotty = Hotmailer.new(username, password) hotty.login messages = hotty.messages messages.each do |f| tempdate = f.date+" 2007" tempdate.gsub!(/(.*)( )(.*)/,'\1 \3') date = Date.parse(tempdate) if date >= day if (f.subject =~ /#{searchString}/) or (f.body =~ /#{searchString}/) i += 1 end end end end if i >= 1 puts "Found #{i} Messages matching \"#{searchString}\"" exit(anInteger=0) else puts "No Messages matching \"#{searchString}\"" exit(anInteger=2) end |
The script should be fairly useful even today with exception of checking hotmail, since I originally wrote this hotmail has redesigned their interface break the hotmailer module I found which screen scraped the site.
So good luck and I hope your properly opted in mail is getting through.
Jul/090
Defining a Solid Escalation Plan
TechOps Guy: Jason
Definition and adherence to an escalation plan can provide clarity in the event of a high priority issue within a Web 2.0 or other company.
In order to create a solid escalation plan one must possess a thorough understanding of the level of service you plan to deliver to your customers; note you will be staffing against this goal
Once your SLA exists and you have a staff in place to support that commitment, then it’s about defining the process with which your Technology teams respond to issues in their efforts to adhere to the SLA.
Priority Definition & Distinctions
I believe there are essentially 4 priorities and/or categories in classifying issues: Priority 0, 1, 2 & 3. Note: This will vary depending on your company’s service or application(s).
Priority 0
This is very bad. Something with Ping/Pipe/Power generally went bad; in short Infrastructure is the culprit. This definition is reserved for the issues that affect your Web application from functioning at all; usually this is an issue affecting the Network/Firewall/Router or Power level. These should be very rare occurrences and are likely due to an ISP failure or not adequately preparing for a power failure. If your monitoring and alerting is setup correctly you should be notified of this issue before your customers discover it, but you will likely have downtime.
Priority 1
This is also very bad. This classification is usually reserved for Core Functionality of the Web application not working at all for any customers or everything for a highly visible customer is not working. The smoking guns in this situations could likely be a configuration error upon a deployment, a serious defect not caught previously with deployment or a sudden rush in traffic/use of a feature that caused a massive performance problem rendering the application almost useless. These occurrences can be due to any number of issues, but if you have intelligent application level monitoring and alerting is setup correctly you should be notified of this issue before your customers discover it. You may not have downtime associated with this issue, but you will likely have to deploy a fix to address these problems.
Priority 2
This can be more common than most Technical Operations teams of Web 2.0 companies would like to admit. Web applications and their specific feature sets can have issues; not working as intended, failing under load, generating server errors, etc.
Priority 3
These are very common for most Web 2.0 companies. These issues are of the non-critical bug variety, broken links, reporting failures.
Service Level Targets
Now that you have a good understanding of each priority definition it is important that we set some service targets that will help us achieve the SLA we’ve agreed to with our customers. Generally there are 4 areas that I feel are important to define for each priority categorization: Call Back Target, Start Work Target, After Hours, Deployment Target.
Call Back Target
This is the amount of time the person on call has to contact the person who escalated the issue; my guideline has always been 15 minutes.
Start Work Target
This is the amount of time the person on call has to get to a place to start work on the issue; my guideline varies depending on the priority of the issue–anywhere from 15 minutes to the next business day.
After Hours
This is a guideline for deciding how to react after core business hours. For example, are issues triaged until resolved regardless of time, or can they be tacked the next business day.
Deployment Target
This is a guideline for determining when we would deploy a fix once one was created; same business day, next business day, or the weekly deployment.
Notifications
So, after you’ve classified your issues, committed to your service level targets you need to communicate to your internal team about severity of the issues your web application. The easiest method to perform this task is to have an email list setup for P0, P1, P2 and P3 issues. When you first encounter these issues, the person on-call responds within their start work target sending an email to one of the above email lists.
Ready to see it all put together? See below:
Jul/091
The Release Management Process
TechOps Guy: Jason
Reliable, Repeatable, Results Over Time. One of my mentors over the years, David Gedye, pounded in my head early in my career that my goal as an Ops/IT expert was to achieve “Reliable, Repeatable, Results Over Time.” In everything I did, he would hammer this phrase home.
Every company who has a website or web application should have a disciplined Release Management Process. There are so many benefits from getting this right and I’m sure everyone is familiar with not getting this one right. The results of a poor process usually entail, Development throwing code over the fence to Test and subsequently throwing it over to the Technical Operations team. The end result usually ends in configuration, integration issues or last minute bug fixes that do not get thoroughly tested for regressions. One of my strengths is being able to walk into an organization and either establish or help improve the current Release Management Process. To do this successfully I first examine the following things:
- What environments are involved in the development, testing and production of your application?
- How are these environments configured and who owns them?
- How are the virtual teams communicating when a release is ready to move from one environment to the next?
- Is there a tracking system for these releases?
- What technologies are used to store and secure the source code?
- What technologies are used to deploy the code from one environment to another?
Environments
If I could start from scratch and had the headcount and funds to do so, I would deploy the following 5 environments:
Development, Test, Sandbox, Staging & Production
While most organizations likely already take advantage of a Development and Production environment; the other 3 environments can provide great value. Let me explain.
Development |
Description: Primary environment to perform feature development and unit testing owned by the Development team |
Support Policy: OPS supports the hardware and OS level support, Development supports the Application |
|
Deployments: Performed by Developers at their discretion |
|
Version: Running R1.2 (or 2 versions ahead of Production) |
|
Test |
Description: Primary environment to perform initial integration testing, basic performance/load testing owned by the Quality Assurance (Test) team |
| Support Policy: OPS supports the hardware and OS level support, Quality Assurance (Test) supports the Application |
|
| Deployments: Performed by Software Test Engineers at their discretion |
|
| Version: Likely running R1.0, R1.1 and R1.2 - since this environment is not a copy of Production there are likely multiple instances with multiple versions |
|
Sandbox |
Description: Environment to test complete builds of the application(s); full integration testing occurs here with recent backups of sanitized Production data; usually working on v. + 1 of Production. |
Support Policy: OPS supports hardware, OS and Application level support |
|
Deployments: Performed by OPS once signed off on by Development and Test; Release Form must be completed prior to deployment |
|
Version: Running R1.1 (or 1 version ahead of Production) |
|
Staging |
Description: Environment to used primary for hotfixes, data only and small code changes. Environment is needed as to not disrupt build testing in SANDBOX. This is definitely a luxury item as it can be costly to manage the additional burden of equipment/OS/application. Code base should match Production. |
Support Policy: OPS supports hardware, OS and Application level support |
|
Deployments: Performed by OPS once signed off on by Development and Test; Release Form must be completed prior to deployment |
|
Version: Running R1.0 (Runs identical code to Production) |
|
Production |
Description: This is the LIVE environment where the customers use the application. |
Support Policy: OPS supports hardware, OS and Application level support |
|
Deployments: Performed by OPS once signed off on by Development and Test; Release Form must be completed prior to deployment |
|
Version: Running R1.0 |
Types of Fixes & Proper Lead Time
One of the most difficult problems a team needs to resolve is setting appropriate expectations to internal customers, external customers and application users regarding the amount of time it takes to properly release items from Development to Production. Often times these
expectations are not documented and this can only lead to disappointment in customer response to critical issues, lack of proper time to ensure quality
testing and little or no practice in deployment of these fixes/feature sets. So the main question is how do we classify releases and how much lead time does
each classification require to ensure a high level of quality while still be very responsive to customer issues. Below I’ve outlined a strategy that helps
tackle some of these difficult issues.
Data fix |
Description: This often occurs within ASP applications where a sample of data needs cleaning up, logically deleted or otherwise altered |
Environments: Data fixes depending on scale and risk are often run and verified in the Staging environment before moving onto Production |
|
Version: 1.01, .01 indicates a revision to the application |
|
Lead Time: Same day turn-around; if it is in by 3pm it can be deployed later that evening during a scheduled deployment |
|
Hotfix |
Description: Generally a break/fix situation with an application |
Environments: Hotfixes are generally run and verified in the Sandbox environment before moving to Staging and ultimately onto Production |
|
Version: 1.01, .01 indicates a revision to the application |
|
Lead Time: Potential Same day turn-around, but would prefer a full day’s notice. A full day’s notice would allow for a full day of testing before the deployment date and allow an on and offshore test team to verify a fix; again if it is in by 3pm it can be deployed later that evening during a scheduled deployment. |
|
Incremental Release |
Description: Incremental Releases include the introduction/modification of new features, slight updates/tweaks to the UI, collection of bug fixes that have been triaged into the release |
Environments: Incremental Releases are generally run and verified in the Sandbox environment before moving to Staging and ultimately onto Production |
|
Version: 1.2, .2 indicates a revision to the application |
|
Lead Time: 2 full day’s notice. This will allow Operations to retrieve an identical copy of data from Production to do a full and accurate deployment in Sandbox. The restoration of Production data takes time to complete so the additional day’s notice is needed. Again if it is in by 3pm it can be deployed 2 days later during a scheduled deployment. |
|
Milestone Release |
Description: These are the biggies! Large feature set deployments, architecture changes, data model changes and additional bug fixes that have been triaged into the release |
Environments: Incremental Releases are generally run and verified in the Sandbox environment before moving to Staging and ultimately onto Production |
|
Version: 2.0, 2.0 indicates a revision to the application |
|
Lead Time: 5 full day’s notice. This will allow Operations to retrieve an identical copy of data from Production to do a full and accurate deployment in Sandbox. The restoration of Production data takes time to complete so the additional day’s notice is needed. Again if it is in by 3pm it can be deployed 5 days later during a scheduled deployment. |
* Lead time descriptions can always be a little fuzzy. Let’s just state that the lead time stated above encourages a rapid
response to customer issues while still maintaining enough time for Quality Assurance Testing…not just Black Box testing
Hardware Selection
Hardware selection is driven completely by the budget you have to spend. If you’re web farm for your application is 10 web servers you likely will not need more than 2 web servers to fully simulate the Production environment in Sandbox and Staging. Also, if you are not
performance/load testing in Sandbox and/or Staging then you can get away with cheaper desktop or inexpensive rack mounted servers rather than the beefier hardware you are likely running in Production.
Comments About Configuration
I’m not a big fan of consolidating multiple services on 1 server in Sandbox, unless it is similarly configured in
Production. Whether it’s football, baseball or Operations; you need to practice like you’re gonna play which is the motivation for this.
Tracking a Release
It seems obvious that a release should be documented and recorded to review over time. The obvious solution is to create a database where you can store information on releases. The following represents what I would consider the ideal information you should capture in your database.
Variable |
Selection Options |
Description |
| Environment | Select Drop-down | Sandbox, Staging, Production |
| Release Type | Select Drop-down | Hot Fix, Service Release, Full Release, Configuration |
| Application | Select Drop-down | Jobster Service Corporate Website JIVE – Delight Highdeal Jobster Search Static Content Coffee Robot UJobs |
| Version | ||
| Release Instructions | Text area for sets of instructions |
|
| Requested By | Select Drop-down | |
| Development Lead | Select Drop-down | |
| QA Lead | Select Drop-down | |
| Notify | Pre-selected Groups | Technical Operations, Quality Assurance/Test, Development, Program Management |
| Comments | Text area for comments about the release |
Typically errors encountered upon deployment, etc. |
| Deployment Start Time | Entered by the Release Tracker |
|
| Deployment End Time | Entered by the Release Tracker |
|
| Submit to QA Time | Entered by the Release Tracker |
|
| Delete Time | Entered by the Release Tracker |
|
| Failed Time | Entered by the Release Tracker |
|
| Passed QA Time | Entered by the Release Tracker |
|
Upon each Release Tracker Deployment Request Form completed the following individuals are notified: Requester, Development Lead for the application, QA Lead for the application, anyone else selected under Notify Groups and the entire Technical Operations team.
Source Code Repositories
Visual Source Safe, Subversion, CVS, Source Depot, etc…your source code needs to be stored in a system that allows for labels, branching and versioning. The key to any Release Tracking process is to make sure you are able to roll back to a previous version of the source code should the need arise. All of the above source code repositories support a labeling or versioning system within the software. To make the Release Tracker process work, you must have a Label or Version correctly identified in your source code repository. It is this version that will be tracked from Deployment Request to Deploying Notification to Deployed and Ready for QA, to Passed/Fail QA; should the process fail anywhere along the way there should be a clear version that the Technical Operations team can roll back to.

