19
Jun/10
0

40 Million IOPS in two racks

TechOps Guy: Nate

Fusion IO does it again, another astonishing level of performance in such an efficient design, from the case study:

LLNL used Fusion’s ioMemory technology to create the world’s highest performance storage array. Using Fusion’s ioSANs and ioDrive Duos, the cluster achieves an unprecedented 40,800,000 IOPS and 320GB/s aggregate bandwidth.
Incredibly, Fusion’s ioMemory allowed LLNL to accomplish this feat in just two racks of appliances– something that would take a comparable hard disk-based solution over 43 racks. In fact, it would take over 100 of the SPC-1 benchmark’s leading all-flash vendor systems combined to match the performance, at a cost of over $300 million.

40 Million IOPS @ ~250 IOPS per 15K RPM disk your talking 160,000 disk drives.

Not all flash is created equal of course, many people don’t understand that. They just see ooh this one is cheap, this one is not, not having any clue (shocker).

It’s just flat out irresponsible to ignore such a industry changing technology, especially for workloads that deal with small (sub TB) amounts of data.

11
Jun/10
0

Investing in IT vs spending in IT

TechOps Guy: Nate

My good friend Chuck over at EMC (ok we’ve never met but he seems like a nice guy, we could be friends) wrote an interesting article about Investing in IT vs Spending on IT. I thought it was a really good read, I hadn’t thought of things in that way, but it made me realize I am one who wants to Invest in IT infrastructure, even if it means paying more up front, the value add of some solutions are just difficult to put numbers on.

Take storage for example, since Chuck is a storage guy. There’s a lot more to storage than cost per TB, cost per IOP, cost per usable TB, and even more than cost of power+cooling for the solution. The smaller things really do add up over time, but how do you put numbers on them? Something as simple as granular monitoring, when I went through a storage refresh a while back the currently established vendor really had no way of measuring the performance of their own system to develop a plan for a suitable technology refresh. It wasn’t a small system either it was a big fairly expensive (for the time) one.

Would you of expected to replace one storage system with another that had less than half the number of disks, and roughly 75% less raw IOPS (on paper)? Would you of expected the new system to not only outperform and out scale the old but continue to eat a significant amount of growth over the following year before needing an upgrade? If your a normal person I would expect you to not expect that. But that’s what happened.

In my experience, my approach is to establish a track record at an organization, this may take a few months, or may take a year(may be much longer if it’s a big company). Once you have established X number of successful projects, a higher degree of trust is put in you to have more lateral control and influence on how things work. Less hand holding, less minute justifications are required to get your point across, and you can get more things done.

Maybe that thinking is too logical, I don’t know. It’s how I think though, myself I put more faith in people the more I see how good they are at their jobs, I trust them more, if they turn out to provide good solutions or even good angles of thought I believe I can rely more on them to do that line of work than to work over their shoulder double checking everything. I think it’s how you can scale. Of course not all employees measure up, I would say especially in larger organizations most do not(government is especially bad I hear).

No one person can run it all, as much as they’d like. I’ve tried, and well the results while not horrible weren’t as good as having more people doing the work. I learned the hard way to delegate more work, whether it’s to co-workers, or to contractors, or even to vendors. People take vendors for granted, there is a lot of experience and knowledge they can bring to the table, not all vendor teams are created equal of course.

If you just want to spend on IT, don’t hire someone like me, I don’t want to work for you. If you want to invest in IT, to give your organization more leadership in new technologies that can improve efficiencies and lower costs, then you may want someone like me. Which is why I gravitate towards smaller higher technological organizations. They usually don’t have the economies of scale to do things as well as the big guns out there, so it’s up to people like me to develop innovative solutions to compete differently. If you read the blog you’ll see I don’t subscribe to any one vendor stack. I like many different products from many different vendors depending on what the requirements are.

From a vendor perspective (since it’s been 5 years since I worked with a contractor of sorts) I do like to have a good relationship with the vendor, they can be a valuable source of information. Vendors either love me or hate me, it really depends on their products, as folks that have worked with me can attest. It also depends on how technical the vendor can get with me. I like to go deep into the technical realm. And I believe I do challenge the System Engineers at my vendors with tough questions. Those that don’t measure up don’t last long. I have high expectations of myself, and I have high expectations of those around me, frequently too high. I don’t like to play political games where you try to screw them over because you know they’ll screw you back the first chance you get. Having a good relationship is one of those things it’s hard to put a number on. To me it’s worth a decent amount.

Jake, another person on this blog(hi Jake!) is similar, though he’s a lot more loyal than me, which again can be a good thing as well. Changing technology paths every 15 minutes is not a good idea, having a dozen different server vendors in your racks because different ones provided 5-10% better pricing at that particular time of day is not a good idea either.

Speaking of Jake, I remember when I first started at my previous company and they were doing negotiations with Oracle on licensing. They were out of compliance on licensing(they paid for Oracle SE One but were using Oracle EE) and were facing hefty fines. I tried to propose an alternative solution (going to Oracle Standard Edition which is significantly different from SE One), which would of saved significant amounts of money with really no loss in functionality(for our apps at the time). I was a new(literally a few weeks) employee and Jake dismissed my opinion, which I could understand at the time I was new and had no track record, nobody knew if I knew what I was talking about. It was OK though, so they paid their fines, and licensed some new Oracle stuff as part of the settlement.

The next year rolled around and Oracle came back again to do an audit, and once again found massive numbers of violations and the company was once again facing large amounts of penalties to get back in compliance. Apparently the previous process wasn’t as transparent as they expected, either the Oracle rep was misleading the company or was generally incompetent, I don’t know since I wasn’t involved in those talks.

Once again I strongly urged the company to migrate to Standard Edition to slash licensing costs, this time they listened to me. It took a few weeks to get all of the environments migrated over, including a full weekend of my time migrating production doing all sorts of unsupported things to get it done(value adds for you) to minimize downtime (while you can go from Oracle SE to EE without downtime typically you can’t do it the other way around). Went the extra mile to establish a standby DB server with log replication and consistent database backups(because you can’t run RMAN against a standby DB at least you couldn’t on 10GR2 at the time), all of it worked great, and we (as expected) slashed our Oracle licensing fees.

Of course I didn’t have to do that, I could of sat by and watched them pay up in fees again(several hundred thousand dollars in total). But I did do it, I did go to them and say I’m willing to work my ass off for several weeks to do this to save you money. Many people I’ve come across I don’t think have the dedication to volunteer for such an effort, they’ll of course do it if asked, but frequently won’t push hard so they can work more. What did I get out of it? I suppose more than anything a sense of accomplishment and pride. I certainly didn’t get any monetary rewards from the company. I didn’t get to re-allocate that portion of the budget towards things we were in very desperate need for.

The only frustrating part of the whole situation was when we licensed Oracle EE originally the optimal CPU configuration at the time was the fastest dual core CPUs you could get. So we ordered a HP DL 380G5 I think it was with dual proc dual core CPUs. Given the system was marked as compatible with 4 core systems I figured it would be an easy switch when or if we went to Standard edition (which charges per socket not per core, a fact I had to correct Oracle’s own reps on more than one occasion). But when the time came it turned out that we had to replace the motherboard on the HP system because the particular part number we had was not compatible with quad core. It took lots of support calls and HP reps insisting that our system was compatible before someone dug further into the details and found out it was not. But we got the board and CPUs replaced and still of course came out way ahead.

When I come up with solutions it’s not half assed. You may have a problem and ask me and I may have an immediate solution for your problem, but it’s not because I just read about it on slashdot that morning. My solutions are heavily thought out over a period of months or years (usually years), and it’s not obvious to people that I work with (or for, often enough) how much thought actually went into a particular solution regardless of the amount of time that elapsed since you posed the question to me. I love technology and I am always on the hunt for what I consider best of breed in whatever industry that the product is in. I’m not afraid to get my hands dirty, I’m not afraid to stand by my decisions in the event I make a mistake, and I really like to operate in an environment of trust.

Would it surprise you that I led an effort to launch an e-commerce site on top of VMware GSX back in early 2004 so my company’s customer would be satisfied? How many of you were running production facing VMware servers back then? Were they doing credit card transactions? I only did it because the company’s software failed to install properly during a system upgrade, and in order to keep the customer happy we decided to build them their own stand alone cluster, went from 0 to fully functional and tested in about 96 hours, most of that time was NOT sleeping.

And before you ask, NO I am not one of those people who is going to go suggest an open source solution for every problem on the planet just because it’s free. I use open source where I believe it adds value regardless of the cost, and use commercial, closed platforms (whether it’s VMware or even Oracle) where I believe they can add value. Don’t equate creative solutions with using free software across the board. That’s just as stupid as using a closed source ecosystem for all of your IT infrastructure.You won’t catch me trying to replace your active directory server with a Samba+LDAP system. You could catch me trying to do that – 10 years ago -. I’m long passed all that.

I can only speak for myself, but let me do my job and you won’t be disappointed. I’m not afraid to say I am one of those people who can do some pretty amazing things given the right resources, if your on linkedin you can check the recommendations on my profile for some examples.

So, round about, thanks Chuck that was a good read. Getting all of this written down really makes me feel a bit better too.

28
May/10
0

That’s not a knife…

TechOps Guy: Nate

There’s been a lot of talk (no thanks to Cisco/EMC) about infrastructure blocks recently. Myself I never (and still don’t) like the concept. I think it makes sense in the SMB world where you have very limited IT staff and they need a canned, integrated solution. Companies like HP and IBM have been selling these sorts of mini stacks for years. As for Microsoft I think they have a “Small business” version of their server platform which includes a bunch of things integrated together as well.

I think the concept falls apart at scale though, I’m a strong believer in best of breed technologies, and what is best of breed really depends on the requirements of the organization. I have my own favorites of course for the industries I’ve been working with/in for the past several years but I know they don’t apply to everyone.

I was reading up yesterday on some new containerized data centers that SGI released in their Ice Cube series. The numbers are just staggering.

In their most dense configuration, in 320 square feet of space consuming approximately 1 megawatt of power you can have either:

  • More then 45,000 CPU cores
  • More than 29 Petabytes of storage

In both cases you can get roughly 45kW per rack, while today most legacy data centers top out at between 2-5kW per rack.

Stop and think about that for a minute, think about the space, think about the density. 320 square feet is smaller than even a studio apartment,, though in Japan it may be big enough to house a family of 10-12 (I hear space is tight over there).

How’s that for an infrastructure block? And yes you can stack one on top of another

ICE Cube utilizes an ISO standard commercially available 9.5′ x 8′ x 40′ container. SGI intentionally designed the offering such that the roof of the container is clear of obstruction and fully capable of utilizing its stacking container feature. Because of this, SGI is positioned to supply a compelling density multiplier for future expansion of the data center. If installed in a location without overhead height restriction the 9.5′ x 8′ x 40′ containers in our primary product offering can be stacked up to three-high, thus allowing customers to double or triple the per square foot density of the facility over the already industry-leading density of a single ICE Cube.

All of this made me think of a particular scene from a ’80s movie.

Really makes these other blocks some vendors are talking about sound like toys by comparison doesn’t it.

6
Nov/09
0

Thin Provisioning strategy with VMware

TechOps Guy: Nate

Since the announcement of thin provisioning built into vSphere I have seen quite a few blog posts on how to take advantage of it but haven’t seen anything that matches my strategy which has served me well utilizing array-based thin provisioning technology. I think it’s pretty foolproof..

The man caveat is that I assume you have a decent amount of storage available on your system, that is your VMFS volumes aren’t the only thing residing on your storage. On my current storage array,written VMFS data accounts for maybe 2-3 % of my storage. On the storage array I had at my last company it was probably 10-15%. I don’t believe in dedicated storage arrays myself. I prefer nice shared storage systems that can sustain random and sequential I/O from any number of hosts and distributed that I/O across all of the resources for maximum efficiency.  So my current array has most of it’s space set aside for a NFS cluster, and then there is a couple dozen terabytes set aside for SQL servers and VMware. The main key is being able to share the same spindles across dozens or even hundreds of LUNs.

There has been a lot of debate over the recent years about how best to size your VMFS volumes. The most recent data I have seen suggests somewhere between 250GB and 500GB. There seems to be unanimous opinion out there not to do something crazy and use 2TB volumes. The exact size depends on your setup. How many VMs, how many hosts, how often you use snapshots, how often you do vMotion, as well as the amount of I/O that goes on. The less of all of those the larger the volume can potentially be.

My method is so simple. I chose 1TB as my volume sizes, thin provisioned of course.  I utilize the default lazy zero VMFS mode and do not explicitly turn on thin provisioning on any VMDK files. There’s no real point if you already have it in the array. So I create 1TB volumes, and I begin creating VMs on them. I try to stop when I get to around 500GB of allocated(but not written) space. That is VMware thinks it is using 500GB, but it may only be using 30GB. This way I know, the system will never use more than 500GB. Pretty simple. Of course I have enough space in reserve that if something crazy were to happen the volume could grow to 500GB and not cause any problems. Even with my current storage array operating in the neighborhood of 89% of total capacity, that still leaves me with several terabytes of space I can use in an emergency.

If I so desire I can go beyond the 500GB at any time without an issue. If I chose not to then I haven’t wasted any space because nothing is written to those blocks. My thin provisioning system is licensed based on written data, so if I have 10TB of thin provisioning on my system I can, if I want create 100TB of thin provisioned volumes, provided I don’t write more than 10TB to them. So you see there really is no loss in making a larger volume when the data is thin provisioned on the array. Why not make it 2TB or even bigger? Well really I can’t see a time when I would EVER want a 2TB VMFS volume which is why I picked 1TB.

I took the time in my early days working with thin provisioning to learn the growth trends of various applications and how best to utilize them to get maximum gain out of thin provisioning.  With VMs that means having a small dedicated disk for OS and swap, and any data resides on other VMDKs or preferably on a NAS or for databases on raw devices(for snapshot purposes). Given that core OSs don’t grow much there isn’t much space needed(I default to 8GB) for the OS, and I give the OS a 1GB swap partition.  For additional VMDKs or raw devices I always use LVM. I use it to assist me in automatically detecting what devices a particular volume are on, I use it for naming purposes, and I use it to forcefully contain growth. Some applications are not thin provisioning friendly but I’d like to be able to expand the volume on demand without an outage. Online LVM resize and file system resize allows this without touching the array. It really doesn’t take much work.

On my systems I don’t really do vMotion(not licensed), I very rarely use VMFS snapshots(few times a year), the I/O on my VMFS volumes is tiny despite having 300+ VMs running on them. So in theory I probably could get away with 1TB or even 2TB VMFS volume sizes, but why lock myself into that if I don’t have to? So I don’t.

I also use dedicated swap VMFS volumes so I can monitor the amount of I/O going on with swap from an array perspective. Currently I have 21 VMware hosts connected to our array totalling 168 CPU cores, and 795GB of memory. Working to retire our main production VMware hosts, many of which are several years old(re-purposed from other applications). Now that I’ve proven how well it can work on existing hardware and the low cost version the company is ready to gear up a bit more and commit more resources to a more formalized deployment utilizing the latest hardware and software technology. You won’t catch me using the enterprise plus or even the enterprise version of VMware though, cost/ benefit isn’t there.

25
Aug/09
0

Are My Emails Getting Through? The Need to Monitor Email Deliverability Part II

TechOps Guy: Dave

I wanted to follow up on Jason’s post about determining if your e-mails are getting through with what we actually implemented. In order to find out whether the big guys (hotmail,gmail,AOL or earthilink) have accepted our (opt-in) e-mail message I created the following Nagios check script

?Download checkmail.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
#!/usr/bin/ruby
 
 require '/usr/local/nagios/libexec/pop_ssl'
 require 'net/imap'
 require 'date'
 require 'date/format'
# require 'imap'
 require '/usr/lib64/ruby/gems/1.8/gems/hotmailer-1.0.1/lib/hotmailer.rb'
 require 'rubygems'
# require 'hotmailer'
 require 'getoptlong'
# require 'rdoc/usage'
 
 
class PlainAuthenticator
 
  def process(data)
    return "/0#{@user}/0#{@password}"
  end
 
  private
 
  def initialize(user, password)
    @user = user
    @password = password
  end
 
end
 
Net::IMAP.add_authenticator('PLAIN', PlainAuthenticator)
 
    opts = GetoptLong.new(
      [ '--help', '-h', GetoptLong::NO_ARGUMENT ],
      [ '--mailhost', '-m', GetoptLong::REQUIRED_ARGUMENT ],
      [ '--username', '-u', GetoptLong::REQUIRED_ARGUMENT ],
      [ '--password', '-p', GetoptLong::REQUIRED_ARGUMENT ],
      [ '--port', '-o', GetoptLong::REQUIRED_ARGUMENT ],
      [ '--search', '-s', GetoptLong::REQUIRED_ARGUMENT ],
      [ '--age', '-a', GetoptLong::REQUIRED_ARGUMENT ],
      [ '--transport', '-t', GetoptLong::REQUIRED_ARGUMENT ]
    )
 
mailhost = nil
username = nil
password = nil
searchString = nil
transport="POP3"
port = nil
age = 1
 
    opts.each do |opt, arg|
      case opt
        when '--help'
#          RDoc::usage
        when '--mailhost'
                mailhost = arg
        when '--username'
                username = arg
        when '--password'
                password = arg
        when '--search'
                searchString = arg
        when '--port'
                port = arg
        when '--transport'
                transport  = arg
          end
      end
 
i = 0
 
#Figure the out how far back we will search in a mailbox
        day = Date.today-age
        imapFormatedDate = day.strftime(fmt="%d-%b-%Y")
 
if transport == "securePOP3"
        Net::POP3.enable_ssl(OpenSSL::SSL::VERIFY_NONE)
end
 
if ((transport == "POP3") or (transport == "securePOP3"))
        pop = Net::POP3.new(mailhost, port)
        pop.start( username,password)
        if pop.mails.empty?
        else
                pop.each_mail do |m|
                        date = nil
                        m.header.each do |h|
                                if h =~ /Date/
                                        date = Date.parse(h)
                                end
                        end
                        if date >= day
                                m.pop.each do |f|
#                               puts "#{f}"
                                        if f =~ /#{searchString}/
                                                i += 1
                                        end
                                end
                        end
                end
        end
 
        pop.finish
end
 
if transport == "IMAP"
        imap = Net::IMAP.new(mailhost)
        imap.login(username, password)
        messages = imap.status("inbox", ["MESSAGES"])
        if messages["MESSAGES"] >= 1
                imap.select('INBOX')
                imap.search(["SINCE", imapFormatedDate]).each do |message_id|
                        msg = imap.fetch(message_id, "(UID RFC822.SIZE ENVELOPE BODY[TEXT])")[0]
                        body = msg.attr["BODY[TEXT]"]
        #               puts "#{body}"
                        envelope  = imap.fetch(message_id, "ENVELOPE")[0].attr["ENVELOPE"]
                        if (envelope.subject =~ /#{searchString}/) or  (body =~ /#{searchString}/)
        #           puts "#{envelope.from[0].name}: \t#{envelope.subject}"
                                i += 1
                        end
                        imap.store(message_id, "+FLAGS", [:Deleted])
                end
                imap.logout
        end
end
 
if transport == "hotmail"
        hotty = Hotmailer.new(username, password) 
        hotty.login
 
        messages = hotty.messages
        messages.each do |f|
                tempdate = f.date+" 2007"
                tempdate.gsub!(/(.*)( )(.*)/,'\1 \3')
                date = Date.parse(tempdate)
                if date >= day
                        if (f.subject =~ /#{searchString}/) or  (f.body =~ /#{searchString}/)
                                i += 1
                        end
                end
        end
 
 
end
 
if i >= 1
        puts "Found #{i} Messages matching  \"#{searchString}\""
        exit(anInteger=0)
else 
        puts "No Messages matching \"#{searchString}\"" 
        exit(anInteger=2)
end

The script should be fairly useful even today with exception of checking hotmail, since I originally wrote this hotmail has redesigned their interface break the hotmailer module I found which screen scraped the site.

So good luck and I hope your properly opted in mail is getting through.

30
Jul/09
0

Defining a Solid Escalation Plan

TechOps Guy: Jason

Definition and adherence to an escalation plan can provide clarity in the event of a high priority issue within a Web 2.0 or other company.
In order to create a solid escalation plan one must possess a thorough understanding of the level of service you plan to deliver to your customers; note you will be staffing against this goal :) Once your SLA exists and you have a staff in place to support that commitment, then it’s about defining the process with which your Technology teams respond to issues in their efforts to adhere to the SLA.

Priority Definition & Distinctions

I believe there are essentially 4 priorities and/or categories in classifying issues: Priority 0, 1, 2 & 3. Note: This will vary depending on your company’s service or application(s).

Priority 0

This is very bad. Something with Ping/Pipe/Power generally went bad; in short Infrastructure is the culprit. This definition is reserved for the issues that affect your Web application from functioning at all; usually this is an issue affecting the Network/Firewall/Router or Power level. These should be very rare occurrences and are likely due to an ISP failure or not adequately preparing for a power failure. If your monitoring and alerting is setup correctly you should be notified of this issue before your customers discover it, but you will likely have downtime.

Priority 1

This is also very bad. This classification is usually reserved for Core Functionality of the Web application not working at all for any customers or everything for a highly visible customer is not working. The smoking guns in this situations could likely be a configuration error upon a deployment, a serious defect not caught previously with deployment or a sudden rush in traffic/use of a feature that caused a massive performance problem rendering the application almost useless. These occurrences can be due to any number of issues, but if you have intelligent application level monitoring and alerting is setup correctly you should be notified of this issue before your customers discover it. You may not have downtime associated with this issue, but you will likely have to deploy a fix to address these problems.

Priority 2

This can be more common than most Technical Operations teams of Web 2.0 companies would like to admit. Web applications and their specific feature sets can have issues; not working as intended, failing under load, generating server errors, etc.

Priority 3

These are very common for most Web 2.0 companies. These issues are of the non-critical bug variety, broken links, reporting failures.

Service Level Targets

Now that you have a good understanding of each priority definition it is important that we set some service targets that will help us achieve the SLA we’ve agreed to with our customers. Generally there are 4 areas that I feel are important to define for each priority categorization: Call Back Target, Start Work Target, After Hours, Deployment Target.

Call Back Target

This is the amount of time the person on call has to contact the person who escalated the issue; my guideline has always been 15 minutes.

Start Work Target

This is the amount of time the person on call has to get to a place to start work on the issue; my guideline varies depending on the priority of the issue–anywhere from 15 minutes to the next business day.

After Hours

This is a guideline for deciding how to react after core business hours. For example, are issues triaged until resolved regardless of time, or can they be tacked the next business day.

Deployment Target

This is a guideline for determining when we would deploy a fix once one was created; same business day, next business day, or the weekly deployment.

Notifications

So, after you’ve classified your issues, committed to your service level targets you need to communicate to your internal team about severity of the issues your web application. The easiest method to perform this task is to have an email list setup for P0, P1, P2 and P3 issues. When you first encounter these issues, the person on-call responds within their start work target sending an email to one of the above email lists.

Ready to see it all put together? See below:

Escalation Path

22
Jul/09
1

The Release Management Process

TechOps Guy: Jason

Reliable, Repeatable, Results Over Time. One of my mentors over the years, David Gedye, pounded in my head early in my career that my goal as an Ops/IT expert was to achieve “Reliable, Repeatable, Results Over Time.” In everything I did, he would hammer this phrase home.

Every company who has a website or web application should have a disciplined Release Management Process. There are so many benefits from getting this right and I’m sure everyone is familiar with not getting this one right. The results of a poor process usually entail, Development throwing code over the fence to Test and subsequently throwing it over to the Technical Operations team. The end result usually ends in configuration, integration issues or last minute bug fixes that do not get thoroughly tested for regressions. One of my strengths is being able to walk into an organization and either establish or help improve the current Release Management Process. To do this successfully I first examine the following things:

  • What environments are involved in the development, testing and production of your application?
  • How are these environments configured and who owns them?
  • How are the virtual teams communicating when a release is ready to move from one environment to the next?
  • Is there a tracking system for these releases?
  • What technologies are used to store and secure the source code?
  • What technologies are used to deploy the code from one environment to another?

Environments

If I could start from scratch and had the headcount and funds to do so, I would deploy the following 5 environments:

Development, Test, Sandbox, Staging & Production

While most organizations likely already take advantage of a Development and Production environment; the other 3 environments can provide great value. Let me explain.


Development

Description: Primary environment to perform feature development and unit
testing owned by the Development team

Support Policy: OPS supports the hardware and OS level support,
Development supports the Application

Deployments: Performed by Developers at their discretion

Version: Running R1.2 (or 2 versions ahead of Production)

Test
Description: Primary environment to perform
initial integration testing, basic performance/load testing owned by the
Quality Assurance (Test) team
Support Policy: OPS supports the hardware
and OS level support, Quality Assurance (Test) supports the Application
Deployments: Performed by Software Test
Engineers at their discretion
Version: Likely running R1.0, R1.1 and R1.2
- since this environment is not a copy of Production there are likely
multiple instances with multiple versions

Sandbox

Description: Environment to test complete builds of the application(s);
full integration testing occurs here with recent backups of sanitized
Production data; usually working on v. + 1 of Production.

Support Policy: OPS supports hardware, OS and Application level support

Deployments: Performed by OPS once signed off on by Development and
Test; Release Form must be completed prior to deployment

Version: Running R1.1 (or 1 version ahead of Production)

Staging

Description: Environment to used primary for hotfixes, data only and
small code changes. Environment is needed as to not disrupt build
testing in SANDBOX. This is definitely a luxury item as it can be
costly to manage the additional burden of equipment/OS/application.
Code base should match Production.

Support Policy: OPS supports hardware, OS and Application level support

Deployments: Performed by OPS once signed off on by Development and
Test; Release Form must be completed prior to deployment

Version: Running R1.0 (Runs identical code to Production)

Production

Description: This is the LIVE environment where the customers use the
application.

Support Policy: OPS supports hardware, OS and Application level support

Deployments: Performed by OPS once signed off on by Development and
Test; Release Form must be completed prior to deployment

Version: Running R1.0


Release Management

Release Management

Types of Fixes & Proper Lead Time

One of the most difficult problems a team needs to resolve is setting appropriate expectations to internal customers, external customers and application users regarding the amount of time it takes to properly release items from Development to Production. Often times these
expectations are not documented and this can only lead to disappointment in customer response to critical issues, lack of proper time to ensure quality
testing and little or no practice in deployment of these fixes/feature sets. So the main question is how do we classify releases and how much lead time does
each classification require to ensure a high level of quality while still be very responsive to customer issues. Below I’ve outlined a strategy that helps
tackle some of these difficult issues.


Data fix

Description: This often occurs within ASP applications where a sample of
data needs cleaning up, logically deleted or otherwise altered

Environments: Data fixes depending on scale and risk are often run and
verified in the Staging environment before moving onto Production

Version: 1.01, .01 indicates a revision to the application

Lead Time: Same day turn-around; if it is in by 3pm it can be deployed
later that evening during a scheduled deployment

Hotfix

Description: Generally a break/fix situation with an application

Environments: Hotfixes are generally run and verified in the Sandbox
environment before moving to Staging and ultimately onto Production

Version: 1.01, .01 indicates a revision to the application

Lead Time: Potential Same day turn-around, but would prefer a full day’s
notice. A full day’s notice would allow for a full day of testing
before the deployment date and allow an on and offshore test team to
verify a fix; again if it is in by 3pm it can be deployed later that
evening during a scheduled deployment.

Incremental Release

Description: Incremental Releases include the introduction/modification
of new features, slight updates/tweaks to the UI, collection of bug
fixes that have been triaged into the release

Environments: Incremental Releases are generally run and verified in the
Sandbox environment before moving to Staging and ultimately onto
Production

Version: 1.2, .2 indicates a revision to the application

Lead Time: 2 full day’s notice. This will allow Operations to retrieve
an identical copy of data from Production to do a full and accurate
deployment in Sandbox. The restoration of Production data takes time to
complete so the additional day’s notice is needed. Again if it is in by
3pm it can be deployed 2 days later during a scheduled deployment.

Milestone Release

Description: These are the biggies! Large feature set deployments,
architecture changes, data model changes and additional bug fixes that
have been triaged into the release

Environments: Incremental Releases are generally run and verified in the
Sandbox environment before moving to Staging and ultimately onto
Production

Version: 2.0, 2.0 indicates a revision to the application

Lead Time: 5 full day’s notice. This will allow Operations to retrieve
an identical copy of data from Production to do a full and accurate
deployment in Sandbox. The restoration of Production data takes time to
complete so the additional day’s notice is needed. Again if it is in by
3pm it can be deployed 5 days later during a scheduled deployment.

* Lead time descriptions can always be a little fuzzy. Let’s just state that the lead time stated above encourages a rapid
response to customer issues while still maintaining enough time for Quality Assurance Testing…not just Black Box testing :)

Hardware Selection

Hardware selection is driven completely by the budget you have to spend. If you’re web farm for your application is 10 web servers you likely will not need more than 2 web servers to fully simulate the Production environment in Sandbox and Staging. Also, if you are not
performance/load testing in Sandbox and/or Staging then you can get away with cheaper desktop or inexpensive rack mounted servers rather than the beefier hardware you are likely running in Production.

Comments About Configuration

I’m not a big fan of consolidating multiple services on 1 server in Sandbox, unless it is similarly configured in
Production. Whether it’s football, baseball or Operations; you need to practice like you’re gonna play which is the motivation for this.

Tracking a Release

It seems obvious that a release should be documented and recorded to review over time. The obvious solution is to create a database where you can store information on releases. The following represents what I would consider the ideal information you should capture in your database.


Variable

Selection Options

Description
Environment Select Drop-down Sandbox, Staging,
Production
Release Type Select Drop-down Hot Fix, Service Release,
Full Release, Configuration
Application Select Drop-down Jobster Service Corporate
Website JIVE – Delight Highdeal Jobster Search Static Content Coffee
Robot UJobs
Version
Release Instructions Text area for sets of
instructions
Requested By Select Drop-down
Development Lead Select Drop-down
QA Lead Select Drop-down
Notify Pre-selected Groups Technical Operations,
Quality Assurance/Test, Development, Program Management
Comments Text area for comments
about the release
Typically errors
encountered upon deployment, etc.
Deployment Start Time Entered by the Release
Tracker
Deployment End Time Entered by the Release
Tracker
Submit to QA Time Entered by the Release
Tracker
Delete Time Entered by the Release
Tracker
Failed Time Entered by the Release
Tracker
Passed QA Time Entered by the Release
Tracker

Upon each Release Tracker Deployment Request Form completed the following individuals are notified: Requester, Development Lead for the application, QA Lead for the application, anyone else selected under Notify Groups and the entire Technical Operations team.

Source Code Repositories

Visual Source Safe, Subversion, CVS, Source Depot, etc…your source code needs to be stored in a system that allows for labels, branching and versioning. The key to any Release Tracking process is to make sure you are able to roll back to a previous version of the source code should the need arise. All of the above source code repositories support a labeling or versioning system within the software. To make the Release Tracker process work, you must have a Label or Version correctly identified in your source code repository. It is this version that will be tracked from Deployment Request to Deploying Notification to Deployed and Ready for QA, to Passed/Fail QA; should the process fail anywhere along the way there should be a clear version that the Technical Operations team can roll back to.