Jul/090
Defining a Solid Escalation Plan
TechOps Guy: Jason
Definition and adherence to an escalation plan can provide clarity in the event of a high priority issue within a Web 2.0 or other company.
In order to create a solid escalation plan one must possess a thorough understanding of the level of service you plan to deliver to your customers; note you will be staffing against this goal
Once your SLA exists and you have a staff in place to support that commitment, then it’s about defining the process with which your Technology teams respond to issues in their efforts to adhere to the SLA.
Priority Definition & Distinctions
I believe there are essentially 4 priorities and/or categories in classifying issues: Priority 0, 1, 2 & 3. Note: This will vary depending on your company’s service or application(s).
Priority 0
This is very bad. Something with Ping/Pipe/Power generally went bad; in short Infrastructure is the culprit. This definition is reserved for the issues that affect your Web application from functioning at all; usually this is an issue affecting the Network/Firewall/Router or Power level. These should be very rare occurrences and are likely due to an ISP failure or not adequately preparing for a power failure. If your monitoring and alerting is setup correctly you should be notified of this issue before your customers discover it, but you will likely have downtime.
Priority 1
This is also very bad. This classification is usually reserved for Core Functionality of the Web application not working at all for any customers or everything for a highly visible customer is not working. The smoking guns in this situations could likely be a configuration error upon a deployment, a serious defect not caught previously with deployment or a sudden rush in traffic/use of a feature that caused a massive performance problem rendering the application almost useless. These occurrences can be due to any number of issues, but if you have intelligent application level monitoring and alerting is setup correctly you should be notified of this issue before your customers discover it. You may not have downtime associated with this issue, but you will likely have to deploy a fix to address these problems.
Priority 2
This can be more common than most Technical Operations teams of Web 2.0 companies would like to admit. Web applications and their specific feature sets can have issues; not working as intended, failing under load, generating server errors, etc.
Priority 3
These are very common for most Web 2.0 companies. These issues are of the non-critical bug variety, broken links, reporting failures.
Service Level Targets
Now that you have a good understanding of each priority definition it is important that we set some service targets that will help us achieve the SLA we’ve agreed to with our customers. Generally there are 4 areas that I feel are important to define for each priority categorization: Call Back Target, Start Work Target, After Hours, Deployment Target.
Call Back Target
This is the amount of time the person on call has to contact the person who escalated the issue; my guideline has always been 15 minutes.
Start Work Target
This is the amount of time the person on call has to get to a place to start work on the issue; my guideline varies depending on the priority of the issue–anywhere from 15 minutes to the next business day.
After Hours
This is a guideline for deciding how to react after core business hours. For example, are issues triaged until resolved regardless of time, or can they be tacked the next business day.
Deployment Target
This is a guideline for determining when we would deploy a fix once one was created; same business day, next business day, or the weekly deployment.
Notifications
So, after you’ve classified your issues, committed to your service level targets you need to communicate to your internal team about severity of the issues your web application. The easiest method to perform this task is to have an email list setup for P0, P1, P2 and P3 issues. When you first encounter these issues, the person on-call responds within their start work target sending an email to one of the above email lists.
Ready to see it all put together? See below:

