TechOpsGuys.com Diggin' technology every day

30Jul/090

Defining a Solid Escalation Plan

TechOps Guy: Jason

Definition and adherence to an escalation plan can provide clarity in the event of a high priority issue within a Web 2.0 or other company.
In order to create a solid escalation plan one must possess a thorough understanding of the level of service you plan to deliver to your customers; note you will be staffing against this goal :) Once your SLA exists and you have a staff in place to support that commitment, then it's about defining the process with which your Technology teams respond to issues in their efforts to adhere to the SLA.

Priority Definition & Distinctions

I believe there are essentially 4 priorities and/or categories in classifying issues: Priority 0, 1, 2 & 3. Note: This will vary depending on your company's service or application(s).

Priority 0

This is very bad. Something with Ping/Pipe/Power generally went bad; in short Infrastructure is the culprit. This definition is reserved for the issues that affect your Web application from functioning at all; usually this is an issue affecting the Network/Firewall/Router or Power level. These should be very rare occurrences and are likely due to an ISP failure or not adequately preparing for a power failure. If your monitoring and alerting is setup correctly you should be notified of this issue before your customers discover it, but you will likely have downtime.

Priority 1

This is also very bad. This classification is usually reserved for Core Functionality of the Web application not working at all for any customers or everything for a highly visible customer is not working. The smoking guns in this situations could likely be a configuration error upon a deployment, a serious defect not caught previously with deployment or a sudden rush in traffic/use of a feature that caused a massive performance problem rendering the application almost useless. These occurrences can be due to any number of issues, but if you have intelligent application level monitoring and alerting is setup correctly you should be notified of this issue before your customers discover it. You may not have downtime associated with this issue, but you will likely have to deploy a fix to address these problems.

Priority 2

This can be more common than most Technical Operations teams of Web 2.0 companies would like to admit. Web applications and their specific feature sets can have issues; not working as intended, failing under load, generating server errors, etc.

Priority 3

These are very common for most Web 2.0 companies. These issues are of the non-critical bug variety, broken links, reporting failures.

Service Level Targets

Now that you have a good understanding of each priority definition it is important that we set some service targets that will help us achieve the SLA we've agreed to with our customers. Generally there are 4 areas that I feel are important to define for each priority categorization: Call Back Target, Start Work Target, After Hours, Deployment Target.

Call Back Target

This is the amount of time the person on call has to contact the person who escalated the issue; my guideline has always been 15 minutes.

Start Work Target

This is the amount of time the person on call has to get to a place to start work on the issue; my guideline varies depending on the priority of the issue--anywhere from 15 minutes to the next business day.

After Hours

This is a guideline for deciding how to react after core business hours. For example, are issues triaged until resolved regardless of time, or can they be tacked the next business day.

Deployment Target

This is a guideline for determining when we would deploy a fix once one was created; same business day, next business day, or the weekly deployment.

Notifications

So, after you've classified your issues, committed to your service level targets you need to communicate to your internal team about severity of the issues your web application. The easiest method to perform this task is to have an email list setup for P0, P1, P2 and P3 issues. When you first encounter these issues, the person on-call responds within their start work target sending an email to one of the above email lists.

Ready to see it all put together? See below:

Escalation Path

Filed under: Uncategorized No Comments
30Jul/090

The Difficulty of Intelligent Application Monitoring

TechOps Guy: Jason

It's become increasingly more important for companies with online applications to have detailed monitoring. No longer are the days when we can monitor drive space, services and ICMP responses to verify availability. One of the biggest assets a company with an online application can have is the ability to understand how the application is behaving at any given time under any conditions. Traditionally this breaks down into 4 areas:

1.) Network

2.) Server

3.) Application

4.) Integration

Most IT professionals are very familiar with the first two categories; as these monitoring capabilities are available in almost every software solution off of the shelf. The old school of thought would be to setup ICMP monitors for the IPs of the application and make sure that you received a prompt reply. The second part of the old school strategy would be to monitor critical items on the server such as disk space, % processor time, status of services running, etc. While both of these strategies are still very much used today, what separates a more robust monitoring solution is adding application and integration level intelligence.

Building in Application/Integration Level Monitoring:

First, let's consider the following very basic web server configuration. You have an application named WebApplication1 running on a web server named WebServer1. WebApplication1 is a simple user community that allows users to register (Register.jsp), login (Login.jsp) and review postings by other users (ViewPage.jsp) and create/update/delete postings of their own (PageFunctions.jsp). The registration, login and password (Password.jsp) pages are protected by SSL encryption (port 443); the rest of the user community utilizes HTTP (port 80). In addition, WebServer1 runs on another application called AppMonitor1 on port 7001 to report the health of the WebApplication1 application via Monitor.jsp.

Sample Data Flow is given below:

user_exp

WebServer1's Web Configuration:

WebAppliation1 is running on WebServer1/2, but could theoretically be scaled out to XX number of servers. In this example, WebApplication1 is running version r1.0 under both ports 80 and 443 while AppMonitor1 is running version r1.1 on port 7001.

Server Configuration


For example let's take the following application under consideration:

Internet --> Router --> Firewall (Not pictured) --> NAT'd addresses --> Load Balancer --> Web Server(s) --> Application Server(s) --> Database Server(s) --> Fail-over
Systems (Not all pictured)


Physical Configuration

The above architecture reflects one example of how application level monitoring could be implemented. In this example we have a shell script or compiled exe running as a service known as AppMonitor1.sh running on Monitor1 that would poll the WebServer1/2/XX pool to check the health of the application at given intervals. If for any reason the AppMonitor1 script cannot return a response then the script results in an email alert being sent out to the Technical Operations team. Under normal circumstances the AppMonitor1 service will be able to return
results showing the status/behavior of critical application indicators as shown in the example below. This service is then run at regular intervals.

Ideally you need to understand how specifically your application is behaving and not just a web page with a returned dataset that says "OK/GREEN".

Filed under: Monitoring No Comments