31
Jul/09
0

It is System Administrator Appreciation Day

TechOps Guy:

The Last Friday in July, so don’t forgot to shower your favorite System Administrator with praise and caffeine. Otherwise they might be sleepy when the Gremlins attack your Servers.

http://www.sysadminday.com/index2009.html

PDF Printer    Send article as PDF to
30
Jul/09
0

Defining a Solid Escalation Plan

TechOps Guy: Jason

Definition and adherence to an escalation plan can provide clarity in the event of a high priority issue within a Web 2.0 or other company.
In order to create a solid escalation plan one must possess a thorough understanding of the level of service you plan to deliver to your customers; note you will be staffing against this goal :) Once your SLA exists and you have a staff in place to support that commitment, then it’s about defining the process with which your Technology teams respond to issues in their efforts to adhere to the SLA.

Priority Definition & Distinctions

I believe there are essentially 4 priorities and/or categories in classifying issues: Priority 0, 1, 2 & 3. Note: This will vary depending on your company’s service or application(s).

Priority 0

This is very bad. Something with Ping/Pipe/Power generally went bad; in short Infrastructure is the culprit. This definition is reserved for the issues that affect your Web application from functioning at all; usually this is an issue affecting the Network/Firewall/Router or Power level. These should be very rare occurrences and are likely due to an ISP failure or not adequately preparing for a power failure. If your monitoring and alerting is setup correctly you should be notified of this issue before your customers discover it, but you will likely have downtime.

Priority 1

This is also very bad. This classification is usually reserved for Core Functionality of the Web application not working at all for any customers or everything for a highly visible customer is not working. The smoking guns in this situations could likely be a configuration error upon a deployment, a serious defect not caught previously with deployment or a sudden rush in traffic/use of a feature that caused a massive performance problem rendering the application almost useless. These occurrences can be due to any number of issues, but if you have intelligent application level monitoring and alerting is setup correctly you should be notified of this issue before your customers discover it. You may not have downtime associated with this issue, but you will likely have to deploy a fix to address these problems.

Priority 2

This can be more common than most Technical Operations teams of Web 2.0 companies would like to admit. Web applications and their specific feature sets can have issues; not working as intended, failing under load, generating server errors, etc.

Priority 3

These are very common for most Web 2.0 companies. These issues are of the non-critical bug variety, broken links, reporting failures.

Service Level Targets

Now that you have a good understanding of each priority definition it is important that we set some service targets that will help us achieve the SLA we’ve agreed to with our customers. Generally there are 4 areas that I feel are important to define for each priority categorization: Call Back Target, Start Work Target, After Hours, Deployment Target.

Call Back Target

This is the amount of time the person on call has to contact the person who escalated the issue; my guideline has always been 15 minutes.

Start Work Target

This is the amount of time the person on call has to get to a place to start work on the issue; my guideline varies depending on the priority of the issue–anywhere from 15 minutes to the next business day.

After Hours

This is a guideline for deciding how to react after core business hours. For example, are issues triaged until resolved regardless of time, or can they be tacked the next business day.

Deployment Target

This is a guideline for determining when we would deploy a fix once one was created; same business day, next business day, or the weekly deployment.

Notifications

So, after you’ve classified your issues, committed to your service level targets you need to communicate to your internal team about severity of the issues your web application. The easiest method to perform this task is to have an email list setup for P0, P1, P2 and P3 issues. When you first encounter these issues, the person on-call responds within their start work target sending an email to one of the above email lists.

Ready to see it all put together? See below:

Escalation Path

PDF Download    Send article as PDF to
Filed under: Uncategorized
30
Jul/09
0

The Difficulty of Intelligent Application Monitoring

TechOps Guy: Jason

It’s become increasingly more important for companies with online applications to have detailed monitoring. No longer are the days when we can monitor drive space, services and ICMP responses to verify availability. One of the biggest assets a company with an online application can have is the ability to understand how the application is behaving at any given time under any conditions. Traditionally this breaks down into 4 areas:

1.) Network

2.) Server

3.) Application

4.) Integration

Most IT professionals are very familiar with the first two categories; as these monitoring capabilities are available in almost every software solution off of the shelf. The old school of thought would be to setup ICMP monitors for the IPs of the application and make sure that you received a prompt reply. The second part of the old school strategy would be to monitor critical items on the server such as disk space, % processor time, status of services running, etc. While both of these strategies are still very much used today, what separates a more robust monitoring solution is adding application and integration level intelligence.

Building in Application/Integration Level Monitoring:

First, let’s consider the following very basic web server configuration. You have an application named WebApplication1 running on a web server named WebServer1. WebApplication1 is a simple user community that allows users to register (Register.jsp), login (Login.jsp) and review postings by other users (ViewPage.jsp) and create/update/delete postings of their own (PageFunctions.jsp). The registration, login and password (Password.jsp) pages are protected by SSL encryption (port 443); the rest of the user community utilizes HTTP (port 80). In addition, WebServer1 runs on another application called AppMonitor1 on port 7001 to report the health of the WebApplication1 application via Monitor.jsp.

Sample Data Flow is given below:

user_exp

WebServer1′s Web Configuration:

WebAppliation1 is running on WebServer1/2, but could theoretically be scaled out to XX number of servers. In this example, WebApplication1 is running version r1.0 under both ports 80 and 443 while AppMonitor1 is running version r1.1 on port 7001.

Server Configuration


For example let’s take the following application under consideration:

Internet –> Router –> Firewall (Not pictured) –> NAT’d addresses –> Load Balancer –> Web Server(s) –> Application Server(s) –> Database Server(s) –> Fail-over
Systems (Not all pictured)


Physical Configuration

The above architecture reflects one example of how application level monitoring could be implemented. In this example we have a shell script or compiled exe running as a service known as AppMonitor1.sh running on Monitor1 that would poll the WebServer1/2/XX pool to check the health of the application at given intervals. If for any reason the AppMonitor1 script cannot return a response then the script results in an email alert being sent out to the Technical Operations team. Under normal circumstances the AppMonitor1 service will be able to return
results showing the status/behavior of critical application indicators as shown in the example below. This service is then run at regular intervals.

Ideally you need to understand how specifically your application is behaving and not just a web page with a returned dataset that says “OK/GREEN”.

Create PDF    Send article as PDF to
Filed under: Monitoring
22
Jul/09
1

The Release Management Process

TechOps Guy: Jason

Reliable, Repeatable, Results Over Time. One of my mentors over the years, David Gedye, pounded in my head early in my career that my goal as an Ops/IT expert was to achieve “Reliable, Repeatable, Results Over Time.” In everything I did, he would hammer this phrase home.

Every company who has a website or web application should have a disciplined Release Management Process. There are so many benefits from getting this right and I’m sure everyone is familiar with not getting this one right. The results of a poor process usually entail, Development throwing code over the fence to Test and subsequently throwing it over to the Technical Operations team. The end result usually ends in configuration, integration issues or last minute bug fixes that do not get thoroughly tested for regressions. One of my strengths is being able to walk into an organization and either establish or help improve the current Release Management Process. To do this successfully I first examine the following things:

  • What environments are involved in the development, testing and production of your application?
  • How are these environments configured and who owns them?
  • How are the virtual teams communicating when a release is ready to move from one environment to the next?
  • Is there a tracking system for these releases?
  • What technologies are used to store and secure the source code?
  • What technologies are used to deploy the code from one environment to another?

Environments

If I could start from scratch and had the headcount and funds to do so, I would deploy the following 5 environments:

Development, Test, Sandbox, Staging & Production

While most organizations likely already take advantage of a Development and Production environment; the other 3 environments can provide great value. Let me explain.


Development

Description: Primary environment to perform feature development and unit
testing owned by the Development team

Support Policy: OPS supports the hardware and OS level support,
Development supports the Application

Deployments: Performed by Developers at their discretion

Version: Running R1.2 (or 2 versions ahead of Production)

Test
Description: Primary environment to perform
initial integration testing, basic performance/load testing owned by the
Quality Assurance (Test) team
Support Policy: OPS supports the hardware
and OS level support, Quality Assurance (Test) supports the Application
Deployments: Performed by Software Test
Engineers at their discretion
Version: Likely running R1.0, R1.1 and R1.2
- since this environment is not a copy of Production there are likely
multiple instances with multiple versions

Sandbox

Description: Environment to test complete builds of the application(s);
full integration testing occurs here with recent backups of sanitized
Production data; usually working on v. + 1 of Production.

Support Policy: OPS supports hardware, OS and Application level support

Deployments: Performed by OPS once signed off on by Development and
Test; Release Form must be completed prior to deployment

Version: Running R1.1 (or 1 version ahead of Production)

Staging

Description: Environment to used primary for hotfixes, data only and
small code changes. Environment is needed as to not disrupt build
testing in SANDBOX. This is definitely a luxury item as it can be
costly to manage the additional burden of equipment/OS/application.
Code base should match Production.

Support Policy: OPS supports hardware, OS and Application level support

Deployments: Performed by OPS once signed off on by Development and
Test; Release Form must be completed prior to deployment

Version: Running R1.0 (Runs identical code to Production)

Production

Description: This is the LIVE environment where the customers use the
application.

Support Policy: OPS supports hardware, OS and Application level support

Deployments: Performed by OPS once signed off on by Development and
Test; Release Form must be completed prior to deployment

Version: Running R1.0


Release Management

Release Management

Types of Fixes & Proper Lead Time

One of the most difficult problems a team needs to resolve is setting appropriate expectations to internal customers, external customers and application users regarding the amount of time it takes to properly release items from Development to Production. Often times these
expectations are not documented and this can only lead to disappointment in customer response to critical issues, lack of proper time to ensure quality
testing and little or no practice in deployment of these fixes/feature sets. So the main question is how do we classify releases and how much lead time does
each classification require to ensure a high level of quality while still be very responsive to customer issues. Below I’ve outlined a strategy that helps
tackle some of these difficult issues.


Data fix

Description: This often occurs within ASP applications where a sample of
data needs cleaning up, logically deleted or otherwise altered

Environments: Data fixes depending on scale and risk are often run and
verified in the Staging environment before moving onto Production

Version: 1.01, .01 indicates a revision to the application

Lead Time: Same day turn-around; if it is in by 3pm it can be deployed
later that evening during a scheduled deployment

Hotfix

Description: Generally a break/fix situation with an application

Environments: Hotfixes are generally run and verified in the Sandbox
environment before moving to Staging and ultimately onto Production

Version: 1.01, .01 indicates a revision to the application

Lead Time: Potential Same day turn-around, but would prefer a full day’s
notice. A full day’s notice would allow for a full day of testing
before the deployment date and allow an on and offshore test team to
verify a fix; again if it is in by 3pm it can be deployed later that
evening during a scheduled deployment.

Incremental Release

Description: Incremental Releases include the introduction/modification
of new features, slight updates/tweaks to the UI, collection of bug
fixes that have been triaged into the release

Environments: Incremental Releases are generally run and verified in the
Sandbox environment before moving to Staging and ultimately onto
Production

Version: 1.2, .2 indicates a revision to the application

Lead Time: 2 full day’s notice. This will allow Operations to retrieve
an identical copy of data from Production to do a full and accurate
deployment in Sandbox. The restoration of Production data takes time to
complete so the additional day’s notice is needed. Again if it is in by
3pm it can be deployed 2 days later during a scheduled deployment.

Milestone Release

Description: These are the biggies! Large feature set deployments,
architecture changes, data model changes and additional bug fixes that
have been triaged into the release

Environments: Incremental Releases are generally run and verified in the
Sandbox environment before moving to Staging and ultimately onto
Production

Version: 2.0, 2.0 indicates a revision to the application

Lead Time: 5 full day’s notice. This will allow Operations to retrieve
an identical copy of data from Production to do a full and accurate
deployment in Sandbox. The restoration of Production data takes time to
complete so the additional day’s notice is needed. Again if it is in by
3pm it can be deployed 5 days later during a scheduled deployment.

* Lead time descriptions can always be a little fuzzy. Let’s just state that the lead time stated above encourages a rapid
response to customer issues while still maintaining enough time for Quality Assurance Testing…not just Black Box testing :)

Hardware Selection

Hardware selection is driven completely by the budget you have to spend. If you’re web farm for your application is 10 web servers you likely will not need more than 2 web servers to fully simulate the Production environment in Sandbox and Staging. Also, if you are not
performance/load testing in Sandbox and/or Staging then you can get away with cheaper desktop or inexpensive rack mounted servers rather than the beefier hardware you are likely running in Production.

Comments About Configuration

I’m not a big fan of consolidating multiple services on 1 server in Sandbox, unless it is similarly configured in
Production. Whether it’s football, baseball or Operations; you need to practice like you’re gonna play which is the motivation for this.

Tracking a Release

It seems obvious that a release should be documented and recorded to review over time. The obvious solution is to create a database where you can store information on releases. The following represents what I would consider the ideal information you should capture in your database.


Variable

Selection Options

Description
Environment Select Drop-down Sandbox, Staging,
Production
Release Type Select Drop-down Hot Fix, Service Release,
Full Release, Configuration
Application Select Drop-down Jobster Service Corporate
Website JIVE – Delight Highdeal Jobster Search Static Content Coffee
Robot UJobs
Version
Release Instructions Text area for sets of
instructions
Requested By Select Drop-down
Development Lead Select Drop-down
QA Lead Select Drop-down
Notify Pre-selected Groups Technical Operations,
Quality Assurance/Test, Development, Program Management
Comments Text area for comments
about the release
Typically errors
encountered upon deployment, etc.
Deployment Start Time Entered by the Release
Tracker
Deployment End Time Entered by the Release
Tracker
Submit to QA Time Entered by the Release
Tracker
Delete Time Entered by the Release
Tracker
Failed Time Entered by the Release
Tracker
Passed QA Time Entered by the Release
Tracker

Upon each Release Tracker Deployment Request Form completed the following individuals are notified: Requester, Development Lead for the application, QA Lead for the application, anyone else selected under Notify Groups and the entire Technical Operations team.

Source Code Repositories

Visual Source Safe, Subversion, CVS, Source Depot, etc…your source code needs to be stored in a system that allows for labels, branching and versioning. The key to any Release Tracking process is to make sure you are able to roll back to a previous version of the source code should the need arise. All of the above source code repositories support a labeling or versioning system within the software. To make the Release Tracker process work, you must have a Label or Version correctly identified in your source code repository. It is this version that will be tracked from Deployment Request to Deploying Notification to Deployed and Ready for QA, to Passed/Fail QA; should the process fail anywhere along the way there should be a clear version that the Technical Operations team can roll back to.

PDF Download    Send article as PDF to
Filed under: Uncategorized