Jul/090
It is System Administrator Appreciation Day
TechOps Guy:
The Last Friday in July, so don’t forgot to shower your favorite System Administrator with praise and caffeine. Otherwise they might be sleepy when the Gremlins attack your Servers.
http://www.sysadminday.com/index2009.html
Jul/090
Defining a Solid Escalation Plan
TechOps Guy: Jason
Definition and adherence to an escalation plan can provide clarity in the event of a high priority issue within a Web 2.0 or other company.
In order to create a solid escalation plan one must possess a thorough understanding of the level of service you plan to deliver to your customers; note you will be staffing against this goal
Once your SLA exists and you have a staff in place to support that commitment, then it’s about defining the process with which your Technology teams respond to issues in their efforts to adhere to the SLA.
Priority Definition & Distinctions
I believe there are essentially 4 priorities and/or categories in classifying issues: Priority 0, 1, 2 & 3. Note: This will vary depending on your company’s service or application(s).
Priority 0
This is very bad. Something with Ping/Pipe/Power generally went bad; in short Infrastructure is the culprit. This definition is reserved for the issues that affect your Web application from functioning at all; usually this is an issue affecting the Network/Firewall/Router or Power level. These should be very rare occurrences and are likely due to an ISP failure or not adequately preparing for a power failure. If your monitoring and alerting is setup correctly you should be notified of this issue before your customers discover it, but you will likely have downtime.
Priority 1
This is also very bad. This classification is usually reserved for Core Functionality of the Web application not working at all for any customers or everything for a highly visible customer is not working. The smoking guns in this situations could likely be a configuration error upon a deployment, a serious defect not caught previously with deployment or a sudden rush in traffic/use of a feature that caused a massive performance problem rendering the application almost useless. These occurrences can be due to any number of issues, but if you have intelligent application level monitoring and alerting is setup correctly you should be notified of this issue before your customers discover it. You may not have downtime associated with this issue, but you will likely have to deploy a fix to address these problems.
Priority 2
This can be more common than most Technical Operations teams of Web 2.0 companies would like to admit. Web applications and their specific feature sets can have issues; not working as intended, failing under load, generating server errors, etc.
Priority 3
These are very common for most Web 2.0 companies. These issues are of the non-critical bug variety, broken links, reporting failures.
Service Level Targets
Now that you have a good understanding of each priority definition it is important that we set some service targets that will help us achieve the SLA we’ve agreed to with our customers. Generally there are 4 areas that I feel are important to define for each priority categorization: Call Back Target, Start Work Target, After Hours, Deployment Target.
Call Back Target
This is the amount of time the person on call has to contact the person who escalated the issue; my guideline has always been 15 minutes.
Start Work Target
This is the amount of time the person on call has to get to a place to start work on the issue; my guideline varies depending on the priority of the issue–anywhere from 15 minutes to the next business day.
After Hours
This is a guideline for deciding how to react after core business hours. For example, are issues triaged until resolved regardless of time, or can they be tacked the next business day.
Deployment Target
This is a guideline for determining when we would deploy a fix once one was created; same business day, next business day, or the weekly deployment.
Notifications
So, after you’ve classified your issues, committed to your service level targets you need to communicate to your internal team about severity of the issues your web application. The easiest method to perform this task is to have an email list setup for P0, P1, P2 and P3 issues. When you first encounter these issues, the person on-call responds within their start work target sending an email to one of the above email lists.
Ready to see it all put together? See below:
Jul/090
The Difficulty of Intelligent Application Monitoring
TechOps Guy: Jason
It’s become increasingly more important for companies with online applications to have detailed monitoring. No longer are the days when we can monitor drive space, services and ICMP responses to verify availability. One of the biggest assets a company with an online application can have is the ability to understand how the application is behaving at any given time under any conditions. Traditionally this breaks down into 4 areas:
1.) Network
2.) Server
3.) Application
4.) Integration
Most IT professionals are very familiar with the first two categories; as these monitoring capabilities are available in almost every software solution off of the shelf. The old school of thought would be to setup ICMP monitors for the IPs of the application and make sure that you received a prompt reply. The second part of the old school strategy would be to monitor critical items on the server such as disk space, % processor time, status of services running, etc. While both of these strategies are still very much used today, what separates a more robust monitoring solution is adding application and integration level intelligence.
Building in Application/Integration Level Monitoring:
First, let’s consider the following very basic web server configuration. You have an application named WebApplication1 running on a web server named WebServer1. WebApplication1 is a simple user community that allows users to register (Register.jsp), login (Login.jsp) and review postings by other users (ViewPage.jsp) and create/update/delete postings of their own (PageFunctions.jsp). The registration, login and password (Password.jsp) pages are protected by SSL encryption (port 443); the rest of the user community utilizes HTTP (port 80). In addition, WebServer1 runs on another application called AppMonitor1 on port 7001 to report the health of the WebApplication1 application via Monitor.jsp.
Sample Data Flow is given below:
WebServer1′s Web Configuration:
WebAppliation1 is running on WebServer1/2, but could theoretically be scaled out to XX number of servers. In this example, WebApplication1 is running version r1.0 under both ports 80 and 443 while AppMonitor1 is running version r1.1 on port 7001.
For example let’s take the following application under consideration:
Internet –> Router –> Firewall (Not pictured) –> NAT’d addresses –> Load Balancer –> Web Server(s) –> Application Server(s) –> Database Server(s) –> Fail-over
Systems (Not all pictured)
The above architecture reflects one example of how application level monitoring could be implemented. In this example we have a shell script or compiled exe running as a service known as AppMonitor1.sh running on Monitor1 that would poll the WebServer1/2/XX pool to check the health of the application at given intervals. If for any reason the AppMonitor1 script cannot return a response then the script results in an email alert being sent out to the Technical Operations team. Under normal circumstances the AppMonitor1 service will be able to return
results showing the status/behavior of critical application indicators as shown in the example below. This service is then run at regular intervals.
Ideally you need to understand how specifically your application is behaving and not just a web page with a returned dataset that says “OK/GREEN”.
Jul/091
The Release Management Process
TechOps Guy: Jason
Reliable, Repeatable, Results Over Time. One of my mentors over the years, David Gedye, pounded in my head early in my career that my goal as an Ops/IT expert was to achieve “Reliable, Repeatable, Results Over Time.” In everything I did, he would hammer this phrase home.
Every company who has a website or web application should have a disciplined Release Management Process. There are so many benefits from getting this right and I’m sure everyone is familiar with not getting this one right. The results of a poor process usually entail, Development throwing code over the fence to Test and subsequently throwing it over to the Technical Operations team. The end result usually ends in configuration, integration issues or last minute bug fixes that do not get thoroughly tested for regressions. One of my strengths is being able to walk into an organization and either establish or help improve the current Release Management Process. To do this successfully I first examine the following things:
- What environments are involved in the development, testing and production of your application?
- How are these environments configured and who owns them?
- How are the virtual teams communicating when a release is ready to move from one environment to the next?
- Is there a tracking system for these releases?
- What technologies are used to store and secure the source code?
- What technologies are used to deploy the code from one environment to another?
Environments
If I could start from scratch and had the headcount and funds to do so, I would deploy the following 5 environments:
Development, Test, Sandbox, Staging & Production
While most organizations likely already take advantage of a Development and Production environment; the other 3 environments can provide great value. Let me explain.
Development |
Description: Primary environment to perform feature development and unit testing owned by the Development team |
Support Policy: OPS supports the hardware and OS level support, Development supports the Application |
|
Deployments: Performed by Developers at their discretion |
|
Version: Running R1.2 (or 2 versions ahead of Production) |
|
Test |
Description: Primary environment to perform initial integration testing, basic performance/load testing owned by the Quality Assurance (Test) team |
| Support Policy: OPS supports the hardware and OS level support, Quality Assurance (Test) supports the Application |
|
| Deployments: Performed by Software Test Engineers at their discretion |
|
| Version: Likely running R1.0, R1.1 and R1.2 - since this environment is not a copy of Production there are likely multiple instances with multiple versions |
|
Sandbox |
Description: Environment to test complete builds of the application(s); full integration testing occurs here with recent backups of sanitized Production data; usually working on v. + 1 of Production. |
Support Policy: OPS supports hardware, OS and Application level support |
|
Deployments: Performed by OPS once signed off on by Development and Test; Release Form must be completed prior to deployment |
|
Version: Running R1.1 (or 1 version ahead of Production) |
|
Staging |
Description: Environment to used primary for hotfixes, data only and small code changes. Environment is needed as to not disrupt build testing in SANDBOX. This is definitely a luxury item as it can be costly to manage the additional burden of equipment/OS/application. Code base should match Production. |
Support Policy: OPS supports hardware, OS and Application level support |
|
Deployments: Performed by OPS once signed off on by Development and Test; Release Form must be completed prior to deployment |
|
Version: Running R1.0 (Runs identical code to Production) |
|
Production |
Description: This is the LIVE environment where the customers use the application. |
Support Policy: OPS supports hardware, OS and Application level support |
|
Deployments: Performed by OPS once signed off on by Development and Test; Release Form must be completed prior to deployment |
|
Version: Running R1.0 |
Types of Fixes & Proper Lead Time
One of the most difficult problems a team needs to resolve is setting appropriate expectations to internal customers, external customers and application users regarding the amount of time it takes to properly release items from Development to Production. Often times these
expectations are not documented and this can only lead to disappointment in customer response to critical issues, lack of proper time to ensure quality
testing and little or no practice in deployment of these fixes/feature sets. So the main question is how do we classify releases and how much lead time does
each classification require to ensure a high level of quality while still be very responsive to customer issues. Below I’ve outlined a strategy that helps
tackle some of these difficult issues.
Data fix |
Description: This often occurs within ASP applications where a sample of data needs cleaning up, logically deleted or otherwise altered |
Environments: Data fixes depending on scale and risk are often run and verified in the Staging environment before moving onto Production |
|
Version: 1.01, .01 indicates a revision to the application |
|
Lead Time: Same day turn-around; if it is in by 3pm it can be deployed later that evening during a scheduled deployment |
|
Hotfix |
Description: Generally a break/fix situation with an application |
Environments: Hotfixes are generally run and verified in the Sandbox environment before moving to Staging and ultimately onto Production |
|
Version: 1.01, .01 indicates a revision to the application |
|
Lead Time: Potential Same day turn-around, but would prefer a full day’s notice. A full day’s notice would allow for a full day of testing before the deployment date and allow an on and offshore test team to verify a fix; again if it is in by 3pm it can be deployed later that evening during a scheduled deployment. |
|
Incremental Release |
Description: Incremental Releases include the introduction/modification of new features, slight updates/tweaks to the UI, collection of bug fixes that have been triaged into the release |
Environments: Incremental Releases are generally run and verified in the Sandbox environment before moving to Staging and ultimately onto Production |
|
Version: 1.2, .2 indicates a revision to the application |
|
Lead Time: 2 full day’s notice. This will allow Operations to retrieve an identical copy of data from Production to do a full and accurate deployment in Sandbox. The restoration of Production data takes time to complete so the additional day’s notice is needed. Again if it is in by 3pm it can be deployed 2 days later during a scheduled deployment. |
|
Milestone Release |
Description: These are the biggies! Large feature set deployments, architecture changes, data model changes and additional bug fixes that have been triaged into the release |
Environments: Incremental Releases are generally run and verified in the Sandbox environment before moving to Staging and ultimately onto Production |
|
Version: 2.0, 2.0 indicates a revision to the application |
|
Lead Time: 5 full day’s notice. This will allow Operations to retrieve an identical copy of data from Production to do a full and accurate deployment in Sandbox. The restoration of Production data takes time to complete so the additional day’s notice is needed. Again if it is in by 3pm it can be deployed 5 days later during a scheduled deployment. |
* Lead time descriptions can always be a little fuzzy. Let’s just state that the lead time stated above encourages a rapid
response to customer issues while still maintaining enough time for Quality Assurance Testing…not just Black Box testing
Hardware Selection
Hardware selection is driven completely by the budget you have to spend. If you’re web farm for your application is 10 web servers you likely will not need more than 2 web servers to fully simulate the Production environment in Sandbox and Staging. Also, if you are not
performance/load testing in Sandbox and/or Staging then you can get away with cheaper desktop or inexpensive rack mounted servers rather than the beefier hardware you are likely running in Production.
Comments About Configuration
I’m not a big fan of consolidating multiple services on 1 server in Sandbox, unless it is similarly configured in
Production. Whether it’s football, baseball or Operations; you need to practice like you’re gonna play which is the motivation for this.
Tracking a Release
It seems obvious that a release should be documented and recorded to review over time. The obvious solution is to create a database where you can store information on releases. The following represents what I would consider the ideal information you should capture in your database.
Variable |
Selection Options |
Description |
| Environment | Select Drop-down | Sandbox, Staging, Production |
| Release Type | Select Drop-down | Hot Fix, Service Release, Full Release, Configuration |
| Application | Select Drop-down | Jobster Service Corporate Website JIVE – Delight Highdeal Jobster Search Static Content Coffee Robot UJobs |
| Version | ||
| Release Instructions | Text area for sets of instructions |
|
| Requested By | Select Drop-down | |
| Development Lead | Select Drop-down | |
| QA Lead | Select Drop-down | |
| Notify | Pre-selected Groups | Technical Operations, Quality Assurance/Test, Development, Program Management |
| Comments | Text area for comments about the release |
Typically errors encountered upon deployment, etc. |
| Deployment Start Time | Entered by the Release Tracker |
|
| Deployment End Time | Entered by the Release Tracker |
|
| Submit to QA Time | Entered by the Release Tracker |
|
| Delete Time | Entered by the Release Tracker |
|
| Failed Time | Entered by the Release Tracker |
|
| Passed QA Time | Entered by the Release Tracker |
|
Upon each Release Tracker Deployment Request Form completed the following individuals are notified: Requester, Development Lead for the application, QA Lead for the application, anyone else selected under Notify Groups and the entire Technical Operations team.
Source Code Repositories
Visual Source Safe, Subversion, CVS, Source Depot, etc…your source code needs to be stored in a system that allows for labels, branching and versioning. The key to any Release Tracking process is to make sure you are able to roll back to a previous version of the source code should the need arise. All of the above source code repositories support a labeling or versioning system within the software. To make the Release Tracker process work, you must have a Label or Version correctly identified in your source code repository. It is this version that will be tracked from Deployment Request to Deploying Notification to Deployed and Ready for QA, to Passed/Fail QA; should the process fail anywhere along the way there should be a clear version that the Technical Operations team can roll back to.





