TechOpsGuys.com Diggin' technology every day

18Dec/12Off

Top 10 outages of the year

TechOps Guy: Nate

It's that time of the year again, top N lists are popping up everywhere, I found this list from Data Center Knowledge to be interesting.

Of note, two big cloud companies were on the list with multiple outages - Amazon having at least three outages and Azure right behind it at two. Outages have been a blight on both services for years.

I don't know about you, but short of a brief time at a poor hosting facility in Seattle (I joined a company in Spring of 2006 that was hosted there and we were moved out by Fall of 2006 - we did go through one power outage while I was there if I recall right), the number of infrastructure related outages I've been through over the past decade have been fairly minimal compared to the number experienced by these cloud companies. The number of application related outages (and total downtime minutes incurred by said applications) out numbers infrastructure related things for me I'd say by at least 1,000:1.

Amazon has had far more downtime for companies that I have worked for (either before or since I was there) than any infrastructure related outages at companies I was at where they hosted their own stuff. I'd say it's safe to say an order of magnitude more outages. Of course not all of these are called outages by Amazon, they leave themselves enough wiggle room to drive an aircraft carrier through in their SLAs. My favorite one was probably the forced reboot of their entire infrastructure.

Unlike infrastructure related outages at individual companies, obviously these large service provider outages have much larger consequences for very large numbers of customers.

Speaking of cloud, I heard that HP brought their own cloud platform out of beta recently. I am not a fan of this cloud either, basically they tried to clone what Amazon is doing in their cloud, which infrastructure wise is a totally 1990s way of doing things (with APIs on top to make it feel nice). Wake me up when these clouds get the ability to pool CPU/memory/storage and have the ability to dynamically configure systems without fixed configurations.

If the world happens to continue on after December 22nd @ 3:11AM Pacific time, and I don't happen to see you before Christmas - have a good holiday from all of us monkeys at Techopsguys.

Tagged as: , 1 Comment
18Dec/12Off

New Cloud provider Profitbricks

TechOps Guy: Nate

(originally I had this on the post above this but I thought it better to split it out since it morphed into something that suited a dedicated post)

Also on the topic of cloud, I came across this other post on Data Center Knowledge's site a few days ago talking about a new cloud provider called ProfitBricks.

I dug into their web site a bit and they really seem to have some interesting technology. They are based out of Europe, but also have a U.S. data center somewhere too. They claim more than 1,000 customers, and well over 100 engineers working on the software.

While Profitbricks does not offer pooling of resources they do have several key architectural advantages that other cloud offerings that I've come across lack:

They really did a good job at least on paper, I haven't used this service, though I did play around with their data center designer

ProfitBricks Data Center designer

Their load balancing offering appears to be quite weak (weaker than Amazon's own offering), but you can deploy a software load balancer like Riverbed Stingray (formerly Zeus). I emailed them about this and they are looking into Stingray, perhaps they can get a partnership going and have it be an offering with their service. Amazon has recently improved their load balancing partnerships and you can now run at least Citrix Netscaler as well as A10 Networks' SoftAX in EC2, in addition to Riverbed Stingray. Amazon's own Elastic Load Balancer is worse than useless in my experience. I'd rather rely on an external DNS-based load balancing from the likes of Dynect than use ELB. Even with Stingray it can take several seconds (up to about 30) for the system to fail over with Elastic IPs, vs normally sub second fail over when your operating your own infrastructure.

Anyway back to Proifitbricks, I was playing around with their designer tool and I was not sure how best to connect servers that would be running load balancers(assuming they don't provide the ability to do IP-takeover). I thought maybe have one LB in each zone, and advertise both data center IP addresses (this is a best practice in any case at least for larger providers). Though in the above I simplified it a bit to a single internet access point and using one of ProfitBricks round robin load balancers to distribute layer 4 traffic to the servers behind it(running Stingray). Some real testing would of course have to go into play and further discussions before I'd run production stuff on it obviously (and I have no need for IaaS cloud right now anyway).

So they have all this, and still their pricing is very competitive. They also claim very high level of support as well which is good to see.

I'll certainly keep them in mind in the event I need IaaS in the future, they seem to know the failings of first generation cloud companies and are doing good things to address them. Now if they could only address the point of lack of resource pooling I'd be really happy!

Tagged as: , 1 Comment
12Sep/12Off

Data Center reminder: deploy environmental sensors

TechOps Guy: Nate

I feel like I am almost alone in the world when it comes to deploying environmental sensors around my equipment. I first did it at home back around 2001 when I had a APC SmartUPS and put a fancy environmental monitoring card in it, which I then wrote some scripts for and tied it into MRTG.

A few years later I was part of a decently sized infrastructure build out that had a big budget so I got one of these, and 16 x environmental probes each with 200 foot cables (I think the probes+cables alone were about $5k(the longest cables they had at the time, which were much more expensive than the short ones, I wasn't sure what lengths I needed so I just went all out), ended up truncating the ~3200 feet of cables down to around ~800 feet I suspect). I focused more on cage environmental than per rack, I would of needed a ton more probes if I had per rack. Some of the sensors went into racks, and there was enough slack on the end of the probes to temporarily position them anywhere within say 10 feet of their otherwise fixed position very easily.

The Sensatronics device was real nifty, so small, and yet it supported both serial and ethernet, had a real basic web server, was easily integrated to nagios (though at the time I never had the time to integrate it so relied entirely on the web server). We were able to prove to the data center at the time their inadequate cooling and they corrected it by deploying more vented tiles. They were able to validate the temperature using one of those little laser gun things.

At the next couple of companies I changed PDU manufacturers and went to ServerTech instead, many (perhaps all?) of their intelligent PDUs come with ports for up to two environmental sensors. Some of their PDUs require an add-on to get the sensor integration.

The probes are about $50 a piece and have about a 10 foot cable on them. Typically I'd have two PDUs in a rack and I'd deploy four probes (2 per PDU). Even though environmental SLAs only apply to the front of the racks, I like information so I always put two sensors in front and two sensors in rear.

I wrote some scripts to tie this sensor data into cacti (the integration is ugly so I don't give it out), and later on I wrote some scripts to tie this sensor data into nagios (this part I did have time to do). So I could get alerts when the facility went out of SLA.

Until today the last time I was at a facility that was out of SLA was in 2009, when one of the sensors on the front of the rack was reporting 87 degrees. The company I was at during that point had some cheap crappy IDS systems deployed in each facility, and this particular facility had a high rate of failures for these IDSs. At first we didn't think *too* much of it, then I had the chance to hook up the sensors and wow, was I surprised. I looked at the temperatures inside the switches and compared it to other facilities (can't really extrapolate ambient temp from inside the switch), and confirmed it was much warmer there than at our other locations.

So I bitched to them and they said there was no problem, after going back and forth they did something to fix it - this was a remote facility - 5,000 miles away and we had no staff anywhere near it, they didn't tell us what they did but the temp dropped like a rock, and stayed within (barely) their SLA after that - it was stable after that.

Cabinet Ambient Temperature

There you have it, oh maybe you noticed there's only one sensor there, yeah the company was that cheap they didn't want to pay for a second sensor, can you believe that, so glad I'm not there anymore (and oh the horror stories I've heard about the place since! what a riot).

Anyways so fast forward to 2012.  Last Friday we had a storage controller fail (no not 3PAR, another lower end HP storage system), with a strange error message, oddly enough the system did not report there was a problem in the web UI (system health "OK"), but one of the controllers was down when you dug into the details.

So we had that controller replaced (yay 4 hour on site support), the next night the second controller failed with the same reason. HP came out again and poked at it, at one point there was a temperature alarm but the on site tech said he thought it was a false alarm, they restarted the controller again and it's been stable since.

So today I finally had some time to start hooking up the monitoring for the temperature sensors in that facility, it's a really small deployment, just 1 rack, so 4 sensors.

I was on site a couple of months ago and at the time I sent an email noting that none of the sensors were showing temperatures higher than 78 degrees (even in the rear of the rack).

So imagine my surprise when I looked at the first round of graphs that said 3 of the 4 sensors were now reporting 90 degrees or hotter temperature, and the 4th(near the floor) was reporting 78 degrees.

Wow, that is toasty, freakin hot more like it. So I figured maybe one of the sensors got moved to the rear of the rack, I looked at the switch temperatures and compared them with our other facility, the hotter facility was a few degrees hotter (4C), not a whole lot.

The servers told another story though.

Before I go on let me say that in all cases the hardware reports the systems are "within operating range", everything says "OK" for temperature - it's just way above my own comfort zone.

Here is a comparison of two servers at each facility, the server configuration hardware and software is identical, the load in both cases is really low, actually load at the hot facility would probably be less given the time of day (it's in Europe so after hours). Though in the grand scheme of things I think the load in both cases is so low that it wouldn't influence temperature much between the two. Ambient temperature is one of 23 temperature sensors on the system.

Data CenterDeviceLocationAmbient Temperature Fan Speeds (0-100%)
[6 fans per server]
Hot Data CenterServer XRoughly 1/3rd from bottom of rack89.6 F90 / 90 / 90 / 78 / 54 / 50
Normal Data CenterServer XRoughly 1/3rd from bottom of rack66.2 F60 / 60 / 57 / 57 / 43 / 40
Hot Data CenterServer YRoughly 1/3rd from bottom of rack87.8 F90 / 90 / 72 / 72 / 50 / 50
Normal Data CenterServer YBottom of Rack66.2 F59 / 59 / 57 / 57 / 43 / 40

That's a pretty stark contrast, now compare that to some of the external sensor data from the ServerTech PDU temperature probes:

LocationAmbient Temperature (one number per sensor)Relative Humidity (one number per sensor)
Hot Data Center - Rear of Rack95 / 8828 / 23
Normal Data Center - Rear of Rack84 / 84 / 76 / 8044 / 38 / 35 / 33
Hot Data Center - Front of Rack90 / 7942 / 31
Normal Data Center - Front of Rack75 / 70 / 70 / 7058 / 58 / 58 / 47

Again pretty stark contrast. Given that all equipment (even the storage equipment that had issues last week) is in "normal operating range" there would be no alerts or notification, but my own alerts go off when I see temperatures like this.

The on site personnel used a hand held meter and confirmed the inlet temperature on one of the servers was 30C (86 F), the server itself reports 89.6, I am unsure as to the physical location of the sensor in the server but it seems reasonable that an extra 3-4 degrees from the outside of the server to the inside is possible. The data center's own sensors report roughly 75 degrees in the room itself, though I'm sure that is due to poor sensor placement.

Temperature readout using a hand held meter

I went to the storage array, and looked at it's sensor readings - the caveat being I don't know where the sensors are located (trying to find that out now), in any case:

  • Sensor 1 = 111 F
  • Sensor 2 = 104 F
  • Sensor 3 = 100.4 F
  • Sensor 4 = 104 F

Again the array says everything is "OK",  I can't really compare to the other site since the storage is totally different(little 3PAR array), but I do know that the cooler data center has a temperature probe directly in front of the 3PAR controller air inlets, and that sensor is reading 70 F. The only temperature sensors I can find on the 3PAR itself are on the physical disks, which range from 91F to 98F, the disk specs say operating temperature from 5-55C (55C = 131F).

So the lesson here is, once again - invest in your own environmental monitoring equipment - don't rely on the data center to do it for you, and don't rely on the internal temperature sensors of the various pieces of equipment (because you can't extract the true ambient temperature and you really need that if your going to tell the facility they are running too hot).

The other lesson is, once you do have such sensors in place, hook them up to some sort of trending tool so you can see when stuff changes.

PDU Temperature Sensor data

The temperature changes in the image above was from when the on site engineer was poking around.

Some sort of irony here the facility that is running hot is a facility that has a high focus on hot/cold isle containment (though the row we are in is not complete so it is not contained right now), they even got upset when I told them to mount some equipment so the airflow would be reversed. They did it anyway of course, that equipment generates such little heat.

In any case there's tons of evidence that this other data center is operating too hot! Time to get that fixed..

9Jul/12Off

Amazon outages from a Datacenter Perspective

TechOps Guy: Nate

I just came across this blog post ("Cloud Infrastructure Might be Boring, but Data Center Infrastructure Is Hard"), and the author spent a decent amount of time ripping into Amazon from a data center operations perspective -

But on the facilities front, it’s hard to see how the month of June was anything short of a disaster for Amazon on the data center operations side.

Also covered are past outages and the author concludes that Amazon lacks discipline in operating their facilities as a chain of outages illustrates over the past few years

[..]since all of them can be traced back to a lack of discipline in the operation of the data centers in question.

[..]I wish they would just ditch the US East-1 data center that keeps giving them problems.  Of course the vast, vast majority of AWS instances are located there, so that may involve acquiring more floor space.

Sort of reminds me when Internap had their massive outage and then followed up by offering basically free migration to their new data center for any customer that wanted it - so many opted for it that they ran out of space pretty quick (though I'm sure they have since provisioned tons more space since the new facility had the physical capacity to handle everyone + lots more once fully equipped).

This goes back to my post where I ripped into them from a customer perspective, the whole built to fail model. For Amazon it doesn't matter of a data center goes offline, they have the capacity to take the hit elsewhere and global DNS will move the load over in a matter of seconds.  Most of their customers don't do that (because it's really expensive and complex mainly - did you happen to notice there's really no help for customers that want to replicate data or configuration between EC2 Regions?). As I tried to point out before, at anything other than massive scale it's far more cost effective(and orders of magnitude simpler) for the vast majority of the applications and workloads out there to have the redundancy in the infrastructure (and of course the operational ability to run the facilities properly) to handle those sorts of events.

Though I'd argue with the author on one point - cloud infrastructure is hard.  (Updated, since the author said it was boring rather than easy, my brain interpreted it as one is hard the other must not be, for whatever reason :) ) Utility infrastructure is easy but true cloud infrastructure is hard.  The main difference being the self service aspect of things. There are a lot of different software offerings trying to offer some sort of self service or another but for the most part they still seem pretty limited or lack maturity (and in some cases really costly). It's interesting to see the discussions about OpenStack for example - not a product I'd encourage anyone to use in house just yet unless you have developer resources that can help keep it running.

30Jun/12Off

Amazon Cloud: Two power outages in two weeks

TechOps Guy: Nate

By now you should know I'm no fan of Amazon's cloud, it makes me feel I'm stuck in the 90s when I use it. I've been using it quite a bit for the past two years(with two different companies) but finally about to get the hell out of there. The last set of systems is set to migrate before my trip to Seattle.

Last week they had one outage in one of their availability zones, though it took them well over an hour to admit it was a power outage, they first tried to say "oh some volumes are experiencing increased latency". What a load of crap. It should take all of 5 seconds to know there is a power outage. The stuff I manage had minor impact fortunately since we are down to just a few things left, we lost some stuff but none of it critical.

Then last night they have another one, which seems to have made some news too.

A slew of sites, including Netflix, Instagram and Pinterest, have gone down this evening, thanks to “power issues” at Amazon’s Elastic Compute Cloud data center in North Virginia. The websites rely on Amazon’s cloud services to power their services. Some pretty violent storms in the region are apparently causing the problems.

This had slightly more impact on stuff I'm responsible for, one of my co-workers handled the issues it wasn't much to do fortunately. I can only imagine the havok of a larger organization like one of the above that depend more heavily on their cloud.

What a lot of people don't realize though is these two outages aren't really considered outages in Amazon's mind, at least for that region, because only one data center or part of one data center went off line. Their SLA is worded so that they exempt themselves from the effects of such an outage and put the onerous on the customer to deal with it. I suspect these facilities aren't even Tier IV, because Tier IV is expensive and Amazon is about cheap. If they were Tier IV a simple storm wouldn't of caused equipment to lose power.

I remember a couple years ago the company I was at had some gear co-located near Chicago at an Equinix site, some big storms and flooding if I remember right rolled through. We didn't have redundant power of course(more on that below), but there was no impact to the equipment other than an email to us saying the site was on generator power for some time and then another email saying the site was back on utility power.

There are exceptions of course, poor design being one. I think back to what was once Internap's premier data center in Seattle Fisher Plaza which was once plagued by power issues and eventually resulted in more then 24 hours of downtime due to a fire knocking out many well known sites like Bing Travel as well as many others. It took them months to repair the facility, they had generator trucks sitting out front providing generator power 24/7 during the repairs. From a storage perspective I remember being told stories of at least one or two customers' NetApp equipment taking more then 24 hours to come back online (file system checks), I'm sure folks that had battery backed cache were in sort of a panic not knowing when or if power would be restored to the facility. Some of my friends were hosted there at another company with a really small 3PAR array and were not worried though, because 3PAR systems dump their cache to an internal disk on the controller when the power goes out, so batteries are not required past that point. Since cache is mirrored there is two copies of it stored on different disks. Some newer systems have fancy flash-backed cache that is even nicer.

Fisher Plaza for a while had about one power outage per year, every year for at least 3 years in a row. Including the somewhat famous EPO event where someone went out of their way to hit the Emergency Power Off switch (there was no emergency) and shut down the facility. After that all customers had to go through EPO Training, which was humorous.

Being the good operations person that I am, shortly after I started work at a company back in 2006 that was hosted at Fisher plaza I started working on plans to move them out - the power issues was too much to bear. We still had about nine months left in our contract and my boss was unsure how we could go before that was up given it would cost a lot. I had an awesome deal on the table from a local AT&T Facility which I had good experiences with (though density wise they are way out dated and after an AT&T re-organization in around 2008 I wouldn't even consider AT&T as a data center provider now). Anyways, I had this great deal and wanted to move but we had a hard time getting past the fact that we still owed a ton on the contract for Internap and we couldn't get out of it. Then Fisher Plaza had another power outage (this was in 2006, the fire was three years later). The VP said to us something along the lines of I don't care what it takes, I want to get out of there now. Music to my ears, things got moving quickly and we moved out within a month or so. I was hosted at that AT&T data center for a good 5 years personally and the companies I was at was hosted there for a good I want to say 8-9 years between the two without a single power event that I am aware of. I was there once when the facility lost power, but the data center floor was unaffected. I believe there was a few other power outages, but again nothing impacting customer equipment.

There are other bad designs out there too - personally I consider anything that relies on a flywheel UPS to be badly designed, because there isn't enough time for on site personnel to manually try to correct a situation before the UPS runs out of juice.  I want at least 10-15 minutes at full load.

Internap later opened a newer fancier data center down in Tukwila in facility owned by Sabey. That is a massive campus, they claim 1.2M square feet of data center space. There is a large Microsoft presence there as well. On one of my tours of the facility I asked their technical people whether or not they use real UPS or a Flywheel, and they said real UPS. They commented how Microsoft literally next door used Flywheel and they said how Microsoft is seemingly constantly running their generators(far more frequently than your typical routine load testing), they did not know specifically why but speculated maybe they don't trust the Fly wheels, and laughed with me. That same Internap facility had another power outage, shortly after it opened though that one was human error. Apparently there was some fault in a UPS, and some person did something bad, the only way to fix it was to shut everything down. Internap claimed they addressed that problem by having every on site action double checked and signed off. I know people that are hosted there and have not heard of issues since the new policies were put in place.

Another reason is being a cheap bastard. I think Amazon falls into this area - they address it for their own applications with application level availability, global load balancing and fancy Citrix load balancers.  I was at another company a few years ago that fell into this area too of being a cheap bastard and not wanting to invest in redundant power. People view power as a utility that won't ever go down, especially in a data center - and this view is reinforced the longer you go without having a power outage. I remember a couple outages at a real cheap co-location the company was using in Seattle, where some other customer plugged a piece of fancy Cisco gear in and for some reason it tripped the UPS which knocked out a half dozen of our racks, because they didn't have redundant power. So naturally we had an outage due to that.  The same thing happened again a few weeks later, after the customer replaced the Cisco gear with a newer Cisco thing and the UPS tripped again. Don't know why.

The back end infrastructure was poorly designed as well, they had literally roughly 2 dozen racks all running off the same UPS, none of them had redundant power(I thought how hard can it be to alternate between UPSs every other rack? Apparently they didn't think of that or didn't want to spend for it).  It was a disaster waiting to happen. They were lucky and they did not have such a disaster while I was there. It was like pulling teeth to get them to commit to redundant power for the new 3PAR system, and even then they'd only agree to one UPS feed and one non UPS feed. This had it's own issues on occasion.

One of my former co-workers told me a story about a data center he used to work at - the worst of both worlds - bad design AND cheap bastard. They bought these generators and enclosed them somewhat in some sort of structure outside. Due to environmental regulations they could not test them very often, only a couple minutes a month or something like that. Maybe the generators were cheap crappy ones that belched out more pollution than others, I don't know. But the point is they never could fully test them. They had a real power outage one day, and they went outside and watched as the generators kicked on, they were happy.

Then a few minutes later they shut down and the facility lost all power. WTF? They went and turned them on again, and a few minutes later they shut off again.  Apparently the structure they built around the generators did not leave enough space for cooling and the generators were overheating and shutting down.

Back to Amazon and their SLAs (or lack thereof). I'm torn between funny and sad when I see people attacking Amazon customers like Netflix or the other social things that are on their cloud when they go down as a result of an Amazon downtime. They rag on the customers for not making their software more resilient against such things. Amazon expects you to do this, they do it after all if Amazon can do it anyone can right?

Yeah, reality is different. Most companies do not do that and probably never will. At a certain scale it makes sense, for some applications it makes sense. For the vast majority it does not, and the proof is in the pudding - most companies don't do it. I've worked at two different companies that built their apps from the ground up in Amazon and neither made any considerations for this aspect of availability. I know there are folks out there that DO do this but they are in the small minority, who think they are hip because they can survive a data center going down without impacting things.

It's far simpler, and cheaper to address the problem in a more traditional way with highly available infrastructure for the vast majority of applications anyways. Disasters do happen and you should still be prepared, but that's far different from the Amazon model of "built to fail". These aren't the first power issues Amazon has had and certainly won't be the last.

The main point to this post is trying to illustrate the difference in how the SLAs are worded, how the particular service provider responds, and how customers respond to the event.

A counter example I have brought up many times, a combination of a power issue AND a fire over at a Terremark facility a few years ago, resulted in no customer impact. Good design and no cheap bastards there.

Some irony here is that Amazon tries to recruit me about once every six months. I politely tell them I'm not interested, unless it's a person I know then I tell them why I'm not interested, and believe me I'm being incredibly polite here.

The current state of Infrastructure as a Service cloud offerings is just a disaster in general (there are some exceptions to parts of the rules here and there). Really everything about it is flawed from the costs to the availability to the basic way you allocate resources. For those of you out there that use cloud offerings and feel like you've traveled back in time I feel your pain, it's been the most frustrating two years of my career by far. Fortunately that era is coming to a close in a couple of weeks and boy does it feel good.

This blog had a many hour outage recently, of course it's not powered by redundant systems though it does have redundant power supplies(I suspect the rack doesn't have true redundant power I don't know it's a managed co-location though I own the server). A few nights ago there was some networking issues, I don't know details I haven't tried to find out. But the  provider who gets me the service(I think they have a cage in the facility they are a computer reseller), had their website on the same subnet as mine and I saw that was unreachable as well.

Whatever it was it was not a power issue since the uptime of my systems was unchanged once things got fixed. Though my bridging OpenBSD VM running pf on my ESXi system crashed for some reason (internal VMware error - maybe too many network errors). So I had to manually fire up the VM again before my other VMs could get internet access.  Not the end of the world though it's just one small server running personal stuff. As you might know I ran my server in the Terremark cloud for about a year while I transitioned between physical server hosts (last server was built in 2004, this one about a year ago). When I started thinking about off site backups, I very quickly determined that cloud wasn't going to cut it for the costs and it was far cheaper to just buy a server with RAID and put it in a co-lo, with roughly 3.6TB of usable capacity protected by RAID-10 on enterprise nearline SAS drives and a hardware RAID controller with battery backed cache I'm happy.

4May/11Off

Microsoft Server Designs

TechOps Guy: Nate

I was out of town for most of last week so didn't happen to catch this bit of news that came out.

It seems shortly after Facebook released their server/data center designs Microsoft has done the same.

I have to admit when I first heard of the Facebook design I was interested, but once I saw the design I felt let down, I mean is that the best they could come up with? It seems there are market based solutions that are vastly superior to what Facebook designed themselves. Facebook did good by releasing in depth technical information but the reality is only a tiny number of organizations would ever think about attempting to replicate this kind of setup. So it's more for the press/geek factor than being something practical.

I attended a Datacenter Dynamics conference about a year ago, where the most interesting thing that I saw there was a talk by a Microsoft guy who spoke about their data center designs, and focused a lot on their new(ish) "IT PAC".  I was really blown away. Not much Microsoft does has blown me away but consider me blown away by this. It was (and still is) by far the most innovative data center design I have ever seen myself at least. Assuming it works of course, at the time the guy said there was still some kinks they were working out, and it wasn't on a wide scale deployment at all at that point. I've heard on the grape vine that Microsoft has been deploying them here and there in a couple facilities in the Seattle area. No idea how many though.

Anyways, back to the Microsoft server design, I commented last year on the concept of using rack level batteries and DC power distribution as another approach to server power requirements, rather than the approach that Google and some others have taken which involve server-based UPSs and server based power supplies (which seem much less efficient).

 

Google Server Design with server-based batteries and power supplies

Add to that rack-based cooling(or in Microsoft's case - container based cooling), ala SGI CloudRack C2/X2, and Microsoft's extremely innovative IT PAC containers, and you got yourself a really bad ass data center. Microsoft seems to borrow heavily from the CloudRack design, enhancing it even further. The biggest update would be the power system with the rack level UPS and 480V distribution.  I don't know of any commercial co-location data centers that offer 480V to the cabinets, but when your building your own facilities you can go to the ends of the earth to improve efficiency.

Microsoft's design permits up to 96 dual socket servers(2 per rack unit) each with 8 memory slots in a single 57U rack (the super tall rack is due to the height of the container). This compares to the CloudRack C2 which fits 76 dual socket servers in a 42U rack (38U of it used for servers).

SGI Cloudrack C2 tray with 2 servers, 8 disks (note no power supplies or fans, those are provided at the rack level )

My only question on Microsoft's design is their mention of "top of rack switches". I've never been a fan of top of rack switches myself. I always have preferred to have switches in the middle of the rack, better for cable management (half of the cables go up, the other half go down). Especially when we are talking about 96 servers in one rack. Maybe it's just a term they are using to describe what kind of switches, though there is a diagram which shows the switches positioned at the top of the rack.

SGI CloudRack C2 with top of rack switches positioned in the middle of the rack

I am also curious on their power usage, which they say they aim to have 40-60 watts/server, which seems impossibly low for a dual socket system, so they likely have done some work to figure out optimal performance based on system load and probably never have the systems run at anywhere near peak capacity.

Having 96 servers consume only 16kW of power is incredibly impressive though.

I have to give mad, mad, absolutely insanely mad props to Microsoft. Something I've never done before.

Facebook - 180 servers in 7 racks (6 server racks + 1 UPS rack)

Microsoft - 630 servers in 7 racks

Density is critical to any large scale deployment, there are limits to how dense you can practically go before the costs are too high to justify it. Microsoft has gone about as far as is achievable given current technology to accomplish this.

Here is another link where Microsoft provides a couple of interesting PDFs, the first one I believe is written by the same guy that gave the Microsoft briefing at the conference I was at last year.

(As a side note I have removed Scott from the blog since he doesn't have time to contribute any more)

21Mar/11Off

Please do not extend Data center tax breaks

TechOps Guy: Nate

This is just disgusting to me. It pissed me off when it passed the first time and it is even more stupid and crazy if it happens to pass again.

Just read on DataCenterKnowledge that Washington state (where I am) has someone(s) proposing a bill that would extend data center tax breaks for another 10+ years.

This, in a time where the state forecast just last week an even larger state budget deficit.

Key lawmakers now turn their full attention to writing budgets for the 2011-2013 cycle. Revenue is expected to be down for that budget by an additional $700 million, Thursday's forecast said. Now, the deficit is estimated to be about $5.1 billion, but that includes voter-approved mandates that lawmakers don't plan to fund.

The big issue I have with this data center tax break is these data centers really don't contribute much. They have a short term gain in construction jobs but operationally they employ hardly anyone and they consume an enormous amount of energy and water requirements for cooling.

Take a look at this $1 billion Apple data center for example -

Tax breaks could total $300 million for 50-employee server farm in North Carolina

If your going to give tax breaks, give them to businesses that actually generate jobs. There should be some sort of rule, # of jobs per square foot, or # of jobs per $ in tax break or something. Data centers are a waste for tax breaks, let them go somewhere else.

The original tax break to data centers was approved right after the state announced a $1 billion tax increase on the rest of the state.

21Oct/10Off

Red Hat wants to end “IT Suckage”

TechOps Guy: Nate

Read an interesting article over on The Register with a lot of comments by a Red Hat executive.

And I can't help but disagree on a bunch of stuff the executive says. But it could be because the executive is looking at and talking with big bloated slow moving organizations that have a lot of incompetent people in their ranks ("Never got fired for buying X" mantra), instead of smaller more nimble more leading edge organizations willing, ready and able to take some additional "risk" for a much bigger return (such as running virtualized production systems, seems like a common concept to many but I know there's a bunch of people out there that aren't convinced that it will work, btw I ran my first VMware in production in 2004, and saved my company BIG BUCKS with the customer (that's a long story, and an even longer weekend)).

OK so this executive says

After all, processor and storage capacity keep tracking along on their respective Moore's and Kryder's Laws, doubling every 18 months, and Gilder's Law says that networking capacity should double every six months. Those efficiencies should lead to comparable economies. But they're not.

I was just thinking this morning about the price and capacity of the latest systems(sorry keep going back to the BL685c G7 with 48 cores and 512GB of ram :) ).

I remember back in 2004/2005 time frame the company I was at paying well over $100,000 for a 8-way Itanium system with 128GB of memory to run Oracle databases. The systems of today whether it is the aforementioned blade or countless others can run circles around such hardware now at a tiny fraction of the price. It wasn't unreasonable just a few short years ago to pay more than $1M for a system that had 512GB of memory and 24-48 CPUs, and now you can get it for less than $50,000(in this case using HP web pricing). That big $1M system probably consumed at least 5-10kW of power and a full rack as well, vs now the same capacity can go for ~800W(100% load off the top of my head) and you can get at least 32 of them in a rack(barring power/cooling constraints).

Granted that big $1M system was far more redundant and available than the small blade or rack mount server, but at the time if you wanted so many CPU cores and memory in a single system you really had no choice but to go big, really big. And if I was paying $1M for a system I'd want it to be highly redundant anyways!

With networking, well 10GbE has gotten to be dirt cheap, just think back a few years ago if you wanted a switch with 48 x 10GbE ports you'd be looking at I'd say $300k+ and it'd take the better part of a rack. Now you can get such switches in a 1U form factor from some vendors(2U from others), for sub $40k?

With storage, well spinning rust hasn't evolved all that much over the past decade for performance unfortunately but technologies like distributed RAID have managed to extract an enormous amount of untapped capacity out of the spindles that older architectures are simply unable to exploit. More recently the introduction of SSDs and the sub LUN automagic storage tiering technology that is emerging (I think it's still a few years away from being really useful) you can really get a lot more bang out of your system. EMC's fast cache looks very cool too from a conceptual perspective at least I've never used it and don't know anyone who has but I do wish 3PAR had it! Assuming I understand the technology right, with the key being the SSDs are used for both read and write caching. Verses something like the NetApp PAM card which is only a read cache. Neither Fast cache nor PAM is enough to make we want to use those platforms for my own stuff.

The exec goes on to say

Simply put, Whitehurst's answer to his own question is that IT vendors suck, and that the old model of delivering products to customers is fundamentally broken.

I would tend to agree for the most part but there are those out there that really are awesome. I was lucky enough to find one such vendor, and a few such manufacturers. As one vendor I deal with says they work with the customer not with the manufacturer, they work to give the customer what is best for them. So many vendors I have dealt with over the years are really lazy when it comes down to it, they only know a few select solutions from a few big name organizations and give blank stares if you go outside their realm of comfort (random thought: I got the image of Speed Bump: The roadkill possum from a really old TV series called Liquid Television that I watched on MTV for a brief time in the 90s).

By the same token while most IT vendors suck, most IT managers suck too, for the same reason. Probably because most people suck that may be what it comes down to it at the end of the day. IT as you well know is still an emerging industry, still a baby really evolving very quickly, but has a ways to go. So like with anything the people out there that can best leverage IT are few and far between. Most of the rest are clueless -- like my first CEO about 10-11 years ago was convinced he could replace me with a tech head from Fry's Electronics (despite my 3 managers telling him he could not). About a year after I left the company he did in fact hire such a person -- only problem was that individual never showed up for work (maybe he forgot).

Exec goes on to say..

"Functionality should be exploding and costs should be plummeting — and being a CIO, you should be a rock star and out on the golf course by 3 pm," quipped Whitehurst to his Interop audience.

That is in fact what is happening -- provided your choosing the right solutions, and have the right people to manage them, the possibilities are there, just most people don't realize it or don't have the capacity to evolve into what could be called the next generation of IT, they have been doing the same thing for so long, it's hard to change.

Speaking of being a rock star and out on the golf course by 3pm, I recall two things I've heard in the past year or so-

The first one used the golf course analogy, from a local VMware consulting shop that has a bunch of smart folks working for them I thought this was a really funny strategy and can see it working quite well in many cases - the person took an industry average of say 2-3 days to provision a new physical system, and said in the virtual world -- don't tell your customers that you can provision that new system in ten minutes, tell them it will take you 2-3 days, spend the ten minutes doing what you need and spend the rest of the time on the golf course.

The second one was from a 3PAR user I believe. Who told one of their internal customers/co-workers something along the lines of "You know how I tell you it takes me a day to provision your 10TB of storage? Well I lied, it only takes me about a minute".

For me, I'm really too honest I think, I tell people how long I think it will really take and at least on big projects am often too optimistic on time lines. Maybe I should take up Scotty's strategy and take my time lines and multiply them by four to look like a miracle worker when it gets done early. It might help to work with a project manager as well, I haven't had one for any IT projects in more than five years now. They know how to manage time (if you have a good one, especially one experienced with IT not just a generic PM).

Lastly the exec says

The key to unlocking the value of clouds is open standards for cloud interoperability, says Whitehurst, as well as standardization up and down the stack to simplify how applications are deployed. Red Hat's research calculates that about two-thirds of a programmer's time is spent worrying about how the program will be deployed rather than on the actual coding of the program.

Worrying about how the program will be deployed is a good thing, an absolutely good thing. Rewinding again to 2004 I remember a company meeting where one of the heads of the company stood up and said something along the lines of 2004 was the year of operations, we worked hard to improve how the product operates, and the next phase is going back to feature work for customers. I couldn't believe my ears, that year was the worst for operations, filled with half implemented software solutions that actually made things worse instead of better, outages increased, stress increased, turnover increased.

The only thing I could do from an operations perspective and buy a crap load of hardware and partition the application to make it easier to manage. We ended up with tons of excess capacity but the development teams were obviouslly unable to make the design changes we needed to improve the operations of the application, but we at least had something that was more manageable, the deployment and troubleshooting teams were so happy when the new stuff was put into production, no longer did they have to try to parse gigabyte sized log files trying to find which errors belong to which transactions from which subsystem. Traffic for different subsystems was routed to different physical systems so you knew if there was an issue with one type of process you go to server farm X to look at it, problem resolution was significantly faster.

I remember having one conversation with a software architect in early 2005 about a particular subsystem that was very poorly implemented (or maybe even designed), it caused us massive headaches in operations, non stop problems really. His response was Well I invited you to a architecture meeting in January of 2004 to talk about this but you never showed up. I don't remember the invite but if I saw it I know why I didn't show up, it's because I was buried in production outages 24/7 and had no time to think more than 24 hours ahead yet alone think about a software feature that was months away from deployment. Just didn't have the capacity, was running on fumes for more than a year.

So yes, if you are a developer please do worry about how it is deployed, never stop worrying. Consult your operations team (assuming they are worth anything), and hopefully you can get a solid solution out the door. If you have a good experienced operations team then it's very likely they know a lot more about running production than you do and can provide some good insight into what would provide the best performance and uptime from an operations perspective. They may be simple changes, or not.

One such example, I was working at a now defunct company who had a hard on for Ruby on Rails. They were developing app after app on this shiny new platform. They were seemingly trying to follow Services Oriented Architecture (SOA), something I learned about ironically at a Red Hat conference a few years ago (didn't know there was a acronym for that sort of thing it seemed so obvious). I had a couple, really simple suggestions for them to take into account for how we would deploy these new apps. Their original intentions called for basically everything running under a single apache instance(across multiple systems), and for example if Service A wanted to talk to Service B then it would talk to that service on the same server. My suggestions which we went with involved two simple concepts:

  • Each application had it's own apache instance, listening on it's own port
  • Each application lived behind a load balancer virtual IP with associated health checking, with all application-to-application communication flowing through the load balancer

Towards the end we had upwards of I'd say 15 of these apps running on a small collection of servers.

The benefits are pretty obvious, but the developers weren't versed in operations -- which is totally fine they don't need to be (though it can be great when they are, I've worked with a few such people though they are VERY RARE) that's what operations people do and you should involve them in your development process.

As for cloud standards -- folks are busy building those as we speak and type. VMware seems to be the furthest along from an infrastructure cloud perspective I believe, I wouldn't expect them to lose their leadership position anytime soon they have an enormous amount of momentum behind them, and it takes a lot to counter that momentum.

About a year ago I was talking to some former co-workers who told me another funny story they were launching a new version of software to production, the software had been crashing their test environments daily for about a month. They had a go no-go meeting in which everyone involved with the product said NO GO. But management overrode them, and they deployed it anyways. The result? A roughly 14 hour production outage while they tried to roll the software back. I laughed and said, things really haven't changed since I left have they?

So the solutions are there, the software companies and hardware companies have been evolving their stuff for years, the problem is the concepts can become fairly complex when talking about things like capacity utilization and stranded resources, getting the right people in place to be able to not only find such solutions but deploy and manage them as well can really go a long ways, but those people are rare at this point.

I haven't been writing too much recently been really busy, Scott looks to be doing a good job so far though.

 

Tagged as: , Comments Off