TechOpsGuys.com Diggin' technology every day

30Jul/12Off

Super Micro out with mini UPSs for servers

TechOps Guy: Nate

It's been talked about for quite a while, I think Google was probably the first to widely deploy a battery with their servers, removing the need for larger batteries at the data center or rack level.

Next came Microsoft in what I consider at least to be a better design(more efficient at least), with Google's apparently using AC power to the servers (though the pictures could well be outdated, who knows what they use now). Microsoft took the approach of rack level UPSs and DC power to the servers.

I was at a Data Center Dynamics conference a couple years back where a presenter talked about a similar topic though he didn't use batteries, it was more along the lines of big capacitors(that had the risk of exploding no less).

Anyways I was wondering along and came across this, which seems really new. It goes beyond the notion that most power events last only two seconds and gives a server an internal battery capacity of anywhere from 30 seconds to 7 minutes depending on sizing and load.

It looks like a really innovative design and it's nice to see a commercial product in this space being brought to market. I'm sure you can get similar things from the depths of the bigger players if your placing absolutely massive orders of servers, but for more normal folks I'm not aware of a similar technology being available.

These can be implemented in 1+1+1 (2 AC modules + 1 UPS Module), 1+2 (1 AC + 2 UPS @ 2000W) or 2+2 (2 AC + 2 UPS @ 2000W) configurations.

It does not appear to be a integrated PSU+Battery, but rather a battery module that fits along side a PSU, in place of what otherwise could be another PSU.

You may have issues running these battery units in 3rd party data centers, I don't see any integration for Emergency Power Off (EPO), some facilities are picky about that kind of thing. I can imagine the look on some uninformed tech's face when they hit the EPO switch, the lights go out but hundreds or thousands of servers keep humming along. That would be a funny sight to see.

While I'm here I guess I should mention the FatTwin systems that they released a few weeks ago, equally innovative compared to the competition in the space at least. Sort of puts the HP SL-series to shame, really. I don't think you'd want to run mission critical stuff on this gear, but for the market it's aimed at which is HPC, web farms, hadoop etc they look efficient, flexible and very dense, quite a nice step up from their previous Twin systems.

It's been many years since I used Super Micro, I suppose the thing they have traditionally lacked more than anything else in my experience (which again isn't recent maybe this is fixed), is better fault detection and reporting of memory errors. Along the lines of HP's Advanced ECC, or IBM's Chipkill (the damn thing was made for NASA what more do you need !) .

I recall some of the newer Intel chips have something similar in the newer chipsets, though the HP and IBM stuff is more CPU agnostic(e.g. supports AMD :) ). I don't know how the new Intel memory protection measures up to Advanced ECC / Chipkill. Note I didn't mention Dell - because Dell has no such technology either (they too rely on the newer Intel chips to provide that similar function for their Intel boxes at least).

The other aspect is when a memory error is reported on an HP system for example (at least one of the better ones 300-series and above) - typically a little LED lights up next to the socket having errors, along with perhaps even a more advanced diagnostics panel on the system before you even open it up to show which socket has issues. Since memory errors were far and away the #1 issue I had when I had Super micro systems, these features became sorely missed very quickly. Another issue was remote management, but they have addressed this to some extent in their newer KVM management modules (now that I think about it the server that powers this blog is a somewhat recent Supermicro with KVM management - but from a company/work/professional perspective it's been a while since I used them).

27Jul/12Off

Microsoft Licenses Linux to Amdocs

TechOps Guy: Nate

Microsoft has been fairly successful in strong arming licensing fees from various Android makers, though less successful in getting fees directly from operators of Linux servers.

It seems one large company, Amdocs, has caved in though.

The patent agreement provides mutual access to each company’s patent portfolio, including a license under Microsoft’s patent portfolio covering Amdocs’ use of Linux-based servers in its data centers.

I almost worked for Amdocs way back in the day. A company I was at was acquired by them, I want to say less than two months after I left the company. Fortunately I still had the ability to go back and buy my remaining stock options and got a little payout from it. One of my former co-workers said that I walked away from a lot of money.  I don't know how much he got but he assured me he spent it quickly and was broke once again! I don't know many folks at the company still since I left it many years ago, but everything I heard sounds like the company turned out to be as bad as I expected, and I don't think I would of been able to put up with the politics or red tape for the retention periods following the acquisition as it was already bad enough to drive me away from the company before they were officially acquired.

I am not really surprised Amdocs licensed Linux from Microsoft. I was told an interesting story a few years ago about the same company. They were a customer of Red Hat for Enterprise Linux, and Oracle enticed them to switch to Oracle Enterprise Linux for half the cost they were paying Red Hat. So they opted to switch.

The approval process had to go through something like a dozen layers in order to get processed, and at one point it ends up on the desk of the head legal guys at Amdocs corporate. He quickly sent an email to the new company they just acquired about a year earlier that the use of Linux or any open source software was forbidden and they had to immediately shut down any Linux systems they had. If I recall right this was on a day before a holiday weekend. My former company was sort of stunned and laughed a bit, they had to sent another letter up the chain of command which I think reached the CEO or the person immediately below the CEO of the big parent who went to the lawyer and said they couldn't shut down their Linux systems because all of the business flowed through Linux, and they weren't about to shut down the business on a holiday weekend, well that and the thought of migrating to a new platform so quickly was sort of out of the question given all the other issues going on at the time.

So they got a special exclusion to run Linux and some other open source software, which I assume is still run to this day. It was the first of three companies (in a row no less) that I worked at that started out as Microsoft shops, then converted to Linux (in all three cases I was hired on a minimum of 6-12 months after they made the switch).

Another thing the big parent did was when they came over to take over the corporate office they re-wired everything into a secure and insecure networks. The local linux systems were not allowed on the secure network only the insecure one(and they couldn't do things like check email from the insecure network). They tried re-wiring it over a weekend and if I recall right they were still having problems a week later.

Fun times I had at that company, I like to tell people I took 15 years of experience and compressed it into three, which given some of the resumes I have come across recently 15 years may not be long enough. It was a place of endless opportunity, and endless work hours. I'd do it again if I could go back I don't regret it, though it came at a very high personal cost which took literally a good five years to recover from fully after I left(I'm sure some of you know the feeling).

I wouldn't repeat the experience again though - I'm no longer willing to put up with outages that last for 10+ hours(had a couple that lasted more than 24 hours), work weeks that extend into the 100 hour range with no end in sight. If I could go back in time and tell myself whether or not to do it - I'd say do it, but I would not accept a position at a company today after having gone through that to repeat the experience again - just not worth it.  A few years ago some of the execs from that company started a new company in a similar market and tried to recruit a bunch of us former employees pitching the idea "it'll be like the good 'ol days", they didn't realize how much of a turn off that was to so many of us!

I'd be willing to bet the vast majority of Linux software at Amdocs is run by the company I was at, at last check I was told it was in the area of 2,000 systems (all of which ran in VMware) - and they had switched back to Red Hat Enterprise again.

27Jul/12Off

FCoE: stillborn 3 years later

TechOps Guy: Nate

At least the hype for the most part has died off, as the market has not really transitioned much over to FCoE since it's launch a few years ago. I mentioned it last year, and griped about it in one of my early posts in 2009 around when Cisco was launching their UCS, along with NetApp both proclaiming FCoE was going to take over.

Brocade has been saying for some time that FCoE adoption was lacking, a few short months ago Emulex came out and said about the same, and more recently Qlogic chiming in with another me too story.

FCoE – the emulated version of Fibre Channel running over Ethernet – is not exactly selling like hot cakes and is not likely to do so anytime soon, so all that FCoE-flavoured Ethernet development is not paying off yet.

More and more switches out there are supporting the Data Center Bridging protocols but those die hard Fibre Channel users aren't showing much interest in it. I imagine the problem is more political than anything else at many larger organizations. The storage group doesn't trust the networking group and would rather have control over their own storage network, and not share anything with the network group. I've talked to several folks over recent years where storage divisions won't even consider something that is exclusively iSCSI for example for the company because it means the networking folks have to get involved and that's not acceptable. Myself, I have had a rash of issues with certain Qlogic 10GbE network cards over the past 7 months which makes me really glad I'm not reliant on ethernet-based storage (there is some of it but all of the critical stuff is good 'ol Fibre channel - on entirely Qlogic infrastructure again). The rash of issues finally ressurected a bad set of memories I had trying to troubleshoot network issues on some Broadcom NICs a few years ago with regards to something buggy called MSI-X. It took about six months to track that problem down, the symptoms were just so bizarre. My current issues with 10GbE NICs aren't all that critical because of the level of redundancy that I have and the fact that storage is run over regular 'ol FC.

I know Qlogic is not alone in their issues with 10GbE, a little company by the name of Clearwire in Seattle I know had what amounted to something like a 14 hour outage a year or two ago on their Cisco UCS platform because of bugs in the Cisco stuff that they had(I think it was bugs around link flapping or something). I know others have had issues too, it sort of surprises me how long 10GbE has been around and we still seem to have quite a few issues with it, at least on the HBA side.

iSCSI has had it's issues too over the years, at least iSCSI in the HBAs, I was talking to one storage company late last year who has an iSCSI-only product and they said how iSCSI is ready for prime time, but after further discussion they clarified well you really only should use it with offloading NIC X or Y or software stack Z. iSCSI was a weak point for a long time on the 3PAR platform, they've addressed it to some extent on the new V-series, but I wouldn't be surprised if they still don't support anything other than pure software initiators.

TCP is very forgiving to networking issues, storage of course is not. In the current world of virtualization with people consolidating things on fewer, larger systems, the added cost of FC really isn't that much. I wouldn't be slapping FC cards in swaths of $3-5k servers, most servers that run VMs have gobs of memory which of course drives the price quite a bit higher than that.

Data center bridging really does nothing when your NIC decides to stop forwarding jumbo frame packets, or when the link starts flapping, or when the firmware crashes, or if the ASIC overheats. The amount of time it often takes for software to detect a problem with the link and fail over to a backup link alone is big enough to cause major issues with storage if it's a regular occurrence. All of the networks I've worked on at least in the past decade or so always have operated at a tiny fraction of their capacity, the bottlenecks are typically things like firewalls between zones (and whenever possible I prefer to rely on switch ACLs to handle that).

Tagged as: Comments Off
11Jul/12Off

Tree Hugging SFO stops buying Apple

TechOps Guy: Nate

I saw this headline over on slashdot just now and couldn't help but laugh. Following Apple's withdrawal from an environmental group, the city of San Fransisco - pretty much in Apple's back yard, is going to stop buying Macs because of it. I imagine they will have to not buy iPads or iPhones too (assuming they were buying any to begin with) since they are just as integrated as the latest Mac laptops.

Apparently the tightly integrated devices are too difficult to recycle to be compliant so rather than make the devices compliant Apple goes their own way.

I don't care either way myself but I can just see the conflict within the hardcore environmentalists who seem to, almost universally from what I've seen anyways adopt Apple products across the board. For me it's really funny at least.

It is an interesting choice though, given Apple's recent move to make one of their new data centers much more green by installing tons of extra solar capacity. On the one hand the devices are not green, but on the other hand the cloud that powers them is. But you can't use the cloud unless you use the devices, what is an environmentalist to do?!

I suppose the question remains - given many organizations have bans on equipment that is not certified by this environmental standards body - once these bans become more widespread, how long is it until some of them cave internally to their own politics and the withdrawal some of their users go through for not being able to use Apple. I wonder if some may try to skirt the issue by implementing BYOD and allowing users to expense their devices.

Speaking of environmental stuff, I came across this interesting article on The Register a couple weeks ago, which talks about how futile it is to try to save power by unplugging your devices - the often talked about power drain as a result of standby mode. The key takeaway from that story for me was this:

Remember: skipping one bath or shower saves as much energy as switching off a typical gadget at the wall for a year.

In the comments of the story one person wrote how this guy's girlfriend or wife would warm up the shower for 4-5 minutes before getting in. The same person wanted to unplug their gadgets to save power. But she didn't want to NOT warm up the shower. Thus obviously wasted a ton more energy than anything that could be saved by unplugging their gadgets. For me, the apartment I live in now has some sort of centralized water heater (first once I've ever seen in a multi home complex). All of my previous places have had dedicated water heaters. So WHEN the hot water works (I've had more outages of hot water in the past year than I have in the previous 20), the shower warms up in about 30-45 seconds.

So if you want to save some energy, take a cold shower once in a while - or skip a shower once in a while. Or if your like Kramer and take 60+ minute showers, cut it to less time(for him it seems even 27 minutes wasn't long enough). If you really want to save some energy, have fewer children.

I'm leaving on my road trip to Seattle tomorrow morning, going to drive the coast from the Bay Area to Crescent City, then cut across to Grants Pass Oregon before stopping for the night. Then take I-5 up to Bellevue on Friday so I can make it in time for Cowgirls that night. Going to take a bunch of pictures with my new camera and test my car out on those roads. I made a quicker trip down south last Sunday - drove the coast to near LA and got some pretty neat pictures there too. I drove back on the 4th of July (started at around 5PM from Escondido, CA), for the first time ever for me at least there was NO TRAFFIC. I drove all the way through LA and never really got below 50-60MPH. I was really shocked even given the Holiday. I drove through LA on Christmas eve last year and still hit a ton of traffic then.

9Jul/12Off

Amazon outages from a Datacenter Perspective

TechOps Guy: Nate

I just came across this blog post ("Cloud Infrastructure Might be Boring, but Data Center Infrastructure Is Hard"), and the author spent a decent amount of time ripping into Amazon from a data center operations perspective -

But on the facilities front, it’s hard to see how the month of June was anything short of a disaster for Amazon on the data center operations side.

Also covered are past outages and the author concludes that Amazon lacks discipline in operating their facilities as a chain of outages illustrates over the past few years

[..]since all of them can be traced back to a lack of discipline in the operation of the data centers in question.

[..]I wish they would just ditch the US East-1 data center that keeps giving them problems.  Of course the vast, vast majority of AWS instances are located there, so that may involve acquiring more floor space.

Sort of reminds me when Internap had their massive outage and then followed up by offering basically free migration to their new data center for any customer that wanted it - so many opted for it that they ran out of space pretty quick (though I'm sure they have since provisioned tons more space since the new facility had the physical capacity to handle everyone + lots more once fully equipped).

This goes back to my post where I ripped into them from a customer perspective, the whole built to fail model. For Amazon it doesn't matter of a data center goes offline, they have the capacity to take the hit elsewhere and global DNS will move the load over in a matter of seconds.  Most of their customers don't do that (because it's really expensive and complex mainly - did you happen to notice there's really no help for customers that want to replicate data or configuration between EC2 Regions?). As I tried to point out before, at anything other than massive scale it's far more cost effective(and orders of magnitude simpler) for the vast majority of the applications and workloads out there to have the redundancy in the infrastructure (and of course the operational ability to run the facilities properly) to handle those sorts of events.

Though I'd argue with the author on one point - cloud infrastructure is hard.  (Updated, since the author said it was boring rather than easy, my brain interpreted it as one is hard the other must not be, for whatever reason :) ) Utility infrastructure is easy but true cloud infrastructure is hard.  The main difference being the self service aspect of things. There are a lot of different software offerings trying to offer some sort of self service or another but for the most part they still seem pretty limited or lack maturity (and in some cases really costly). It's interesting to see the discussions about OpenStack for example - not a product I'd encourage anyone to use in house just yet unless you have developer resources that can help keep it running.