Nate « TechOpsGuys.com

August 20, 2012

The Screwballs have Spoken

Filed under: Datacenter — Tags: vmware — Nate @ 2:07 pm

Just got this link from Gabriel (thanks!), it seems the screwball VMware community has spoken and VMware listened and is going to ditch their controversial vRAM licensing that they introduced last year.

In its upcoming release of vSphere 5.1, VMware is getting rid of vRAM entitlements, which debuted with vSphere 5 and determine how much memory customers are permitted to allocate to virtual machines on the host, according to sources familiar with VMware’s plans.

I tried to be a vocal opponent to this strategy and firmly believed it was going to hurt VMware, I haven’t seen any hard numbers as to the up take of vSphere 5, but there have been hints that it has not been as fast as VMware had hoped.

I had a meeting with a VMware rep about a year ago and complained about this very issue for at least 30 minutes but it was like talking to a brick wall. I was told recently that the rep in question isn’t with the company anymore.

I have little doubt that VMware was forced into this change because of slow uptake and outright switching to other platforms. They tried to see how much leverage they had at customers and realized they don’t have as much as they thought they had.

Now the question is will they repeat the mistake again in the future – myself I am pretty excited to hear that Red Hat is productizing OpenStack, along with RHEV, that really looks like it has a lot of potential (everything I see today about OpenStack says steer clear unless you have some decent in house development resources). I don’t have any spare gear to be able to play with this stuff on at the moment.

Thanks VMware for coming to your senses, the harsh feelings are still there though, can I trust you again after what you tried to pull? Time will tell I guess.

(In case you’re wondering where I got the title of this post from it’s from here.)

 Marge get to make her concluding statement, in which she asks all concerned  parents to write to I&S and express their feelings. In his office, Mr.  Meyers goes through the tons of angry mail he's received... ``The  screwballs have spoken...''

Comments (3)

August 13, 2012

Freakish performance with Site to Site VPN

Filed under: Networking — Tags: sonicwall, vpn — Nate @ 6:07 pm

UPDATED I’ll be the first to admit – I’m not a network engineer. I do know networking, and can do the basics of switching, static routing, load balancing firewalls etc. But it’s not my primary background. I suppose you could call me a network engineer if you base my talents off of some past network engineers I’ve worked with (which is kinda sad really).

I’ve used quite a few different VPNs over the years, all of them without any special WAN Optimization, though the last “appliance” based VPN I was directly responsible for was a VPN between two sites connected by Cisco PIXs about 4-5 years ago. Since then either my VPN experience has been limited to using OpenVPN on my own personal stuff, or relying on other dedicated network engineer(s) to manage it.

In general, my experience has told me that site to site VPN performance generally equates to internet performance, you may get some benefit with the TCP compression and stuff, but without specialized WAN Optimization / protocol optimization / cachingÂ etc – throughput is limited by latency.

I conferred with a colleague on this and his experience was similar – he expects site to site VPN performance to about match that of Internet site to site performance when no fancy WAN Opt is in use.

So imagine my surprise when a few weeks ago I hooked up a site to site VPN in between Atlanta and Amsterdam (~95ms of latency between the two), and I get 10-30 fold improvement in throughput over the VPN than over the internet.

Internet performance = ~600-700 Kilobytes/second sustained using HTTPS
Site to site VPN performance = ~5 Megabytes/second using NFS, ~12 Megabytes/second sustained using SCP, and 20 Megabytes/second sustained using HTTP

The links on each end of the connection are 1Gbps, tier 1 ISP on the Atlanta side, I would guesstimate tier 2 ISP (with tons of peering connections) on the Amsterdam side.

It’s possible the performance could be even higher, I noticed that speed continued to increase the longer the transfer was running. My initial tests were limited to ~700MB files – 46.6 seconds for a 697MB file with SCP. Towards the end of the SCP it was running at ~17MB/sec (at the beginning only 2MB/sec).

A network engineer who I believe is probably quite a bit better than me told me

By my calculation – the max for a non-jittered 95ms connection is about 690KB/s so it looks like you already have a clean socket.
Keep in mind that bandwidth does not matter at this point since latency dictates the round-trip-time.

I don’t know what sort of calculation was done, but the throughput matches what I see on the raw Internet.

These are all single threaded transfers. Real basic tests. In all cases the files being copied are highly compressed(in my main test case the 697MB file uncompresses to 14GB), and in the case of the SCP test the data stream is encrypted as well. I’ve done multiple tests over several weeks and the data is consistent.

It really blew my mind, even with fancy WAN optimization I did not expect this sort of performance using something like SCP. Obviously they are doing some really good TCP windowing and other optimizations, despite there still being ~95ms of latency between the two sites within the VPN itself the throughput is just amazing.

I opened a support ticket to try to get support to explain to me what more was going on but they couldn’t answer the question. They said because there is no hops in the VPN it’s faster. There may be no hops but there’s still 95ms of latency between the systems even within the VPN.

I mean just a few years ago I wrote a fancy distributed multi threaded file replication system for a company I was at to try to get around the limits of throughput between our regional sites because of the latency. I could of saved myself a bunch of work had we known at the time (and we had a dedicated network engineer at the time) that this sort of performance was possible without really high end gear or specialized protocols (I was using rsync over HPN-SSH).Â I remember trying to setup OpenVPN between two sites at the company at that company for a while to test throughput there, and performance was really terrible(much worse than the existing Cisco VPN that we had on the same connection). For a while we had Cisco PIX or ASAs I don’t recall which but had a 100Mbit limit on throughput, we tapped them out pretty quickly and had to move on to something faster.

I ran a similar test between Atlanta and Los Angeles, where the VPN in Los Angeles was a Cisco ASA (vs the other sites are all Sonic Wall), and the performance was high there too – I’m not sure what the link speed is in Los Angeles but throughput was around 8 Megabytes/second for a compressed/encrypted data stream, easily 8-10x faster than over the raw Internet. I tested another VPN link between a pair of Cisco firewalls and their performance was the same as the raw Internet (15ms of latency between the two), I think the link was saturated in those tests(not my link so I couldn’t check it directly at the time).

I’m sure if I dug into the raw tcp packets the secrets would be there – but really even after doing all the networking stuff I have been doing for the past decade+ I still can’t make heads or tails of 90% of the stuff that is in a packet (I haven’t tried to either, hasn’t been a priority of mine, not something that really interests me).

But sustaining 10+ megabytes/second over a 95 millisecond link over 9 internet routers on a highly compressed and encrypted data stream without any special WAN optimization package is just amazing to me.

Maybe this is common technology now, I don’t know, I mean I’d sort of expect marketing information to advertise this kind of thing, if you can get 10-30x faster throughput over a VPN without high end WAN optimizationÂ vs regular internet I’d be really interested in that technology. If you’ve seen similar massive increases in performance w/o special WAN Optimization on a site to site VPN I’d be interested to hear about it.

In this particular case, the products I’m using are Sonic Wall NSA3500s. The only special feature licensed is high availability, other than that it’s a basic pair of units on each end of the connection. (WAN Optimization is a software option but is NOT licensed). These are my first Sonic Walls, I had some friends trying to push me to use Juniper (SRX I think) or in one case Palo Alto networks, but Juniper is far too complicated for my needs, and Palo Alto networks is not suitable for Site to Site VPNs with their cost structure (the quote I had for 4 devices was something like $60k). So I researched a few other players and met with Sonic Wall about a year ago and was satisfied with their pitch and verified some of their claims with some other folks, and settled on them. So far it’s been a good experience, very easy to manage, and I’m still just shocked by this throughput. I really had terrible experiences managing those Cisco PIXs a few years back by contrast. OpenVPN is a real pain as well (once it’s up and going it’s alright, configuring and troubleshooting are a bitch).

Sonic Wall claimed they were the only ones (2nd to Palo Alto Networks) who had true deep packet inspection in their firewalls (vs having other devices do the work). That claim interested me, as I am not well versed in the space. I bounced the claim off of a friend that I trust (who knows Palo Alto inside and out) and said it was probably true, Palo Alto’s technology is better (less false positives) but nobody else offers that tech.Â Not that I need that tech, this is for a VPN – but it was nice to know that we got the option to use it in the future. Sonic Wall’s claims go beyond that as well saying they are better than Palo Alto in some cases due to size limitations on Palo Alto (not sure if that is still true or not).

Going far beyond simple stateful inspection, the DellÂ®SonicWALLÂ® Reassembly-Free Deep Packet Inspectionâ„¢ (RFDPI) engine scans against multiple application types and protocols to ensure your network is protected from internal and external attacks as well as application vulnerabilities. Unlike other scanning engines, the RFDPI engine is not limited by file size or the amount of concurrent traffic it can scan, making our solutions second to none.

SonicWall Architecture - looked neat, adds some needed color to this page

The packet capture ability of the devices is really nice too, makes it very easy to troubleshoot connections. In the past I recall on Cisco devices at least I had to put the device in some sort of debug mode and it would spew stuff to the console(my Cisco experience is not current of course). With these Sonic Walls I can setup filters really easily to capture packets and it shows them in a nice UI and I can export the data to wireshark or plain text if needed.

My main complaint on these Sonic Walls I guess is they don’t support link aggregation(some other models do though). Not that I need it for performance I wanted it for reliability so that if a switch fails the Sonic wall can stay connected and not trigger a fail over there as well, but as-is I had to configure them so each Sonic wall is logically connected to a single switch (though they have physical connections to both – I learned of the limitation after I wired them up). Not that failures happen often of course but it’s too bad this isn’t supported in this model (which has 6x1Gbps ports on it).

The ONLY thing I’ve done on these Sonic Walls is VPN (site to site mainly, but have done some SSL VPN stuff too), so beyond that stuff I don’t know how well they work. Sonic wall traditionally has had a “SOHO” feel to it though it seems in recent years they have tried to shrug this off, with their high end reaching as high as 240 Gbps in an active-active cluster. Nothing to sneeze at.

UPDATE – I ran another test, and this time I captured a sample of the CPU usage on the Sonic Wall as well as the raw internet throughput as reported by my router, I mean switch, yeah switch.

2,784MB gzip’d file copied in 3 minutes 35 seconds using SCP. If my math is right that comes to an average of roughly 12.94 Megabytes/second ? This is for a single stream, basic file copy.

The firewall has a quad core 550 MHz Mips64 Octeon Processor (I assume it’s quad core and not four individual processors). CPU usage snapshot here:

SonicWall CPU usage snapshot across cores during big file xfer

The highest I saw it was CPU core #1 going to about 45% usage, with core #2 about 35% maybe, and core #3 maybe around 20%, with core #0 being idle (maybe that is reserved for management? given it’s low usage during the test.. not sure).

Raw network throughput topped out at 135.6 Megabits/second (well some of that was other traffic, so wager 130 Megabits for the VPN itself).

Raw internet throughput for VPN file transfer

Apparently this post found it’s way to Dell themselves and they were pretty happy to see it. I’m sorry I just can’t get over how bitchin’ fast this thing is! I’d love for someone at Dell/SonicWALL who knows more than the tier 1 support person I talked with a few weeks ago to explain it better.

Comments Off

August 7, 2012

Adventures with vCenter, Windows and expired Oracle passwords

Filed under: General — Tags: oracle, vmware, windows — Nate @ 7:39 pm

Today’s a day that I could have back – it was pretty much a waste/wash.

I’m not a windows person by trade of course, but I did have an interesting experience today. I write this in the hopes that perhaps it can save someone else the same pain.

Last night I kicked off some Windows updates on a vCenter server, done it a bunch of times before never had an issue. There was only about 6-10 updates to install. It installed them, then rebooted, and was taking a really long time to complete the post install stuff, after about 30mins I gave up and went home. It’s always come back when it’s done.

I forgot about it until this morning when I went to go do stuff with vCenter and could not connect. Then I tried to remote desktop into the system and could not(tcp port not listening). So I resorted to logging in via VMware console. Tried resetting remote desktop to no avail. I went to control panel to check on windows update, and the windows update control panel just hung. I went to the ‘add/remove programs’ thing to roll back some updates and it hung while looking for the updates.

I tried firing up IE9, and it didn’t fire, it just spun an hourglass for a few seconds and stopped. I scoured the event logs and there was really nothing there – no errors. I was convinced at this time an OS update went wrong, I mean why else would something like IE break ? There was an IE update as part of the updates that were installed last night after all.

After some searches I saw some people comment on how some new version of Flash was causing IE to break, so I went to remove flash (forgot why it was installed but there was a reason at the time), and could not. In fact I could not uninstall anything, it just gave me a generic message saying something along the lines of “wait for the system to complete the process before uninstalling this”.

I came across a windows tool called System Update Readiness Tool which sounded promising as well, I was unable to launch IE of course, I did have firefox and could load the web page but was unable to download the software without Firefox hanging(!?). I managed to download it on another computer and copy it over the network to the affected server’s HD. But when I tried to launch it – sure enough it hung too almost immediately.

Rebooting didn’t help, shut down completely and start up again – no luck. Same behavior. After consulting with the IT manager who spends a lot more time in Windows than me we booted to safe mode – came right up. Windows update is not available in safe mode, most services were not started. But I was able to get in and uninstall the hot fix for IE. I rebooted again.

At some point along the line I got the system to where I could remote desktop in, windows update looked ok, IE loaded etc. I called the IT manager over to show him, and decided to reboot to make sure it was OK only to have it break on me again.

I sat at the post install screen for the patches (Stage 3 of 3 0%) for about 30 minutes, at this point I figure I better start getting prepared to install another vCenter server so I started that process in parallel, talked a bit with HP/Vmware support and I shut off the VM again and rebooted – no difference just was sitting there. So I rebooted again into safe mode, and removed the rest of the patches that were installed last night, and rebooted again into normal mode and must’ve waited 45 minutes or so for the system to boot – it did boot eventually, got past that updates screen. But the system was still not working right, vCenter was hanging and I could not remote desktop in.

About 30 minutes after the system booted I was able to remote desktop in again, not sure why, I kept poking around, not making much progress. I decided to take a VM snapshot (I had not taken one originally but in the grand scheme of things it wouldn’t of helped), and re-install those patches again, and let the system work through whatever it has to work through.

So I did that, and the system was still wonky.

I looked and looked – vCenter still hanging, nothing in the event log and nothing in the vpx vCenter log other than stupid status messages like

2012-08-08T01:08:01.186+01:00 [04220 warning 'VpxProfiler' opID=SWI-a5fd1c93] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:08:12.535+01:00 [04220 warning 'VpxProfiler' opID=SWI-12d43ef2] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:08:23.884+01:00 [04356 warning 'VpxProfiler' opID=SWI-f6f6f576] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:08:35.234+01:00 [04220 warning 'VpxProfiler' opID=SWI-a928e16] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:08:46.583+01:00 [04220 warning 'VpxProfiler' opID=SWI-729134b2] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:08:57.932+01:00 [04328 warning 'VpxProfiler' opID=SWI-a395e0af] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:09:09.281+01:00 [04220 warning 'VpxProfiler' opID=SWI-928de6d2] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:09:20.631+01:00 [04328 warning 'VpxProfiler' opID=SWI-7a5a8966] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:09:32.058+01:00 [04220 warning 'VpxProfiler' opID=SWI-524a7126] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:09:43.804+01:00 [04328 warning 'VpxProfiler' opID=SWI-140d23cf] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:09:55.551+01:00 [04356 warning 'VpxProfiler' opID=SWI-acadf68a] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:10:07.297+01:00 [04328 warning 'VpxProfiler' opID=SWI-e42316c] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:10:19.044+01:00 [04356 warning 'VpxProfiler' opID=SWI-3e976f5f] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:10:30.790+01:00 [04328 warning 'VpxProfiler' opID=SWI-2734f3ba] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms

No errors anywhere, I believe I looked at the tomcat logs a few times and there was no logs for today.

Finally I dug into the tomcat logs from last night and came across this –

Aug 6, 2012 11:27:30 PM com.vmware.vim.common.vdb.VdbODBCConfig isConnectableUrl
SEVERE: Unable to get a connection to: jdbc:oracle:thin:@//DB_SERVER:1521/DB_SERVER as username=VPXADMIN due to: ORA-28001: the password has expired

I had encountered a password expiry on my sys account a few weeks ago, but didn’t really think much about it at the time. Anyways I reset the password and vCenter was able to start. I disabled password expiry per this page (I have used Oracle 10G and a little of 8/9i and never recall having password expire issues), which says defaults were changed in 11G and passwords do expire now.

I have had vCenter fail to start because of DB issues in the past – in fact because vCenter does not properly release locks on the Oracle DB when it shuts down the easiest workaround is to restart Oracle whenever I reboot the vCenter server (because vCenter is the only thing on the Oracle DB it’s just a simpler solution). When vCenter fails in this way it causes no issues to the rest of the OS. Just an error message in the event log saying vCenter failed to start, and a helpful explanation as to why –

Unable to get exclusive access to vCenter repository.Â Â  Please check if another vCenter instance is running against the same database schema.

What got me, even now is how the hell did this expired password cascade into Internet Explorer breaking, remote desktop breaking, windows update breaking, etc ? My only guess is that vCenter was perhaps flooding the system with RPC messages causing other things to break. Again – there was no evidence of any errors in the event log anywhere. I even called a friend who works at Microsoft and deploys hundreds of Windows servers for a living (he works as a Lab Manager), hoping he would have an idea. He said he had seen this behavior several times before but never tried to debug it, he just wiped the system out and reinstalled. I was close to doing that today, but fortunately eventually found a solution, and I guess you could say I learned something in the process ?

I don’t know.

I have not seriously used windows since the NT4 days (I have used it casually on the desktop and in some server roles like this vCenter system), why I stopped using it, well there was many reasons, I suppose this was sort of a reminder. I’m not really up to moving to the Linux vCenter appliance yet it seems beta-ish, if I ever get to move to that appliance before I upgrade to KVM (at some point, no rush). I have a very vague memory of experimenting one time on NT4, or maybe it was 3.51, where I decided to stop one/more of the RPC services to see what would happen. Havok, of course. I noticed one of the services vCenter depends upon, the DCOM Server Process Launcher, seems similar of importance in Windows 2008, though 2008 smartly does not allow you to stop it, I chuckled when I saw the Recovery Action for this service failure is Restart the Computer. But in this case the service was running… I looked for errors for it in the event log as well and there were none.

Comments (4)

ESXi 5 Uptake still slow?

Filed under: General — Tags: vmware — Nate @ 10:10 am

Just came across this article from our friends at The Register, and two things caught my eye –

HP is about to launch a new 2U quad socket system – the HP DL560 Gen8, which is what the article is about. I really can’t find any information on this server online, so it seems it is not yet officially announced. I came across this PDF from 2005, which says the 560 has existed in the past – though I never recall hearing about it and I’ve been using HP gear off and on since before that. Anyways, on the HP site the only 500-series systems I see are the 580 and 585, nothing new there.

HP has taken it’s sweet time joining the 4-socket 2U gang, I recall Sun was among the first several years ago on the Opteron, then later Dell and others joined in but HP was bulky still with the only quad socketÂ rack option being 4U.

The more interesting thing though to me was the lack of ESXi 5.0 results posted with VMware’s own benchmark utilities. Of the 23 results posted since ESXi 5 was made generally avaialble, only four are running on the newest hypervisor. I count six systems using ESX 4.1U2 and vCenter 5.0 (a combination I chose for my company’s infrastructure). Note I said ESX – not ESXi. I looked at a couple of the disclosure documents and would expect them to specifically call out ESXi if that is in fact what was used.

So not only are they NOT using ESXi 5.0 but they aren’t even using ESXi period with these newest results (there is not a single ESXi 4.x system on the site as far as I can tell).

Myself I find that fascinating. Why would they be testing with an older version of the hypervisor and not even using ESXi? I have my own reasons for preferring ESX over ESXi, but I’d really expect for benchmark purposes they’d go with the lighter hypervisor. I mean it consumes significantly less time to install onto a system since it’s so small.

I have to assume that they are using this configuration because it’s what the bulk of their customers are still deploying today, otherwise it makes no sense to be testing the latest and greatest Intel processors on server hardware that’s not even released yet on an OS kernel that is going on three years old at this point. I thought there was supposed to be some decent performance boosts in ESXi 5?

I’m not really a fan of the VMark benchmark itself, it seems rather confusing to interpret results, there are no cost disclosures, and I suspect it only runs on VMware making it difficult or impossible to compare with other hypervisors. Also the format of the results is not ideal, I’d like to see at least CPU/Memory/Storage benchmarks included so it’s easier to tell how each subsystem performed. Testing brand X with processor Y and memory Z against brand W with processor Y and memory Z by itself doesn’t seem very useful.

SPEC has another VM benchmark, though it seems similarly confusing to interpret results, though at least they have results for more than one hypervisor.

vSphere, aka ESX 4, when it was released really was revolutionary, it ditched the older 32-bit system for a more modern 64-bit system, and introduced a ton of new things as well.

I was totally underwhelmed by ESXi 5, even before the new licensing change was announced. I mean just compare What’s New between vSphere 4 and vSphere 5.

Comments (2)

August 2, 2012

Losing $400M in a matter of minutes due to a software bug

Filed under: News — Tags: bug, software — Nate @ 9:52 am

This is pretty crazy, yesterday morning a Wall St market maker had some bug(s) in their software platform that caused them to perform a ton of trades, as one reporter put it around 300% of normal volume on the NYSE was being traded by this one company. As the story unfolded the company continue to say everything is normal, then they changed their story to we’re investigating, then they changed their story to a technology error occurred.

[..] Among other stocks, Protective Life (PL.N) had already traded more than 10 times its usual volume, and Juniper Networks JPNR.N has already seen six times its usual daily volume.

I bet they hoped the NYSE was going to reverse those trades that were triggered by whatever bug it is – but at the end of the day yesterday the NYSE opted not to reverse the vast majority of them.

The result is the company – Knight Capital lost more than $400 million as a result of the trades and is now seeking alternative means to re-capitalize itself. Knight’s stock has lost $600M in market cap (~67%) since the event. They had a billion dollar market cap as recently as 36 hours ago.

Yesterday the traders at Knight were obviously under a lot of stress and they took great comfort in this video CNBC showed on air. It is quite a funny video, the folks at Knight kept asking to see it again and again.

I’m not sure if I’ve mentioned this here before but this is a good of spot as any, a video from a couple months ago about how serious high frequency trading is – and the difference 11 milliseconds can make.

Comments Off

August 1, 2012

Oracle loses 2nd major recent legal battle

Filed under: General — Tags: oracle — Nate @ 5:15 pm

Not long ago, Oracle lost the battle against Google’s Android, and now it seems they have lost the battle with HP on Itanium.

A California court has ruled that Oracle is contractually obligated to produce software for Hewlett-Packard’s Itanium-based servers and must continue to do so for as long as HP sells them.

That’s quite a ruling – for as long as HP sells them. That could be a while! Though I think a lot of damage is already done with Itanium, all of the uncertainty I’m sure prompted a bunch of customers to upgrade to other platforms since they thought Oracle was gone. I suspect it won’t stop either, I think customers will think they will get poor levels of support with Itanium because Oracle is forced to do it kicking and screaming.

Couldn’t of happened to a nicer company (even though I am a long time fan of Oracle DB itself..)

Comments Off

July 30, 2012

Super Micro out with mini UPSs for servers

Filed under: General — Tags: supermicro — Nate @ 9:56 pm

It’s been talked about for quite a while, I think Google was probably the first to widely deploy a battery with their servers, removing the need for larger batteries at the data center or rack level.

Next came Microsoft in what I consider at least to be a better design(more efficient at least), with Google’s apparently using AC power to the servers (though the pictures could well be outdated, who knows what they use now). Microsoft took the approach of rack level UPSs and DC power to the servers.

I was at a Data Center Dynamics conference a couple years back where a presenter talked about a similar topic though he didn’t use batteries, it was more along the lines of big capacitors(that had the risk of exploding no less).

Anyways I was wondering along and came across this, which seems really new. It goes beyond the notion that most power events last only two seconds and gives a server an internal battery capacity of anywhere from 30 seconds to 7 minutes depending on sizing and load.

It looks like a really innovative design and it’s nice to see a commercial product in this space being brought to market. I’m sure you can get similar things from the depths of the bigger players if your placing absolutely massive orders of servers, but for more normal folks I’m not aware of a similar technology being available.

These can be implemented in 1+1+1 (2 AC modules + 1 UPS Module), 1+2 (1 AC + 2 UPS @ 2000W) or 2+2 (2 AC + 2 UPS @ 2000W) configurations.

It does not appear to be a integrated PSU+Battery, but rather a battery module that fits along side a PSU, in place of what otherwise could be another PSU.

You may have issues running these battery units in 3rd party data centers, I don’t see any integration for Emergency Power Off (EPO), some facilities are picky about that kind of thing. I can imagine the look on some uninformed tech’s face when they hit the EPO switch, the lights go out but hundreds or thousands of servers keep humming along. That would be a funny sight to see.

While I’m here I guess I should mention the FatTwin systems that they released a few weeks ago, equally innovative compared to the competition in the space at least. Sort of puts the HP SL-series to shame, really. I don’t think you’d want to run mission critical stuff on this gear, but for the market it’s aimed at which is HPC, web farms, hadoop etc they look efficient, flexible and very dense, quite a nice step up from their previous Twin systems.

It’s been many years since I used Super Micro, I suppose the thing they have traditionally lacked more than anything else in my experience (which again isn’t recent maybe this is fixed), is better fault detection and reporting of memory errors. Along the lines of HP’s Advanced ECC, or IBM’s Chipkill (the damn thing was made for NASA what more do you need !) .

I recall some of the newer Intel chips have something similar in the newer chipsets, though the HP and IBM stuff is more CPU agnostic(e.g. supports AMD 🙂 ). I don’t know how the new Intel memory protection measures up to Advanced ECC / Chipkill. Note I didn’t mention Dell – because Dell has no such technology either (they too rely on the newer Intel chips to provide that similar function for their Intel boxes at least).

The other aspect is when a memory error is reported on an HP system for example (at least one of the better ones 300-series and above) – typically a little LED lights up next to the socket having errors, along with perhaps even a more advanced diagnostics panel on the system before you even open it up to show which socket has issues. Since memory errors were far and away the #1 issue I had when I had Super micro systems, these features became sorely missed very quickly. Another issue was remote management, but they have addressed this to some extent in their newer KVM management modules (now that I think about it the server that powers this blog is a somewhat recent Supermicro with KVM management – but from a company/work/professional perspective it’s been a while since I used them).

Comments (2)

July 27, 2012

Microsoft Licenses Linux to Amdocs

Filed under: General — Tags: linux, microsoft — Nate @ 3:00 pm

Microsoft has been fairly successful in strong arming licensing fees from various Android makers, though less successful in getting fees directly from operators of Linux servers.

It seems one large company, Amdocs, has caved in though.

The patent agreement provides mutual access to each companyâ€™s patent portfolio, including a license under Microsoftâ€™s patent portfolio covering Amdocsâ€™ use of Linux-based servers in its data centers.

I almost worked for Amdocs way back in the day. A company I was at was acquired by them, I want to say less than two months after I left the company. Fortunately I still had the ability to go back and buy my remaining stock options and got a little payout from it. One of my former co-workers said that I walked away from a lot of money.Â I don’t know how much he got but he assured me he spent it quickly and was broke once again! I don’t know many folks at the company still since I left it many years ago, but everything I heard sounds like the company turned out to be as bad as I expected, and I don’t think I would of been able to put up with the politics or red tape for the retention periods following the acquisition as it was already bad enough to drive me away from the company before they were officially acquired.

I am not really surprised Amdocs licensed Linux from Microsoft. I was told an interesting story a few years ago about the same company. They were a customer of Red Hat for Enterprise Linux, and Oracle enticed them to switch to Oracle Enterprise Linux for half the cost they were paying Red Hat. So they opted to switch.

The approval process had to go through something like a dozen layers in order to get processed, and at one point it ends up on the desk of the head legal guys at Amdocs corporate. He quickly sent an email to the new company they just acquired about a year earlier that the use of Linux or any open source software was forbidden and they had to immediately shut down any Linux systems they had. If I recall right this was on a day before a holiday weekend. My former company was sort of stunned and laughed a bit, they had to sent another letter up the chain of command which I think reached the CEO or the person immediately below the CEO of the big parent who went to the lawyer and said they couldn’t shut down their Linux systems because all of the business flowed through Linux, and they weren’t about to shut down the business on a holiday weekend, well that and the thought of migrating to a new platform so quickly was sort of out of the question given all the other issues going on at the time.

So they got a special exclusion to run Linux and some other open source software, which I assume is still run to this day. It was the first of three companies (in a row no less) that I worked at that started out as Microsoft shops, then converted to Linux (in all three cases I was hired on a minimum of 6-12 months after they made the switch).

Another thing the big parent did was when they came over to take over the corporate office they re-wired everything into a secure and insecure networks. The local linux systems were not allowed on the secure network only the insecure one(and they couldn’t do things like check email from the insecure network). They tried re-wiring it over a weekend and if I recall right they were still having problems a week later.

Fun times I had at that company, I like to tell people I took 15 years of experience and compressed it into three, which given some of the resumes I have come across recently 15 years may not be long enough. It was a place of endless opportunity, and endless work hours. I’d do it again if I could go back I don’t regret it, though it came at a very high personal cost which took literally a good five years to recover from fully after I left(I’m sure some of you know the feeling).

I wouldn’t repeat the experience again though – I’m no longer willing to put up with outages that last for 10+ hours(had a couple that lasted more than 24 hours), work weeks that extend into the 100 hour range with no end in sight. If I could go back in time and tell myself whether or not to do it – I’d say do it, but I would not accept a position at a company today after having gone through that to repeat the experience again – just not worth it.Â A few years ago some of the execs from that company started a new company in a similar market and tried to recruit a bunch of us former employees pitching the idea “it’ll be like the good ‘ol days”, they didn’t realize how much of a turn off that was to so many of us!

I’d be willing to bet the vast majority of Linux software at Amdocs is run by the company I was at, at last check I was told it was in the area of 2,000 systems (all of which ran in VMware) – and they had switched back to Red Hat Enterprise again.

Comments Off

FCoE: stillborn 3 years later

Filed under: Networking — Tags: fcoe — Nate @ 9:44 am

At least the hype for the most part has died off, as the market has not really transitioned much over to FCoE since it’s launch a few years ago. I mentioned it last year, and griped about it in one of my early posts in 2009 around when Cisco was launching their UCS, along with NetApp both proclaiming FCoE was going to take over.

Brocade has been saying for some time that FCoE adoption was lacking, a few short months ago Emulex came out and said about the same, and more recently Qlogic chiming in with another me too story.

FCoE â€“ the emulated version of Fibre Channel running over Ethernet â€“ is not exactly selling like hot cakes and is not likely to do so anytime soon, so all that FCoE-flavoured Ethernet development is not paying off yet.

More and more switches out there are supporting the Data Center Bridging protocols but those die hard Fibre Channel users aren’t showing much interest in it. I imagine the problem is more political than anything else at many larger organizations. The storage group doesn’t trust the networking group and would rather have control over their own storage network, and not share anything with the network group. I’ve talked to several folks over recent years where storage divisions won’t even consider something that is exclusively iSCSI for example for the company because it means the networking folks have to get involved and that’s not acceptable. Myself, I have had a rash of issues with certain Qlogic 10GbE network cards over the past 7 months which makes me really glad I’m not reliant on ethernet-based storage (there is some of it but all of the critical stuff is good ‘ol Fibre channel – on entirely Qlogic infrastructure again). The rash of issues finally ressurected a bad set of memories I had trying to troubleshoot network issues on some Broadcom NICs a few years ago with regards to something buggy called MSI-X. It took about six months to track that problem down, the symptoms were just so bizarre. My current issues with 10GbE NICs aren’t all that critical because of the level of redundancy that I have and the fact that storage is run over regular ‘ol FC.

I know Qlogic is not alone in their issues with 10GbE, a little company by the name of Clearwire in Seattle I know had what amounted to something like a 14 hour outage a year or two ago on their Cisco UCS platform because of bugs in the Cisco stuff that they had(I think it was bugs around link flapping or something). I know others have had issues too, it sort of surprises me how long 10GbE has been around and we still seem to have quite a few issues with it, at least on the HBA side.

iSCSI has had it’s issues too over the years, at least iSCSI in the HBAs, I was talking to one storage company late last year who has an iSCSI-only product and they said how iSCSI is ready for prime time, but after further discussion they clarified well you really only should use it with offloading NIC X or Y or software stack Z. iSCSI was a weak point for a long time on the 3PAR platform, they’ve addressed it to some extent on the new V-series, but I wouldn’t be surprised if they still don’t support anything other than pure software initiators.

TCP is very forgiving to networking issues, storage of course is not. In the current world of virtualization with people consolidating things on fewer, larger systems, the added cost of FC really isn’t that much. I wouldn’t be slapping FC cards in swaths of $3-5k servers, most servers that run VMs have gobs of memory which of course drives the price quite a bit higher than that.

Data center bridging really does nothing when your NIC decides to stop forwarding jumbo frame packets, or when the link starts flapping, or when the firmware crashes, or if the ASIC overheats. The amount of time it often takes for software to detect a problem with the link and fail over to a backup link alone is big enough to cause major issues with storage if it’s a regular occurrence. All of the networks I’ve worked on at least in the past decade or so always have operated at a tiny fraction of their capacity, the bottlenecks are typically things like firewalls between zones (and whenever possible I prefer to rely on switch ACLs to handle that).

Comments Off

July 11, 2012

Tree Hugging SFO stops buying Apple

Filed under: General — Tags: apple, environmental — Nate @ 8:31 am

I saw this headline over on slashdot just now and couldn’t help but laugh. Following Apple’s withdrawal from an environmental group, the city of San Fransisco – pretty much in Apple’s back yard, is going to stop buying Macs because of it. I imagine they will have to not buy iPads or iPhones too (assuming they were buying any to begin with) since they are just as integrated as the latest Mac laptops.

Apparently the tightly integrated devices are too difficult to recycle to be compliant so rather than make the devices compliant Apple goes their own way.

I don’t care either way myself but I can just see the conflict within the hardcore environmentalists who seem to, almost universally from what I’ve seen anyways adopt Apple products across the board. For me it’s really funny at least.

It is an interesting choice though, given Apple’s recent move to make one of their new data centers much more green by installing tons of extra solar capacity. On the one hand the devices are not green, but on the other hand the cloud that powers them is. But you can’t use the cloud unless you use the devices, what is an environmentalist to do?!

I suppose the question remains – given many organizations have bans on equipment that is not certified by this environmental standards body – once these bans become more widespread, how long is it until some of them cave internally to their own politics and the withdrawal some of their users go through for not being able to use Apple. I wonder if some may try to skirt the issue by implementing BYOD and allowing users to expense their devices.

Speaking of environmental stuff, I came across this interesting article on The Register a couple weeks ago, which talks about how futile it is to try to save power by unplugging your devices – the often talked about power drain as a result of standby mode. The key takeaway from that story for me was this:

Remember: skipping one bath or shower saves as much energy as switching off a typical gadget at the wall for a year.

In the comments of the story one person wrote how this guy’s girlfriend or wife would warm up the shower for 4-5 minutes before getting in. The same person wanted to unplug their gadgets to save power. But she didn’t want to NOT warm up the shower. Thus obviously wasted a ton more energy than anything that could be saved by unplugging their gadgets. For me, the apartment I live in now has some sort of centralized water heater (first once I’ve ever seen in a multi home complex). All of my previous places have had dedicated water heaters. So WHEN the hot water works (I’ve had more outages of hot water in the past year than I have in the previous 20), the shower warms up in about 30-45 seconds.

So if you want to save some energy, take a cold shower once in a while – or skip a shower once in a while. Or if your like Kramer and take 60+ minute showers, cut it to less time(for him it seems even 27 minutes wasn’t long enough). If you really want to save some energy, have fewer children.

I’m leaving on my road trip to Seattle tomorrow morning, going to drive the coast from the Bay Area to Crescent City, then cut across to Grants Pass Oregon before stopping for the night. Then take I-5 up to Bellevue on Friday so I can make it in time for Cowgirls that night. Going to take a bunch of pictures with my new camera and test my car out on those roads. I made a quicker trip down south last Sunday – drove the coast to near LA and got some pretty neat pictures there too. I drove back on the 4th of July (started at around 5PM from Escondido, CA), for the first time ever for me at least there was NO TRAFFIC. I drove all the way through LA and never really got below 50-60MPH. I was really shocked even given the Holiday. I drove through LA on Christmas eve last year and still hit a ton of traffic then.

Comments Off

« Newer Posts — Older Posts »

TechOpsGuys.com Diggin' technology every day

August 20, 2012

August 13, 2012

August 7, 2012

August 2, 2012

August 1, 2012

July 30, 2012

July 27, 2012

July 11, 2012