TechOpsGuys.com Diggin' technology every day

August 27, 2012

vSphere 5.1: what 5.0 should of been!

Filed under: General — Tags: — Nate @ 6:40 pm

I’m not going to bore you with all the mundane details about what is new, so many other folks are doing that, but here is a post from VMware which has links to pdfs as to what is new.

It looks pretty decent, the licensing change is welcome, though not everyone agrees with that. I find it interesting that the web console is going to be the main GUI to manage vSphere going forward. I found the web console in 5.0 severely lacking, but I’m sure it’s improved in 5.1. Anyone happen to know if the new console is backwards compatible with vCenter 5.0 ? Also I wonder if this web console applies to managing ESXi hosts directly (without vCenter)? I assume it doesn’t apply(yet) ?

I don’t see myself upgrading anytime before the end of the year, but it does seem strongly to me that this 5.1 release is what 5.0 should of been last year.

I find this race to a million iops quite silly, whether it is VMware’s latest claim of 1 million iops in a VM, or EMC’s latest claim, or HDS’s latest claim, everyone is trying to show they can do a million too, and the fine print always seems to point to a 100% read workload, maybe customers will buy their arrays with their data pre-loaded, so they don’t have to do any writes to them.

 

August 24, 2012

3PAR: Helping and Hurting HP ?

Filed under: Storage — Tags: — Nate @ 8:59 am

Here I go, another blog post starting with question. Yet another post sparking speculation on my part due to an article by our friends at The Register who were kind enough to do some number crunching of HP’s latest quarterly numbers were storage revenues were down about 5%.

Apparently a big chunk of the downward slide for revenues was declines in EVA and tape, offset to some degree by 3PAR and StoreOnce de-dupe products.

I suppose this thought could apply to both scenarios, but I’ll focus on the disk end, since I have no background on StoreOnce.

Before HP acquired 3PAR, obviously EVA was a juicy target to go after to replace EVAs with 3PARs. The pitch was certainly you can get a hell of a lot more done on less 3PAR than you can with more EVA. So you’ll end up saving money. I’ve never used EVA before myself, heard some good aspects of it and some really bad aspects of it, I don’t think I’d ever want to use EVA regardless.

I am sure that 3PAR reps (those that haven’t left anyways – I’ve heard from numerous sources they outclass their HP counterparts by leagues and leagues), who are now responsible for pitching HP’s entire portfolio obvious have a strong existing bias towards 3PAR and away from the other HP products. They try to keep a balanced viewpoint but I’m sure that’s hard to do, especially after they’ve been spending so much time telling the world how much these other products are bad and why the customer should use 3PAR instead. Can’t blame them, its a tough switch to make.

So, assuming you can get a hell of a lot more done on a smaller/fewer 3PAR system(s) than EVA  – which I think is totally true, (with some caveat as to what sort of discounts some may be able to score on EVA, 3PAR has traditionally had some strict margin rules where they have no problem walking away from a deal if the margin is too low), add to that the general bias of at least part of the sales force, as well as HP’s general promotion that 3PAR is the future, and you can quite possibly get lower overall revenue while the customers are saving money by having to buy fewer array resources to accomplish the same (or more) tasks.

3PAR revenue was up more than 60% apparently, on top of the previous gains made since the acquisition.

It would be very interesting to me to see how much consolidation some of these deals end up being – traditionally NetApp I think has been the easiest target for 3PAR, I’ve seen some absolutely massive consolidation done in the past with those products, it was almost comical in some cases. I bet EVA is similar.

Now the downside to the lower revenues, and I’ve seen this at both Dell and HP – both companies are feeling tremendous pressure to try to outperform, they haven’t been able to do it on the revenue side, so they’ve been squeezing on internal costs, which really can degrade services. Overall quality of the sales forces at the likes of HP and Dell have traditionally been terrible, compared to the smaller company counterparts (at least in storage). Add to that the internal politics and region limitations that the companies place on their sales forces further complicates and frustrates the quality people internally as well as customers externally. Myself I was unable to get anything out of a local HP/3PAR account team for months in the Bay Area, so I reached out to my friends in Seattle and they turned some stuff around for me in a matter of hours no questions asked, and they didn’t get any credit (from HP) for it either. Really sad situation for both sides.

I don’t have much hope that HP will be willing or able to retain the top quality 3PAR folks at least on the sales side over the medium term, they, like Dell seem focused on driving down costs rather than keeping quality high, which is a double edged sword. The back end folks will probably stick around for longer, given that 3PAR is one of the crown jewels in HP’s enterprise portfolio.

For some reason I’m immediately reminded of this quote from Office Space:

[..] "that is not right, Michael. For five years now, you've worked your ass  off at Initech, hoping for a promotion or some kind of profit sharing  or something. Five years of your mid-20s now, gone. And you're gonna go  in tomorrow and they're gonna throw you out into the street. You know  why? So Bill Lumbergh's stock will go up a quarter of a point."

 

One of 3PAR’s weak points has been at the low end of the market, say sub $100k deals, is a space 3PAR has never tried to compete in.  Apparently according to The Register the HP P4000/Lefthand side of things is not doing so hot, and also seemed to be HP’s go-to product for this price range. This product range is what HP used to be excited about, before 3PAR, I attended a storage briefing at a VMware User group meeting just before HP bought 3PAR, expecting some sort of broad storage overview, but it was entirely Lefthand focused. While Lefthand has some interesting tech (the network RAID is pretty neat), for the most part I’d rather pay more and use 3PAR obviously.

I wonder what will happen to Lefthand in the future, will the best of it’s tech get rolled up into 3PAR? or vise versa? Or maybe it will just stay where it’s at, the one good thing Lefthand has is the VSA, it’s not as complete as I’d like to see it, but it’s one of the very few VSAs out there.

Dell has been busy trying to integrate their various storage acquisitions whether it’s Compellent, Ocarnia, Exanet, and I think there was one or two more that I don’t remember. Storage revenues there down as well. I’m not sure how much of the decline has to do with Dell terminating the EMC reselling stuff at this point, but it seems like a likely contributor to the declines in their case.

August 23, 2012

Real time storage auto tiering?

Filed under: Storage — Tags: — Nate @ 9:42 am

Was reading this article over by our friends at The Register, and apparently there is some new system from DotHill that claims to provide real time storage tiering – that is moving data between tiers every 5 seconds.

The AssuredSAN Pro 5000 has a pretty bad name, borrows the autonomic term that 3PAR loves to use so much(I was never too fond of it, preferred the more generic term automatic), in describing their real time tiering. According to their PDF, they move data around in 4MB increments, far below most other array makers.

From the PDF –

  • Scoring to maintain a current page ranking on each and every I/O using an efficient process that adds less than one microsecond of overhead. The algorithm takes into account both the frequency and recency of access. For example, a page that has been accessed 5 times in the last 100 seconds would get a high score.
  •  Scanning for all high-scoring pages occurs every 5 seconds, utilizing less than 1.0% of the system’s CPU. Those pages with the highest scores then become candidates for promotion to the higher-performing SSD tier.
  •  Sorting is the process that actually moves or migrates the pages: high scoring pages from HDD to SSD; low scoring pages from SDD back to HDD. Less than 80 MB of data are moved during any 5 second sort to have minimal impact on overall system performance.

80MB of data every 5 seconds is quite a bit for a small storage system, I have heard of situations where auto tiering has such an impact that it actually made things worse due to so much data moving around internally on the system, and had to be disabled. I would hope they have some other safe gaurds in there like watching spindle latency and scheduling the movements to be low priority, and perhaps even canceling movements if the job can’t be completed in some period of time.

The long delay time between data movements is perhaps the #1 reason why I have yet to buy into the whole automagic storage tiering concept. My primary source of storage bottlenecks over the years has been some out-of-nowhere job that generates a massive amount of writes and blows out the caches on the system. Sometimes the number of I/Os inbound is really small too, but the I/O size can be really big (5-10x normal) and is then split up into smaller I/Os on the back end and the IOPS are multiplied as a result. The worst offender was some special application I supported a few years ago which would, as part of it’s daily process dump 10s of GB of data from dozens of servers in parallel to the storage system as fast as it could. This didn’t end well as you can probably imagine, at the peak we had about roughly 60GB of RAM cache between the SAN and NAS controllers. I tried to get them to re-architect the app to use something like local Fusion IO storage but they did not they needed shared NFS. I suspect this sort of process would not be helped too much by automatic storage tiering because I’m sure the blocks are changing each day to some degree. This is also why I have not been a fan of things like SSD read caches (hey there NetApp!), and of course having a SSD-accelerated write cache on the server end doesn’t make a lot of sense either since you could lose data in the event the server fails. Unless you have some sort of fancy mirroring system, to mirror the writes to another server but that sounds complicated and problematic I suspect.

Compellent does have one type of tiering that is real time though – all writes by default go to the highest tier and then are moved down to lower tiers later. This particular tiering is even included in the base software license. It’s a feature I wish 3PAR had.

This DotHill system also supports thin provisioning, though no mention of thin reclaiming, not too surprising given the market this is aimed at.

They also claim to have some sort of rapid rebuild, by striping volumes over multiple RAID sets, I suppose this is less common in the markets they serve (certainly isn’t possible on some of the DotHill models), this of course has been the norm for a decade or more on larger systems. Rapid rebuild to me obviously involves sub disk distributed RAID.

Given that DotHill is a brand that others frequently OEM I wonder if this particular tech will bubble up under another name, or will the bigger names pass on this in the fear it may eat into their own storage system sales.

August 22, 2012

Are people really waiting for Windows 8?

Filed under: General — Tags: — Nate @ 11:04 am

UPDATED  – Dell came out with their quarterly results and their PC business wasn’t doing well, down 22% from a year ago.

A common theme I see popping up are people claiming folks are holding off until Windows 8 before buying their systems. Do you think this is true? I can imagine some folks holding off to buy the MS Tablet or something, but I can’t imagine many people (I’d bet it’s a rounding error) waiting for the desktop/laptop version of Windows 8 before buying their new PC.  Especially given apparently 70% of Dells PC sales go to commercial/government with only 30% going to consumer.

UPDATE – HP released their results and their PC shipments were similarly poor with revenues down 10-12%.

They’d rather not admit that a weak economy combined with the rise of tablets not running Microsoft software is probably where most of the blame lies. A market that Linux advocates long hoped to try to capture by being “good enough” to do things like browse the web, read email etc, but for many reasons was never able to really establish any foot hold. I remember my grandfather had such a Linux system, forgot who made it, but it ran entirely off of a CD, with perhaps only a few megs of flash storage (this was about 10-12 years ago). It was slow but it mostly worked, he certainly got far fewer viruses on that then when he eventually went to windows a few years later(Dell had to re-install his system several times a year due to malware infestations – he refused to stop going to dodgy porn sites).

What’s new and good in Windows 8 vs 7 that would have folks want to hold back on a new desktop or laptop? It’s pitched as the biggest product launch since Windows 95 and I believe that, though I don’t see anywhere near the leap in technology from Windows 7 to 8 that happened from Win3.x to 95. (sort of like my feelings on vSphere 3.5->4 vs 4->5).

I suspect it’s just an excuse, time will tell if there is a massive surge in PC buying shortly after Windows 8 launches but I don’t expect there to be. The new Metro (or whatever they are calling it) ecosystem is minuscule compared to the existing windows ecosystem. Hell, all other ecosystems pale in comparison to the current windows ecosystem.

Do you know if people are holding back waiting for Windows 8 on a desktop or laptop style device? If so I’d be curious to hear the reasons.

I fear for Microsoft that Windows 8 will hit the ground with a thud. It’ll sell millions just because there is such a massive install base of Windows folks (Vista sold quite a bit too remember that!).  Unlike some other players (*cough* HP WebOS *cough*), even if it is a thud initially Microsoft won’t give up. I recall similar hype around Windows phone 7 and that hit with a thud as well and has gone nowhere. In short – MS is setting the expectations too high.

Something I did learn recently which I was not aware of before, one of my friends at Microsoft mentioned that the Windows Phone 7 platform was mostly acquired from another company(forgot which), Microsoft then went and gutted it and the result is what we have today. He had long lost all faith in Microsoft, in their ability to execute, stifling projects that have good prospects while promoting others that have none. I suppose that sort of thing is typical for a really big company. I don’t know how he (or others) can put up with it without going crazy. He didn’t have much positive things to say about Windows phone, nor did his girlfriend who also works at Microsoft. It was sort of “meh”.

They’ll keep trying though, Microsoft that is, throw enough money at anything and you’ll eventually get it right, it may cost them a few more billion, and a few more years, but it’s a big important market, so it’s certainly a worthwhile investment.

I do find it funny while Ballmer was out trying to destroy Google, Apple comes out of nowhere and takes the torch and runs with it, and it took Microsoft many years to regroup and try to respond.

I don’t hate Microsoft, haven’t for a while, I do sort of feel sorry for them though, their massive fall from the top of the world to where they are now. They still have tons of cash mind you.. but from pretty much every other angle they aren’t doing so well.

August 21, 2012

When is a million a million?

Filed under: Storage — Tags: — Nate @ 7:24 pm

I was doing my usual reading The Register, specifically this article and something popped into my mind and I wanted to write a brief note about it.

I came across the SPC-1 results for the Huawei OceanStor Dorado 5100 before I saw the article and didn’t think a whole lot about it.

I got curious when I read the news article though so I did some quick math – the Dorado 5100 is powered by 96 x 200GB SSDs and 96GB of cache in a dual controller active-active configuration. Putting out an impressive 600,000 IOPS with the lowest latency (by far) that I have seen anyways. Also they did have a somewhat reasonable unused storage ratio of 32.35% (I would of liked to have seen much better given  the performance of the box but I’ll take what I can get).

But the numbers aren’t too surprising – I mean SSDs are really fast right. What got me curious though is the # of IOPS coming out of each SSD  to the front end, in this case it comes to 6,250 IOPS/SSD. Compared to some of the fastest disk-based systems this is about 25x faster per disk than spinning rust. There is no indication that I can see at least that tells what specific sort of SSD technology they are using(other than SLC). But 6,250 per disk seems like a far cry from the 10s of thousands of IOPS many SSDs claim to be able to do.

I’m not trying to say it’s bad or anything but I found the stat curious.

I went ahead and looked at another all-SSD solution the IBM V7000, this time 18 x 200GB SSDs are providing roughly 120,000 IOPS also with really good latency with 16GB of data cache between the pair of controllers.  Once again the numbers come to roughly 6,600 IOPS per SSD. IBM ran at an even better unused storage ratio of just under 15%, hard to get much better than that.

Texas memory systems (recently acquired by IBM), posted results for their RamSan-630 about a year ago, with 20 x 640GB SSDs pushing out roughly 400,000 IOPS with pretty good latency. This time however the numbers change – around 20,000 IOPS per SSD here, as far as I can tell there is no RAM cache either. The TMS system came in at a 20% unused storage ratio.

While there are no official results, HP did announce not long ago that an ‘all SSD’ variant of the P10000(just realized it is kind of strange to have two sub models(V400 and V800 which were the original 3PAR models) of the larger P10000 model), which said would get the same 450,000 IOPS on 512 x SSDs. The difference here is pretty stark with each SSD theoretically putting out only 878 IOPS(so roughly 3.5x faster than spinning rust).

At least originally I know originally 3PAR chose a slower STEC Mach8IOPS SSD primarily due to cost (it was something like 60% cheaper). STEC’s own website shows the same SSD getting 10,000 IOPS (on a read test – whereas the disk they compared it to seemed to give around 250 IOPS).  Still though  you can tap out the 8 controllers with almost 1/4th the number of disks supported with these SSDs. I don’t know whether or not the current generation of systems uses the same SSD or not.

I’ll be the first to admit an all-SSD P10000 doesn’t make a lot of sense to me, though it’s nice that customers have that option if that’s what they want (I never understood why all-SSD was not available before that didn’t make sense either). HP says it is 70% less expensive than an all-disk variant, though are not specific whether they are using 100GB SSDs (I assume they are) vs 200GB SSDs.

Both TMS and Huawei advertise their respective systems as being “1 million IOPS”, I suppose if you took one of each and striped them together that’s about what you’d get ! Sort of reminds me of a slide show presentation I got from Hitachi right before their AMS2000-series launched one of the slides showed the # of IOPS from cache (they did not have a number for IOPS from disk at the time), which didn’t seem like a terribly useful statistic.

So here you have individual SSDs providing anywhere from 900 to 20,000 IOPS per disk on the same test…

I’d really love to see SPC-1 results for the likes of Pure Storage, Nimble Storage, Nimbus Storage, and perhaps even someone like Tintri, just to see how they measure up on a common playing field, with a non trivial utilization rate. Especially with claims like this from Nimbus saying they can do 20M IOPS/rack, does that mean at 10% of usable capacity or greater than 50%? I really can’t imagine what sort of workload would need that kind of I/O but there’s probably one or two folks out there that can leverage it.

We now take you back to your regularly scheduled programming..

August 20, 2012

The Screwballs have Spoken

Filed under: Datacenter — Tags: — Nate @ 2:07 pm

Just got this link from Gabriel (thanks!), it seems the screwball VMware community has spoken and VMware listened and is going to ditch their controversial vRAM licensing that they introduced last year.

In its upcoming release of vSphere 5.1, VMware is getting rid of vRAM entitlements, which debuted with vSphere 5 and determine how much memory customers are permitted to allocate to virtual machines on the host, according to sources familiar with VMware’s plans.

I tried to be a vocal opponent to this strategy and firmly believed it was going to hurt VMware, I haven’t seen any hard numbers as to the up take of vSphere 5, but there have been hints that it has not been as fast as VMware had hoped.

I had a meeting with a VMware rep about a year ago and complained about this very issue for at least 30 minutes but it was like talking to a brick wall. I was told recently that the rep in question isn’t with the company anymore.

I have little doubt that VMware was forced into this change because of slow uptake and outright switching to other platforms. They tried to see how much leverage they had at customers and realized they don’t have as much as they thought they had.

Now the question is will they repeat the mistake again in the future – myself I am pretty excited to hear that Red Hat is productizing OpenStack, along with RHEV, that really looks like it has a lot of potential (everything I see today about OpenStack says steer clear unless you have some decent in house development resources). I don’t have any spare gear to be able to play with this stuff on at the moment.

Thanks VMware for coming to your senses, the harsh feelings are still there though, can I trust you again after what you tried to pull? Time will tell I guess.

(In case you’re wondering where I got the title of this post from it’s from here.)

 Marge get to make her concluding statement, in which she asks all concerned  parents to write to I&S and express their feelings. In his office, Mr.  Meyers goes through the tons of angry mail he's received... ``The  screwballs have spoken...'' 

August 13, 2012

Freakish performance with Site to Site VPN

Filed under: Networking — Tags: , — Nate @ 6:07 pm

UPDATED I’ll be the first to admit – I’m not a network engineer. I do know networking, and can do the basics of switching, static routing, load balancing firewalls etc. But it’s not my primary background. I suppose you could call me a network engineer if you base my talents off of some past network engineers I’ve worked with (which is kinda sad really).

I’ve used quite a few different VPNs over the years, all of them without any special WAN Optimization, though the last “appliance” based VPN I was directly responsible for was a VPN between two sites connected by Cisco PIXs about 4-5 years ago. Since then either my VPN experience has been limited to using OpenVPN on my own personal stuff, or relying on other dedicated network engineer(s) to manage it.

In general, my experience has told me that site to site VPN performance generally equates to internet performance, you may get some benefit with the TCP compression and stuff, but without specialized WAN Optimization / protocol optimization / caching  etc – throughput is limited by latency.

I conferred with a colleague on this and his experience was similar – he expects site to site VPN performance to about match that of Internet site to site performance when no fancy WAN Opt is in use.

So imagine my surprise when a few weeks ago I hooked up a site to site VPN in between Atlanta and Amsterdam (~95ms of latency between the two), and I get 10-30 fold improvement in throughput over the VPN than over the internet.

  • Internet performance = ~600-700 Kilobytes/second sustained using HTTPS
  • Site to site VPN performance = ~5 Megabytes/second using NFS, ~12 Megabytes/second sustained using SCP, and 20 Megabytes/second sustained using HTTP

The links on each end of the connection are 1Gbps, tier 1 ISP on the Atlanta side, I would guesstimate tier 2 ISP (with tons of peering connections) on the Amsterdam side.

It’s possible the performance could be even higher, I noticed that speed continued to increase the longer the transfer was running. My initial tests were limited to ~700MB files – 46.6 seconds for a 697MB file with SCP. Towards the end of the SCP it was running at ~17MB/sec (at the beginning only 2MB/sec).

A network engineer who I believe is probably quite a bit better than me told me

By my calculation – the max for a non-jittered 95ms connection is about 690KB/s so it looks like you already have a clean socket.
Keep in mind that bandwidth does not matter at this point since latency dictates the round-trip-time.

I don’t know what sort of calculation was done, but the throughput matches what I see on the raw Internet.

These are all single threaded transfers. Real basic tests. In all cases the files being copied are highly compressed(in my main test case the 697MB file uncompresses to 14GB), and in the case of the SCP test the data stream is encrypted as well. I’ve done multiple tests over several weeks and the data is consistent.

It really blew my mind, even with fancy WAN optimization I did not expect this sort of performance using something like SCP. Obviously they are doing some really good TCP windowing and other optimizations, despite there still being ~95ms of latency between the two sites within the VPN itself the throughput is just amazing.

I opened a support ticket to try to get support to explain to me what more was going on but they couldn’t answer the question. They said because there is no hops in the VPN it’s faster. There may be no hops but there’s still 95ms of latency between the systems even within the VPN.

I mean just a few years ago I wrote a fancy distributed multi threaded file replication system for a company I was at to try to get around the limits of throughput between our regional sites because of the latency. I could of saved myself a bunch of work had we known at the time (and we had a dedicated network engineer at the time) that this sort of performance was possible without really high end gear or specialized protocols (I was using rsync over HPN-SSH).  I remember trying to setup OpenVPN between two sites at the company at that company for a while to test throughput there, and performance was really terrible(much worse than the existing Cisco VPN that we had on the same connection). For a while we had Cisco PIX or ASAs I don’t recall which but had a 100Mbit limit on throughput, we tapped them out pretty quickly and had to move on to something faster.

I ran a similar test between Atlanta and Los Angeles, where the VPN in Los Angeles was a Cisco ASA (vs the other sites are all Sonic Wall), and the performance was high there too – I’m not sure what the link speed is in Los Angeles but throughput was around 8 Megabytes/second for a compressed/encrypted data stream, easily 8-10x faster than over the raw Internet. I tested another VPN link between a pair of Cisco firewalls and their performance was the same as the raw Internet (15ms of latency between the two), I think the link was saturated in those tests(not my link so I couldn’t check it directly at the time).

I’m sure if I dug into the raw tcp packets the secrets would be there – but really even after doing all the networking stuff I have been doing for the past decade+ I still can’t make heads or tails of 90% of the stuff that is in a packet (I haven’t tried to either, hasn’t been a priority of mine, not something that really interests me).

But sustaining 10+ megabytes/second over a 95 millisecond link over 9 internet routers on a highly compressed and encrypted data stream without any special WAN optimization package is just amazing to me.

Maybe this is common technology now, I don’t know, I mean I’d sort of expect marketing information to advertise this kind of thing, if you can get 10-30x faster throughput over a VPN without high end WAN optimization  vs regular internet I’d be really interested in that technology. If you’ve seen similar massive increases in performance w/o special WAN Optimization on a site to site VPN I’d be interested to hear about it.

In this particular case, the products I’m using are Sonic Wall NSA3500s. The only special feature licensed is high availability, other than that it’s a basic pair of units on each end of the connection. (WAN Optimization is a software option but is NOT licensed). These are my first Sonic Walls, I had some friends trying to push me to use Juniper (SRX I think) or in one case Palo Alto networks, but Juniper is far too complicated for my needs, and Palo Alto networks is not suitable for Site to Site VPNs with their cost structure (the quote I had for 4 devices was something like $60k). So I researched a few other players and met with Sonic Wall about a year ago and was satisfied with their pitch and verified some of their claims with some other folks, and settled on them. So far it’s been a good experience, very easy to manage, and I’m still just shocked by this throughput. I really had terrible experiences managing those Cisco PIXs a few years back by contrast. OpenVPN is a real pain as well (once it’s up and going it’s alright, configuring and troubleshooting are a bitch).

Sonic Wall claimed they were the only ones (2nd to Palo Alto Networks) who had true deep packet inspection in their firewalls (vs having other devices do the work). That claim interested me, as I am not well versed in the space. I bounced the claim off of a friend that I trust (who knows Palo Alto inside and out) and said it was probably true, Palo Alto’s technology is better (less false positives) but nobody else offers that tech.  Not that I need that tech, this is for a VPN – but it was nice to know that we got the option to use it in the future. Sonic Wall’s claims go beyond that as well saying they are better than Palo Alto in some cases due to size limitations on Palo Alto (not sure if that is still true or not).

Going far beyond simple stateful inspection, the Dell®SonicWALL® Reassembly-Free Deep Packet Inspection™ (RFDPI) engine scans against multiple application types and protocols to ensure your network is protected from internal and external attacks as well as application vulnerabilities. Unlike other scanning engines, the RFDPI engine is not limited by file size or the amount of concurrent traffic it can scan, making our solutions second to none.

SonicWall Architecture - looked neat, adds some needed color to this page

The packet capture ability of the devices is really nice too, makes it very easy to troubleshoot connections. In the past I recall on Cisco devices at least I had to put the device in some sort of debug mode and it would spew stuff to the console(my Cisco experience is not current of course). With these Sonic Walls I can setup filters really easily to capture packets and it shows them in a nice UI and I can export the data to wireshark or plain text if needed.

My main complaint on these Sonic Walls I guess is they don’t support link aggregation(some other models do though). Not that I need it for performance I wanted it for reliability so that if a switch fails the Sonic wall can stay connected and not trigger a fail over there as well, but as-is I had to configure them so each Sonic wall is logically connected to a single switch (though they have physical connections to both – I learned of the limitation after I wired them up). Not that failures happen often of course but it’s too bad this isn’t supported in this model (which has 6x1Gbps ports on it).

The ONLY thing I’ve done on these Sonic Walls is VPN (site to site mainly, but have done some SSL VPN stuff too), so beyond that stuff I don’t know how well they work. Sonic wall traditionally has had a “SOHO” feel to it though it seems in recent years they have tried to shrug this off, with their high end reaching as high as 240 Gbps in an active-active cluster. Nothing to sneeze at.

UPDATE – I ran another test, and this time I captured a sample of the CPU usage on the Sonic Wall as well as the raw internet throughput as reported by my router, I mean switch, yeah switch.

2,784MB gzip’d file copied in 3 minutes 35 seconds using SCP. If my math is right that comes to an average of roughly 12.94 Megabytes/second ? This is for a single stream, basic file copy.

The firewall has a quad core 550 MHz Mips64 Octeon Processor (I assume it’s quad core and not four individual processors). CPU usage snapshot here:

SonicWall CPU usage snapshot across cores during big file xfer

The highest I saw it was CPU core #1 going to about 45% usage, with core #2 about 35% maybe, and core #3 maybe around 20%, with core #0 being idle (maybe that is reserved for management? given it’s low usage during the test.. not sure).

Raw network throughput topped out at 135.6 Megabits/second (well some of that was other traffic, so wager 130 Megabits for the VPN itself).

Raw internet throughput for VPN file transfer

Apparently this post found it’s way to Dell themselves and they were pretty happy to see it. I’m sorry I just can’t get over how bitchin’ fast this thing is! I’d love for someone at Dell/SonicWALL who knows more than the tier 1 support person I talked with a few weeks ago to explain it better.

August 7, 2012

Adventures with vCenter, Windows and expired Oracle passwords

Filed under: General — Tags: , , — Nate @ 7:39 pm

Today’s a day that I could have back – it was pretty much a waste/wash.

I’m not a windows person by trade of course, but I did have an interesting experience today. I write this in the hopes that perhaps it can save someone else the same pain.

Last night I kicked off some Windows updates on a vCenter server, done it a bunch of times before never had an issue. There was only about 6-10 updates to install. It installed them, then rebooted, and was taking a really long time to complete the post install stuff, after about 30mins I gave up and went home. It’s always come back when it’s done.

I forgot about it until this morning when I went to go do stuff with vCenter and could not connect. Then I tried to remote desktop into the system and could not(tcp port not listening). So I resorted to logging in via VMware console. Tried resetting remote desktop to no avail. I went to control panel to check on windows update, and the windows update control panel just hung. I went to the ‘add/remove programs’ thing to roll back some updates and it hung while looking for the updates.

I tried firing up IE9, and it didn’t fire, it just spun an hourglass for a few seconds and stopped. I scoured the event logs and there was really nothing there – no errors. I was convinced at this time an OS update went wrong, I mean why else would something like IE break ? There was an IE update as part of the updates that were installed last night after all.

After some searches I saw some people comment on how some new version of Flash was causing IE to break, so I went to remove flash (forgot why it was installed but there was a reason at the time), and could not. In fact I could not uninstall anything, it just gave me a generic message saying something along the lines of “wait for the system to complete the process before uninstalling this”.

I came across a windows tool called System Update Readiness Tool which sounded promising as well, I was unable to launch IE of course, I did have firefox and could load the web page but was unable to download the software without Firefox hanging(!?). I managed to download it on another computer and copy it over the network to the affected server’s HD. But when I tried to launch it – sure enough it hung too almost immediately.

Rebooting didn’t help, shut down completely and start up again – no luck. Same behavior. After consulting with the IT manager who spends a lot more time in Windows than me we booted to safe mode – came right up. Windows update is not available in safe mode, most services were not started. But I was able to get in and uninstall the hot fix for IE. I rebooted again.

At some point along the line I got the system to where I could remote desktop in, windows update looked ok, IE loaded etc. I called the IT manager over to show him, and decided to reboot to make sure it was OK only to have it break on me again.

I sat at the post install screen for the patches (Stage 3 of 3 0%) for about 30 minutes, at this point I figure I better start getting prepared to install another vCenter server so I started that process in parallel, talked a bit with HP/Vmware support and I shut off the VM again and rebooted – no difference just was sitting there. So I rebooted again into safe mode, and removed the rest of the patches that were installed last night, and rebooted again into normal mode and must’ve waited 45 minutes or so for the system to boot – it did boot eventually, got past that updates screen. But the system was still not working right, vCenter was hanging and I could not remote desktop in.

About 30 minutes after the system booted I was able to remote desktop in again, not sure why, I kept poking around, not making much progress. I decided to take a VM snapshot (I had not taken one originally but in the grand scheme of things it wouldn’t of helped), and re-install those patches again, and let the system work through whatever it has to work through.

So I did that, and the system was still wonky.

I looked and looked – vCenter still hanging, nothing in the event log and nothing in the vpx vCenter log other than stupid status messages like

2012-08-08T01:08:01.186+01:00 [04220 warning 'VpxProfiler' opID=SWI-a5fd1c93] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:08:12.535+01:00 [04220 warning 'VpxProfiler' opID=SWI-12d43ef2] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:08:23.884+01:00 [04356 warning 'VpxProfiler' opID=SWI-f6f6f576] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:08:35.234+01:00 [04220 warning 'VpxProfiler' opID=SWI-a928e16] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:08:46.583+01:00 [04220 warning 'VpxProfiler' opID=SWI-729134b2] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:08:57.932+01:00 [04328 warning 'VpxProfiler' opID=SWI-a395e0af] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:09:09.281+01:00 [04220 warning 'VpxProfiler' opID=SWI-928de6d2] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:09:20.631+01:00 [04328 warning 'VpxProfiler' opID=SWI-7a5a8966] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:09:32.058+01:00 [04220 warning 'VpxProfiler' opID=SWI-524a7126] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:09:43.804+01:00 [04328 warning 'VpxProfiler' opID=SWI-140d23cf] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:09:55.551+01:00 [04356 warning 'VpxProfiler' opID=SWI-acadf68a] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:10:07.297+01:00 [04328 warning 'VpxProfiler' opID=SWI-e42316c] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:10:19.044+01:00 [04356 warning 'VpxProfiler' opID=SWI-3e976f5f] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:10:30.790+01:00 [04328 warning 'VpxProfiler' opID=SWI-2734f3ba] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms

No errors anywhere, I believe I looked at the tomcat logs a few times and there was no logs for today.

Finally I dug into the tomcat logs from last night and came across this –

Aug 6, 2012 11:27:30 PM com.vmware.vim.common.vdb.VdbODBCConfig isConnectableUrl
SEVERE: Unable to get a connection to: jdbc:oracle:thin:@//DB_SERVER:1521/DB_SERVER as username=VPXADMIN due to: ORA-28001: the password has expired

I had encountered a password expiry on my sys account a few weeks ago, but didn’t really think much about it at the time. Anyways I reset the password and vCenter was able to start. I disabled password expiry per this page (I have used Oracle 10G and a little of 8/9i and never recall having password expire issues), which says defaults were changed in 11G and passwords do expire now.

I have had vCenter fail to start because of DB issues in the past – in fact because vCenter does not properly release locks on the Oracle DB when it shuts down the easiest workaround is to restart Oracle whenever I reboot the vCenter server (because vCenter is the only thing on the Oracle DB it’s just a simpler solution). When vCenter fails in this way it causes no issues to the rest of the OS. Just an error message in the event log saying vCenter failed to start, and a helpful explanation as to why –

Unable to get exclusive access to vCenter repository.   Please check if another vCenter instance is running against the same database schema.

What got me, even now is how the hell did this expired password cascade into Internet Explorer breaking, remote desktop breaking, windows update breaking, etc ? My only guess is that vCenter was perhaps flooding the system with RPC messages causing other things to break. Again – there was no evidence of any errors in the event log anywhere. I even called a friend who works at Microsoft and deploys hundreds of Windows servers for a living (he works as a Lab Manager), hoping he would have an idea. He said he had seen this behavior several times before but never tried to debug it, he just wiped the system out and reinstalled. I was close to doing that today, but fortunately eventually found a solution, and I guess you could say I learned something in the process ?

I don’t know.

I have not seriously used windows since the NT4 days (I have used it casually on the desktop and in some server roles like this vCenter system), why I stopped using it, well there was many reasons, I suppose this was sort of a reminder. I’m not really up to moving to the Linux vCenter appliance yet it seems beta-ish, if I ever get to move to that appliance before I upgrade to KVM (at some point, no rush). I have a very vague memory of experimenting one time on NT4, or maybe it was 3.51, where I decided to stop one/more of the RPC services to see what would happen. Havok, of course. I noticed one of the services vCenter depends upon, the DCOM Server Process Launcher, seems similar of importance in Windows 2008, though 2008 smartly does not allow you to stop it, I chuckled when I saw the Recovery Action for this service failure is Restart the Computer. But in this case the service was running… I looked for errors for it in the event log as well and there were none.

ESXi 5 Uptake still slow?

Filed under: General — Tags: — Nate @ 10:10 am

Just came across this article from our friends at The Register, and two things caught my eye –

HP is about to launch a new 2U quad socket system – the HP DL560 Gen8, which is what the article is about. I really can’t find any information on this server online, so it seems it is not yet officially announced. I came across this PDF from 2005, which says the 560 has existed in the past – though I never recall hearing about it and I’ve been using HP gear off and on since before that. Anyways, on the HP site the only 500-series systems I see are the 580 and 585, nothing new there.

HP has taken it’s sweet time joining the 4-socket 2U gang, I recall Sun was among the first several years ago on the Opteron, then later Dell and others joined in but HP was bulky still with the only quad socket  rack option being 4U.

The more interesting thing though to me was the lack of ESXi 5.0 results posted with VMware’s own benchmark utilities. Of the 23 results posted since ESXi 5 was made generally avaialble, only four are running on the newest hypervisor. I count six systems using ESX 4.1U2 and vCenter 5.0 (a combination I chose for my company’s infrastructure). Note I said ESX – not ESXi. I looked at a couple of the disclosure documents and would expect them to specifically call out ESXi if that is in fact what was used.

So not only are they NOT using ESXi 5.0 but they aren’t even using ESXi period with these newest results (there is not a single ESXi 4.x system on the site as far as I can tell).

Myself I find that fascinating. Why would they be testing with an older version of the hypervisor and not even using ESXi? I have my own reasons for preferring ESX over ESXi, but I’d really expect for benchmark purposes they’d go with the lighter hypervisor. I mean it consumes significantly less time to install onto a system since it’s so small.

I have to assume that they are using this configuration because it’s what the bulk of their customers are still deploying today, otherwise it makes no sense to be testing the latest and greatest Intel processors on server hardware that’s not even released yet on an OS kernel that is going on three years old at this point. I thought there was supposed to be some decent performance boosts in ESXi 5?

I’m not really a fan of the VMark benchmark itself, it seems rather confusing to interpret results, there are no cost disclosures, and I suspect it only runs on VMware making it difficult or impossible to compare with other hypervisors. Also the format of the results is not ideal, I’d like to see at least CPU/Memory/Storage benchmarks included so it’s easier to tell how each subsystem performed. Testing brand X with processor Y and memory Z against brand W with processor Y and memory Z by itself doesn’t seem very useful.

SPEC has another VM benchmark, though it seems similarly confusing to interpret results, though at least they have results for more than one hypervisor.

vSphere, aka ESX 4, when it was released really was revolutionary, it ditched the older 32-bit system for a more modern 64-bit system, and introduced a ton of new things as well.

I was totally underwhelmed by ESXi 5, even before the new licensing change was announced. I mean just compare What’s New between vSphere 4 and vSphere 5.

August 2, 2012

Losing $400M in a matter of minutes due to a software bug

Filed under: News — Tags: , — Nate @ 9:52 am

This is pretty crazy, yesterday morning a Wall St market maker had some bug(s) in their software platform that caused them to perform a ton of trades, as one reporter put it around 300% of normal volume on the NYSE was being traded by this one company. As the story unfolded the company continue to say everything is normal, then they changed their story to we’re investigating, then they changed their story to a technology error occurred.

[..] Among other stocks, Protective Life (PL.N) had already traded more than 10 times its usual volume, and Juniper Networks JPNR.N has already seen six times its usual daily volume.

I bet they hoped the NYSE was going to reverse those trades that were triggered by whatever bug it is – but at the end of the day yesterday the NYSE opted not to reverse the vast majority of them.

The result is the company – Knight Capital lost more than $400 million as a result of the trades and is now seeking alternative means to re-capitalize itself. Knight’s stock has lost $600M in market cap (~67%) since the event. They had a billion dollar market cap as recently as 36 hours ago.

Yesterday the traders at Knight were obviously under a lot of stress and they took great comfort in this video CNBC showed on air. It is quite a funny video, the folks at Knight kept asking to see it again and again.

I’m not sure if I’ve mentioned this here before but this is a good of spot as any, a video from a couple months ago about how serious high frequency trading is – and the difference 11 milliseconds can make.

 

Older Posts »

Powered by WordPress