A Terabit of application switching throughput

April 19, 2012

A Terabit of application switching throughput

Filed under: Networking — Tags: netscaler — Nate @ 6:02 pm

That’s a pretty staggering number to me. I had some friends that worked at a company that is now defunct (acquired by F5) called Crescendo Networks.

One of their claims to fame was the ability to “cluster” their load balancers so that you could add more boxes on the fly and it would just go faster, instead of having to rip and replace, or add more boxes and do funky things with DNS load balancing in trying to balance between multiple groups of load balancers

Crescendo's Scale out design - too bad the company didn't last long enough to see anyone leverage a 24-month expansion

Another company, A10 Networks (who is still around, though I think Brocade and F5 are trying to make them go away), whom introduced similar technology about a year ago called virtual chassis (details are light on their site). There may be other companies that have similar things too – they all seem gunning for the F5 VIPRION, which is a monster system, they took a chassis approach and support up to 4 blades of processing power. Then they do load balancing of the blades themselves to distribute the load. I have a long history in F5 products and know them pretty well, going back to their 4.x code base which was housed in (among other things) generic 4U servers with BSDI and generic motherboards.

I believe Zeus does it as well, I have used Zeus but have not gone beyond a 2 node cluster. I forgot, Riverbed bought them and changed the name to Stingray. I think Zeus sounds cooler.

With Crescendo the way they implemented their cluster was quite basic, it was very similar to how other load balancing companies improved their throughput for streaming media applications – some form of direct response from the server to the client, instead of having the response go back through the load balancer a second time. Here is a page from a long time ago on some reasons why you may not want to do this. I’m not sure how A10 or Zeus do it.

I am a Citrix customer now, having heard some good things about them over the years, but never having tried the product. I found it curious why the likes of Amazon and Google gobble up Netscaler appliances like M&Ms when for everything else they seem to go out of their way to build things themselves. I know Facebook is a big user of the F5 VIPRION system as well.

You’d think (or at least I think) companies like this – if they could leverage some sort of open source product and augment it with their own developer resources they would – I’m sure they’ve tried – and maybe they are using such products in certain areas. My information about who is using what could be out of date. I’ve used haproxy(briefly), nginx(more) at least for load balancers and wasn’t happy with either product. Give me a real load balancer please! Zeus seems to be a pretty nice platform – and open enough that you can run it on regular server hardware, rather than being forced into buying fixed appliances.

Anyways, I had a ticket open with Citrix today about a particular TLS issue regarding SSL re-negotiation, after a co-worker brought it to my attention that our system was reported as vunerable by her browser / plugins. During my research I came across this excellent site which shows a ton of useful info about a particular SSL site.

I asked Citrix how I could resolve the issues the site was reporting and they said the only way to do it was to upgrade to the latest major release of code (10.x). I don’t plan to do that, resolving this particular issue doesn’t seem like a big deal (though would be nice – not worth the risk of using this latest code so early after it’s release for this one reason alone). Add to that our site is fronted by AkamaiÂ (which actually posted poorer results on the SSL check than our own load balancers). We even had a “security scan” run against our servers for PCI compliance and it didn’t pick up anything related to SSL.

Anyways, back on topic. I was browsing through the release notes for the 10.x code branch and saw that Netscaler now supports clustering as well

You can now create a cluster of nCore NetScaler appliances and make them work together as a single system image. The traffic is distributed among the cluster nodes to provide high availability, high throughput, and scalability. A NetScaler cluster can include as few as 2 or as many as 32 NetScaler nCore hardware or virtual appliances.

With their top end load balancers tapping out at 50Gbps, that comes to 1.6Tbps with 32 appliances. Of course you won’t reach top throughput depending on your traffic patterns so taking off 600Gbps seems reasonable, still 1Tbps of throughput! I really can’t imagine what kind of service could use that sort of throughput at one physical site.

It seems, at least compared to the Crescendo model the Citrix model is a lot more like a traditional cluster, probably a lot more like a VIPRION design –

The NetScaler cluster uses Equal Cost Multiple Path (ECMP), Linksets (LS), or Cluster Link Aggregation Group (CLAG) traffic distribution mechanisms to determine the node that receives the traffic (the flow receiver) from the external connecting device. Each of these mechanisms uses a different algorithm to determine the flow receiver.

Citrix Netscaler Traffic Flow

The flow reminds me a lot of the 3PAR cluster design actually.

My Thoughts on Netscaler

My experience so far with the Netscalers is mixed, some things I really like such as an integrated mature SSL VPN (note I said mature! well at least for windows – nothing for Linux and their Mac client is buggy and incomplete), application aware MySQL and DNS load balancing, and a true 64-bit multithreaded, shared memory design. I also really like their capacity on demand offering as well. These boxes are always CPU bound, so to have the option to buy a technically lower end box with the same exact CPU setup as a higher end box (that is rated for 2x the throughput) is really nice. It means I can turn on more of those CPU heavy features without having to fork over the cash for a bigger box.

Citrix nCore

While for the most part, at least last I checked – F5 was still operating on 32-bit TMOS (on top of 64-bit Linux kernels) leveraging a multi process design instead of a multi threaded design. So they were forced to add some hacks to load balance across multiple CPUs in the same physical load balancer in order to get the system to scale more (and there has been limitations over the years as to what could actually be distributed over multiple cores and what features were locked to a single core — as time has gone on they have addressed most of those that I am aware of). One in particular I remember (which may be fixed now I’m not sure – would be curious to know if it was fixed how they fixed it) – was that each CPU core had it’s own local memory with no knowledge of other CPUs – which means when doing HTTP cachingÂ – each CPU had to cache the content individually – massively duplicating the cache and slashing the effectiveness of the memory you had on the box. This was further compounded by the 32-bitness of TMM itself in it’s limited ability to address larger amounts of memory.

In any case the F5 design is somewhat arcane, they chose to bolt on software features early on instead of re-building the core. The strategy seems to have paid off though from a market share and profits standpoint, just from a technical standpoint it’s kinda lame 🙂

To be fair there are some features in the multi threaded Citrix Netscaler that are not available that are available in the older legacy code.

Things I don’t like about the Netscaler include their Java GUI which is slow as crap (they are working on a HTML 5 GUI – maybe that is in v10?), I mean it can literally take about 4 minutes to load up all of my server groups (Citrix term for F5 Pools). F5 I can load them in about half a second. I think the separation of services with regards to content switching on Citrix is, well pretty weird to say the least. If I want to do content filtering I have to have an internal virtual server and an external virtual server, the external one does the content filtering and forwards to the internal one. With F5 it was all in one (same for Zeus too). The terminology has been difficult to adjust to vs my F5 (and some Zeus) background.

I do miss the Priority Activation feature F5 has, there is no similar feature on Citrix as far as I know (well I think you can technically do it but the architecture of the Netscaler makes it a lot more complex). This feature allows you to have multiple pools of servers within a single pool at different priorities. By default the load balancer sends to the highest (or lowest? I forgot, it’s been almost 2 years) group of servers, if that group fails then it goes to the next, and the next. I think you can even specify the minimum number of nodes to have in a group before it fails over entirely to the next group.

Not being able to re-use objects with the default scripting language just seems brain dead to me, so I am using the legacy scripting language.

So I do still miss F5 for some things, Zeus for some other things, though Netscaler is pretty neat in it’s own respects. F5 obviously has a strong presence where I spent the last decade of my life in and around Seattle, being that it was founded and has it’s HQ in Seattle. Still have a buncha friends over there. Some pretty amazing stories I’ve heard come out of that place, they grew so fast, it’s hard to believe they are still in one piece after all they’ve been through, what a mess!

If you want to futz around with a Netscaler you have the option of downloading their virtual appliance (VPX) for free – I believe it has a default throughput limit of 1Mbit. Upgrades to as high as 3Gbps. Though the VPX is limited to two CPU cores last I recall. F5 and A10 have virtual appliances as well.

Crescendo did not have a virtual appliance, which is one of the reasons I wasn’t particularly interested in perusing their offering back when they were around. The inside story of the collapse of Crescendo is the stuff geek movies are made out of. I won’t talk about it here but it was just amazing to hear what happened.

The load balancing market is pretty interesting to see the different regions and where various players are stronger vs weaker. Radware for example is apparently strong over on the east coast but much less presence in the west. Citrix did a terrible job marketing the Netscaler for many years (a point they acknowledged to me), then there are those folks out there that still use Cisco (?!) which just confuses me.Â Then there are the smaller guys like A10, Zeus, Brocade – Foundry networks (acquired by Brocade, of course) really did themselves a disservice when they let their load balancing technology sit for a good five years between hardware refreshes, they haven’t been able to recover from that from what I’ve seen/heard. They tried to pitch me their latest iron a couple of years ago after it came out – only to find out that it didn’t support SSL at the time – I mean come on — of course they later fixed that lack of a feature but it was too late for my projects.

And in case you didn’t know – Extreme used to have a load balancer WAY BACK WHEN. I never used it. I forget what it’s called off the top of my head. Extreme also partnered with F5 in the early days and integrated F5 code into their network chipsets so their switches could do load balancing too (the last switch that had this was released almost a decade ago – nothing since). Though the code in the chipsets was very minimal and not useful for anything serious.

Comments (11)

11 Comments

You know what’s retarded about Netscaler VPX? They limit it to only 1 vCPU! See their “Citrix NetScaler VPX Getting Started Guide” and “NETSCALER VPX FAQ” PDFs for confirmation.
This is both for Xen and ESXi, so it’s not like they’d want to cripple competing virtualization platform.
How idiotic can they be?! People are buying multi-cpu-multi-core servers and you can only use 1 vCPU?!

Comment by Mxx — July 5, 2012 @ 8:46 am
Two vCPUs is the limit, but I totally agree it is a stupid limitation, they should be unlimited CPUs and just charge you for the bandwidth like they do already. I hardly think 2vCPUs is enough to fully drive a VPX-3000.

thanks for the comment!

Comment by Nate — July 5, 2012 @ 10:18 am
Actually, that’s not the case. NetScaler v10 allows for multiple CPUs, up to 4 packet engines and 1 management CPU. This is more than enough to drive a VPX 3000 (ie as many CPUs as are in the hardware appliances) Of course, if you want to do SSL, it’s not enough, but that is what hardware is for.

Comment by Kit — October 11, 2012 @ 9:03 am
Even Netscaler v9 can drive multiple CPUs – I believe at the high end they go to 12 cores on the 17500 and above.

But the VPX is handicapped with support for only 2 CPUs. I have an email from Citrix themselves in June of 2011 that says that, perhaps things have changed since? You can certainly put more vCPUs on the VM but it won’t use them.

If you have some evidence that shows otherwise I’d love to see it.

thanks for the comment!

Comment by Nate — October 11, 2012 @ 9:36 am
My thoughts.
VPX:
VPX-3000 can push 3G of traffic. Of course depends on what you are doing. SSL of course would be out of the question and I imagine compression and app firewall would be as well since these amounts also change on the hardware. I know 2vCPUs is the minimum but I dont think with V10 it is the maximum any more.
Honestly, why in the world would you not go to a hardware device is you are looking at that much throughput? Seriously?
When you compare this to other virtual load balancers check out the things they can/can’t do. I did and was surprised at the differences in features and configurations

Clustering:
I have done a lot of research into this and have found that they do offer true clustering. Meaning active-active-active-active etc. So the maximum (TB+) is actually to a single IP address. F5 calls what they do clustering but the devil is in the details. 1 IP can only be active on 1 system at a time. so they are for lack of a better term active-passive-standby.
Who needs that much traffic outside of the big .coms? I dont know, I dont. Regardless it is very cool and does point out to me that they are staying on top of the game.

Viprion:
I saw an article from some marketing director at F5 saying it really isnt virtualization. I need to look into that some more to make sure it meets our security needs before really looking at it.

Obviously I am biased. I dumped F5 for NetScaler when they went to v9. v9 truly and severely blew duck. We were an F5 shop for years but that ruined them for me. Just do your research and dont just look at marketing bs or opinion articles. Interesting stuff but I am not going to stake my career on anything before I really research it and find the facts. Not just max traffic because I will never get near there. I am more concerned about bugs, new features, support (can you believe F5 doesnt support irules other than a syntax check), etc. All of the things that can affect me. I am selfish that way.

Comment by Oolankaloophid — October 11, 2012 @ 9:52 am
I think with F5 it changes a bit with their “Super VIP” concept at least on the VIPRION (I don’t think Super VIPs are on their other platforms). You can still do real active-active on F5 but you need some DNS load balancing love to do multiple IPs, or you can just advertise both IPs via DNS (advertising more than 1 IP is a “best practice” from what I’ve heard not specifically for this reason but because of the built in poor man’s failover in web browsers (something that happens within the browser, doesn’t apply to other applications).

I agree if your doing 3gigs of traffic it’s best to be in hardware – but for me the point isn’t the throughput it’s the cpu cycles, if I can get more CPU cycles to do more CPU intensive things – application firewall, whatever that of course lowers the capacity of the box from a throughput perspective, so it’d be nice to have extra CPUs to do that even knowing you won’t get near 3Gbps of traffic.

For the same reason I really like the Netscaler appliances that can be upgraded with a software key because they come with all of the CPUs unlocked.

I was a F5 user for about 7 years, deployed my first Netscaler almost a year ago. They are both good platforms for different reasons. There’s a lot of cool stuff in Netscaler, though there’s also a lot of annoying things too (e.g. unable to use a policy with more than one resource when using the new scripting language (workaround is to use the old so I do that)). The whole web UI on Netscaler is clunky and very slow (even for my small site if I try to view service groups over the WAN it can take it about 3 minutes to load them). I do miss F5’s “priority activation”.

Netscaler is nice in having real mysql load balancing, real DNS load balancing, the integrated VPN is nice, the architecture of nCore is nice. I was going to say native 64-bit as well but it appears the NSPPE processes are 32-bit (same as F5), even though it runs on a 64-bit OS (same as F5). Though I do like A10’s architecture(on paper at least) better than Citrix or F5. I know F5 has had issues in the past especially with HTTP caching where the “dedicated memory per process” has hurt them (causes objects to be cached multiple times since processes don’t talk to each other), not sure if they found a way around that.

I have exploited Netscalers in more ways than I ever did on F5, though at the same time that resulted in many more support tickets for broken things than I ever had on F5 (more tickets in 1 year than in 10 years on F5). But I suppose if I tried the same with F5 I would have similar problems.

I’m sort of conflicted which I like more, from a end user standpoint F5 is leagues ahead so it’s less frustrating to interact with, from a traffic management standpoint it seems like Citrix is ahead (with the caveat it’s been 2 years since I worked on a F5 box). From a cost standpoint Citrix can be more cost effective, e.g. we bought 7500s for the small company I am at, again for the CPU cores – the closest comparison to F5 is the 3900 which is not in the same league as the 7500 (which has the same processing power as the 9500).

Oh one more thing I do miss on the F5 that Citrix doesn’t have – true out of band management. Most people did not even know it existed on F5 because there was a special process you had to go through to enable it, but there are two computers (which has been the bane of support on occasion for friends that worked there though I never had an issue) in the F5 box, and the management port has two MAC addresses on it, one for in band management the other for out of band, when configured you can get full access even down to hard reset/power cycle/watch POST screen etc. Citrix has dual management ports but unfortunately no way to configure them in a active/passive way (with a shared IP), which would be really nice for higher availability when connecting to multiple switches.

The next time I’m in the market for a load balancer at an organization that doesn’t have an existing investment in either platform – it will be a difficult choice to make between the two.

thanks for the comment!

Comment by Nate — October 11, 2012 @ 10:25 am
Also, a couple of more comments! Great article and sorry I missed it until now.

– V10 indeed does away almost entirely with Java and almost everything is HTML5. It’s much faster and more responsive. Give it a shot on a VPX?
– Content filtering can be done on an LB VIP, but you can also bind it to a CS VIP directly, or you can do the same functions with responder, and bind it either to the LB VIP or the CS VIP.
– NetScaler has a spillover functionality on the Vserver, such that if you’ve reached a particular connection count, you can “spillover” to a backup vserver. There is also the concept of a traditional backup vserver, where the netscaler can redirect to a different vserver when the primary vserver is down. You can do multiple levels of cascading failures by having a backup vserver on the backup vserver.

Comment by Kit — October 11, 2012 @ 3:03 pm
Also, the NetScaler 11500-20500, 17500-21500, and 17550-21550 (as well as the new 8200-8600) NetScalers have a full Lights Out Management card installed.

NetScaler nCore now allows for a shared HTTP cache, which is pretty cool, being that it can be extremely large. (up to 48GB of ram cache)

Comment by Kit — October 11, 2012 @ 3:09 pm
Thanks for the info! Yeah I have seen the backup VSERVER which is kinda similar to priority group though a bit more complex/limited (as in there are additional layers on Netscaler to handle it vs single layer in F5). I haven’t had a chance to play with it yet.

The Citrix SE that we have did mention that the UI was revamped in v10, though he said there’s still a bunch of stuff that uses Java, I am interested to check it out though at some point soon.

I do use CS VIPs, and responders (on the CS VIPs, haven’t tried responders on regular HTTP VIPs). The extra layer of the CS VIP is not something I’m used to vs F5 of course and it is sort of strange to work with, though it has some useful advantages as well, where I can re-route traffic with a policy in a CS VIP to another service group without having to adjust the policy itself (we do this for switching between banks of active vs inactive servers, I just update the service group on the back end Vserver which the CS VIP points to). Before the Netscalers we were using Zeus (in Amazon cloud ugh), and on Zeus from what I recall you basically had to re-write the policy each time you wanted to re-route traffic (if you were using the UI it was just a drop down box to change the service group, from the CLI you had to re-do the policy). I think F5 is similar in that regard though for most of the years I used F5 I made it a point to keep most stuff in layer 4 mode for performance. It wasn’t until probably mid 2009 that I got more in depth with full layer 7 mainly for oneconnect the servers were able to push about 3,000 requests/second and without oneconnect the TCP stack on the server side would max out at around 1200/sec(30k+ connections in TIME_WAIT). But with full layer 7 and our application at least the F5 box performed at about 15-20% of it’s “on paper” capacity, which was sort of surprising at first(I think most customers would be surprised by that). After some discussions with them and clarifications on request sizes they said that was the “expected” results based on the benchmarks they use. Though of course the numbers they publish are slightly different workloads 🙂

Good to know about the full out of band management, did not know that! thanks for all the useful info

Comment by Nate — October 11, 2012 @ 3:22 pm
it has been so long since I wrote this article that I didn’t realize some of my comments were redundant as they were already in the original post! DOH!

Comment by Nate — October 11, 2012 @ 4:26 pm
[…] I do like the option of being able to have virtual network services, whether it is a load balancer or firewall or something. But those do have limitations that need to be accounted for. Whether it is performance, flexibility (# of VLANs etc), as well as dependency (you may not want to have your VPN device in a VM if your storage shits itself you may lose VPN too!). Managing 30 different load balancers may in fact be significantly more work(I'd wager it is- the one exception is service provider model where you are delegating administrative control to others – which still means more work is involved it is just being handled by more staff) than managing a single load balancer that supports 30 applications. […]

Pingback by Nth Symposium 2013 Keynote: SDN « TechOpsGuys.com — August 7, 2013 @ 5:02 pm