TechOpsGuys.com Diggin' technology every day

December 2, 2009

Extremely Simple Redundancy Protocol

Filed under: Networking — Tags: , , , — Nate @ 7:31 am

ESRP. That is what I have started calling it at least. The official designation is Extreme Standby Router Protocol. It’s one of, if not the main reason I prefer Extreme switches at the core of any Layer 3 network. I’ll try to explain why here, because Extreme really doesn’t spend any time promoting this protocol, I’m still pushing them to change that.

I’ve deployed ESRP at two different companies in the past five years ranging from

What are two basic needs of any modern network?

  1. Layer 2 loop prevention
  2. Layer 3 fault tolerance

Traditionally these are handled by separate protocols that are completely oblivious to one another mainly some form of STP/RSTP and VRRP(or maybe HSRP if your crazy). There have been for a long time interoperability issues with various implementations of STP as well over the years, further complicating the issue because STP often needs to run on every network device for it to work right.

With ESRP life is simpler.

Advantages of ESRP include:

  • Collapsing of layer 2 loop prevention and layer 3 fault tolerance(with IP/MAC takeover) into a single protocol
  • Can run in either layer 2 only mode, layer 3 only mode or in combination mode(default).
  • Sub second convergence/recovery times.
  • Eliminates the need to run protocols of any sort on downstream network equipment
  • Virtually all down stream devices supported. Does not require an Extreme-only network. Fully inter operable with other vendors like Cisco, HP, Foundry, Linksys, Netgear etc.
  • Supports both managed and unmanaged down stream switches
  • Able to override loop prevention on a per-port basis(e.g. hook a firewall or load balancer directly to the core switches, and you trust they will handle loop prevention themselves in active/fail over mode)
  • The “who is master?” question can be determined by setting an ESRP priority level which is a number from 0-254 with 255 being standby state.
  • Set up from scratch in as little as three commands(for each core switch)
  • Protect a new vlan with as little as 1 command (for each core switch)
  • Only one IP address per vlan needed for layer 3 fault tolerance(IP-based management provided by dedicated out of band management port)
  • Supports protecting up to 3000 vlans per ESRP instance
  • Optional “load balancing” by running core switches in active-active mode with some vlans on one, and others on the other.
  • Additional fail over based on tracking of pings, route table entries or vlans.
  • For small to medium sized networks you can use a pair of X450A(48x1GbE) or X650(24x10GbE) switches as your core for a very low priced entry level solution.
  • Mature protocol. I don’t know exactly how old it is, but doing some searches indicates at least 10 years old at this point
  • Can provide significantly higher overall throughput vs ring based protocols(depending on the size of the ring), as every edge switch is directly connected to the core.
  • Nobody else in the industry has a protocol that can do this. If you know of another protocol that combines layer 2 and layer 3 into a single protocol let me know. For a while I thought Foundry’s VSRP was it, but it turns out that is mainly layer 2 only. I swear I read a PDF that talked about limited layer 3 support in VSRP back in 2004/2005 time frame but not anymore.  I haven’t spent the time to determine the use cases between VSRP and Foundry’s MRP which sounds similar to Extreme’s EAPS which is a layer 2 ring protocol heavily promoted by Extreme.

Downsides to ESRP:

  • Extreme Proprietary protocol. To me this is not a big deal as you only run this protocol at the core. Downstream switches can be any vendor.
  • Perceived complexity due to wide variety of options, but they are optional, basic configurations should work fine for most people and it is simple to configure.
  • Default election algorithm includes port weighting, this can be good and bad depending on your point of view. Port weighting means if you have an equal number of active links of the same speed on each core switch, and the master switch has a link go down the network will fail over. If you have non-switches connected directly to the core(e.g. firewall) I will usually disable the port weighting on those specific ports so I can reboot the firewall without causing the core network to fail over. I like port weighting myself, viewing it as the network trying to maintain it’s highest level of performance/availability. That is, who knows why that port was disconnected, bad cable? bad ASIC, bad port? Fail over to the other switch that has all of it’s links in a healthy state.
  • Not applicable to all network designs(is anything?)

The optimal network configuration for ESRP is very simple, it involves two core switches cross connected to each other(with at least two links), with a number of edge switches, each edge switch has at least one link to each core switch. You can have as few as three switches in your network, or you can have several hundred(as many as you can connect to your core switches max today I think is say 760 switches using high density 1GbE ports on a Black Diamond 8900, plus 8x1Gbps ports for cross connect).

ESRP Mesh Network Design

ESRP Domains

ESRP uses a concept of domains to scale itself. A single switch is master of a particular domain which can include any number of vlans up to 3000. Health packets are sent for the domain itself, rather than the individual vlans dramatically simplifying things and making them more scalable simultaneously.

This does mean that if there is a failure in one vlan, all of the vlans for that domain will fail over, not that one specific vlan. You can configure multiple domains if you want, I configure my networks with one domain per ESRP instance. Multiple domains can come in handy if you want to distribute the load between the core switches. A vlan can be a member of only one ESRP domain(I expect, I haven’t  tried to verify).

Layer 2 loop prevention

The way ESRP loop prevention works is the links going to the slave switch are placed in a blocking state, which eliminates the need for downstream protocols and allows you to provide support for even unmanaged switches transparently.

Layer 3 fault tolerance

Layer 3 fault tolerance in ESRP operates in two different modes depending on whether or not the downstream switches are Extreme. It assumes by default they are, you can override this behavior on a per-port basis. In an all-Extreme network ESRP uses EDP [Extreme Discovery Protocol](similar to Cisco’s CDP) to inform down stream switches the core has failed over and to flush their forwarding entries for the core switch.

If downstream switches are not Extreme switches, and you decided to leave the core switch in default configuration, it will likely take some time(seconds, minutes) for those switches to expire their forwarding table entries and discover the network has changed.

Port Restart

If you know you have downstream switches that are not Extreme I suggest for best availability to configure the core switches to restart the ports those switches are on. Port restart is a feature of ESRP which will cause the core switch to reset the links of the ports you configure to try to force those switches to flush their forwarding table. This process takes more time than in an Extreme-only network. In my own tests specifically with older Cisco layer 2 switches, with F5 BigIP v9, and Cisco PIX this process takes less than one second(if you have a ping session going and trigger a fail over event to occur rarely is a ping lost).

Host attached ports

If you are connecting devices like a load balancer, or a firewall directly to the switch, you typically want to hand off loop prevention to those devices, so that the slave core switch will allow traffic to traverse those specific ports regardless of the state of the network. Host attached mode is an ESRP feature that is enabled on a per-port basis.

Integration with ELRP

ESRP does not protect you from every type of loop in the network, by design it’s intended to prevent a loop from occurring between the edge switch and the two core switches. If someone plugs an edge switch back into itself for example that will cause a loop still.

ESRP integrates with another Extreme specific protocol named ELRP or Extreme Loop Recovery Protocol. Again I know of no other protocol in the industry that is similar, if you do let me know.

What ELRP does is it sends packets out on the various ports you configure and looks for the number of responses. If there is more than it expects it sees that as a loop. There are three modes to ELRP(this is getting a bit off topic but is still related). The simplist mode is one shot mode where you can have ELRP send it’s packets once and report, the second mode is periodic mode where you configure the switch to send packets periodically, I usually use 10 seconds or something, and it will log if there are loops detected(it tells you specifically what ports the loops are originating on).

The third mode is integrated mode, which is how it relates to ESRP. Myself I don’t use integrated mode and suggest you don’t either at least if you follow an architecture that is the same as mine. What integrated mode does is if there is a loop detected it will tell ESRP to fail over, hoping that the standby switch has no such loop. In my setups the entire network is flat, so if there is a loop detected on one core switch, chances are extremely(no pun intended) high that the same loop exists on the other switch. So there’s no point in trying to fail over. But I still configure all of my Extreme switches(both edge and core) with ELRP in periodic mode, so if a loop occurs I can track it down easier.

Example of an ESRP configuration

We will start with this configuration:

  • A pair of Summit X450A-48T switches as our core
  • 4x1Gbps trunked cross connects between the switches (on ports 1-4)
  • Two downstream switches, each with 2x1Gbps uplinks on ports 5,6 and 7,8 respectively which are trunked as well.
  • One VLAN named “webservers” with a tag of 3500 and an IP address of 10.60.1.1
  • An ESRP domain named esrp-prod

The non ESRP portion of this configuration is:

enable sharing 1 grouping 1-4 address-based L3_L4
enable sharing 5 grouping 5-6 address-based L3_L4
enable sharing 7 grouping 7-8 address-based L3_L4
create vlan webservers
config webservers tag 3500
config webservers ipaddress 10.60.1.1 255.255.255.0
config webservers add ports 1,5,7 tagged

What this configuration does

  • Creates a port sharing group(802.3ad) grouping ports 1-4 into a virtual port 1.
  • Creates a port sharing group(802.3ad) grouping ports 5-6 into a virtual port 5.
  • Creates a port sharing group(802.3ad) grouping ports 5-7 into a virtual port 7.
  • Creates a vlan named webservers
  • Assigns tag 3500 to the vlan webservers
  • Assigns the IP 10.60.1.1 with the netmask 255.255.255.0 to the vlan webservers
  • Adds the virtual ports 1,5,7 in a tagged mode to the vlan webservers

The ESRP portion of this configuration is:

create esrp esrp-prod
config esrp-prod add master webservers
config esrp-prod priority 100
config esrp-prod ports mode 1 host
enable esrp

The only difference between the master and slave, is to change the priority. From 0-254 higher numbers is higher priority, 255 is reserved for putting the switch in standby state.

What this configuration does

  • Creates an ESRP domain named esrp-prod.
  • Adds a master vlan to the domain, I believe the master vlan carries the control traffic
  • Configures the switch for a specific priority [optional – I highly recommend doing it]
  • Enables host attach mode for port 1, which is a virtual trunk for ports 1-4. This allows traffic for potentially other host attached ports on the slave switch to traverse to the master to reach other hosts on the network. [optional – I highly recommend doing it]
  • enables ESRP itself (you can use the command show esrp at this point to view the status)

Protecting additional vlans with ESRP

It is a simple one liner command to each core switch, extending the example above, say you added a vlan appservers with it’s associated parameters and wanted to protect it, the command is:

config esrp-prod add member appservers

That’s it.

Gotchas with ESRP

There is only one gotcha that I can think of off hand specific to ESRP. I believe it is a bug, and reported it a couple of years ago(code rev 11.6.3.3 and earlier, current code rev is 12.3.x) I don’t know if it is fixed yet. But if you are using port restart configured ports on your switches, and you add a vlan to your ESRP domain, those links will get restarted(as expected), what is not expected is this causes the network to fail over because for a moment the port weighting kicks in and detects link failure so it forces the switch to a slave state. I think the software could be aware why the ports are going down and not go to a slave state.

Somewhat related, again with port weightings, if you are connecting a new switch to the network, and you happen to connect it to the slave switch first, port weighting will kick in being that the slave switch now has more active ports than the master, and will trigger ESRP to fail over.

The workaround to this, and in general it’s a good practice anyways with ESRP, is to put the slave switch in a standby state when you are doing maintenance on it, this will prevent any unintentional network fail overs from occurring while your messing with ports/vlans etc. You can do this by setting the ESRP priority to 255. Just remember to put it back to a normal priority after you are done. Even in a standby state, if you have ports that are in host attached mode(again e.g. firewalls or load balancers) those ports are not impacted by any state changes in ESRP.

Sample Modern Network design with ESRP

Switches:

  • 2 x Extreme Networks Summit X650-24t with 10GbaseT for the core
  • 22 x Extreme Networks Summit X450A-48T each with an XGM2-2xn expansion module which provides 2x10GbaseT up links providing 1,056 ports of highest performance edge connectivity (optionally select X450e for lower, or X350 for lowest cost edge connectivity. Feel free to mix/match all of them use the same 10GbaseT up link module).

Cross connect the X650 switches to each other using 2x10GbE links with CAT6A UTP cable. Connect each of the edge switches to each of the core switches with CAT5e/CAT6/CAT6a UTP cable. Since we are working at 10Gbps speeds there is no link aggregation/trunking needed for the edge(there is still aggregation used between the core switches) simplifying configuration even further

Is a thousand ports not enough? Break out the 512Gbps stacking for the X650 and add another pair of X650s, your configuration changes to include:

  • Two pairs of 2 x Extreme Networks X650-24t switches in stacked mode with a 512Gbps interconnect(exceeds many chassis switch backplane performance).
  • 46 x 48-port edge switches providing 2,208 ports of edge connectivity.

Two thousand ports not enough, really? You can go further though the stacking interconnect performance drops in half, add another pair of X650s and your configuration changes to include:

  • Two pairs of 3 x Extreme Networks X650-24t switches in stacked mode with a 256Gbps interconnect(still exceeds many chassis switch backplane performance).
  • 70 x 48-port edge switches providing 3,360 ports of edge connectivity.

The maximum number of switches in an X650 stack is eight. My personal preference is with this sort of setup don’t go beyond three. There’s only so much horsepower to do all of the routing and stuff and when your talking about having more than three thousand ports connected to them, I just feel more comfortable that you have a bigger switch if you go beyond that.

Take a look at the Black Diamond 8900 series switch modules on the 8800 series chassis. It is a more traditional core switch that is chassis based. The 8900 series modules are new, providing high density 10GbE and even high density 1GbE(96 ports per slot). It does not support 10GbaseT at the moment, but I’m sure that support isn’t far off. It does offer a 24-port 10GbE line card with SFP+ ports(there is a SFP+ variant of the X650 as well). I believe the 512Gbps stacking between a pair of X650s is faster than the backplane interconnect on the Black Diamond 8900 which is between 80-128Gbps per slot depending on the size of the chassis(this performance expected to double in 2010). While the backplane is not as fast, the CPUs are much faster, and there is a lot more memory, to do routing/management tasks than is available on the X650.

The upgrade process for going from an X650-based stack to a Black Diamond based infrastructure is fairly straight forward. They run the same operating system, they have the same configuration files. You can take down your slave ESRP switch, copy the configuration to the Black Diamond, re-establish all of the links and then repeat the process with the master ESRP switch. You can do this all with approximately one second of combined downtime.

So I hope, in part with this posting you can see what draws me to the Extreme portfolio of products. It’s not just the hardware, or the lower cost, but the unique software components that tie it together. In fact as far as I know Extreme doesn’t even make their own network chipsets anymore. I think the last one was in the Black Diamond 10808 released in 2003, which is a high end FPGA-based architecture(they call it programable ASICs, I suspect that means high end FPGA but not certain). They primarily(if not exclusively) use Broadcom chipsets now. They’ve used Broadcom in their Summit series for many years, but their decision to stop making their own chips is interesting in that it does lower their costs quite a bit. And their software is modular enough to be able to adapt to many configurations (e.g. their Black Diamond 10808 uses dual processor Pentium III CPUs, the Summit X450 series uses ARM-based CPUs I think)

November 24, 2009

Legacy CLI

Filed under: Networking — Tags: , — Nate @ 5:05 pm

One of the bigger barriers to adoption of new equipment often revolves around user interface. If people have to adapt to something radically different some of them naturally will resist. In the networking world, switches in particular Extreme Networks has been brave enough to go against the grain, toss out the legacy UI and start from scratch(they did this more than a decade ago). While most other companies out there tried to make their systems look/feel like Cisco for somewhat obvious reasons.

Anyways I’ve always though highly of them for doing that, don’t do what everyone else is doing just because they are doing it that way, do it better(if you can). I think they have accomplished that. Their configuration is almost readable in plain english, the top level commands are somewhat similar to 3PAR in some respects:

  • create
  • delete
  • configure
  • unconfigure
  • enable
  • disable

Want to add a vlan ? create vlan Want to configure that vlan? configure vlan (or config vlan for short, or config <vlan name> for shorter). Want to turn on sFlow? enable sflow. You get the idea. There are of course many other commands but the bulk of your work is spent with these. You can actually login to an Extreme XOS-based switch that is on the internet, instructions are here. It seems to be a terminal server and you connect on the serial port as you can do things like reboot the switch and wipe out the configuration and you don’t lose connectivity or anything. If you want a more advanced online lab they have them, but they are not freely accessible.

Anyways back on topic, legacy cli. I first heard rumors of this about five years ago when I was looking at getting(and eventually did) a pair of Black Diamond 10808 switches which at the time was the first and only switch that ran Extremeware XOS.  Something interesting I learned recently which I had no idea was the case was that Extremeware XOS is entirely XML based. I knew the configuration file was XML based, but they take it even further than that, commands issued on the CLI are translated into XML objects and submitted to the system transparently. Which I thought was pretty cool.

About three years ago I asked them about it again and the legacy cli project had been shelved they said due to lack of customer interest. But now it’s back, and it’s available.

Now really back on topic. The reason for this legacy cli is so that people that are used to using the 30+ year old broken UI that others like Cisco use can use something similar on Extreme if they really want to. At least it should smooth out a migration to the more modern UI and concepts associated with Extremeware XOS(and Extremeware before it), an operating system that was built from the ground up with layer 3 services in mind(and the UI experience shows it). XOS was also built from the ground up(First released to production in December 2003) to support IPv6 as well. I’m not a fan of IPv6 myself but that’s another blog entry.

It’s not complete yet, right now it’s limited to most of the layer 2 functions of the switch, layer 3 stuff is not implimented at this point. I don’t know if it will be implimented I suppose it depends on customer feedback. But anyways if you have a hard time adjusting to a more modern world, this is available for use. The user guide is here.

If you are like me and like reading technical docs, I highly reccomend the Extremware XOS Concepts Guide. There’s so much cool stuff in there I don’t know where to begin, and it’s organized so well! They really did an outstanding job on their docs.

81,000 RAID arrays

Filed under: Storage,Virtualization — Tags: , , — Nate @ 2:56 pm

I keep forgetting to post about this, I find this number interesting myself. It is the number of mini RAID arrays on my storage system, which has 200 spindles, which comes out to about 400 RAID arrays per disk! Why so many? Well it allows for maximum distribution of storage space and I/O across the system as well as massively parallel RAID rebuilds as every disk in the system will participate when a disk fails, which leads to faster rebuild times and much better service times during rebuild.

While 3PAR talks a lot about their mini RAID arrays(composed of virtual 256MB disks) it turns out there really isn’t an easy way to query how many there are, I suppose because they expect it to be so abstracted that you should not care. But I like to poke around if you haven’t noticed already!

The little script to determine this number is:

#!/bin/bash

export ARRAYS_TOTAL=0
export ARRAY="mrt"
echo "showld -d" | ssh $ARRAY | grep cage | while read line;
do
        export RAWSIZE=`echo $line | awk '{print $7}'`;
        export LD_NAME=`echo $line | awk '{print $2}'`;
        export SETSIZE=`echo $line | awk '{print $10}'`;
        export ARRAYS=`echo "${RAWSIZE}/256/${SETSIZE}" | bc`;
        export ALL_ARRAYS=`echo "${ARRAYS_TOTAL}+${ARRAYS}" | bc `;
        export ARRAYS_TOTAL="$ALL_ARRAYS"; echo "NAME:${LD_NAME} Raw Size:${RAWSIZE}  Set Size:${SETSIZE} Micro arrays in LD:${ARRAYS}  Total Micro arrays so far:${ALL_ARRAYS}";
done

Hopefully my math is right..

Output looks like:

NAME:log2.0 Raw Size:40960  Set Size:2 Micro arrays in LD:80  Total Micro arrays so far:80
NAME:log3.0 Raw Size:40960  Set Size:2 Micro arrays in LD:80  Total Micro arrays so far:160
NAME:pdsld0.0 Raw Size:49152  Set Size:2 Micro arrays in LD:96  Total Micro arrays so far:256
[..]
NAME:tp-7-sd-0.242 Raw Size:19968  Set Size:6 Micro arrays in LD:13  Total Micro arrays so far:81351
NAME:tp-7-sd-0.243 Raw Size:19968  Set Size:6 Micro arrays in LD:13  Total Micro arrays so far:81364

Like the mini RAID arrays the logical disks (the command above is showld or show logical disks) are created/maintained/deleted automatically by the system, another layer of abstraction that you really never need to concern yourself with.

The exact number is 81,364 which is up from about 79,000 in June of this year. To me at least it’s a pretty amazing number when I think about it, 80,000+ little arrays working in parallel, how does the system keep track of it all?

3PAR isn’t unique in this technology though I think maybe they were first. I believe Compellent has something similar, and Xiotech constructs RAID arrays at the disk head level, which I find very interesting, I didn’t know it was possible to “target” areas of the disk as specifically as the head. I think of these three implimentations though the 3PAR one is the most advanced because it’s implimented in hardware(Compellent is software), and it’s much more granular(400 per disk in this example, Xiotech would have up to 8 per disk I think).

The disks are not full yet either, they are running at about ~87% of capacity, so maybe room for another 10,000 RAID arrays on them or something..

I learned pretty quick there’s a lot more to storage than just the number/type of disks..

(Filed under virtualization as well since this is a virtualized storage post)

November 20, 2009

Enterprise SATA disk reliability

Filed under: Storage — Tags: , — Nate @ 7:55 am

..even I was skeptical, though I knew with support it probably wouldn’t be a big deal, a disk fails and it gets replaced in a few hours. When we were looking to do a storage refresh last year I was proposing going entirely SATA for our main storage array because we had a large amount of inactive data, stuff we write to disk and then read back that same day then keep it around for a month or so before deleting it. So in theory it sounded like a good option, we get lots of disks to give us the capacity and those same lots of disks give us enough I/O to do real work too.

I don’t think you can do this with most storage systems, the architecture’s don’t support it nearly as well. To this point the competition was trying to call me out on my SATA solution last year citing reliability and performance reasons. They later backtracked on their statements after I pointed them to some documentation their own storage architects wrote which said the exact opposite.

It’s been just over a year since we had our 3PAR T400 installed with 200x750GB SATA disks, which are Seagate ST3750640NS if you are curious.

Our disks are hit hard, very hard. It’s almost a daily basis that we exceed 90 IOPS/disk at some point during the day, which large I/O sizes this drives the disk’s response time way up, I have another blog entry on that. Fortunately the controller cache is able to absorb that hit. But the point is our disks are not idle, they get slammed 24/7.

pd-svctime20091120

Average Service time across all spindles on my T400

How many disk failures have we had in the past year? One? two? three?

Zero.

For SATA drives, even enterprise SATA drives to me this is a shocking number given the load these disks are put under on a daily basis. Why is it zero? I think a good part of it has to do with the advanced design of the 3PAR disk chassis. Something they don’t really talk about outside of their architecture documentation. I think it is quite a unique design in their enterprise S and T-class systems (not available in their E or F-class systems). The biggest advantages these chassis have I believe is two fold:

  • Vibration absorbing drive sleds – I’ve read in several places that vibration is the #1 cause of disk failure
  • Switched design – no loops, each drive chassis is directly connected to the controllers, and each disk has two independent switched connections to a midplane in the drive chassis. Last year we had two separate incidents on our previous storage array that due to the loop design, allowed a single disk failure to take down the entire loop causing the array to go partially off line(outage), despite there being redundant loops on the system. I have heard stories more recently of other similar arrays doing the same thing.

There are other cool things but my thought is those are the two main ones that drive an improvement in reliability. They have further cool things like fast RAID rebuild which was a big factor in deciding to go with SATA on their system, but even if the RAID rebuilds in 5 seconds that doesn’t make the physical disks more reliable, and this post is specifically about physical disk reliability rather than recovering from failure. But as a note I did measure rebuild rate, and for a fully loaded 750GB disk we can rebuild a degraded RAID array in about three hours, with no impact to array system performance.

My biggest complaint about 3PAR at this point is their stupid naming convention for their PDFs. STUPID! FIX IT! I’ve been complaining off and on for years. But in the grand scheme of things…

Not shocked? Well I don’t know what to say. Even my co-worker who managed our previous storage system is continually amazed that we haven’t had a disk die

Now I’ve jinxed it I’m sure and I’ll get an alert saying a disk has died.

November 18, 2009

Xiotech goes SSD

Filed under: Storage — Tags: , , — Nate @ 8:00 am

Just thought it was kind of funny timing. Xiotech came to my company a few weeks ago touting their ISE systems, about the raw IOPS they can deliver(apparently they do something special with the SCSI protocol that gets them 25% more IOPS than you can normally get). I asked them about SSD and they knocked it saying it wasn’t reliable enough for them(compared to their self healing ISEs).

Well apparently that wasn’t the case, because it seems they might be using STEC SSDs in the near future according to The Register. What? No Seagate? As you may or may not be aware Xiotech’s unique features come with an extremely tight integration with the disk drives, something they can only achieve by using a single brand, which is Seagate(who helped create the technology and later spun it out into Xiotech). Again, The Register has a great background on Xiotech and their technology.

My own take on their technology is it certainly looks interesting, their Emprise 5000 looks like a great little box as a standalone unit. It scales down extremely well. I’m not as convinced with how well it can scale up with the Emprise 7000 controllers though, they tried to extrapolate SPC-1 numbers from a single ISE 5000 to the same number of drives as a 3PAR T800 which I believe still holds the SPC-1 record at least for spinning disks anyways. Myself I’d like to see them actually test a high end 64-node ISE 7000 system for SPC-1 and show the results.

If your a MS shop you might appreciate Xiotech’s ability to integrate with MS Excel, as a linux user myself I did not of course. I prefer something like perl. Funny that they said their first generation products integrated with perl, but their current ones do not at the moment.

This sort of about face with regards to SSD in such a short time frame of reminds me when NetApp put out a press release touting their de-duplication technology as being the best for customers only to come out a week later and say they are trying to buy Data Domain because they have better de-duplication technnology. I mean I would of expected Xiotech to say something along the lines of “we’re working on it” or something. Perhaps the STEC news was an unintentional leak, or maybe their regional sales staff here was just not informed or something.

November 17, 2009

Affordable 10GbE has arrived

Filed under: Networking — Tags: , — Nate @ 6:00 pm

10 Gigabit Ethernet has been around for many years, for much of that time it has been for the most part(and with most vendors still is) restricted to more expensive chassis switches. For most of these switches the port density available for 10GbE is quite low as well, often maxing out at less than 10 ports per slot.

Within the past year Extreme Networks launched their X650 series of 1U switches, which currently consists of 3 models:

  • 24-port 10GbE SFP+
  • 24-port 10GbaseT first generation
  • 24-port 10GbaseT second generation (added link to press release, I didn’t even know they announced the product yesterday it’s been available for a little while at least)

For those that aren’t into networking too much, 10GbaseT is an ethernet standard that provides 10 Gigabit speeds over standard CAT5e/CAT6/CAT6a cable.

All three of them are line rate, full layer 3 capable, and even have high speed stacking(ranging from 40Gbps to 512Gbps depending on configuration). Really nobody else in the industry has this ability at this time at least among:

  • Brocade (Foundry Networks) – Layer 2 only (L3 coming at some point via software update), no stacking, no 10GbaseT
  • Force10 Networks – Layer 2 only, no stacking, no 10GbaseT
  • Juniper Networks – Layer 2 only, no stacking, no 10GbaseT. An interesting tidbit here is the Juniper 1U 10GbE switch is an OEM’d product, does not run their “JunOS” operating system, and will never have Layer 3 support. They will at some point I’m sure have a proper 10GbE switch but they don’t at the moment.
  • Arista Networks – Partial Layer 3(more coming in software update at some point), no stacking, they do have 10GbaseT and offer a 48-port version of the switch.
  • Brocade 8000 – Layer 2 only, no stacking, no 10GbaseT (This is a FCoE switch but you can run 10GbE on it as well)
  • Cisco Nexus 5000 – Layer 2 only, no stacking, no 10GbaseT (This is a FCoE switch but you can run 10GbE on it as well)
  • Fulcrum Micro Monte Carlo – I had not heard of these guys until 30 seconds ago, found them just now. I’m not sure if this is a real product, it says reference design, I think you can get it but it seems targeted at OEMs rather than end users. Perhaps this is what Juniper OEMs for their stuff(The Fulcrum Monaco looks the same as the Juniper switch). Anyways they do have 10GbaseT, no mention of Layer 3 that I can find beyond basic IP routing, no stacking. Probably not something you want to use in your data center directlty due to it’s reference design intentions.

The biggest complaints against 10GbaseT have been that it was late to market(first switches appeared somewhat recently), and it is more power hungry. Well fortunately for it the adoption rate of 10GbE has been pretty lackluster over the past few years with few deployments outside of really high end networks because the cost was too prohibitive.

As for the power usage, the earlier 10GbaseT switches did use more power because well it usually requires more power to drive stuff over copper vs fiber. But the second generation X650-24T from Extreme has lowered the power requirements by ~30%(reduction of 200W per switch), making it draw less power than the SFP+ version of the product! All models have an expansion slot on the rear for stacking and additional 10GbE ports. For example if you wanted all copper ports on the front but needed a few optical, you could get an expansion module for the back that provides 8x 10GbE SFP+ ports on the rear. Standard it comes with a module that has 4x1GbE SFP ports and 40Gbps stacking ports.

So what does it really cost? I poked around some sites trying to find some of the “better” fully layer 3 1U switches out there from various vendors to show how cost effective 10GbE can be, at least on a per-gigabit basis it is cheaper than 1GbE is today. This is street pricing, not list pricing, and not “back room” discount pricing. YMMV

VendorModelNumber of ports on the frontBandwidth
for front
ports
(Full Duplex)
Priced
From
Street
Price
Cost per
Gigabit
Support
Costs?
Extreme NetworksX650-24t24 x 10GbE480 GbpsCDW$19,755 *$41.16Yes
Force10 NetworksS50N48 x 1GbE 96 GbpsInsight$5,078$52.90Yes
Extreme NetworksX450e-48p48 x 1GbE 96 GbpsDell$5,479$57.07Optional
Extreme NetworksX450a-48t48 x 1GbE 96 GbpsDell$6,210$64.69Yes
Juniper NetworksEX420048 x 1GbE 96 GbpsCDW$8,323$86.69Yes
Brocade (Foundry Networks)NetIron CES 2048C48 x 1GbE 96 GbpsPendingPendingPendingYes
Cisco Systems3750E-48TD48 x 1GbE 96 GbpsCDW$13,500$140.63Yes

* The Extreme X650 switch by default does not include a power supply(it has two internal power supply bays for AC or DC PSUs). So the price includes the cost of a single AC power supply.

HP VirtualConnect for Dummies

Filed under: Networking,Storage,Virtualization — Tags: , , , — Nate @ 5:27 pm

Don’t know what VirtualConnect is? Check this e-book out. Available to the first 2,500 people that register. I just browsed over it myself it seems pretty good.

I am looking forward to using the technology sometime next year(trying to wait for the 12-core Opterons before getting another blade system). Certainly looks really nice on paper, and the price is quite good as well compared to the competition. It was first introduced I believe in 2006 so it’s fairly mature technology.

Three thousand drives in the palm of your hand

Filed under: Storage — Tags: , , — Nate @ 2:59 pm

I was poking around again and came across a new product from Fusion IO which looked really cool. Their new Iodrive Octal, which packs 800,000 IOPS on a single card with 6 Gigabytes/second sustained bandwidth. To put this in perspective, a typical high end 15,000 RPM SAS/Fiber Channel disk drive can do about 250 IOPS. As far as I/O goes this is roughly the same as 3,200 drives. The densest high performance storage I know of is 3PAR who can pack 320 15,000 RPM drives in a rack in their S-class and T-class systems (others can do high density SATA,  I’m not personally aware of others that can do high density 15,000 RPM drives for online data processing).

But anyways, in 3PAR’s case that is 10 racks of drives, and three more racks for disk controllers(24 controllers), roughly 25,000 pounds of equipment(performance wise) in the palm of your hand with the Iodrive Octal. Most other storage arrays top out at between 200 and 225 disks per rack.

The Fusion IO solutions aren’t for everyone of course they are targeted mostly at specialized applications with smaller data sets that require massive amounts of I/O. Or those that are able to distribute their applications amongst several systems using their PCIe cards.

November 6, 2009

Thin Provisioning strategy with VMware

Filed under: Storage,Virtualization — Tags: , , , — Nate @ 3:44 pm

Since the announcement of thin provisioning built into vSphere I have seen quite a few blog posts on how to take advantage of it but haven’t seen anything that matches my strategy which has served me well utilizing array-based thin provisioning technology. I think it’s pretty foolproof..

The man caveat is that I assume you have a decent amount of storage available on your system, that is your VMFS volumes aren’t the only thing residing on your storage. On my current storage array,written VMFS data accounts for maybe 2-3 % of my storage. On the storage array I had at my last company it was probably 10-15%. I don’t believe in dedicated storage arrays myself. I prefer nice shared storage systems that can sustain random and sequential I/O from any number of hosts and distributed that I/O across all of the resources for maximum efficiency.  So my current array has most of it’s space set aside for a NFS cluster, and then there is a couple dozen terabytes set aside for SQL servers and VMware. The main key is being able to share the same spindles across dozens or even hundreds of LUNs.

There has been a lot of debate over the recent years about how best to size your VMFS volumes. The most recent data I have seen suggests somewhere between 250GB and 500GB. There seems to be unanimous opinion out there not to do something crazy and use 2TB volumes. The exact size depends on your setup. How many VMs, how many hosts, how often you use snapshots, how often you do vMotion, as well as the amount of I/O that goes on. The less of all of those the larger the volume can potentially be.

My method is so simple. I chose 1TB as my volume sizes, thin provisioned of course.  I utilize the default lazy zero VMFS mode and do not explicitly turn on thin provisioning on any VMDK files. There’s no real point if you already have it in the array. So I create 1TB volumes, and I begin creating VMs on them. I try to stop when I get to around 500GB of allocated(but not written) space. That is VMware thinks it is using 500GB, but it may only be using 30GB. This way I know, the system will never use more than 500GB. Pretty simple. Of course I have enough space in reserve that if something crazy were to happen the volume could grow to 500GB and not cause any problems. Even with my current storage array operating in the neighborhood of 89% of total capacity, that still leaves me with several terabytes of space I can use in an emergency.

If I so desire I can go beyond the 500GB at any time without an issue. If I chose not to then I haven’t wasted any space because nothing is written to those blocks. My thin provisioning system is licensed based on written data, so if I have 10TB of thin provisioning on my system I can, if I want create 100TB of thin provisioned volumes, provided I don’t write more than 10TB to them. So you see there really is no loss in making a larger volume when the data is thin provisioned on the array. Why not make it 2TB or even bigger? Well really I can’t see a time when I would EVER want a 2TB VMFS volume which is why I picked 1TB.

I took the time in my early days working with thin provisioning to learn the growth trends of various applications and how best to utilize them to get maximum gain out of thin provisioning.  With VMs that means having a small dedicated disk for OS and swap, and any data resides on other VMDKs or preferably on a NAS or for databases on raw devices(for snapshot purposes). Given that core OSs don’t grow much there isn’t much space needed(I default to 8GB) for the OS, and I give the OS a 1GB swap partition.  For additional VMDKs or raw devices I always use LVM. I use it to assist me in automatically detecting what devices a particular volume are on, I use it for naming purposes, and I use it to forcefully contain growth. Some applications are not thin provisioning friendly but I’d like to be able to expand the volume on demand without an outage. Online LVM resize and file system resize allows this without touching the array. It really doesn’t take much work.

On my systems I don’t really do vMotion(not licensed), I very rarely use VMFS snapshots(few times a year), the I/O on my VMFS volumes is tiny despite having 300+ VMs running on them. So in theory I probably could get away with 1TB or even 2TB VMFS volume sizes, but why lock myself into that if I don’t have to? So I don’t.

I also use dedicated swap VMFS volumes so I can monitor the amount of I/O going on with swap from an array perspective. Currently I have 21 VMware hosts connected to our array totalling 168 CPU cores, and 795GB of memory. Working to retire our main production VMware hosts, many of which are several years old(re-purposed from other applications). Now that I’ve proven how well it can work on existing hardware and the low cost version the company is ready to gear up a bit more and commit more resources to a more formalized deployment utilizing the latest hardware and software technology. You won’t catch me using the enterprise plus or even the enterprise version of VMware though, cost/ benefit isn’t there.

November 3, 2009

The new Cisco/EMC/Vmware alliance – the vBlock

Filed under: Storage,Virtualization — Tags: , , , , , , , — Nate @ 6:04 pm

Details were released a short time ago thanks to The Register on the vBlock systems coming from the new alliance of Cisco and EMC, who dragged along Vmware(kicking and screaming I’m sure). The basic gist of it is to be able to order a vBlock and have it be a completely integrated set of infrastructure ready to go, servers and networking from Cisco, storage from EMC, and Hypervisor from VMware.

vBlock0 consists of rack mount servers from Cisco, and unknown EMC storage, price not determined yet

vBlock1 consists 16-32 blade servers from Cisco and EMC CX4-480 storage system. Price ranges from $1M – 2.8M

vBlock2 consists of 32-64 blade servers from Cisco and an EMC V-MAX. Starting price $6M.

Sort of like FCoE, sounds nice in concept but the details fall flat on their face.

First off is the lack of choice. That is Cisco’s blades are based entirely on the Xeon 5500s, which are, you guessed it limited to two sockets. And at least at the moment limited to four cores. I haven’t seen word yet on compatibility with the upcoming 8-core cpus if they are socket/chip set compatible with existing systems or not(if so, wonderful for them..). Myself I prefer more raw cores, and AMD is the one that has them today(Istanbul with 6 cores, Q1 2010 with 12 cores). But maybe not everyone wants that so it’s nice to have choice. In my view HP blades win out here for having the broadest selection of offerings from both Intel and AMD. Combine that with their dense memory capacity(16 or 18 DIMM slots on a half height blade), allows you up to 1TB of memory in a blade chassis in an afforadable confiugration using 4GB DIMMs. Yes Cisco has their memory extender technology but again IMO at least with a dual socket Xeon 5500 that it is linked to the CPU core:memory density is way outta whack. It may make more sense when we have 16, 24, or even 32 cores on a system using this technology. I’m sure there are niche applications that can take advantage of it on a dual socket/quad core configuration, but the current Xeon 5500 is really holding them back with this technology.

Networking, it’s all FCoE based, I’ve already written a blog entry on that, you can read about my thoughts on FCoE here.

Storage, you can see how even with the V-MAX EMC hasn’t been able to come up with a storage system that can start on the smaller end of the scale, something that is not insanely unaffordable to 90%+ of the organizations out there. So on the more affordable end they offer you a CX4. If you are an organization that is growing you may find yourself outliving this array pretty quickly. You can add another vBlock, or you can rip and replace it with a V-MAX which will scale much better, but of course the entry level pricing for such a system makes it unsuitable for almost everyone to try to start out with even on the low end.

I am biased towards 3PAR of course as both of the readers of the blog know, so do yourself a favor and check out their F and T series systems, if you really think you want to scale high go for a 2-node T800, the price isn’t that huge, the only difference between a T400 and a T800 is the backplane. They use “blocks” to some extent, blocks being controllers(in pairs, up to four pairs), disk chassis(40 disks per chassis, up to 8 per controller pair I think). Certainly you can’t go on forever, or can you? If you don’t imagine you will scale to really massive levels go for a T400 or even a F400.  In all cases you can start out with only two controllers the additional cost to give you the option of an online upgrade to four controllers is really trivial, and offers nice peace of mind. You can even go from a T400 to a T800 if you wanted, just need to switch out the back plane (downtime involved). The parts are the same! the OS is the same! How much does it cost? Not as much as you would expect. When 3PAR announced their first generation 8-node system 7 years ago, entry level price started at $100k. You also get nice things like their thin built in technology which will allow you to run those eager zeroed VMs for fault tolerance and not consume any disk space or I/O for the zeros. You can also get multi level synchronous/asynchronous replication for a fraction of the cost of others. I could go on all day but you get the idea. There are so many fiber ports on the 3PAR arrays that you don’t need a big SAN infrastructure just hook your blade enclosures directly to the array.

And as for networking hook your 10GbE Virtual Connect switches on your c Class enclosures to your existing infrastructure. I am hoping/expecting HP to support 10GbaseT soon, and drop the CX4 passive copper cabling. The Extreme Networks Summit X650 stands alone as the best 1U 10GbE (10GbaseT or SFP+) switch on the market. Whether it is line rate, or full layer 3, or high speed stacking, or lower power consuming 10GbaseT vs fiber optics,  or advanced layer 3 networking protocols to simplify management,  price and ease of use — nobody else comes close. If you want bigger check out the Black Diamond 8900 series.

Second you can see with their designs that after the first block or two the whole idea of a vBlock sort of falls apart. That is pretty quickly your likely to just be adding more blades(especially if you have a V-MAX), rather than adding more storage and more blades.

Third you get the sense that these aren’t really blocks at all. The first tier is composed of rack mount systems, the second tier is blade systems with CX4, the third tier is blade systems with V-MAX. Each tier has something unique which hardly makes it a solution you can build as a “block” as you might expect from something called a vBlock. Given the prices here I am honestly shocked that the first tier is using rack mount systems. Blade chassis do not cost much, I would of expected them to simply use a blade chassis with just one or two blades in it. Really shows that they didn’t spend much time thinking about this.

I suppose if you treated these as blocks in their strictest sense and said yes we won’t add more than 64 blades to a V-MAX, and add it like that you could get true blocks, but I can imagine the amount of waste doing something like that is astronomical.

I didn’t touch on Vmware at all, I think their solution is solid, and they have quite a bit of choices. I’m certain with this vBlock they will pimp the enterprise plus version of software, but I really don’t see a big advantage of that version with such a small number of physical systems(a good chunk of the reason to go to that is improved management with things like host profiles and distributed switches). As another blogger recently noted, Vmware has everything to lose out of this alliance, I’m sure they have been fighting hard to maintain their independence and openness, this reeks of the opposite, they will have to stay on their toes for a while when dealing with their other partners like HP, IBM, NetApp, and others..

« Newer PostsOlder Posts »

Powered by WordPress