Diggin' technology every day


Take a number: how to fix

TechOps Guy: Nate

Sorry for slackin off recently, there just hasn't been a whole lot out there that has gotten me fired up.

Not too long ago I ranted a bit about outages. Basically saying if your site is down for a few hours, big whoop. It happens to everyone. The world is not going to end, your not going to go out of business.

Now if your website is down for a week or multiple weeks the situation is a bit different. I saw on a news broadcast that experts had warned the White House that the new $600M+ web site was not ready. But the people leading the project, as it seems so typical probably figured the claims were overblown (are they ever? in my experience they have not been - though I've never been involved in a $600M project before, or anywhere close to it) and decided to press onwards regardless.

So they had some architecture issues, some load issues, capacity problems etc. I just thought to myself - this problem really sounds easy to solve from a technical standpoint. They tried to do this to some extent(and failed) apparently with various waiting screens. There are some recent reports that longer term fixes may take weeks to months.

I've been on the receiving end of some pretty poorly written/designed applications that it didn't really matter how much hardware you had it flat out wouldn't scale. I remember one situation in particular during an outage of some kind and the VP of Engineering interrupted us on the conference call and said Guys - is there anything I can buy that would make this problem go away?  The answer back to him was No. At this same company we had Oracle - obviously a big company in the database space come to our company and tell us they had no other customers in the world doing what we were doing, and they could not guarantee results. Storage companies were telling us the same thing. Our OLTP database at the time was roughly 8 times the next largest Oracle OLTP database in the world (which was Amazon). That was, by far the most over designed application I've ever supported. It was an interesting experience, I learned a lot. Most other applications that I have supported suffered pretty serious design issues, though none were quite as bad as this one company in particular.

My solution is simple - go old school, take a number and notify people when they can use the website.

Write a little basic app, point to it, allow people to register with really basic info like name and email address (or phone# if they prefer to use SMS). This would be an entirely separate application not part of the regular web site. This is really light weight application, perhaps even store it in some noSQL solution(for speed) because worst case if you lose the data they'll just have to come back and register again.

So part of the registration the site would say we'll send you an email or SMS when your turn is up, with a code,  and you'll have a 24 hour window in which to use the site (past that and you have to register for a new number). If they can get the infrastructure done perhaps they could even have an automated phone system give them a call as well.

Then simply only allow a fraction of the # of people at a time on the website that the system can handle, if they built it for 50,000 people at a time I would probably start with 20,000 the first day or two and see how it goes(20,000 people per day not 20,000 simultaneous). Then ramp it up, if the application is scaling ok. As users register successfully the other application sees this and the next wave of notifications is sent. Recently I heard that officials were recommending people sign up through the call center(s), which I suppose is an OK stop gap but can't imagine the throughput is very high there either.

I figure it may take a team of developers a few days to come up with such an app.

Shift the load of people trying to hit an expensive application over and over again to a really basic high performance registration application, and put the expensive application behind a barrier requiring an authentication code.

IMO they should of done this from the beginning, perhaps even in advance generating times based on social security numbers or something.

All of this is really designed to manage the flood of initial registrations, once the tidal wave is handled then open the web site up w/o authentication anymore.

There should be a separate, static, high speed site(on many CDNs) that has all of the information people would need to know when signing up, again something that is not directly connected to the transactional system. People can review this info in advance and that would make sign ups faster.

Tagged as: 3 Comments

Verizon looks to Seamicro for next gen cloud

TechOps Guy: Nate

Last week Verizon made big news in the cloud industry that they were shifting gears significantly and were not going to have their clouds built on top of traditional enterprise equipment from the likes of HP, Cisco, EMC etc.

I can't find an article on it but I recall hearing on CNBC that AT&T announced something similar - that was going to result in them in saving $2 billion over some period of time that I can't remember.

Today our friends at The Register reveal that this design win actually comes from AMD's Seamicro unit. AMD says they have been working closely with Verizon for two years on designs for a highly flexible and efficient platform to scale with.

Seamicro has a web page dedicated to this announcement.

Some of the capabilities include:

  • Fine-grained server configuration options that match real life requirements, not just small, medium, large sizing, including processor speed (500 MHz to 2,000 MHz) and DRAM (.5 GB increments) options
  • Shared disks across multiple server instances versus requiring each virtual machine to have its own dedicated drive
  • Defined Storage quality of service by specifying performance up to 5,000 IOPS to meet the demands of the application being deployed, compared to best-effort performance
  • Strict traffic isolation, data encryption, and data inspection with full featured firewalls that achieve Department of Defense and PCI compliance levels
  • Reserved network performance for every virtual machine up to 500 Mbps

I don't see much more info than that. Questions that remain with me are what level of SMP will they support, and what processor(s) are they using (specifically are they using AMD procs or Intel procs since Seamicro can use both, Intel has obviously been dominating the cloud landscape, so it would be nice to see a new large scale deployment of AMD).

I have written about SeaMicro a couple times in the past, most recently comparing HP's Moonshot to the AMD platform. In those posts I mentioned how I felt that Moonshot fell far short of what Seamicro seems to be capable of offering. Given Verizon's long history as a customer of HP, I can't help but assume that HP tried hard to get them to consider Moonshot but fell short on the technology(or timing, or both).

Seamicro, to my knowledge (I don't follow micro servers too closely) is the only micro server platform that offers fully virtualized storage, both inside the chassis as well as more than 300TB of external storage. One of the unique abilities that sounds nice for larger scale deployments is the ability to export essentially read only snapshots of base operating systems to many micro servers for easier management(and you could argue more secure given they are read only), without needing fancy SAN storage. It's also fairly mature (relatively to the competition) given it's been on the market for several years now.

Verizon/Terremark obviously had some trouble competing with the more commodity players with their enterprise platform both on cost and on capabilities. I was a vCloudExpress user for about a year, and worked through an RFP with them at one of my former companies for a disaster recovery project. Their cost model, like most cloud providers was pretty insane. The assumption we had at the time is we were a small company without much purchasing leverage, so expected the cost to be pretty decent given the volumes a cloud provider can command. Though reality set in quick when their cost was at least 5-6 fold what our cost was for the same capabilities from similar enterprise vendors.

Other providers had similar pricing models, and I continue to hear stories to this day about various providers costing too much relative to doing things in house (there really is no exception), with ROIs really never exceeding 12 months. I think I've said many times but I'll say it again - I'll be the first one to be willing to pay a premium for something that gives premium abilities. None of them come close to meeting that though. Not even in the same solar system at this point.

This new platform will certainly make Verizon's cloud offering more competitive, they are having to build an entirely new control platform for it though - not much off the shelf software here, simply because none of it is built to that level of scale. Such problems are difficult to address, and until you encounter them you probably won't anticipate what is required to solve them.

I am mainly curious whether or not these custom things that AMD built for Verizon -- if those will be available to other cloud players. I assume they will..