TechOpsGuys.com Diggin' technology every day

12Aug/14Off

Some internet routers ran out of memory today

TechOps Guy: Nate

(here is a link to in depth analysis on the issue)

Fortunately I didn't notice any direct impact to anything I personally use. But I first got notification from one of the data center providers we use that they were having network problems they traced it down to memory errors and they frantically started planning for emergency memory upgrades across their facilities. My company does not and has never relied upon this data center for network connectivity so it never impacted us.

A short time later I noticed a new monitoring service that I am using sent out an outage email saying their service providers were having problems early this morning and they had migrated customers away from the affected data center(s).

Then I contacted one of the readers of my blog whom I met a few months ago and told him the story of my data center that is having this issue which sounded similar to a story he told me at the time about his data center provider. He replied with a link to this Reddit article which talks about how the internet routing table exceeded 512,000 routes for the first time today, and that is a hard limit in some older equipment which causes them to either fail, or to perform really slowly as some routes have to be processed in software instead of hardware.

I also came across this article (which I commented on) which mentions similar problems but no reference to BGP or routing tables (outside my comments at the bottom).

[..]as part of a widespread issue impacting major network providers including Comcast, AT&T, Time Warner and Verizon.

One of my co-workers said he was just poking around and could find no references to what has been going on today other than the aforementioned Reddit article. I too am surprised if so many providers are having issues that this hasn't made more news.

(UPDATE - here is another article from zdnet)

I looked at the BGP routing capacity of some core switches I had literally a decade ago and they could scale up to 1 million unique routes of BGP4 routes in hardware, and 2 million non unique (not quite sure what the difference is anything beyond static routing has never been my thing). I recall seeing routers again many years ago that could hold probably 10 times that (I think the main distinction between a switch and a router is the CPU and memory capacity ? at least for the bigger boxes with dozens to hundreds of ports?)

So it's honestly puzzling to me how any service provider could be impacted by this today. How any equipment not capable of handling 512k routes is still in use in 2014 (I can understand for smaller orgs but not service providers). I suppose this also goes to show that there is wide spread lack of monitoring of these sorts of metrics. In the Reddit article there is mention of talks going on for months people knew this was coming -- well apparently not everyone obviously.

Someone wasn't watching the graphs.

I'm planning on writing a blog post on the aforementioned monitoring service I recently started using soon too, I've literally spent probably five thousand hours over the past 15 years doing custom monitoring stuff and this thing just makes me want to cry it's so amazingly powerful and easy to use. In fact just yesterday I had someone email me about a MRTG document I wrote 12 years ago and how it's still listed on the MRTG site even today (I asked the author to remove the link more than a year ago that was the last time someone asked me about it, that site has been offline for 10 years but is still available in the internet archive).

This post was just a quickie inspired by my co-worker who said he couldn't find any info on this topic, so hey maybe I'm among the first to write about it.

13Oct/13Off

Take a number: how to fix healthcare.gov

TechOps Guy: Nate

Sorry for slackin off recently, there just hasn't been a whole lot out there that has gotten me fired up.

Not too long ago I ranted a bit about outages. Basically saying if your site is down for a few hours, big whoop. It happens to everyone. The world is not going to end, your not going to go out of business.

Now if your website is down for a week or multiple weeks the situation is a bit different. I saw on a news broadcast that experts had warned the White House that the new $600M+ healthcare.gov web site was not ready. But the people leading the project, as it seems so typical probably figured the claims were overblown (are they ever? in my experience they have not been - though I've never been involved in a $600M project before, or anywhere close to it) and decided to press onwards regardless.

So they had some architecture issues, some load issues, capacity problems etc. I just thought to myself - this problem really sounds easy to solve from a technical standpoint. They tried to do this to some extent(and failed) apparently with various waiting screens. There are some recent reports that longer term fixes may take weeks to months.

I've been on the receiving end of some pretty poorly written/designed applications that it didn't really matter how much hardware you had it flat out wouldn't scale. I remember one situation in particular during an outage of some kind and the VP of Engineering interrupted us on the conference call and said Guys - is there anything I can buy that would make this problem go away?  The answer back to him was No. At this same company we had Oracle - obviously a big company in the database space come to our company and tell us they had no other customers in the world doing what we were doing, and they could not guarantee results. Storage companies were telling us the same thing. Our OLTP database at the time was roughly 8 times the next largest Oracle OLTP database in the world (which was Amazon). That was, by far the most over designed application I've ever supported. It was an interesting experience, I learned a lot. Most other applications that I have supported suffered pretty serious design issues, though none were quite as bad as this one company in particular.

My solution is simple - go old school, take a number and notify people when they can use the website.

Write a little basic app, point healthcare.gov to it, allow people to register with really basic info like name and email address (or phone# if they prefer to use SMS). This would be an entirely separate application not part of the regular web site. This is really light weight application, perhaps even store it in some noSQL solution(for speed) because worst case if you lose the data they'll just have to come back and register again.

So part of the registration the site would say we'll send you an email or SMS when your turn is up, with a code,  and you'll have a 24 hour window in which to use the site (past that and you have to register for a new number). If they can get the infrastructure done perhaps they could even have an automated phone system give them a call as well.

Then simply only allow a fraction of the # of people at a time on the website that the system can handle, if they built it for 50,000 people at a time I would probably start with 20,000 the first day or two and see how it goes(20,000 people per day not 20,000 simultaneous). Then ramp it up, if the application is scaling ok. As users register successfully the other application sees this and the next wave of notifications is sent. Recently I heard that officials were recommending people sign up through the call center(s), which I suppose is an OK stop gap but can't imagine the throughput is very high there either.

I figure it may take a team of developers a few days to come up with such an app.

Shift the load of people trying to hit an expensive application over and over again to a really basic high performance registration application, and put the expensive application behind a barrier requiring an authentication code.

IMO they should of done this from the beginning, perhaps even in advance generating times based on social security numbers or something.

All of this is really designed to manage the flood of initial registrations, once the tidal wave is handled then open the web site up w/o authentication anymore.

There should be a separate, static, high speed site(on many CDNs) that has all of the information people would need to know when signing up, again something that is not directly connected to the transactional system. People can review this info in advance and that would make sign ups faster.

Tagged as: Comments Off