From the group of people that brought the Map Reduce algorithm to a much broader audience (despite the concepts being decades old), Google has now outgrown it and is moving on according to our friends at The Register.
The main reason behind it is map reduce was hindering their ability to provide near real time updates to their index. So they migrated their Search infrastructure to a Bigtable distributed database. They also optimized the next generation Google file system for this database, making it inappropriate for more general uses.
MapReduce is a sequence of batch operations, and generally, Lipkovits explains, you can’t start your next phase of operations until you finish the first. It suffers from “stragglers,” he says. If you want to build a system that’s based on series of map-reduces, there’s a certain probability that something will go wrong, and this gets larger as you increase the number of operations. “You can’t do anything that takes a relatively short amount of time,” Lipkovitz says, “so we got rid of it.”
I have to wonder how much this new distributed database-based index was responsible for Google to be able to absorb upwards of a 7 fold increase in search traffic due to the Google Instant feature being launched.
I had an interview at a company a couple of months ago that was trying to use Hadoop + Map Reduce for near real-time operations (the product had not launched yet), and thought that wasn’t a very good use of the technology. It’s a batch processing system. Google of course realized this and ditched it when it could no longer scale to the levels of performance that they needed (despite having an estimated 1.8 million servers at their disposal).
As more things get closer to real time I can’t help but wonder about all those other companies out there that have hopped on the Hadoop/Map Reduce bandwagon, when they will realize this and try once again to follow the food crumbs that Google is dropping.
I just hope for those organizations, that they don’t compete with Google in any way, because they will be at a severe disadvantage from a few angles:
- Google has a near infinite amount of developer resources internally and as one Yahoo! person said “[Google] clearly demonstrated that it still has the organizational courage to challenge its own preconceptions,”
- Google has a near infinite hardware capacity and economies of scale. What one company may pay $3-5,000 for, Google probably pays less than $1,000. They are the largest single organization that uses computers in the world. They are known for getting special CPUs,. and everyone at cloud scale operates with specialized motherboard designs. They build their own switches and routers (maybe). Though last I heard they are still a massive user of Citrix Netscaler load balancers.
- Google of course operates it’s own high end, power efficient data centers which means they get many more servers per kW than you can get in a typical data center. I wrote earlier in the year about a new container that supports 45 kilowatts per rack, more than ten times your average data center.
- Google is the world’s third largest internet carrier and due to peering agreements pays almost nothing for bandwidth.
Google will be releasing more information about their new system soon, I can already see the army of minions out there gearing up to try to duplicate the work and try to remain competitive. ha ha! I wouldn’t want to be them, that’s all I can say 🙂
[Google’s] Lipokovitz stresses that he is “not claiming that the rest of the world is behind us.”
Got to admire the modesty!