« Intel offers software upgrade for unlocking CPU power | Main | Koans for learning Ruby, Python and more »
September 18, 2010
MapReduce : Learning MapReduce in 5 minutes with actual code
Even though the major buzz around MapReduce has tapered of, it still represents an important technique to understand for all software engineers. And I'm not just saying this because Google relies heavily on it, I'm saying it because it allows vast amounts of data to be processed in parallel.
If you best learn technology by working with code, you may find reading Google's MapReduce paper a little theoretical, where as setting up a popular MapReduce implementation like Hadoop can be somewhat daunting and time consuming.
Don't get me wrong, MapReduce is a dense topic because of its distributed nature, but learning the fundamentals with some actual code doesn't have to take hours or days. You can do it in minutes.
| [Entry continues to the left and below ad ] |
First of, lets get the basic theory out of the way: MapReduce is not a database, it's a framework for processing data in parallel. Data can be even be placed in ordinary text files, although many MapReduce frameworks operate on the premise of having data stored in a distributed database.
Mappper functions are charged with sorting data into a Map, think key-value pairs or HashMaps. Once data has been ordered into a Map, it's passed into a Reduce function which applies logic to all Map values, in effect, reducing the original data structure into a result.
The take away point is that the Mapper and Reduce functions can be executed on data spread throughout multiple machines. This allows you to distribute the processing power required to crunch massive amounts of data to multiple machines. A calculation that could take hours or days on a single machine, could take minutes using MapReduce. The primary reason Google uses it to process data from millions of web sites and build their search results.
So now onto the actual code. How would you obtain the number of word occurrences in a file containing 10,000,000 lines ? In MapReduce this is a trivial task. The following listing illustrates a Mapper and Reduce function written in Python needed to obtain such a result.
| Mapper and Reuce functions for counting words |
def mapfunction(k, line):
for word in line.split():
yield word, 1
def reducefunction(k, vs):
result = 0
for v in vs:
result += v
return result
|
The Mapper function in this case splits all file lines and creates a map with a word and a count of 1 (e.g.key-value, ("the",1),("framework",1)). This map is then passed to a reduce function, which is charged with reducing the entire map by counting the total number of word occurrences. With a final result generated by MapReduce in the following form: {"the": 375437, "framework": 423634, "MapReduce": 153568}.
Pretty simple, right ? This is all there is to understanding MapReduce. The complexity in MapReduce stems not from the functions in an of themselves, but from how MapReduce handles their workload across several nodes: Where is the data located ? What happens if a node suddenly fails ? These are all issues resolved by the MapReduce implementation. Reason why setting up a robust implementation like Hadoop can be daunting and time consuming.
Nevertheless, there are several MapReduce implementations you can choose from to get a better feel for the framework. One of these implementations is Mincemeat for Python , which even though does not have the same feature set as an implementation like Hadoop, is still capable of executing basic MapReduce tasks. Not to mention it will allow you to work with some actual MapReduce code in a matter of minutes.
Posted by Daniel at September 18, 2010 5:58 PM





