Python + Hadoop = Flying Circus Elephant

Thursday, 29 May 2008
by klaas
filed under Code
Comments: 19


As a research intern here at Last.fm, dealing with huge datasets has become my daily bread. Having a herd of yellow elephants at my disposal makes this a lot easier, but the conventional way of writing Hadoop programs can be rather cumbersome. It generally involves lots of typing, compiling, building, and moving files around, which is especially annoying for the “write once, run never again” programs that are often required for research purposes. After fixing one too many stupid bugs caused by this copy/paste-encouraging work flow, I finally decided to look for an alternative.

The sluggishness of the conventional Hadoop work flow is mainly caused by the fact that Java is a very verbose and compiled programming language. Hence, finding a solution basically boils down to replacing Java. Since simplicity and programmer productivity were my main goals, it didn’t take me too long to decide that I wanted to use Python instead. The approach described here is the most convenient way of writing Hadoop programs in Python that I could find on the web, but it still wasn’t pythonic enough for my taste. The mapper and the reducer shouldn’t have to reside in separate files, and having to write boilerplate code should be avoided as much as possible. To get rid of these issues, I wrote a simple Python module called Dumbo. The word count example can be written as follows using this module:

def mapper(key,value):
   for word in value.split(): yield word,1
def reducer(key,values):
   yield key,sum(values)
if __name__ == "__main__":
   import dumbo
   dumbo.run(mapper,reducer)

Compare this to the 60 lines of code required to do exactly the same thing in Java! On a UNIX machine, this program can be run locally by executing the command

python wordcount.py map < wc_input.txt | sort | \
python wordcount.py red > wc_output.txt

and running it on a Hadoop cluster is as simple as typing

python -m dumbo wordcount.py \
-hadoop /path/to/hadoop \
-input wc_input.txt -output wc_output

To put some icing on the cake, Dumbo also offers a few handy extras. You’ll have to read the documentation or even the code to discover them all, but one worth mentioning is that you can implement combiner-like functionality by passing a third parameter to “dumbo.run”. For instance, the amount of network traffic can be reduced substantially in the example above by adding “reducer” as a third parameter to “dumbo.run”.

Obviously, the combination of Hadoop Streaming and Dumbo is not a one-size-fits-all solution. Java code is still faster and provides the programmer more flexibility and control. Even with several liters of the finest Belgian beer, there is no way I could convince Johan or Martin to port their heavy regularly-running Hadoop programs to Python. In many cases, however, it makes perfect sense to trade some speed and flexibility for programmer productivity. It definitely allows me to get more work done in less time.