Python + Hadoop = Flying Circus Elephant

Thursday, 29 May 2008
by klaas
filed under Code
Comments: 19


As a research intern here at Last.fm, dealing with huge datasets has become my daily bread. Having a herd of yellow elephants at my disposal makes this a lot easier, but the conventional way of writing Hadoop programs can be rather cumbersome. It generally involves lots of typing, compiling, building, and moving files around, which is especially annoying for the “write once, run never again” programs that are often required for research purposes. After fixing one too many stupid bugs caused by this copy/paste-encouraging work flow, I finally decided to look for an alternative.

The sluggishness of the conventional Hadoop work flow is mainly caused by the fact that Java is a very verbose and compiled programming language. Hence, finding a solution basically boils down to replacing Java. Since simplicity and programmer productivity were my main goals, it didn’t take me too long to decide that I wanted to use Python instead. The approach described here is the most convenient way of writing Hadoop programs in Python that I could find on the web, but it still wasn’t pythonic enough for my taste. The mapper and the reducer shouldn’t have to reside in separate files, and having to write boilerplate code should be avoided as much as possible. To get rid of these issues, I wrote a simple Python module called Dumbo. The word count example can be written as follows using this module:

def mapper(key,value):
   for word in value.split(): yield word,1
def reducer(key,values):
   yield key,sum(values)
if __name__ == "__main__":
   import dumbo
   dumbo.run(mapper,reducer)

Compare this to the 60 lines of code required to do exactly the same thing in Java! On a UNIX machine, this program can be run locally by executing the command

python wordcount.py map < wc_input.txt | sort | \
python wordcount.py red > wc_output.txt

and running it on a Hadoop cluster is as simple as typing

python -m dumbo wordcount.py \
-hadoop /path/to/hadoop \
-input wc_input.txt -output wc_output

To put some icing on the cake, Dumbo also offers a few handy extras. You’ll have to read the documentation or even the code to discover them all, but one worth mentioning is that you can implement combiner-like functionality by passing a third parameter to “dumbo.run”. For instance, the amount of network traffic can be reduced substantially in the example above by adding “reducer” as a third parameter to “dumbo.run”.

Obviously, the combination of Hadoop Streaming and Dumbo is not a one-size-fits-all solution. Java code is still faster and provides the programmer more flexibility and control. Even with several liters of the finest Belgian beer, there is no way I could convince Johan or Martin to port their heavy regularly-running Hadoop programs to Python. In many cases, however, it makes perfect sense to trade some speed and flexibility for programmer productivity. It definitely allows me to get more work done in less time.

Comments

  1. Jelle
    29 May, 15:39

    Looks interesting enough for people who understand what you’re talking about. Anyway, if it’s helping Last.fm, it’s helping me!

    Jelle – 29 May, 15:39
  2. gpberlin
    29 May, 17:56

    ehy thanx for sharing this, I would like to know more about yours architecture and programming… having a lot of data to process is a dream and a nightmare for any programmer :)

    gpberlin – 29 May, 17:56
  3. Lukas Vlcek
    29 May, 20:24

    Thanks for sharing some useful links!

    Lukas Vlcek – 29 May, 20:24
  4. caxqueiroz
    29 May, 23:55

    Well, looks like you are using a regular text editor with no templates, macros or code completion
    Using an IDE like Eclipse that 60 lines of code could be cut down to 20 max.

    caxqueiroz – 29 May, 23:55
  5. Todd Troxell
    30 May, 00:57

    This is a nice bit of glue!

    Todd Troxell – 30 May, 00:57
  6. Klaas
    30 May, 10:02

    @caxqueiroz: We do use eclipse actually and that sure helps, but it’s still a lot less convenient than Dumbo in many cases…

    Klaas – 30 May, 10:02
  7. casey
    30 May, 16:36

    Could this be used with HBase? I’m still pretty new to Hadoop/HBase and was about to write a batch job in Java. Python would be much nicer to prototype with!

    casey – 30 May, 16:36
  8. Joshua Volz
    30 May, 17:46

    Could your Python implementation be ported to Jython? That way you get to use the JVM (which you presumably already have installed) and Python. I tried to find out to what degree this would slow your programs, but a quick Google search didn’t turn up the results I was looking for.

    Joshua Volz – 30 May, 17:46
  9. Colin Evans
    31 May, 01:19

    At Metaweb, we’ve implemented a similiar framework using Jython. The API looks about the same as above, but with tighter integration with DFS and the Hadoop APIs, and native Java libraries available. Performance is comparable to a CPython approach. The big downside is that Jython is stuck at Python 2.2.

    Colin Evans – 31 May, 01:19
  10. StefanG
    31 May, 01:44

    Checkout Cascading.org and apache pig both pretty interesting solutions to your problem.

    StefanG – 31 May, 01:44
  11. Klaas
    31 May, 08:02

    @Joshua: As Colin indicated already, Jython lags behind quite a bit and we do rely on some fairly recent CPython features for Dumbo. So no, right now it would not be possible to port it to Jython.

    @StefanG: Those approaches are indeed interesting, but they both abstract away the map-reduce paradigm, while we just wanted a very easy way to write map-reduce programs.

    Klaas – 31 May, 08:02
  12. StefanG
    31 May, 08:15

    @Klass: Well, so why are you using Python when you could use straight assembly? Right! :) So why you want to write map reduce if your pig latin script compiles down to map reduce as well. Cascading even allows groovy scripting now:
    http://www.cascading.org/documentation/groovy.html

    Give it a try. I’m sure you will solve your problems even faster, also performance wise both tool will be more efficient than the streaming api.

    StefanG – 31 May, 08:15
  13. Klaas
    31 May, 10:15

    @StefanG: Lets not turn this into a PIG/Cascading versus Dumbo discussion, but to answer your question: I want to use map-reduce directly because I think it’s a very convenient programming model. I’m used to thinking in map-reduce and I’m perfectly happy with that. Hence it wasn’t the map-reduce programming model that was slowing me down, but rather the tools I was using to write map-reduce programs. Dumbo fixes this without requiring me to spend time on getting familiar with a different model, which makes me a very happy programmer :)

    Klaas – 31 May, 10:15
  14. ibmetom
    31 May, 17:27

    I may be dyslexic I read: Python + Monty = Flying Circus.

    ibmetom – 31 May, 17:27
  15. StefanG
    31 May, 18:48

    @Klaas: sorry wasn’t ment to be a this vs that, just a hint since I’m very exited about these tools. I think in general we all agree hadoop makes those of us, having to deal with web scale data very happy programmes. :)

    StefanG – 31 May, 18:48
  16. Harish Mallipeddi
    1 June, 16:45

    @Klass

    How do you create more complex keys in Dumbo? If you’re using Java, you just define a new key class, add members to it, implement a custom compareTo() method on the key class if you wanted a custom sort order, etc. The last time I tried hadoop-streaming, I couldn’t figure out a way to do this.

    Harish Mallipeddi – 1 June, 16:45
  17. Elias Pampalk
    3 June, 09:07

    @Hadish: Dumbo uses text representations of keys. However, Klaas has built in some support for Python’s basic data structures. For example, there is some support for sets, lists, dictionaries and tuples.

    Btw, I’ve been using Hadoop a lot for almost a year now. Dumbo doesn’t replace implementing programs in Java. However, for many of the small jobs that just need to run once it’s a very useful tool and I find myself using Dumbo all the time now.

    Elias Pampalk – 3 June, 09:07
  18. curious
    24 June, 17:45

    A more general streaming question: can streaming handle a directory of gz’d data? Seems like an easy thing to do but it isn’t obvious from a cursory search of the mailing list and documentation.

    curious – 24 June, 17:45
  19. hatsiukj
    7 July, 20:35

    hiiiiiiiiiiiiiiiiii loveeeeeeeeeeeee u you look soooooooooo sexy

    hatsiukj – 7 July, 20:35

Comments are closed for this entry.