Mapreduce Bash Script

Monday, 6 April 2009
by erikf
filed under Code and Lunch Table
Comments: 51

One night at the pub we discussed whether one could replace Hadoop (a massive and comprehensive implementation of Mapreduce) with a single bash script, an awk command, sort, and a sprinkling of netcat. This turned into a weekend project dubbed bashreduce.

To be fair, Hadoop probably does a few more things than bashreduce. But we’ve managed to cover a few key concepts in our script:

  • Task coordination (kind of! sort of!)
  • Mapping/Partitioning
  • Reducing
  • Merging
  • Distributed file system (sort of! if you squint just right)

More than just a toy project, bashreduce lets us address a common scenario around these parts: we have a few analysis machines lying around, and we have data from various systems that are not in Hadoop. Rather than go through the rigmarole of sending it to our Hadoop cluster and writing yet another one-off Java or Dumbo program, we instead fire off a one-liner bashreduce using tools we already know in our reducer: sort, awk, grep, join, and so on.

I think it’s a neat idea! If you think it’s a neat idea, and you look at this gnarly bash code and think of ways to improve it, to make it more useful or more elegant, you would enjoy working for us. We’re looking for a clever C++ developer to help us tackle data mining and scale problems. My favorite line in the job posting is Interested in – we do all those things save one, which you can probably guess.

We’ve collected a few of our developer’s blogs here as well – more fodder for those of you interested in what we do.