Mapreduce Bash Script

Monday, 6 April 2009
by Erik Frey
filed under Code and Lunch Table
Comments: 51

One night at the pub we discussed whether one could replace Hadoop (a massive and comprehensive implementation of Mapreduce) with a single bash script, an awk command, sort, and a sprinkling of netcat. This turned into a weekend project dubbed bashreduce.

To be fair, Hadoop probably does a few more things than bashreduce. But we’ve managed to cover a few key concepts in our script:

  • Task coordination (kind of! sort of!)
  • Mapping/Partitioning
  • Reducing
  • Merging
  • Distributed file system (sort of! if you squint just right)

More than just a toy project, bashreduce lets us address a common scenario around these parts: we have a few analysis machines lying around, and we have data from various systems that are not in Hadoop. Rather than go through the rigmarole of sending it to our Hadoop cluster and writing yet another one-off Java or Dumbo program, we instead fire off a one-liner bashreduce using tools we already know in our reducer: sort, awk, grep, join, and so on.

I think it’s a neat idea! If you think it’s a neat idea, and you look at this gnarly bash code and think of ways to improve it, to make it more useful or more elegant, you would enjoy working for us. We’re looking for a clever C++ developer to help us tackle data mining and scale problems. My favorite line in the job posting is Interested in – we do all those things save one, which you can probably guess.

We’ve collected a few of our developer’s blogs here as well – more fodder for those of you interested in what we do.

Comments

  1. Nicolas Maia
    6 April, 14:40

    Whatever, you’re going to charge users. BOO.

    Nicolas Maia – 6 April, 14:40
  2. Sakura
    6 April, 15:47

    Congrats guys…congrats xD
    You deserve to fail xDDD

    Hope see u closing the site! :D
    Go to hell damn racists xD

    SaKuRa

    PD: NON-FREE MUSIC FOR EVERYONE, NOT FOR WHO DON’T KISS YOU ON YOUR ASSES

    Sakura – 6 April, 15:47
  3. forever B
    6 April, 16:58

    FREE MUSIC FOR EVERYONE
    DEAD TO LAST FM

    forever B – 6 April, 16:58
  4. the big boss
    6 April, 17:21

    lol @ the comments

    the big boss – 6 April, 17:21
  5. le_dormeur
    6 April, 17:22

    seriously guys stop it! last.fm fo what it has to do !

    le_dormeur – 6 April, 17:22
  6. anon
    6 April, 17:49

    FWIW, bashreduce is a good name.

    anon – 6 April, 17:49
  7. Dan Mayer
    6 April, 17:56

    Wow confused about the comments… Off topic at the very least. Anyways that is really cool. I am always interested in new ways to distribute work. Nice job on hacking something like that together, although I feel like that much bash stuff would eventually turn into spaghetti code.

    Dan Mayer – 6 April, 17:56
  8. Adam Fisk
    6 April, 18:27

    This is sweet. I love anything that pushes the boundaries of what’s possible, and I love bash scripting (who knows why), especially bash scripts that take a few minutes for me to understand like this one! Nice work guys.

    Adam Fisk – 6 April, 18:27
  9. Paul
    6 April, 19:10

    Yeah whatever…after 2 years, i won’t be renewing my subscription anymore; this site no longer merits my support.

    Do you remember ““Listen free and discover music at Last.fm”?

    Paul – 6 April, 19:10
  10. João Pinheiro
    6 April, 19:20

    I wish the script could also reduce all the bashing that’s being done about the radio changes. :p

    It’s a pity the position has to be London based. :/

    And heeey, don’t be evil to bogosort just because it happens to be effectiveness-impaired. :p

    João Pinheiro – 6 April, 19:20
  11. librelover
    6 April, 19:47

    Note to all the dissillusioned fans of Last.fm -> check out libre.fm

    A totally free (GNU AGPL) software replacement!

    Free software, free society!

    librelover – 6 April, 19:47
  12. epriest
    6 April, 20:29

    This is probably the most literate bash program I’ve ever seen.

    epriest – 6 April, 20:29
  13. Piete
    6 April, 20:39

    lol @ the butthurt cheapskate kids who think they’re going to change anything.

    Piete – 6 April, 20:39
  14. troels
    6 April, 21:54

    See also: http://wiki.apache.org/hadoop/HadoopStreaming

    troels – 6 April, 21:54
  15. Mr. Man
    6 April, 21:54

    Erik Frey you’re so clever! Really!

    Mr. Man – 6 April, 21:54
  16. Craig
    6 April, 22:10

    Is this available in Canada or do we have to pay?

    Craig – 6 April, 22:10
  17. Ivan
    6 April, 22:30

    You pretend to be like hackers… but charge some users for others enjoy the same for free.

    What you really are is assholes. Oh yes, and YOU SUCK.

    Ivan – 6 April, 22:30
  18. Brett Bavar
    6 April, 22:40

    Your usage info and help are inconsistent. In the usage info, you list -m for both hosts and map, and then in the help you do not list the second -m for map. What’s up there?

    Brett Bavar – 6 April, 22:40
  19. Mark Wotton
    7 April, 02:22

    http://wiki.github.com/mfisk/filemap

    is another hack at this sort of idea. I’ve found this sort of thinking really helpful – quite often you’ve got a long-running job that’s not worth writing a whole Hadoop app for, but where a 2x or 3x speedup would be very welcome.

    (also, not quite sure what’s going on with the comments. At what point does last.fm owe you anything, people? If you don’t like it, no-one’s got a gun to your head to use it.)

    Mark Wotton – 7 April, 02:22
  20. faizal
    7 April, 02:58

    after you get all musics database, statistic from your user, now you want to charge your user. shame to you..

    faizal – 7 April, 02:58
  21. Steve
    7 April, 10:19

    As an ex-last.fm employee it’s extremely sad to see you people commenting about completely unrelated crap here. Shame on you commenters.

    Great work Erik! I remember you talking about this a while back, and now I can see how it works! Mad hacks.

    Keep up the good stuff!

    Steve – 7 April, 10:19
  22. Alex Angas
    7 April, 10:27

    Hi Erik, That sounds good but I have no idea what this means! What do these components do and how do they benefit me as a last.fm user?

    Thanks!

    Alex Angas – 7 April, 10:27
  23. MJ
    7 April, 14:03

    I’ve been a computer programmer professionally for 13 years. I code fluently in two languages and non-fluently in several more. I even solved my Rubik’s Cube three times on the bus on the way in to work today. And I have NO IDEA what you’re talking about. ;)

    MJ – 7 April, 14:03
  24. Erik Frey
    7 April, 17:16

    @Brett Bavar, Thanks — fixed!

    @Mark Wotton, I gave filemap a try some time ago. It looks like excellent python code, but the program gave me lots of errors with rsync and was never able to run an actual job. Very promising, though.

    @Steve, Thanks :)

    @Alex Angas, this code is a tool for doing very simple data mining operations quickly and easily. It benefits you because we use it to improve the recommendations technology for the web site. It also benefits our fellow nerds because we share the code!

    Erik Frey – 7 April, 17:16
  25. Daniel Einspanjer
    7 April, 20:26

    This thing sounds pretty neat, and I definitely wanted to take a look since I’m frequently wanting to do some sed/awk/grep/sort/uniq type commands on massive log files without having to resort to writing a more fleshed out program to do so. When it comes to that, I’d love to be able to easiy distribute the work out to some spare machines.

    Unfortunately, I wasn’t able to get it to work after a bit of playing. I’ve got two different Linux distros, RHEL5 servers, and Ubuntu slaves. They use different versions of netcat and they have different usernames (even though I have pubkey authentication to them all). I poked around for a while, but I couldn’t find a version of netcat similar to the ubuntu build (which you appear to be using based on the arguments you use) that would compile on RHEL. I thought about making a config layer that would allow me to specify username and nc args for each host, but I then looked at the clock and realized how much time I had wasted. :/

    I’ll try to take a look again later, but if you might have a good solution up your sleeve for that, I’d love to hear it!

    Daniel Einspanjer – 7 April, 20:26
  26. ball valve
    8 April, 06:54

    Task coordination (kind of! sort of!)
    Mapping/Partitioning
    Reducing
    Merging
    Distributed file system (sort of! if you squint just right)
    that’s it

    ball valve – 8 April, 06:54
  27. twice
    8 April, 08:41

    I hope LAST FM dies.

    twice – 8 April, 08:41
  28. Erik Frey
    8 April, 09:43

    @Daniel Einspanjer, that’s interesting! I would have expected the netcat settings I chose to be pretty universal. I’ve got redhat lying around so I’ll put that on the todo.

    Erik Frey – 8 April, 09:43
  29. Erik Frey
    8 April, 09:47

    @Daniel Einspanjer, also, as a hack (and let's face it, the whole script is a hack), you should be able to specify username@host to -m (or /etc/br.hosts) if you have different usernames on different machines.

    Erik Frey – 8 April, 09:47
  30. Oleksandr
    8 April, 10:03

    A great feature, still missing at LAST.FM, could be a list of new albums and singles produced by artists from the personal library, with a possibility to buy them, of course! Are you going to implement smth like that? It shouldn’t be too difficult, and can be very useful!

    Oleksandr – 8 April, 10:03
  31. cadencecity
    8 April, 15:53

    Why dont you work on getting it back to the way it was before you do anything else…

    free music. whatever happend to listen FREE on last fm?

    cadencecity – 8 April, 15:53
  32. kaiyi li
    8 April, 17:42

    Last.fm is really a big player in open source community. Thanks for sharing that interesting piece of code.

    kaiyi li – 8 April, 17:42
  33. vutterfly
    10 April, 06:06

    Thanks for not deleting negative and even unrelated comments, last.fm people. I don’t know anything about coding, but I still think it’s retarded you want to start charging the lower class.

    vutterfly – 10 April, 06:06
  34. delusionbeta
    12 April, 18:26

    Let me get this straight: it’s a script that spreads the workload of Linux machines over multiple cores/machines?

    That might be useful in the PC gaming industry, specifically servers for online PC games…

    But alas, this is a music site, and so there’s not much profit in something that can share the workload around.

    Or is there?

    delusionbeta – 12 April, 18:26
  35. Jaime
    12 April, 22:54

    are u goin to charge for the code?

    Jaime – 12 April, 22:54
  36. podlak
    13 April, 15:04

    @Jaime: not immediately, first they’ll wait, so community could help with this code, afterwards charges come.

    podlak – 13 April, 15:04
  37. Dazall
    13 April, 18:59

    @podlak…that’s not entirely true. IF you live in the US, UK or Germany, you don’t have to pay anything! Also, if you’re not from those 3 countries, you can purchase certificates to get the code for your friends!

    Dazall – 13 April, 18:59
  38. Ramakrishna Reddy yekulla
    14 April, 05:39

    o Wouldn’t it be easier to have a planet for all the developer blogs at last.fm. it would be easier to aggregate into other planets

    Ramakrishna Reddy yekulla – 14 April, 05:39
  39. not-american
    14 April, 11:49

    How do you have to pay if your not American?

    not-american – 14 April, 11:49
  40. delusionbeta
    14 April, 17:47

    @Dazail: that may not be exactly true either. I can see the Last.fm demand dying a death outside of the UK, US and Germany coming the change, and I can see them implementing charges in the attempt to recover lost advertising costs, thus killing off the site wholesale.

    Of course, this is off-topic speculation… And it might not be right either…

    delusionbeta – 14 April, 17:47
  41. notsomuchofanidiot
    16 April, 05:39

    Hey idiots, last.fm is a business. They have to play ball with the copyright holders who could sue them into non-existence.

    notsomuchofanidiot – 16 April, 05:39
  42. delusionbeta
    16 April, 11:16

    And the copyright holders could play ball with last.fm (i.e.“reduce the demands or we won’t play your music”).

    delusionbeta – 16 April, 11:16
  43. Vide
    17 April, 15:15

    Sorry for my dumbness Erik but can you give a little practical example of usage? I mean, I know I can pass a program as reduce function which will be piped after input to get out.
    I can put a “-r grep xyz”, type some string and see how it only prints the string containing xyz.
    But how am I supposed to do a real reduction, so splitting the initial input in little pieces that then a node can crunch?

    Thanks :)

    Vide – 17 April, 15:15
  44. Twice
    19 April, 18:21

    Last FM hugs nuts.

    Commercial shitfaces.

    Twice – 19 April, 18:21
  45. game
    26 April, 15:30

    hello

    game – 26 April, 15:30
  46. shinobi
    28 April, 01:36

    C++ for data mining????

    You are wasting your time. Better use Perl 5.10

    shinobi – 28 April, 01:36
  47. Marcus Herou
    29 April, 18:30

    Cool man.

    Bashreduce – I give you a ten for the name haha.

    Marcus Herou – 29 April, 18:30
  48. Marcin Mańk
    9 May, 18:48

    Can`t the two lines with ssh to the worker be merged into one big pipe-sequence, without using a temporary file? Is there any advantage to using a temporary file?

    Also, I think the srand() should be moved to a BEGIN{} block. As it stands, all lines will go to one worker.

    Thanks for this, I finally got around to figuring out the map-reduce thing :)

    Marcin Mańk – 9 May, 18:48
  49. Marcin Mańk
    9 May, 19:04

    Also, (OK, maybe I don`t understand map-reduce YET) shouldn`t the final sort be run through reduce too?

    Marcin Mańk – 9 May, 19:04
  50. wtf
    11 May, 14:55

    Will this be avalible in my country, too?

    wtf – 11 May, 14:55
  51. Jack Broun
    15 May, 10:29

    Erik Frey
    How do you have to pay if your not American?

    Jack Broun – 15 May, 10:29

Comments are closed for this entry.