Mapreduce Bash Script

Monday, 6 April 2009
by erikf
filed under Code and Lunch Table
Comments: 51

One night at the pub we discussed whether one could replace Hadoop (a massive and comprehensive implementation of Mapreduce) with a single bash script, an awk command, sort, and a sprinkling of netcat. This turned into a weekend project dubbed bashreduce.

To be fair, Hadoop probably does a few more things than bashreduce. But we’ve managed to cover a few key concepts in our script:

  • Task coordination (kind of! sort of!)
  • Mapping/Partitioning
  • Reducing
  • Merging
  • Distributed file system (sort of! if you squint just right)

More than just a toy project, bashreduce lets us address a common scenario around these parts: we have a few analysis machines lying around, and we have data from various systems that are not in Hadoop. Rather than go through the rigmarole of sending it to our Hadoop cluster and writing yet another one-off Java or Dumbo program, we instead fire off a one-liner bashreduce using tools we already know in our reducer: sort, awk, grep, join, and so on.

I think it’s a neat idea! If you think it’s a neat idea, and you look at this gnarly bash code and think of ways to improve it, to make it more useful or more elegant, you would enjoy working for us. We’re looking for a clever C++ developer to help us tackle data mining and scale problems. My favorite line in the job posting is Interested in – we do all those things save one, which you can probably guess.

We’ve collected a few of our developer’s blogs here as well – more fodder for those of you interested in what we do.

Introducing Boffin:'s music knowledge meets your local mp3 collection

Wednesday, 11 March 2009
by mxcl
filed under Announcements and Code
Comments: 81

Sometimes the endless rows of music in my media player leave me at a loss. I have a music collection that I’ve spent years lovingly crafting; all my favourite bands. Yet as I spin my mouse wheel its full length, nothing springs out. I scroll up, scouring the rows for something fresh. I scroll down, searching for some long forgotten treasure. After a few minutes I select “shuffle” and go make a cup of tea.

But maybe it doesn’t have to be like that. Wouldn’t it be great if you could tune into your local music like you do with radio?

That last sentence sounded very much like a product announcement did it not? Well, it was!

Pick a tag, maybe another related tag, and click play. Boffin. Strictly a tech demo. Let us know how you like it, and we’ll roll it into the next major release of our desktop software :)

Forum announcement and download links

Flickr group
Wordle gallery

Hadoop User Group UK - 14th of April

Monday, 2 March 2009
by johan
filed under Code
Comments: 0

In August last year we organized the first Hadoop User Group in the UK. We liked it so much we’re doing another one on the 14th of April.

Quite a few of you probably haven’t heard about Hadoop, in short it’s an awesome piece of software that is used to process large datasets on multiple machines. If that’s your kind of thing, read more about it here.

So far the event schedule looks like this:
10.00 – 10.15: Arriving and chatting

10.15 – 11.15: Practical MapReduce (Tom White, Cloudera)

11.15 – 12.15: Introducing Apache Mahout (Isabel Drost, ASF)

12.15 – 13.15: Lunch (three kinds of pizza, sponsored by Sun)

13.15 – 14.15: Terrier (Iadh Ounis and Craig Macdonald, University of Glasgow)

14.15 – 15.15: Having Fun with PageRank and MapReduce (Paolo Castagna, HP)

15.15 – 16.15: Apache HBase (Michael Stack, Powerset)

16.15 – 17.00: General chat, perhaps lightning talks (powered by Sun beer)

17.00 – 00.00: Discussions continues at a nearby pub

The meetup is held at Sun’s office near Monument station in London. It’s free, but we ask that you register if you want to come. For more up to date news keep an eye on the blog.

A big thanks to Sun for sponsoring the event with a venue, food and beer!

Hack Day 2008

Monday, 22 December 2008
by james
filed under Code and Stuff Other People Made
Comments: 18

A week or so ago, on Sunday 14th December, we held our first open Hack Day, giving developers a chance to show off what they could build in a day with nothing but their wits and the API.

At around 10:30, the hungry and cold developers started pouring into Corbet Place, behind Brick Lane in the heart of East London. With free food and drink behind the bar, plenty of comfy sofas to drape themselves over, and a Surface table with which to amuse themselves, the hackers dug in. (Sorry you guys had to wait in the cold for longer than we’d hoped, that sucked.)

By 6:30, and in spite of wifi woes throughout the day, we had 30 quick fire demos lined up to wow the assembled crowd of geeks. As anyone who’s run an event like this will attest, getting 30 odd laptops hooked up to 2 projector adaptors on rotation with a 2 minute turnaround is no mean feat, but no one was trampled underfoot and only one person outright gave up (apologies Steve!).

Amongst our favourite hacks were Bret Ehlert’s, an app that creates playlists based on your historical charts so you can relive your headier musical days; Your Next Favourite Band by Utku Can and Phil Nash, which finds the band everyone’s listening to but you; and Neil Crosby’s Last Genius, a bookmarklet that builds a playlist using any track on as the starting point.

We were extremely impressed with everyone’s work but after careful deliberation, we had to select three winners for the awesome prizes provided by Codeplex:

Rob Mckinnon walked away with a shiny XBox 360 for his work building Gig notifications with Growl. In his own words:
“Need help remembering to get gig tickets? This hack gives you local event notifications via for the band you’re now playing. Implemented as Ruby script that uses api and growl notifications. Could be wrapped up as a Mac dashboard widget with a bit of work.”

Cameron Ross also snagged an XBox for the awesome Universal Scrobbler:
“This suite of tools allow you to scrobble songs from previously unscrobblable sources. There is a FireFox extension to allow scrobbling of songs listened on MySpace, a tool to browse MusicBrainz for albums and tracks to scrobble (for example for if you listen to an album in the car or CD player), a tool to scrobble songs retrospectively that you listened to on BBC Radio, and a tool to scrobble a custom song.”

David Padbury and Jamie Hollingworth stormed to win the grand prize of £1000 with Staff
“StaffWars works by playing a user’s personal station communally to the office. When someone becomes offended by their colleague’s poor taste in music they initiate a challenge to take control of the office stereo from the current user. At this point StaffWars analyses the profiles of the competing users and looks for similar tastes in music. It will generate a small music quiz based on these similar tastes in music and ask it too both users. If the challenging user wins they will take control of the stereo otherwise the existing station will carry on playing.”

The day was rounded off with an excellent set from Hexstatic while the last free drinks were squeezed out of the bar.

Thanks again to those who came and made this event a success. I can’t wait for the next one!

We’ll try to keep this list updated with all the other hacks as info comes in, let us know if we’re missing you.

More photos from the day, courtesy Russ and Dimi.

Hack Day

Wednesday, 26 November 2008
by anil
filed under Announcements and Code
Comments: 13

Back in June when we announced our new API, we were bowled over with the positive response. Hundreds of developers have been engaged in discussions over in our Web Services Group, building API bindings for Python, PHP, Actionscript, Java and other languages along the way.

Some of my favourite apps built on the platform include Andrew Godwin’s Lastgraph, Chris Mear and James Darling’s Vinyl Scrobbler and Jorge Diaz’s One Hit Wonders. That’s just a small sample of the apps available at our gallery.

We thought it was about time to bring some of the dev community together in East London for a hack day. So here it is: The first Hack Day, on Sunday 14th December. There’s only 150 places available and it’s first come first served. We kick off at 10AM and we have a great live act playing last thing in the evening to wash down the code. developers will be available on the day to answer API questions and build custom data APIs on the fly should they be needed. We will be at your service throughout.

Oh, I almost forgot to mention – top prize on the day is £1000 (yes, that’s plummeting sterling, not stagnant dollar), and the runners-up prize is pretty juicy too. I hope to see you all there. If you spot a yellow square with a line running down it, say hi.

Hadoop User Group UK

Thursday, 28 August 2008
filed under Code
Comments: 5

At we’re fond of elephants. A few months ago Martin and I went to a gathering of elephant enthusiasts and liked what we saw. In fact, we liked it so much that we decided to host a similar event in London. Some 50-60 herders turned up to enjoy the presentations as well as the free beer and food kindly supplied by Yahoo! and Skills Matter.

If you are wondering what on earth I’m talking about, the event was focused on Hadoop – “a software platform that lets one easily write and run applications that process vast amounts of data.” We use Hadoop extensively at and, judging by the number of people who came to the event, we’re not alone.

To make sure you don’t miss the next event you can subscribe to the Hadoop User Group UK mailing list by sending an e-mail to and then replying to the confirmation e-mail RSS feed of the HUGUK blog. You can also simulate the experience by getting a beer from your local shop and watching the videos and presentations below. Sadly you will have to talk to yourself after you’ve watched them as we won’t be there.

A big thank you to the presenters!

My only regret is that we didn’t have time to finish all the beer, better luck next time.


Developers, developers, developers

Friday, 27 June 2008
filed under Code
Comments: 30

I’m proud to announce our new public API, which allows any application or device deeper integration with the platform than ever before. Our vision is the most comprehensive social music API on the web, and today marks a big step forward in that direction. Spiral by Sha Hwang, built with the API.

The new API introduces a user authentication protocol which for the first time allows applications to create user sessions, bringing both read and write services to web apps, desktop apps and mobile devices.

Take our new tagging API’s. Developers can both pull and apply tags to music content from any application on any platform now. The same goes for sharing – developers can build sharing support into any app.

There are also new search, playlist, event and geo API’s being rolled out today, with lots more stuff planned in the coming weeks and months.

If you’ve been working with our existing services, bear in mind scrobbling is also integrated with the new API, so there’s just one session key required to use any service now.

If you want to work with services or have done so in the past, don’t forget to join our new web services group to provide feedback & suggestions as well as discuss your application ideas. From tour planners to batch tag editors, we can’t wait to see what you come up with – you’ve consistently surprised us with imaginative ideas so far and we have no doubt in your ability to get on your feet and make it happen.

Happy hacking:

Searching with my co-monkey

Thursday, 5 June 2008
by jonty
filed under Code and Announcements
Comments: 10

Some of you may have been following the news around a new Yahoo! service named SearchMonkey, a platform that opens Yahoo! search to external developers.

In layman’s terms, it allows an application developer to inject extra information into specific results, delivering a richer search experience.

This probably makes more sense with a few examples:

  • What if Facebook profile results had the photo of the person in question?
  • What if IMDB results had the movie rating right there in the search result?
  • What if Flickr results had a selection of pictures from the photostream?
  • What if results had useful information about the artist in them?

Something like this perhaps?

SearchMonkey launched to the public just a few hours ago, and we’ve been playing with it from the beginning, testing the platform, making suggestions and ultimately producing a SearchMonkey application for you all to use.

The first version of our application deals with artist, album and track pages giving you a useful extract of the biography, links to listen to the artist if we have them available, tags, similar artists and the best picture we can muster for the page in question.

If you’d like to try it out, you can find it over here in the Yahoo! search applications gallery.

So how does it work?

Anyone can develop an application for the SearchMonkey platform that works for any URL. However you need to add individual applications to your Yahoo! search preferences for them to take effect; this opt-in helps ensure that a malicious application developer can’t affect everyone.

When you search on Yahoo! after adding an application, SearchMonkey scans through the list of results testing if the application is set to trigger on their URLs. Once a qualifying URL has been spotted, the app either (a) automatically uses embedded microformats, or (b) uses a dedicated webservice (built by the application developer) to extract information about the page.

Finally, it uses an xslt transformation to translate the extracted data into the DataRSS format which is finally parsed (via a custom PHP class) into a format suitable for the search results. Ace.

It’s been fun working with Yahoo! on this small but useful application. Give it a try and let us know what you think!

Python + Hadoop = Flying Circus Elephant

Thursday, 29 May 2008
by klaas
filed under Code
Comments: 19

As a research intern here at, dealing with huge datasets has become my daily bread. Having a herd of yellow elephants at my disposal makes this a lot easier, but the conventional way of writing Hadoop programs can be rather cumbersome. It generally involves lots of typing, compiling, building, and moving files around, which is especially annoying for the “write once, run never again” programs that are often required for research purposes. After fixing one too many stupid bugs caused by this copy/paste-encouraging work flow, I finally decided to look for an alternative.

The sluggishness of the conventional Hadoop work flow is mainly caused by the fact that Java is a very verbose and compiled programming language. Hence, finding a solution basically boils down to replacing Java. Since simplicity and programmer productivity were my main goals, it didn’t take me too long to decide that I wanted to use Python instead. The approach described here is the most convenient way of writing Hadoop programs in Python that I could find on the web, but it still wasn’t pythonic enough for my taste. The mapper and the reducer shouldn’t have to reside in separate files, and having to write boilerplate code should be avoided as much as possible. To get rid of these issues, I wrote a simple Python module called Dumbo. The word count example can be written as follows using this module:

def mapper(key,value):
   for word in value.split(): yield word,1
def reducer(key,values):
   yield key,sum(values)
if __name__ == "__main__":
   import dumbo,reducer)

Compare this to the 60 lines of code required to do exactly the same thing in Java! On a UNIX machine, this program can be run locally by executing the command

python map < wc_input.txt | sort | \
python red > wc_output.txt

and running it on a Hadoop cluster is as simple as typing

python -m dumbo \
-hadoop /path/to/hadoop \
-input wc_input.txt -output wc_output

To put some icing on the cake, Dumbo also offers a few handy extras. You’ll have to read the documentation or even the code to discover them all, but one worth mentioning is that you can implement combiner-like functionality by passing a third parameter to “”. For instance, the amount of network traffic can be reduced substantially in the example above by adding “reducer” as a third parameter to “”.

Obviously, the combination of Hadoop Streaming and Dumbo is not a one-size-fits-all solution. Java code is still faster and provides the programmer more flexibility and control. Even with several liters of the finest Belgian beer, there is no way I could convince Johan or Martin to port their heavy regularly-running Hadoop programs to Python. In many cases, however, it makes perfect sense to trade some speed and flexibility for programmer productivity. It definitely allows me to get more work done in less time.

Java Summer Interns

Wednesday, 9 April 2008
filed under Code and About Us
Comments: 24

We have now filled all the slots for the internship. Thanks to everyone who applied!

Our lovely back end team is looking for fresh meat. Specifically Java programming students that want to work with huge datasets and clusters of yellow elephants. So if you are interested in hanging out in East London and getting your hands dirty hacking some of the most exciting music-related software on the ‘net, then read on for more information on our summer internship programme.

You’ll be spending your days bathing in the ball pit and getting back rubs from our support department. Occasionally we’ll ask you to do some work too, for example developing new features for the open source projects below or improving our own internal systems. You will be mentored by experienced Last fm developers and stand to learn loads about making the software that is used daily by millions of people across the globe.

Potential areas to work in:

  • The open source distributed data processing project Apache Hadoop.
  • The distributed data storage project Apache HBase.
  • Processing our listening data into interesting funky stuff we can use on the site.
  • Improving the Java support in (soon to be Apache) Thrift.
  • Using X-Trace to find and resolve bottlenecks in our distributed systems.
  • We’re also open for student suggestions! If it’s useful and/or cool, include any ideas you may have in your application.

If you’re allowed to work in the UK please e-mail your applications including a CV to, put “java summer intern” in the subject line. The internship will last for the coming summer (we are quite flexible with the exact dates). If things go really well, you might even end up joining our team when your studies are over.

More information on our jobs page.

The Last fm intern experience