Hadoop Summit 2008: Creating new Infrastructures for Big Data

Tuesday, 1 April 2008
by martind
filed under About Us and Code
Comments: 15

Last week Johan and I were fortunate to be sent to the Hadoop Summit 2008, the first conference focused on the Hadoop open source project and its surrounding ecosystem. Great talks, conversations with a lot of interesting and talented people, and loads of food for thought.

Big Data Startups

Last.fm is a prominent representative of a growing class of Internet startups: smaller companies whose business entails storing and processing huge amounts of data with a low amount of overhead. Our project teams are comparably tiny and we rely mostly on open source infrastructure; and while the Big Data problems of big corporations have been catered to for decades (in the form of an entire industry of integrated infrastructures, expensive hardware, consulting fees, training seminars, …) we’re not really in that market. As a result we often have to create our own, low-overhead solutions.

Throughout the Hadoop Summit it was very apparent that we’re seeing the dawn of a new culture of data teams inside Internet startups and corporations, people who manage larger and larger data sets and want better mechanisms for storage, offline processing and analysis. Many are unhappy with existing solutions; because they solve the wrong problems, are too expensive, or based on unsuitable infrastructure designs.

The ideal computing model in this context is a distributed architecture: if your current system is at its limits you can just add more machines. Google was one of the first Internet companies to not only recognise that, but to find great solutions; their papers on MapReduce (for distributed data processing) and their Google File System (for distributed storage) have had great impact on how modern data processing systems are built.

Doug Cutting’s Hadoop project incorporates core principles from both of them, and on top of that is very easy to develop for. It’s a system for the offline processing of massive amounts of data; where you write data once, then never change it, where you don’t care about transactions or latency, but want to be able to scale up easily and cheaply.

Distributed Computing, Structured Data and Higher Levels of Abstraction

One current trend we’re seeing within the Hadoop community is the emergence of MapReduce processing frameworks on a higher level of abstraction; these usually incorporate a unified model to manage schemas/data structures, and data flow query languages that often bear a striking resemblance to SQL. But they’re not trying to imitate relational databases — as mentioned above, nobody is interested in transactions or low latency. These are offline processing systems.

My personal favourite among these projects is Facebook’s Hive, which could be described as their approach to a data warehousing model on top of MapReduce; according to the developers it will see an open source release this year. Then there’s Pig, Jaql, and others; and Microsoft’s research project Dryad, which implements an execution engine for distributed processing systems that self-optimises by transforming a data flow graph, and that integrates nicely with their existing (commercial and closed-source) development infrastructure.

Another increasingly prominent project is HBase, a distributed cell store for structured data, which in turn implements another Google paper (”Bigtable“). HBase uses Hadoop’s distributed file system for storage, and we’re already evaluating it for use inside the Last.fm infrastructure.

But despite all this activity it’s still very apparent that this is a young field, and there are at least as many unsolved problems as there are solved ones. This is only a faint indicator of what’s to come…

If You Just Got Interested

Are you a software developer with a background in distributed computing, large databases, statistics, or data warehousing? Want to gain first-hand experience inside an emerging industry? Then apply for our Java Developer and Data Warehouse Developer positions!

Fingerprinting and Metadata Progress Report

Tuesday, 25 March 2008
by RJ
filed under Announcements and Code
Comments: 147

Those of you who’ve been keeping an eye on the blog will have noticed we are working on some audio fingerprinting technology to assist us in cleaning up the mess that is music metadata.

Fingerprint Metrics

So far our fingerprint server identified 23 million unique tracks, from the 650 million fingerprint requests you’ve thrown at it. Who knows how many unique tracks there are out there.. We have a couple of hundred million tracks based on spelling alone – but not all of them are spelt correctly. We’re getting closer to an answer though.

Order from Chaos

Erik has developed an ingenious method of extracting the correct names from the millions of fingerprinted tracks. This turns out to be a decidedly tricky problem – the most popular spelling is not necessarily the correct one. There is also the issue of foreign language artist names, and popular non-misspellings such as name-changes, abbreviations or acronyms (think RATM, TAFKAP, GNR, ATWKUBTTOD etc). Oh, and there are plenty of interesting corner cases, like our old friend “Various Artists” – yes, I’m talking about Torsten Pröfrock

We’d also like to take this opportunity to big up the Musicbrainz and Discogs massive – without the free availability of high-quality datasets such as these, we’d be having a much harder time checking our results. We are planning to feed some useful data back to them in the form of artists we think exist, but don’t show up in their respective catalogues.

Based on all this, we have an initial data set which maps fingerprints to what we think the correct metadata might be. We are using this data in two ways right now: tech-demo fingerprint client, and artist alias voting on the site.

Command-line fingerprinter demo

The tech-demo for this is a command-line fingerprint client that can convert an mp3 file into a fingerprint id, and lookup the metadata.

Grab the command-line fingerprint tool here:

  • Source code (GPL2): svn://svn.audioscrobbler.net/recommendation/MusicID/lastfm_fplib

We’re interested in any feedback on the accuracy of the metadata lookup feature, especially if you find things that are horribly wrong/broken.

Making sense of it all with artist corrections

So based on all this fingerprint data, we have a new artist alias list. We’re not doing any auto-corrections until we’re sure how sure we are, but in the meantime if you stumble across an artist page that has potential corrections, you are given a chance to vote on the correct one:


You will only see this if you are logged in.

This basic voting system will help us evaluate how good the current data set is. We have a track dataset too, but let’s start with artists and see how it goes.

The Guns N’ Roses Issue

Back in December I used Guns N’ Roses to illustrate the metadata problem by asking:

  • Just how many ways to write “Guns N’ Roses – Knockin’ on Heaven’s Door” are there?

We still don’t have a concrete answer for that question, but here are the Top 100 ways to write: Guns N’ Roses – Knockin’ on Heaven’s Door

And just for fun, here are some of the artist names used on Guns N’ Roses MP3s in descending order of popularity – yes people really do mistag things this badly :)

What the future holds

Fixing up music metadata remains a hot topic for us this year, there is plenty more to come:

  • Disambiguation to properly display the nine different artists that all have the name: Mono.
  • Correct data displayed when scrobbling, even if your tags suck.
  • Maybe the ability to correct your crappy tags, if it’s accurate enough, and there is demand for it.
  • Return other identifiers from fingerprints, such as Musicbrainz IDs and Discogs URLs.
  • Consolidate the stats on Last.fm so that Artist top tracks, and ultimately user charts and all other charts are corrected.
  • Implicitly improve Last.fm recommendations and radio due to less noise in the data.
  • More formal collaboration with other music metadata crusaders.

Conclusion

If you are using the official Last.fm software, every time you play a track you are contributing to the metadata cleanup effort by scrobbling and fingerprinting what you play.

In the meantime, we are working to refine the data, and will publish updated artist corrections, track corrections and some analysis of the feedback we receive from this round of voting soon.

Listeners, Leopard, Lookups, and Lots More

Wednesday, 31 October 2007
by rj
filed under Announcements and Code
Comments: 44

Although I feel bad pushing the Shaggy Bigs Up Last.fm video further down the page (sorry Muz), here is an update on recent releases in case you missed them:

Last.fm on OS X Leopard (beta)

Max has posted a build of the Last.fm software which supposedly plays nice with Leopard, see this forum thread for the download link. Feedback appreciated from Leopard users (in that thread please!) so we can ‘officially’ release a Leopard build.

Listeners update: Just Listened

The ‘listeners’ section on artist/album/track pages now includes a healthy mix of people who are currently online and have just listened to the item in question. This makes for some interesting exploration.

Just Listened Example

Privacy

If you’re worried about your ex-girlfriend stalking your profile to see when you’re online (you know who you are), then you can opt-out of realtime features from your privacy settings.

Fingerprinting

Norman has published a command-line fingerprint client which also returns some basic metadata, if we can identify the song.

The next fingerprint milestone will be releasing a build of the Last.fm software that can fingerprint songs (expected in a couple of weeks time). Soon after that we’ll have enough data to publish the proper webservice, so it will be possible to give the fingerprint service a good work-out.

Needless to say we are all extremely excited at the prospect of finally fixing up the plethora of badly spelt artist and track titles on Last.fm.

Wiki Fact Tags

Artist wikis now support a few special ‘fact tags’ which allow you to markup certain important facts in artist biographies. This makes the facts easily extractable, hopefully without spoiling the readability whilst editing a wiki.

Example: Radiohead were formed in [placeformed]Oxfordshire, UK[/placeformed] in [yearformed]1986[/yearformed]….etc

Fact Tags Example

See more Artist Wiki Fact-tag examples

Admittedly we’re not doing anything especially interesting with this data yet apart from displaying it on the biography page – but the mad-scientist department at Last.fm is itching to start adding this data into the mix to improve recommendations and radio.

One more thing…

We had a flood in the office :(

Another one more thing…

Jonty & Steve have started to test their brains an extra bit each day while at the same time having some much needed looking-away-from-your-monitors time by playing relaxing games of Go. You can see the outcome of this on the live GoCam in the right sidebar here on the blog.

The GoCam

Ivy Subversion Resolver code set free

Wednesday, 24 October 2007
by adrian
filed under Code and About Us
Comments: 6

This one is for all the Java geeks out there…

You may or may not know that a lot of the backend data processing at Last.fm (e.g. chart generation, listening pattern analysis) is done by applications we write in Java. Like many other Java users we also use Ant to build our software. Recently we have incorporated Ivy into our build process to help us manage the dependencies between our projects and any third party libraries that we use (incidentally, Ivy has just graduated to an official Ant subproject so it is moving from strength to strength). This went smoothly until we ran into a limitation of Ivy that prevented us from using Subversion as a dependency repository as this isn’t supported out of the box. Fortunately Ivy has been architected in such a way that it is very extensible so we coded up our own Ivy Subversion Resolver and gave it the oh-so catchy name “Ivy-svn” (a name only a geek could love).

Anyway, the point of all this is that we found this really useful and decided not to keep this to ourselves but to share the fruits of our labour (and the code) under the Apache license. If this sounds like something for you, head on over to the Audioscrobbler development site at:

http://www.audioscrobbler.net/development/ivysvn/

to find out more. If you do end up using it and have feedback or issues, we’d love to hear from you.