Those of you who’ve been keeping an eye on the blog will have noticed we are working on some audio fingerprinting technology to assist us in cleaning up the mess that is music metadata.
So far our fingerprint server identified 23 million unique tracks, from the 650 million fingerprint requests you’ve thrown at it. Who knows how many unique tracks there are out there.. We have a couple of hundred million tracks based on spelling alone – but not all of them are spelt correctly. We’re getting closer to an answer though.
Order from Chaos
Erik has developed an ingenious method of extracting the correct names from the millions of fingerprinted tracks. This turns out to be a decidedly tricky problem – the most popular spelling is not necessarily the correct one. There is also the issue of foreign language artist names, and popular non-misspellings such as name-changes, abbreviations or acronyms (think RATM, TAFKAP, GNR, ATWKUBTTOD etc). Oh, and there are plenty of interesting corner cases, like our old friend “Various Artists” – yes, I’m talking about Torsten Pröfrock
We’d also like to take this opportunity to big up the Musicbrainz and Discogs massive – without the free availability of high-quality datasets such as these, we’d be having a much harder time checking our results. We are planning to feed some useful data back to them in the form of artists we think exist, but don’t show up in their respective catalogues.
Based on all this, we have an initial data set which maps fingerprints to what we think the correct metadata might be. We are using this data in two ways right now: tech-demo fingerprint client, and artist alias voting on the site.
Command-line fingerprinter demo
The tech-demo for this is a command-line fingerprint client that can convert an mp3 file into a fingerprint id, and lookup the metadata.
Grab the command-line fingerprint tool here:
- Source code (GPL2): svn://svn.audioscrobbler.net/recommendation/MusicID/lastfm_fplib
We’re interested in any feedback on the accuracy of the metadata lookup feature, especially if you find things that are horribly wrong/broken.
Making sense of it all with artist corrections
So based on all this fingerprint data, we have a new artist alias list. We’re not doing any auto-corrections until we’re sure how sure we are, but in the meantime if you stumble across an artist page that has potential corrections, you are given a chance to vote on the correct one:
You will only see this if you are logged in.
This basic voting system will help us evaluate how good the current data set is. We have a track dataset too, but let’s start with artists and see how it goes.
The Guns N’ Roses Issue
Back in December I used Guns N’ Roses to illustrate the metadata problem by asking:
- Just how many ways to write “Guns N’ Roses – Knockin’ on Heaven’s Door” are there?
We still don’t have a concrete answer for that question, but here are the Top 100 ways to write: Guns N’ Roses – Knockin’ on Heaven’s Door
And just for fun, here are some of the artist names used on Guns N’ Roses MP3s in descending order of popularity – yes people really do mistag things this badly :)
What the future holds
Fixing up music metadata remains a hot topic for us this year, there is plenty more to come:
- Disambiguation to properly display the nine different artists that all have the name: Mono.
- Correct data displayed when scrobbling, even if your tags suck.
- Maybe the ability to correct your crappy tags, if it’s accurate enough, and there is demand for it.
- Return other identifiers from fingerprints, such as Musicbrainz IDs and Discogs URLs.
- Consolidate the stats on Last.fm so that Artist top tracks, and ultimately user charts and all other charts are corrected.
- Implicitly improve Last.fm recommendations and radio due to less noise in the data.
- More formal collaboration with other music metadata crusaders.
If you are using the official Last.fm software, every time you play a track you are contributing to the metadata cleanup effort by scrobbling and fingerprinting what you play.
In the meantime, we are working to refine the data, and will publish updated artist corrections, track corrections and some analysis of the feedback we receive from this round of voting soon.