The previous post about beta testing our new fingerprint technology generated quite a bit of feedback – thanks to everyone involved.
- 10 million fingerprint submissions received so far (keep ’em coming)
- Lots of useful feedback and thoughtful questions
- We are frantically working on the server architecture in order to get a public-facing lookup service ready as soon as possible.
- Dedicated hardware for this is scheduled to arrive on Friday afternoon. Laurie will probably post pictures once all the LEDs are working.
This weekend we started feeding the fingerprints en-mass into the fingerprint indexing service, which will tell us how many duplicate tracks there are and so on. This is in full swing, but we’ve only imported about 3m so far. A graph showing total fingerprints received, and total unique tracks is available here (updates twice an hour) – bear in mind that we have not imported all the fingerprints yet, so expect the ratio of ‘unique tracks’ / ‘fingerprints received’ to increase over the course of the week.
Although far from complete, we have the capability to query the currently imported data and return an internal unique ID, and the list of spellings used for that fingerprint:
$ ./query ../mp3/Pink\ Floyd/The\ wall\ \(disc\ 1\)/03\ Another\ brick\ in\ the\ wall\ \(part\ 1\).mp3 Fingerprint ID: 527153 -- Pink Floyd - Another Brick in the Wall (Part 1) Pink Floyd - Another Brick In The Wall (Part I) Pink Floyd - Another Brick In The Wall Part 1 Pink Floyd - Another Brick in the Wall, Par Pink Floyd - Another Brick in the Wall, Pt. 1 Pink Floyd - The Wall (PartI) Pink Floyd - Another brick in the wall (p.. Unknown - unknown
Or another example:
$ ./query ../mp3/Guns\ N\'\ Roses/Use\ Your\ Illusion\ II/04\ -\ Knockin\'\ On\ Heaven\'s\ Door.mp3 Fingerprint ID: 1211395 -- Guns N' Roses - Knockin' on Heaven's Door Guns N' Roses - Knockin´ On Heaven´s Door Guns N' Roses - Knockin On Heavens Door
I’m expecting a lot more versions of this one once we’ve cleared the fingerprint backlog… all those apostrophes ;)
What this shows is simply the list of known spellings for songs that sound exactly the same as the MP3s used as a test. The list of known spellings will increase as we import the rest of the data. It’s going to be trivial for us to order this list by popularity, and assume that the most popular spelling is usually correct. We are however, painfully aware that this is not going to be reliable enough in many cases.
The crux of the moderation task will be taking these sets of spellings (once album and popularity data is added) and distilling it into a single correct version
Once we have processed the fingerprint backlog (and hopefully updated the client to address some known issues) we will be in a better position to figure out how we’re going to do this.
Initial questions we hope to answer:
- How accurate will ‘the most popular spelling is correct’ assumption be?
- How many unique recordings make up the ~10m fingerprint submissions?
- Can we work with MusicBrainz for voting, and to publish a “common misspellings” dataset?
- Just how many ways to write “Guns N’ Roses – Knockin’ on Heaven’s Door” are there?