Advanced Robotics

Tuesday, 24 July 2012
by christopher
filed under Announcements
Comments: 32

Do you have Robot Ears?

That’s the question we asked a few weeks ago. As we explained at the time, we’ve been training an army of music-listening robots (or “audio analysis algorithms” if you want to get technical!) to try to better understand the music you scrobble.

The idea is that by automatically analysing tracks we’ll be able to add helpful tags, improve recommendations, and provide novel new ways to explore your collection and discover new music.

We asked for your help to evaluate our robots. We thought they were doing a pretty good job in most cases but there was definite room for improvement, and like any good scientist we were looking for some large-scale evidence (i.e. lots of feedback from real people) rather than just going on our own impressions. So we built the Robot Ears app which asks humans to classify tracks and then compares their answers with what our “robots” said about the tracks.

Click to try the latest Robot Ears

Now, six weeks later, we’ve gathered over 30,000 judgements on 600 or so music tracks and we’re ready to share some initial results.

*Spoiler alert*
The robots did pretty well – but we’re not satisfied yet!

We’re kicking off another round of experiments, to learn even more about a wider variety of music tracks. The more people we can get to take part the better, so whether you’ve tried it already or not, please visit Robot Ears - and help the robots to keep improving!

Want to know what we (and the robots) have learned so far? Read on for the details…

The results so far

We were aiming to answer two different questions with this experiment:

  1. Are the labels we’re trying to apply to tracks meaningful?
  2. Do our robots reliably apply the right labels to a track?

The first question is the more fundamental – if we’re using labels that don’t mean anything to humans, it doesn’t much matter what our robots say! To answer this question we looked at the average agreement between humans for each track. If humans reliably agree with each other we can conclude the label has a clear meaning, and it’s worth trying to get our robots to replicate those judgements.

We were looking at 15 different audio “features”. Each feature describes a particular aspect of music, such as:

  • Speed
  • Rhythmic Regularity
  • Noisiness
  • “Punchiness”


The features have a number of categories, for example “Speed” can be fast, slow or midtempo. Each time a human used the Robot Ears app, they were asked to sort tracks into the appropriate categories for a particular feature. Meanwhile our robots were asked to do the same. At the end of a turn, we showed you how your answers matched up with the robot’s:

After we’d gathered about 16,000 human judgements we took a look at the results so far. There were a few interesting learning points about which features were doing better or worse. Based on this we adjusted some of the labels, threw some out completely, tweaked our robot algorithms and started a new experiment. Another 14,000 judgements later we reached the following results:

We can see that the levels of human agreement vary quite a lot across the features, with activity, percussiveness, smoothness and energy seeming to be the most reliable. By the end of the second round experiment there were just a handful of features (rhythmic regularity, sound creativity and harmonic creativity) we still weren’t convinced by. We aren’t giving up on these, but it seems like we don’t quite have the right words to describe them yet!

Speaking of which – we had a side question in each test: “would you change any of these labels?”

We got some interesting suggestions. Some were helpful. Some… less so! For example:

  • Noisiness: noisydistorted
  • Energy: softcalm
  • Energy: energeticpowerful, emotional, EXTREME HIGH
  • Danceability: dancestrong beat, rhythmic
  • Danceability: atmosphericambient, spacious
  • Harmonic Creativity: little harmonic interestboring
  • Tempo: steadygreat workout stuff
  • Punchiness: punchywide dynamic range
  • Sound Creativity: consistentsimple
  • Sound Creativity: variedupfront texture
  • Smoothness: uneventurbulent

One user also suggested renaming the Not Sure box “I’m an idiot”!

So what about the second question: “How did our robots do?” Well, again, there was quite a range of performance across the different types of feature:

As you can see there are a few features our robots are particularly good at, and a few where their ears definitely need to be cleaned out!

What’s next?

Doing these first two experiments allowed us to refine our terminology and the way our robots classify tracks. Naturally, being built in London, our robots are currently very excited about the Olympics. In that spirit we’re going to award them a bronze medal for progress so far:

We’ve already started to work on some new functionality based on the more reliable features. Here’s a sneak preview of what Mark and Graham came up with at a recent internal hack day:

There’s a lot more work still to do though, and so we’re kicking off a third round of experiments. The key difference is we’ll be using a lot more music tracks this time, and hopefully getting a lot more user feedback.

Whether you’ve taken part already or not, we’d love it if you’d come visit Robot Earsand help our robots go for the Gold!