Hello people. I’m cms, and my job here at Last.fm is looking after the databases. Much of the time I’m involved with operational running of database servers, designing and optimising SQL queries, and scaling work on our relational database clusters. Every now and then though, I do get an opportunity to poke around in the Last.fm dataset and explore some of the interesting relations.
I recently re-discovered the seminal album ‘Spirit Of Eden’ by ‘Talk Talk’ (haven’t tried it? You really should, it’s magical), and I’d been giving it quite heavy rotation. This prompted a comment on my profile by one of our lovely users, who suggested making a playlist from artists whose names consisted of repeating word patterns. This idea appealed to me, but off the top of my head I could only come up with a paltry half-dozen candidates. Surely there were many, many more. If only there was some kind of database nearby I could query…
We keep our main catalogue data in a PostgreSQL database. PostgreSQL has a nice set of extended string operators, including quite comprehensive regular expressions support, which would be useful for an ad-hoc query like this.
Here’s what I came up with initially off the top of my head
select name from artist where name ~* E'^(\\w+\\M)\\s+\\y\\1$' ;
Using the case insensitive regular expression match operator ~* and matching against a string that begins with a sequence of word characters leading up to a word boundary, which I’m capturing as a group, then a sequence of whitespace, then the start of a word boundary followed by the original captured match.
This query worked really well at defining the pattern for repeating names. I was matching well over 10,000 distinct strings. The problem was that we store all the submitted data for artists, and this includes data from a broad range of unverifiable sources. I was getting lots of great artist names in my set, but many of them were bogus; typos, mis-taggings, spelling corrections, and that was just the obvious mistakes.
I needed to come up with a way of filtering the set further. My first iteration was to use track information. Incorrect artist attributions seemed unlikely to have relations over tracks in the catalogue, and I could extend my query relatively easily to take account of prolificness like so.
select count(1), a.name from artist a, track t where a.name ~* E'^(\\w+\\M)\\s+\\y\\1$' and t.artist = a.id group by 2 order by 1 desc;
This got me a shorter set of artists (8000 odd), with some ordering. I could see that recognisable artist names (hello Duran Duran !) were sorting towards the top. However, ordering by catalogue volume still wasn’t quite right. Ideally I needed some kind of popularity weighting. Unfortunately we don’t store any scrobble data in the PostgreSQL catalogue schemas.
However we do store scrobbles, alongside exported catalogue information in our Hadoop cluster. Although I have been known to write Java code in the past, I’m mildly allergic to it. Luckily for me we have a Hive interface to Hadoop. Hive offers an interactive query language over Hadoop that is closely modelled on SQL. The only stumbling block remaining was porting my regular expression over to use Java syntax.
Here’s what I ended up with as a hive query:
from meta_artist join
overallplayreach_artist on meta_artist.id = overallplayreach_artist.id
where meta_artist.name RLIKE '^(.+?\\b)\\s+\\b\\1$' and
meta_artist.correctid IS NULL
and overallplayreach_artist.reach > 50
order by overallplayreach_artist.reach desc ;
Joining against some “playreach” data to give a weighting according to rough popularity. My original SQL query took 17 minutes to run, on a fairly beefy database server. The hive query took less than 100 seconds to return, running across the entire Hadoop cluster. Awesome.
Without any further ado, here’s the top 10 results, roughly ordered by artist popularity.
|Artists with repeating name patterns|
I’ve created a tag artistartist, and tagged some of the entries already.
The full list is available here. There might well still be some rough data in there, I haven’t particularly sanity checked it by eye.