Hello people. I’m cms, and my job here at Last.fm is looking after the databases. Much of the time I’m involved with operational running of database servers, designing and optimising SQL queries, and scaling work on our relational database clusters. Every now and then though, I do get an opportunity to poke around in the Last.fm dataset and explore some of the interesting relations.
I recently re-discovered the seminal album ‘Spirit Of Eden’ by ‘Talk Talk’ (haven’t tried it? You really should, it’s magical), and I’d been giving it quite heavy rotation. This prompted a comment on my profile by one of our lovely users, who suggested making a playlist from artists whose names consisted of repeating word patterns. This idea appealed to me, but off the top of my head I could only come up with a paltry half-dozen candidates. Surely there were many, many more. If only there was some kind of database nearby I could query…
We keep our main catalogue data in a PostgreSQL database. PostgreSQL has a nice set of extended string operators, including quite comprehensive regular expressions support, which would be useful for an ad-hoc query like this.
Here’s what I came up with initially off the top of my head
select name from artist where name ~* E'^(\\w+\\M)\\s+\\y\\1$' ;
Using the case insensitive regular expression match operator ~* and matching against a string that begins with a sequence of word characters leading up to a word boundary, which I’m capturing as a group, then a sequence of whitespace, then the start of a word boundary followed by the original captured match.
This query worked really well at defining the pattern for repeating names. I was matching well over 10,000 distinct strings. The problem was that we store all the submitted data for artists, and this includes data from a broad range of unverifiable sources. I was getting lots of great artist names in my set, but many of them were bogus; typos, mis-taggings, spelling corrections, and that was just the obvious mistakes.
I needed to come up with a way of filtering the set further. My first iteration was to use track information. Incorrect artist attributions seemed unlikely to have relations over tracks in the catalogue, and I could extend my query relatively easily to take account of prolificness like so.
select count(1), a.name from artist a, track t where a.name ~* E'^(\\w+\\M)\\s+\\y\\1$' and t.artist = a.id group by 2 order by 1 desc;
This got me a shorter set of artists (8000 odd), with some ordering. I could see that recognisable artist names (hello Duran Duran !) were sorting towards the top. However, ordering by catalogue volume still wasn’t quite right. Ideally I needed some kind of popularity weighting. Unfortunately we don’t store any scrobble data in the PostgreSQL catalogue schemas.
However we do store scrobbles, alongside exported catalogue information in our Hadoop cluster. Although I have been known to write Java code in the past, I’m mildly allergic to it. Luckily for me we have a Hive interface to Hadoop. Hive offers an interactive query language over Hadoop that is closely modelled on SQL. The only stumbling block remaining was porting my regular expression over to use Java syntax.
Here’s what I ended up with as a hive query:
select meta_artist.name,
overallplayreach_artist.reach
from meta_artist join
overallplayreach_artist on meta_artist.id = overallplayreach_artist.id
where meta_artist.name RLIKE '^(.+?\\b)\\s+\\b\\1$' and
meta_artist.correctid IS NULL
and overallplayreach_artist.reach > 50
order by overallplayreach_artist.reach desc ;
Joining against some “playreach” data to give a weighting according to rough popularity. My original SQL query took 17 minutes to run, on a fairly beefy database server. The hive query took less than 100 seconds to return, running across the entire Hadoop cluster. Awesome.
Without any further ado, here’s the top 10 results, roughly ordered by artist popularity.
Artists with repeating name patterns |
---|
Duran Duran |
Frou Frou |
Gus Gus |
Talk Talk |
Xiu Xiu |
The The |
Man Man |
Cash Cash |
Danger Danger |
Gudda Gudda |
I’ve created a tag artistartist, and tagged some of the entries already.
The full list is available here. There might well still be some rough data in there, I haven’t particularly sanity checked it by eye.
If you too would like the chance to play with Last.fm’s vast amounts of data and join our team, check out our job openings.
Comments
Chris Crum
12 October, 17:46
I realize it’s different, but it seems like Mr. Mister belongs on the list. Talk Talk rules, by the way.
Tecfan
13 October, 09:49
Really cool experiment :)
How about names like Cats and Cats and Cats
xerode
13 October, 09:53
Duran Duran but no Duran Duran Duran? Nooo
;_;
Also would !!! count?
James Grant
13 October, 10:42
Looking at Colin’s list got me thinking “What about artists who names are palindromes?”. Since I happened to be testing a new version of Hive I thought I might as well find out. The top 20 (in order of the number of listeners):
1. ABBA
2. múm
3. B.o.B
4. Eve
5. MUM
6. 2002
7. PMMP
8. Ada
9. Anúna
10. M2M
11. D-A-D
12. Asa
13. AYA
14. OOIOO
15. Otto
16. BOB
17. EverEve
18. TNT
19. W&W
20. Ana
Lawrie
13 October, 11:02
See also –
Attack! Attack! (US)
Attack Attack (UK)
And my personal favorites – Giraffes? Giraffes!
kulpdogg
13 October, 14:02
Everything Everything
they’re AMAZING, PLEASE look them up!
Jonathan Norman
13 October, 14:04
Wakey
James
13 October, 14:05
This is actually useful for my studies as I’m doing SQL server and database management lol.
Jonathan Norman
13 October, 14:05
Wakey! Wakey!
Gerrit Fries
13 October, 14:18
gobble gobble
Me
13 October, 14:25
How about The The?
Al_
13 October, 14:29
What about artist names like
Hey Hey My My
Pony Pony Run Run
? :)
Lion_ditchie
13 October, 14:31
Nice Nice
passianotto
13 October, 15:06
The Mighty Mighty Bosstones may have been excluded from your query because of the last word, but I love them.
=)
eric casteleijn
13 October, 16:00
Why ‘artistartist’ as the tag though, and not ‘artist artist’, since your regular expression enforces whitespace between the two identical parts, thus excluding artists like Wakey!Wakey! or Duvelduvel. ;)
Steven Gravell
13 October, 16:26
Spirit of Eden is one of the best
Anonymous
13 October, 17:02
obligatory:
http://xkcd.com/208/
Kale Booth
13 October, 17:11
@James I’m glad you thought of the palindrome idea, too. it’s the first thing that popped into my head. Thanks!
Eliseo Soto
13 October, 18:19
Nice! José José and Polo Polo made it to the list. They’re really popular in México.
Mike Hates Music
13 October, 21:40
wet wet wet
cms
14 October, 10:48
Thanks for all the comments and suggestions. All the people suggesting artist names that aren’t covered; this is just down to the precise regular expression used, and this is somewhat arbitrary on my part. In this sort of exercise you must define some rules for the pattern match, and I just picked them quite whimsically. In fact the earlier versions of the regular expression ignored punctuation, I must have removed this while I was refining the sets.
@eric casteleijn I just liked the way ‘artistartist’ looked I guess.
Clyde Machine
15 October, 01:23
Danger Danger! Rock ‘n roll!
Interesting read, by the way.
Clyde Machine
15 October, 01:40
“obligatory:
http://xkcd.com/208/” -Anonymous
That’s what I thought of when I saw “regular expressions”!
nandosweettokin
15 October, 13:09
hot hot heat
Yeah yeah yeahs
chk chk chk (aka !!!)
cfcf
hh
15 October, 15:27
So why did you use the postgres E’‘ syntax (for a regexp, of all things!) when you could’ve used the regular ‘’ syntax and halved the backslashes?
cms
15 October, 15:57
@hh force of habit I suppose. Many of our postgreSQL servers run with standards_conforming_strings = off for legacy reasons, and I’m used to expressing strings in that fashion.
You’re quite right that it looks much noisier, and I could have cleaned it up before publication, I’ll bear that in mind for the future.
Chrise
15 October, 16:34
Xiu Xiu. China is taking over the world anyway so you might as well be the first in line to acknowledge this!
RompeRatones
17 October, 15:42
How about Controller.Controller
JustSomeOldJoe
17 October, 20:33
The problem was that we store all the submitted data for artists, and this includes data from a broad range of unverifiable sources.
. . .
Unfortunately we don’t store any scrobble data in the PostgreSQL catalogue schemas.
So, all individual scrobbles are maintained on the Hadoop cluster, but unique scrobbled artist/album/track metadata is propogated from Hadoop over to the DB (via trigger/nightly batch/divine intervention) directly into the base catalog? Is it Hadoop Scrobbled -> Hadoop Catalog -> DB Catalog?
The name pattern matching is interesting, but a description of the specific methods/architecture used to wrangle such a large amount of volume (flow direction, replication points, etc.) would be very, very cool!
The hive query took less than 100 seconds to return, running across the entire Hadoop cluster.
Quite a diff from 17 minutes. Not surprising considering the pattern match op up against a DB and the size of the catalog (especially if it includes scrobbled metadata).
Curious whether or not your search software supports pattern matching, and, if so, what the results would look like.
Thanks for sharing.
spudart
27 October, 22:29
Will there be a blog post about the Station Changes?
Kath
28 October, 17:50
Atl Atl didn’t make the cut, either due to improper capitalization in the Last.fm database, or a cutoff on the # of listeners.
Lisa
4 November, 00:45
Interesting experiment!
Thanks for all the hard work put into keeping last.fm on the air…absolutely love itXD
However, I have one request that, if fulfilled, will make so many people including myself even happier with last.fm…
The thing is, I happen to like Japanese rock music, and listen to artists like L’Arc~en~Ciel and GACKT.
The problem is, when the solo acts of members of L’Arc~en~Ciel, like HYDE and Ken, come up, they ARE artists called HYDE and Ken, but not the HYDE and Ken that are related to L’Arc~en~Ciel (I’ve never had the correct HYDE or Ken be played).
The same problem seems to pop up pretty frequently, especially with Japanese artists like D, Flow, Nightmare, Kagrra,, to list a few.
I’m sorry if this isn’t the right place to discuss this issue, but I could not find a contact e-mail that looked sufficient:(
It would be great if somebody at last.fm could look into it.
Thanks!
Comments are closed for this entry.