Artist Artist

Wednesday, 13 October 2010
by cms
filed under Found On Last.fm and Code
Comments: 32

Hello people. I’m cms, and my job here at Last.fm is looking after the databases. Much of the time I’m involved with operational running of database servers, designing and optimising SQL queries, and scaling work on our relational database clusters. Every now and then though, I do get an opportunity to poke around in the Last.fm dataset and explore some of the interesting relations.

I recently re-discovered the seminal album ‘Spirit Of Eden’ by ‘Talk Talk’ (haven’t tried it? You really should, it’s magical), and I’d been giving it quite heavy rotation. This prompted a comment on my profile by one of our lovely users, who suggested making a playlist from artists whose names consisted of repeating word patterns. This idea appealed to me, but off the top of my head I could only come up with a paltry half-dozen candidates. Surely there were many, many more. If only there was some kind of database nearby I could query…

We keep our main catalogue data in a PostgreSQL database. PostgreSQL has a nice set of extended string operators, including quite comprehensive regular expressions support, which would be useful for an ad-hoc query like this.

Here’s what I came up with initially off the top of my head

select name from artist where name ~* E'^(\\w+\\M)\\s+\\y\\1$' ;

Using the case insensitive regular expression match operator ~* and matching against a string that begins with a sequence of word characters leading up to a word boundary, which I’m capturing as a group, then a sequence of whitespace, then the start of a word boundary followed by the original captured match.

This query worked really well at defining the pattern for repeating names. I was matching well over 10,000 distinct strings. The problem was that we store all the submitted data for artists, and this includes data from a broad range of unverifiable sources. I was getting lots of great artist names in my set, but many of them were bogus; typos, mis-taggings, spelling corrections, and that was just the obvious mistakes.

I needed to come up with a way of filtering the set further. My first iteration was to use track information. Incorrect artist attributions seemed unlikely to have relations over tracks in the catalogue, and I could extend my query relatively easily to take account of prolificness like so.

select count(1), a.name from artist a, track t where a.name ~* E'^(\\w+\\M)\\s+\\y\\1$' and t.artist = a.id group by 2 order by 1 desc;

This got me a shorter set of artists (8000 odd), with some ordering. I could see that recognisable artist names (hello Duran Duran !) were sorting towards the top. However, ordering by catalogue volume still wasn’t quite right. Ideally I needed some kind of popularity weighting. Unfortunately we don’t store any scrobble data in the PostgreSQL catalogue schemas.

However we do store scrobbles, alongside exported catalogue information in our Hadoop cluster. Although I have been known to write Java code in the past, I’m mildly allergic to it. Luckily for me we have a Hive interface to Hadoop. Hive offers an interactive query language over Hadoop that is closely modelled on SQL. The only stumbling block remaining was porting my regular expression over to use Java syntax.

Here’s what I ended up with as a hive query:

select meta_artist.name, overallplayreach_artist.reach from meta_artist join overallplayreach_artist on meta_artist.id = overallplayreach_artist.id where meta_artist.name RLIKE '^(.+?\\b)\\s+\\b\\1$' and meta_artist.correctid IS NULL and overallplayreach_artist.reach > 50 order by overallplayreach_artist.reach desc ;

Joining against some “playreach” data to give a weighting according to rough popularity. My original SQL query took 17 minutes to run, on a fairly beefy database server. The hive query took less than 100 seconds to return, running across the entire Hadoop cluster. Awesome.

Without any further ado, here’s the top 10 results, roughly ordered by artist popularity.

Artists with repeating name patterns
Duran Duran
Frou Frou
Gus Gus
Talk Talk
Xiu Xiu
The The
Man Man
Cash Cash
Danger Danger
Gudda Gudda

I’ve created a tag artistartist, and tagged some of the entries already.

The full list is available here. There might well still be some rough data in there, I haven’t particularly sanity checked it by eye.

If you too would like the chance to play with Last.fm’s vast amounts of data and join our team, check out our job openings.

Comments

  1. Chris Crum
    12 October, 17:46

    I realize it’s different, but it seems like Mr. Mister belongs on the list. Talk Talk rules, by the way.

    Chris Crum – 12 October, 17:46
  2. Tecfan
    13 October, 09:49

    Really cool experiment :)

    How about names like Cats and Cats and Cats

    Tecfan – 13 October, 09:49
  3. xerode
    13 October, 09:53

    Duran Duran but no Duran Duran Duran? Nooo
    ;_;

    Also would !!! count?

    xerode – 13 October, 09:53
  4. James Grant
    13 October, 10:42

    Looking at Colin’s list got me thinking “What about artists who names are palindromes?”. Since I happened to be testing a new version of Hive I thought I might as well find out. The top 20 (in order of the number of listeners):

    1. ABBA
    2. múm
    3. B.o.B
    4. Eve
    5. MUM
    6. 2002
    7. PMMP
    8. Ada
    9. Anúna
    10. M2M
    11. D-A-D
    12. Asa
    13. AYA
    14. OOIOO
    15. Otto
    16. BOB
    17. EverEve
    18. TNT
    19. W&W
    20. Ana

    James Grant – 13 October, 10:42
  5. Lawrie
    13 October, 11:02

    See also –
    Attack! Attack! (US)
    Attack Attack (UK)

    And my personal favorites – Giraffes? Giraffes!

    Lawrie – 13 October, 11:02
  6. kulpdogg
    13 October, 14:02

    Everything Everything

    they’re AMAZING, PLEASE look them up!

    kulpdogg – 13 October, 14:02
  7. Jonathan Norman
    13 October, 14:04

    Wakey

    Jonathan Norman – 13 October, 14:04
  8. James
    13 October, 14:05

    This is actually useful for my studies as I’m doing SQL server and database management lol.

    James – 13 October, 14:05
  9. Jonathan Norman
    13 October, 14:05

    Wakey! Wakey!

    Jonathan Norman – 13 October, 14:05
  10. Gerrit Fries
    13 October, 14:18

    gobble gobble

    Gerrit Fries – 13 October, 14:18
  11. Me
    13 October, 14:25

    How about The The?

    Me – 13 October, 14:25
  12. Al_
    13 October, 14:29

    What about artist names like
    Hey Hey My My
    Pony Pony Run Run

    ? :)

    Al_ – 13 October, 14:29
  13. Lion_ditchie
    13 October, 14:31

    Nice Nice

    Lion_ditchie – 13 October, 14:31
  14. passianotto
    13 October, 15:06

    The Mighty Mighty Bosstones may have been excluded from your query because of the last word, but I love them.
    =)

    passianotto – 13 October, 15:06
  15. eric casteleijn
    13 October, 16:00

    Why ‘artistartist’ as the tag though, and not ‘artist artist’, since your regular expression enforces whitespace between the two identical parts, thus excluding artists like Wakey!Wakey! or Duvelduvel. ;)

    eric casteleijn – 13 October, 16:00
  16. Steven Gravell
    13 October, 16:26

    Spirit of Eden is one of the best

    Steven Gravell – 13 October, 16:26
  17. Anonymous
    13 October, 17:02

    obligatory:
    http://xkcd.com/208/

    Anonymous – 13 October, 17:02
  18. Kale Booth
    13 October, 17:11

    @James I’m glad you thought of the palindrome idea, too. it’s the first thing that popped into my head. Thanks!

    Kale Booth – 13 October, 17:11
  19. Eliseo Soto
    13 October, 18:19

    Nice! José José and Polo Polo made it to the list. They’re really popular in México.

    Eliseo Soto – 13 October, 18:19
  20. Mike Hates Music
    13 October, 21:40

    wet wet wet

    Mike Hates Music – 13 October, 21:40
  21. cms
    14 October, 10:48

    Thanks for all the comments and suggestions. All the people suggesting artist names that aren’t covered; this is just down to the precise regular expression used, and this is somewhat arbitrary on my part. In this sort of exercise you must define some rules for the pattern match, and I just picked them quite whimsically. In fact the earlier versions of the regular expression ignored punctuation, I must have removed this while I was refining the sets.

    @eric casteleijn I just liked the way ‘artistartist’ looked I guess.

    cms – 14 October, 10:48
  22. Clyde Machine
    15 October, 01:23

    Danger Danger! Rock ‘n roll!

    Interesting read, by the way.

    Clyde Machine – 15 October, 01:23
  23. Clyde Machine
    15 October, 01:40

    “obligatory:
    http://xkcd.com/208/” -Anonymous

    That’s what I thought of when I saw “regular expressions”!

    Clyde Machine – 15 October, 01:40
  24. nandosweettokin
    15 October, 13:09

    hot hot heat
    Yeah yeah yeahs
    chk chk chk (aka !!!)
    cfcf

    nandosweettokin – 15 October, 13:09
  25. hh
    15 October, 15:27

    So why did you use the postgres E’‘ syntax (for a regexp, of all things!) when you could’ve used the regular ‘’ syntax and halved the backslashes?

    hh – 15 October, 15:27
  26. cms
    15 October, 15:57

    @hh force of habit I suppose. Many of our postgreSQL servers run with standards_conforming_strings = off for legacy reasons, and I’m used to expressing strings in that fashion.

    You’re quite right that it looks much noisier, and I could have cleaned it up before publication, I’ll bear that in mind for the future.

    cms – 15 October, 15:57
  27. Chrise
    15 October, 16:34

    Xiu Xiu. China is taking over the world anyway so you might as well be the first in line to acknowledge this!

    Chrise – 15 October, 16:34
  28. RompeRatones
    17 October, 15:42

    How about Controller.Controller

    RompeRatones – 17 October, 15:42
  29. JustSomeOldJoe
    17 October, 20:33

    The problem was that we store all the submitted data for artists, and this includes data from a broad range of unverifiable sources.
    . . .
    Unfortunately we don’t store any scrobble data in the PostgreSQL catalogue schemas.

    So, all individual scrobbles are maintained on the Hadoop cluster, but unique scrobbled artist/album/track metadata is propogated from Hadoop over to the DB (via trigger/nightly batch/divine intervention) directly into the base catalog? Is it Hadoop Scrobbled -> Hadoop Catalog -> DB Catalog?

    The name pattern matching is interesting, but a description of the specific methods/architecture used to wrangle such a large amount of volume (flow direction, replication points, etc.) would be very, very cool!

    The hive query took less than 100 seconds to return, running across the entire Hadoop cluster.

    Quite a diff from 17 minutes. Not surprising considering the pattern match op up against a DB and the size of the catalog (especially if it includes scrobbled metadata).

    Curious whether or not your search software supports pattern matching, and, if so, what the results would look like.

    Thanks for sharing.

    JustSomeOldJoe – 17 October, 20:33
  30. spudart
    27 October, 22:29

    Will there be a blog post about the Station Changes?

    spudart – 27 October, 22:29
  31. Kath
    28 October, 17:50

    Atl Atl didn’t make the cut, either due to improper capitalization in the Last.fm database, or a cutoff on the # of listeners.

    Kath – 28 October, 17:50
  32. Lisa
    4 November, 00:45

    Interesting experiment!
    Thanks for all the hard work put into keeping last.fm on the air…absolutely love itXD

    However, I have one request that, if fulfilled, will make so many people including myself even happier with last.fm…
    The thing is, I happen to like Japanese rock music, and listen to artists like L’Arc~en~Ciel and GACKT.
    The problem is, when the solo acts of members of L’Arc~en~Ciel, like HYDE and Ken, come up, they ARE artists called HYDE and Ken, but not the HYDE and Ken that are related to L’Arc~en~Ciel (I’ve never had the correct HYDE or Ken be played).
    The same problem seems to pop up pretty frequently, especially with Japanese artists like D, Flow, Nightmare, Kagrra,, to list a few.
    I’m sorry if this isn’t the right place to discuss this issue, but I could not find a contact e-mail that looked sufficient:(
    It would be great if somebody at last.fm could look into it.

    Thanks!

    Lisa – 4 November, 00:45

Comments are closed for this entry.