Lyric clouds, genre maps and distinctive words

Wednesday, 22 June 2011
by andrew
filed under Trends and Data
Comments: 20

One of the interesting things that sets even superficially similiar genres of music apart is their lyrical content. tags can overlap to a great degree, but we were interested to see what the words can tell you about the subtler shades of meaning that go along with those tags. As usual around here, the best way to answer questions like these is by asking the data.

So I downloaded the musiXmatch dataset, a collection of lyric tables for nearly 240,000 songs from all around the world (and the musical universe). They are tables in the sense that they don’t contain the intact lyrics of each song, but rather a list of words present in each song, along with the number of times that word occurs. No use for karaoke, but perfect for investigating the overall properties of a genre. I then matched up the songs in the dataset with tracks in our own catalogue, and correlated this with tag data, in order to count the number of times a given word appeared in each of several prominent genres.

Lyric clouds

Of course, lists of words and frequencies are a little dry, but thankfully IBM have released a Word-Cloud Generator which can take a weighted list of words and display it graphically, as seen on the Wordle website. The more often a word appears, the bigger it will be rendered. Here’s what it came up with for the genres I tried — the software did the layout, but you can blame me for the font selection.

Click to open full images in a new window.

Warning: they contain lyrics you may find offensive. Not safe for work.












I did a bit of pre-processing to remove common ‘stopwords’ that don’t really hold any information about the topics of the lyrics (and, for, I, you, the, plus many more), but this only took into account English words — and if you look closely, you’ll see a few common words from German, French and Spanish (and probably others) that are from foreign-language songs in the dataset. But what’s most striking for me about these is not how much they differ, but in fact how often some of the words appear prominently across genres. Almost everyone sings about love, for example, with the exception of Rap and Hip-Hop, and time comes up… time and time again.

Genre maps

A limitation of word clouds is that while they’re great for showing the comparative popularity of words within a genre, they’re not so good for looking at the overall similarities or differences of several genres at once. To do that, you need some measure of similarity which can be rendered graphically as a kind of ‘genre neighbourhood map’. So I measured the similarities between the word lists for each genre, ranked by popularity, using a method which was developed to compare the result rankings from different search engines. This gives a single value for how similar the lyric choices are between each pair of genres, where differences towards the top of the lists (the most popular words) are considered more important than differences further down. A bit of extra number crunching in R can convert these similarity scores into a 2D map, which I imported into OpenOffice to render:

Click image to open larger version in a new window.

This map is really interesting for its combination of expected and unexpected neighbours, and also for the way it clearly shows Rap and Hip-Hop as outliers from the main axis on the left. Goth and Metal, which may appear similar to the un-trained ear (and eye!), are considerably separated, while Metal and Folk are — surprisingly — much closer. Electronic (a very broad tag) is clustered together with Soul and Blues, presumably because of the soulful origins of house music, which is one of the more lyrical electronic sub-genres. And Rap and Hip-Hop, which might be considered synonymous by the layman, are about as different as Indie and Country in terms of lyric ranking.

Distinctive words

The word clouds as shown draw the viewer’s attention to the very frequent words, but these also tend to be the ones like love and time which are popular across genres. This is a problem if you want to find out which words are most distinctive or characteristic of a given genre — the words which, if used as search terms for example, would be best at selecting songs from that genre correctly (true positives), while minimizing the number of songs retrieved from other genres (false positives). Once again, information retrieval (the science behind search engines) can help us — the F measure or F score is specifically designed for measuring the tradeoff between true positives and false positives in a set of results. It’s a score between 0 and 1, where 0 means “no relevant documents retrieved”, but 1 means “all relevant documents retrieved” and “no additional spurious documents retrieved”.

So I calculated the F score that each word would have as a search term for each genre in some notional lyric-based search engine: “how relevant would the results be if I searched for Indie tracks with the search term friend“ for example. This doesn’t take into account the number of times each word occurs within a song, just the fact that it occurs at all, but it does let us redraw the lyric clouds with each word’s size determined by its F score for that genre. As you can see, this brings out the words that are characteristic of each genre, rather than emphasizing those that are globally popular:

Click to open full images in a new window.

Warning: they contain lyrics you may find offensive. Not safe for work.












I think they bring out the unique character of each genre much more effectively, and the variation in size between the words is much less, so the less prominent words are easier to see. There are some interesting quirks visible too. For example, many German words are much more clearly visible in the Goth cloud than they were before, reflecting both the comparatively large number of songs in German in that genre, and the lack of German lyrics in most other genres. Country for example is entirely English.

Finally, a little extra present from the data. The word with the highest F score in the whole dataset is Christmas, with an F score of 0.3892 for the tag… Christmas. So, unseasonal greetings from the data crunchers here at Last.HQ!

Thanks to musiXmatch for making the lyric database available, and Thierry Bertin-Mahieux for helping me to reconstruct the full words from the stems in the database.


  1. d34
    22 June, 11:03


    d34 – 22 June, 11:03
  2. Ru
    22 June, 12:13

    ‘Cause’ really is kind of a stopword.. it doesn’t mean anything!

    Ru – 22 June, 12:13
  3. E.
    22 June, 12:44

    I have my doubts about removing the words ‘you’ and ‘I’ from the lyrics; I think they give a lot of information about the topic of a song. Oh, and the distinctive words for hip-hop are hilarious.

    E. – 22 June, 12:44
  4. Adrian Rosebrock
    22 June, 12:58

    The problem with this statistical analysis is that the dataset is too much of “toy dataset” and not a “realistic dataset”. Assume for a minute that we can separate all these classifiers with a polynomial kernel (or some other machine learning technique) and would like to use the classifier to classify the genre based entirely on lyrics. We now have to consider what happens if a blues band covers a song that a rock band does or a hip-hop band covers a rock song. There will be minor deviations in the lyrics, but the overall content will be the same which will drastically disrupt the classifier and likely cause the centroids to be inseparable.

    Instead of actually throwing out the stopwords, I would be interested in actually using them by themselves. The distribution of stopwords can be linked to authorship attribution/disambiguation and there might be some sort of correlation for stopwords when used within a genre. However, that system would break down as well, for example, if Elton John were to write a rock song and then, for whatever reason, write a goth song — the stopword distributions would likely be similar.

    Adrian Rosebrock – 22 June, 12:58
  5. max ciociola
    22 June, 13:18

    I’m super happy to any of the test you guys wanna do around musixmatch lyrics data.
    Just ping me max at

    max ciociola – 22 June, 13:18
  6. Dan Ellis
    22 June, 14:00

    Adrian Rosebrock says:

    The problem with this statistical analysis is that the dataset is too much of “toy dataset” and not a “realistic dataset”.

    To be fair, 240,000 tracks with counts for the top 5,000 words doesn’t strike me as a “toy”. That must cover more than 95% of the “real” music “real” people actually listen to, at least within the Million Song Dataset that the musiXmatch dataset is defined over.

    Dan Ellis – 22 June, 14:00
  7. Andrew Clegg
    22 June, 16:23

    Ru: yeah, there are quite a few things in the finished clouds which are really stopwords — all of the foreign-language words are as well.

    Adrian, E.: Averaged out over a whole genre, the stopwords wouldn’t really tell us much. I’m not really interested in individual writing styles of singers/songwriters, more about the themes that bubble to the surface when you look at genres as a whole.

    Andrew Clegg – 22 June, 16:23
  8. Andrew Clegg
    22 June, 16:25

    …sorry, I should have said “most of the foreign language words which are big enough to see” in the last comment.

    Andrew Clegg – 22 June, 16:25
  9. Ross Collins
    22 June, 16:50

    Would be interesting to see the results of more genres, like Reggae, New Age, Swing, Easy Listening, Disco…

    Ross Collins – 22 June, 16:50
  10. Nathan Chase
    22 June, 18:25

    Some really amazing stuff here, Andrew. Keep it coming! The more cool information you can discern from the data, the better.

    Nathan Chase – 22 June, 18:25
  11. closedmouth
    22 June, 18:35


    closedmouth – 22 June, 18:35
  12. E-Clect-Eddy
    22 June, 21:34

    Anglophile & time-framed! I can’t believe that these are the words 95% real people listen too. Spanish is the most common spoken language on this planet and only in the last 10-20 years have English sung music overtake the local language, is my believe at least for Latin-America. I would even go so far to say that most music in most countries until the 70s was in their own local language. If this mapping had been done in 1965 it would look much different for non-English words at least. I do like this exercise, must songs are about “you” & “I” & “love” & “baby” & Feel” in any genre except hip-hop/rap :-)

    No German words in Country, well maybe in other languages they have a lcoal name for that style say, Ranchera (Mexico) or Sertaneja (Brazil) or plainly called traditional, in their language. If those songs were to be translated and thrown into the mix they might be neighbours in your graph.

    Christmas a high-scorer, well Bing made that song famous 70 years ago and since then has gained popularity all over the planet, beginning in a time when Catholic / Christians was no.1 religion on the planet, also due to Latin-America. So it would tempt many to also write something about this still popular and most recognizable holiday. Since WW II English is growing as alternative language so I guess the word Christmas will sill be scoring high in years to come.

    E-Clect-Eddy – 22 June, 21:34
  13. HodgeStar
    23 June, 10:21

    @closedmouth: comment ça va ?!

    HodgeStar – 23 June, 10:21
  14. Jeff Kingfisher
    24 June, 15:31

    This is an interesting study to be sure in that it highlights cultural differences. It’s unfortunate, though, that it’s dependent on “genre” when so many of us – songwriters and listeners alike – experience a real disconnect between “genre” and the music we create and/or love. It would be interesting to explore clouds based on topics like “death” or “dreams” or “hope” … I wonder if that’s possible. I’ve never been to musiXmix and have no idea how it works, but I’m gonna go see.

    Jeff Kingfisher – 24 June, 15:31
  15. Jessica
    24 June, 19:35

    I’m wondering what the artists or songs “most typical” of each genre would be, i.e., ones that use the most of the words common to that genre. Related to this, which artists or songs are least typical of their genre, or which show inconsistency in genre?

    Jessica – 24 June, 19:35
  16. Andrew Clegg
    25 June, 11:17

    Jessica — I’d actually thought about that myself. It’s a lot of number-crunching to do, but I might give it a go for a future post.

    Andrew Clegg – 25 June, 11:17
  17. TuiDragon
    25 June, 18:15

    Actually, “cause” is not necessarily a stopword.

    Depends upon the usage. Cause is typically used to signify some type of “social justice” movement. Which wouldn’t surprise me to be highly ranked in rap music.

    Of course, “cause” being used as a shortened version of “because” creates issues.

    TuiDragon – 25 June, 18:15
  18. Senny
    25 June, 18:56

    Interesting, but this really should have been separated into different languages. The presence of foreign words (especially ca, ein, and la) skew the distribution.

    Senny – 25 June, 18:56
  19. Simba
    30 June, 18:36

    Very interesting, thanks. I would appreciate language specific versions, too. I did something similar for tango a while back, basically all in Spanish.

    Simba – 30 June, 18:36
  20. plx
    4 July, 18:00

    leave. stay.

    plx – 4 July, 18:00

Comments are closed for this entry.