monster mashup 2014

finding the devil with open refine

The devil has been a popular theme of modern music since Robert Johnson sold his soul to him in the 1930s. In the late 1980s and early 1990s, a moral panic concerning Satanism and alleged Satanism-related crimes swept the United States. In this poorly-methodologized example purely for entertainment purposes which in no way bears resemblance to actual rigorous research, I will get a data set into shape using Open Refine to see whether an increase in Satanic Panic correlates with an increase in music about the devil.

get the stuff

the tools

Using OpenRefine is a good way to find commonalities in a large data set by using facets and filters to act upon different parts of it. This is much faster than loading it into Excel or another software program, and you can load files directly from the web. Get OpenRefine.

the data

I have to find a data set that will be suitable for analysis. The Million Song Dataset contains a million songs from 1930 to 2011. You can read more about the methodology of gathering these songs and their metadata at the data set's FAQ page.

the data?

One key bit of information I found when looking through the MSD documentation: they only have year metadata for about half the tracks, and they make this available in an additional data file that’s just a text file. This is bad because it’s fewer data points to work with for my analysis, but good because I don’t have to learn a new data format and processing method to be able to use the data.

what now?

The rest of this tutorial will walk you through important steps to process this huge data set and get it ready for analysis:

  • ingest the data set
  • separate the columns
  • merge band names
  • deduplicate records
  • filter on keywords
  • export the selected data

discern the transmundane (or: transform the data)

ingest the data

We can load the dataset into Open Refine straight from the URL we found, just click on "Web Addresses (URLs), and paste it into the box!

Curiously, from this screen, Refine only wants to let you have one character as a separator. This dataset uses a strange separator, so let's just leave everything in one column for now by clicking "custom" but leaving the box blank.

Make sure to check the options for ingest thoroughly -- for this data set, we don't want to split columns that contain quotation marks(that will break everything!) and we have no headers.

separate the columns

Now everything is ingested, but it's one big string of text in just one column. We want to isolate year, identifier, band name, and song title. This is very easy with Open Refine.

Click the little arrow pointing down next to the column name (that's how you do operations on a single column). From there, choose Edit column --> split into several columns. HERE we have the option to put in more than one character to split the columns, so go ahead and put in SEP in brackets.

merge band names

Open Refine has a useful clustering tool that will look for items that are similar to each other and standardize their names. In this dataset, the same band is represented in a number of ways, which could impact our results. To merge artist names, select the down arrow on the column that contains band names, and select Edit cells --> Cluster and edit.

See how many different representations of the same thing there are? I just chose "select all" because I'm pretty confident the clustering will be accurate enough for my purposes, but you can also select only a few items if you want to be more discerning. Click "Merge Selected & Recluster" to finish the operation.

transform to lowercase

Since my analysis of this data doesn't care whether the devil represented is the Capital-D-Devil or not, it's easier to transform all the song titles to lowercase so we can ignore case sensitivity when selecting the records we want later. You can do a lot of other useful transformations as well. Click the column's down arrow as usual, then select Edit cells --> Common transforms --> To lowercase.

deduplicate records

This part is tricky. The “duplicates” facet that comes built in with Open Refine will let you filter out duplicates, but won’t let you leave one instance of a record in. I definitely want one version of a track in my data set, but not two or three. To remove subsequent instances of duplicates, do this:

  1. Use the trusty down arrow on the column that has duplicates, and select "sort." Sorting by text is fine for this.
  2. Now there appears a NEW "sort" option at the top of the main window.
  3. Use the new sort option at the top and choose "reorder rows permanently." This may take a while. Go get another coffee.
  4. When the rows are reordered, go back to using the down arrow on the column of interest, and choose Edit cells --> Blank down. This will turn every subsequent duplicate into a blank cell.
  5. Now we can use facets! Click the column's down arrow --> Facet --> Customized facets --> Facet by blank.
  6. Facets basically let you organize the whole dataset into "true for this facet" and "false for this facet." Select the facet that says true, which will show us everything that has a blank cell for our selected column.
  7. If you want to permanently get rid of all the blank cells (duplicates), choose the down arrow that says "All" --> Edit rows --> remove all matching rows. You can use this same command to get rid of any rows you don't want. We're just using it this time to get rid of rows that used to be duplicates so we only have one copy of a song in our data set.
  8. It may seem scary to remove data, but the undo/redo function is very useful and will let you back up as many steps as you want if you make a mistake.

filter on keywords

We can now use regular expressions to find tracks that have words that match the words we're interested in. Open Refine does automatic stemming for this function, which can be good and bad. Here's my list of words:

  • Satan (gets satanic, satanas, etc)
  • Lucifer
  • Beelzebub
  • Mephisto (gets mephistopheles)
  • Demon (gets demons, demonic, demonstration)
  • Devil (gets devils, devilish ... vaudeville)

To start finding songs of interest, click the down arrow on the column, and go to Facet --> Custom text facet. You can learn more about the Google Refine Expression Language, but we'll be using the simple function: value.contains("wordofinterest"), and just repeating it as many times as necessary. It saves all of your expressions in the "history" tab!

Since this is a facet function, it'll divide the dataset into "true" and false. The image here is showing that 585 songs returned as "true" for their titles containing "devil." If I click to see only these records that returned true, I can then go to down arrow next to "All" --> Edit rows --> Star rows. This will tell Refine that I think these rows are important, and they will stay starred unless I unstar them.

Run the regular expression for every term you want to find, and star all the rows along the way just like in the step above. When you've got everything you want, go to All --> Facet by star. The rows that return as starred are your carefully curated subset of data! Congratulations!

export the data

Exporting data is simple. Make sure you only have the rows that you're interested in selected (in this case, the starred ones), and go to "Export" at the top of the page. Download it in any format you want to work with it in.

finish up!

Now that we have just ~900 rows instead of over 500,000, Excel easily handles the data set.

findings

I made a spreadsheet and broke down the songs per year and percentage of those songs per year that included my criteria for the devil.

As you can see, the percentages over time are fairly stable (discounting 1931, which has a sample size too small to be valid), the years in which songs that have titles referencing the devil make up more than 0.50% of total songs in the dataset for that year are:

  • 1972 (.55%)
  • 2010 (.55%)
  • 1956 (.61%)
  • 1960 (.74%)
  • 1955 (1.19%)

One particularly salient finding is that the Swedish black metal band Dark Funeral, noted practitioners of LaVeyan Satanism, have the most songs in the dataset, at 9.

celebrate!

I wouldn’t lead you through all this great music without leaving you with a playlist of my favorite songs about ((satan OR lucifer) OR (demon OR demons) OR (hell OR evil)). You can play the playlist on Spotify.

I Think of Demons
  1. I Think of Demons - Roky Erickson
  2. Sympathy for the Devil - The Rolling Stones
  3. Satan Rejected My Soul - Morrissey
  4. Lord of this World - Black Sabbath
  5. Satan Is Real - The Louvin Brothers
  6. Pentagram - Cake
  7. Stand for the Fire Demon - Roky Erickson
  8. In League with Satan - Venom
  9. The Devil Went Down to Georgia - The Charlie Daniels Band
  10. Hell Awaits - Slayer
  11. The Mephisto Waltz - Franz Liszt
  12. Me and the Devil Blues - Robert Johnson
  13. Pact with Lucifer - Coven
  14. Evil - Mercyful Fate
  15. Unholy - KISS
  16. The Number of the Beast - Iron Maiden
  17. Red Right Hand - Nick Cave & The Bad Seeds
  18. My Demon Brother - Blood Ceremony
  19. Shout at the Devil - Motley Crue
  20. Long Way Back From Hell - Danzig
  21. Am I Evil? - Diamond Head
  22. Devil's Bride - Electric Wizard
  23. O Father O Satan O Sun! - Behemoth
  24. To Hell with the Devil - Stryper
  25. See You In Hell - Grim Reaper