Monster Mashup 2015: Horror Business

This presentation will guide you through some basics for taking a data set, cleaning it up, and adding additional variables using OpenRefine.

First, get OpenRefine!

If you don't have Open Refine, you'll have to download it. It's a desktop application available for Mac, Linux, and Windows, and you can get it here: openrefine.org.

Get your data.

In my case, I'm using data supplied by Yale University Library about their VHS collection of horror and exploitation films. This data takes the form of a spreadsheet of titles and some other metadata about the objects.

To read more about Yale's VHS collection and its significance, check out Saving the Scream Queens: Why Yale University Library decided to preserve nearly 3,000 horror & exploitation movies on VHS in The Atlantic.

Create a Project in OpenRefine

Import the first file, the original data.

Parse the files the way you want them: which worksheets do you want to import? Are column headers actual headers?

Clean the data

This data set needs to be cleaned up a little to get the most out of the following steps.

You can hide rows you don't want to concentrate on without deleting them entirely.

Split the TITLE column: some of them have addditional information within parenthesis that we don't want. TITLE column down arrow --> edit column --> split into several columns.

The way to split a column is: TITLE column down arrow --> edit column --> split into several columns.

In the dialog box that follows, note that you can choose the separator (in our case, "("), how many columns at most, and whether to remove the original column.

This is what it will look like when it's finished. We can delete unwanted columns after this.

Now is also a good time to check for duplicates by choosing TITLE 1 down arrow --> Facet --> Customized facets --> Duplicates facet. In this case, we don't have any -- we just have two copies of a few items. That's fine.

Reconcile

Reconciliation Resources

This example is horror film metadata, but you can reconcile against several data sources that may be more relevant to your projects: https://github.com/OpenRefine/OpenRefine/wiki/Reconcilable-Data-Sources

Reconcile with Freebase

Reconciling the data against FreeBase will allow us to pull additional information in later. To do this, use the new freebase reconciliation service described by the OpenRefine wiki.

To reconcile my titles, I'm going to choose Title 1 --> Reconcile --> Start reconciling...

In these options, I'm going to choose "Add Standard Service..." and use the URL http://reconcile.freebaseapps.com/reconcile as advised by the OpenRefine wiki.

You can see it lets me choose a type of entry to reconcile to. We'e only dealing with film titles, so that's easy.

Start the reconciliaton. This will take a long time. Go get a coffee.

Matching the entries

See how it estimates the match of what our cell says with what exists on Freebase? They mostly look correct and this is just a demo, so we'll auto-match everything. Title 1 --> Reconcile --> Actions --> Match each cell to its best candidate.

Add information from Freebase

Let's add release dates and directors for the TITLE column that has been reconciled with Freebase. Since Freebase now recognizes our titles as a piece of metadata that exists in their system, we can pull from it the directors and release dates for those films that it recognizes.

Add directors

TITLE 1 --> Edit column --> Add columns from Freebase... and then add a property -- just searching for "director" will pop up a list of suggestions, and we can choose the correct one: "directed by" in /film/film/directed_by"

Add release dates

TITLE 1 --> Edit column --> Add columns from Freebase... and then add a property -- searching for dates will get you to the one we want -- "initial release date."

This adds the full date. If you want to split the column so you have a column of just release years, do it the same as before!

Export!

Export as a project, as a spreadsheet, as anything you like for further analysis.

Congratulations, and Happy Halloween!