For the past few months, I have been working on a cool project for my school. The goal was to create a graph allowing people to navigate through the IMDb database.
The class this project was made for is managed by Franck Ghitalla. You can read his blog, on which he talks about other projects dealing with graphs.
The class's subject is to study graphs for various domains: politics, books, science... in order to isolate patterns that are not easily seen on classic data representations.
For that purpose, we are using an open source tool called Gephi, which allows us to create, visualize and manipulate graphs. Gephi was originally developed by former students of my school.
Gephi relies on the GEXF format, which is an easy-to-use XML language. Anyone who wants to create a graph from a database can just create a script that outputs a GEXF file, and then open it from Gephi. It's as simple as that.
After spending a few hours manipulating Gephi, we were asked to find a project, and to submit it at the end of the semester in order to pass the class. As I am pretty interested in movies, I decided I would be making something based on IMDb. The database is solid, and seemed easy to access.
My initial idea was to create a graph with movies and actors. Everytime an actor would play in a movie, I would create an edge between that actor and the movie. After some experimentation I quickly realized I would need to find another project, simply because the database would be too big to manipulate, and that it wouldn't be very interesting in the end.
I then decided to start another project: analyzing the movie plots, and see which themes would emerge from there. Of course, the database would still be pretty huge, so I decided to only analyze american action movies from 1950 to 2009, which is already a lot.
The final project would be a graph with keywords linked to each other when they're found within the same movie plot.
I initially thought the project would only take me a few weeks. I ended up finishing it after three months...
The final project
Before I dive into the technical details, I would like to present the actual output.
So here are the graphs I made:
You can navigate through the graph with your mouse. If you click on a node, you will get more information about it, like the movies it is related to, and the other keywords it was found with.
How I did it
Getting the data
The first step for the project was to figure out how to get the data.
My first idea was to find an API. Unfortunately IMDb does not provide one. I considered crawling the website and getting the data by analyzing the HTML pages, but it sounded like a lot of work, that would not even be that reliable.
I then came across IMDbPy, an unofficial Python API that allows you to get data from IMDb's servers. After a some experimentation, I decided it would be too slow to retrieve all the data I needed.
Though, IMDbPy provides one cool feature: it allows you to convert IMDb's plain text data files to a MySQL database. All I had to do was to download IMDb's text files, and then follow the documentation. It then takes a few hours to export the whole thing to your database (or more than a week if you forget to set the right engine to your tables).
Once I had the whole IMDb on my MySQL database, I started writing scripts to get the data and export what I needed to a graph file. I soon realized that it was too slow, because the original database was made of many different tables that I had to join with each other to find the information I needed.
In order to fix that, I created a materialized view of what I needed. Basically, I just created a new table, with just the right columns. Then I would not need to join the tables anymore, and it would allow me to test my scripts a lot more, because accessing the data would be faster.
Once I have the data, my algorithm to analyze movie plots is be pretty easy:
- For each movie, I take every single word from the plot
- Then, for each of these words, I add them as nodes in my graph (if they don't already exist)
- Finally, I create edges between words of the same plot.
After I have all this data, all I need to do is to export it as a GEXF file, which is basically just a simple XML language.
Here is an example of a GEXF file taken from the official website:
<?xml version="1.0" encoding="UTF-8"?> <gexf xmlns="http://www.gexf.net/1.2draft" version="1.2"> <meta lastmodifieddate="2009-03-20"> <creator>Gexf.net</creator> <description>A hello world! file</description> </meta> <graph mode="static" defaultedgetype="directed"> <nodes> <node id="0" label="Hello" /> <node id="1" label="Word" /> </nodes> <edges> <edge id="0" source="0" target="1" /> </edges> </graph> </gexf>
As you can see, it's pretty simple to export a file that you can then manipulate using Gephi.
I decided to implement this algorithm as a PHP console script, but any language would have worked. I am just used to working with PHP, and it provides an easy way to connect to a MySQL database.
Although the algorithm sounds super simple, there was one huge difficulty. Indeed, by running the script, I had a huge volume of data to handle. Having so much data brings two problems:
- First, my computer wouldn't be able to handle it. Both from the export script side, and from Gephi's side. Displaying 10k nodes and 50k edges on Gephi is fine, but making a spatialization, and computing statistics out of it would be way slower and make my poor laptop crash, especially with that many edges.
- Then, in terms of quality, there would be a huge issue. I wanted to isolate what movies are about during a certain period, but if I take too many words, it won't make any sense. I did not want to have words like "have", "be", or "within" on my graph.
So this is why the project took me much longer than expected: I needed to filter the data.
My first filter was to only take words with a minimum length. I considered that words that are contain less than four letters would be useless. This helped me filter pretty easily a lot of words.
Another filter was to exclude common words that I would make a list of. I took the most common English verbs, adverbs, conjunctions, adjectives...
My third filter was to only take words that appear at least a certain amount of times. Indeed, if a word appears in only one movie, it is too specific. What is too specific is not relevant. It doesn't show a pattern.
My fourth filter was to exclude words that appear in too many movies. Indeed, what is too common is not relevant either.
My fifth and last filter was to exclude edges that are not heavy enough. Indeed, when an edge is not strong, it means that the two words were not related very often, and therefore I don't need that edge. If I didn't do that, I would have pretty much every single word connected to every single other word. That wouldn't help visualize patterns.
Fine tuning these filters took a lot of experimentation. I had to find just the right settings to be able to output an interesting graph.
Also, I couldn't use the same filters for all time periods. The filters I used for 2000-2009 don't work on 1950-1959 at all, because I do not have the same amount of data, and because that data is very different.
In order to export the graphs to the HTML pages above, I used a Gephi plugin that allowed me to create Sigma js exports.
Unfortunately, the export still wasn't perfect: the information for each node was not very relevant, or displayed poorly. Therefore I decided to create scripts that would customize the exports. They add my logo, customize the CSS, remove uninteresting data, fix settings, and add the ability to see all movies related to every keyword.
Although I am done with this project for now, I think it could be enhanced a lot.
Indeed, the graphs do not share the same quality. I feel like 2000-2009 is pretty good, but 1950-1959 is too simple, for example. The only way to fix that would be to adjust the filtering, and that takes a lot more time than it seems. I think the key to significantly improve the results would be to have a complete list of all words that do not really carry any meaning, and exclude all of them. Making such a list is, unfortunately, extremely time consuming.
Another problem is that sometimes similar words could be fused. For instance, I wish I could fuse "kill" and "killed". Unfortunately I wasn't able to find a reliable way to do it.
Also, because of the amount of data and time, I had to stick to american action movies since 1950. I think it would be interesting to apply it to comedy movies, sci-fi movies... or even to all movies.
Finally, I wish I was able to use more data from IMDb, such as user ratings, in order to make the graphs more relevant. Indeed, a lot of the movies are pretty unknown.
In the end, I am still pretty happy with the result. I think the final output shows a different way to navigate through movies, and if taken further, could be a pretty nice way to discover new movies. Maybe I will take the project further one day.