Using linear regression to predict movie box office gross

The second project in Metis’ Data Science program is an exercise in using Linear Regression to either interpret or make predictions with data. I chose to use movie data to predict the world-wide box office gross using only information available before a movie is released.

Sidebar: Actually, I originally wanted to predict which movies would be profitable, but after struggling with that problem for a few days, the Metis TA shared that it was known to be intractable. Many former Metis students had tired and failed to do the same. Beware.

I split the project into two parts, as reflected by the two notebooks in my git repository: web scraping, and modeling.

Web scraping

After exploring several available data sources, I settled on using boxofficemojo.com as my main source for information on movies released in the past five years.

I used Beautiful Soup to get the data, and it was straight-forward. To avoid downloading the same pages multiple times, I wrote a simple web page cache that I shared with my classmates and was the subject of another blog entry.

More problematic than the actual scraping, was filling holes in the data. I learned fairly quickly that boxofficemojo.com lacks budget numbers for most of the movies in their database. So, I scraped the-numbers.com and IMDB.com to supplement that information and was eventually able to get budget estimates for one-third of the movies.

Another interesting aspect of the scraping process was the logic I used to get the average box office of a movie’s actors, directors, producers, writers and composers to generate a ‘score’ I could use as a feature in the model. The alternative would have been to add hundreds, if not thousands, of dummy variables and I didn’t think that would be practical.

It worked best for me to cleanly separate scraping from modeling so I could make adjustments to one without affecting the other, so at the end of the scraping notebook, I save the data to a CSV file that the other notebook can import.

Modeling

As the goal of this project was to using linear regression, and as I’m a person who likes to understand how things work, I used statsmodels.OLS to build my model. I could have used the formula API, but I felt it too much of a black box that prevented me from seeing “inside” the model.

Surprisingly, my model worked quite well on the first try. The first time I fit the model to my data, I got an adjusted R-squared close to 0.8. My first thought was that my model was over-fit, but that turned not to be the case.

It’s obvious in retrospect, but the best indicators of how well a movie will do in the box office are the budget, and the box office appeal of the people involved in the project. The model I created used just seven features: budget, MPAA rating, genre, and the five box office ‘scores’ I assigned to the actors, directors, producers and writers.

In the model there’s one cumulative score for the actors, director(s), etc., that is the sum of each person’s average box office gross. In the model Ryan Gosling, for example, gets a higher score as an actor than as a director, because the films he directed did not do as well as the films in which he acted.

This score effectively captured how well each person’s movies had done in the past, and also the case where the same people reappear together in a franchise.

Though I’m pleased with how well the scoring system worked, I realize it has a major flaw. The simple model I built assigns each person a single score, where a more accurate model would assign different scores at different times in a person’s career.

Specifically, the score for an appearance in a movie should be the average gross before it was released. Cult films aside, it’s nonsensical to say a movie does well because of another movie a cast member worked on after.

That aside, the project served well as a way to study web scraping and linear regression.

Written on January 30, 2017