I'm rather fascinated by the potential of smart use of data for prediction (notwithstanding some of the issues that need to be ironed out with so-called big data) and particularly when applied to something as tricky to forecast as the box office takings of big budget movies.
Data is already being used fairly extensively in realtion to film content of-course. Netflix and Amazon make extensive use of algorithms that analyze our previous selections to create recommendations for customers. Film studios might use interaction metrics and online trailer views to help shape marketing spend.
Epagogix, a consulting firm that works with the entertainment industry, employs analysts to read the scripts of films in production and attribute scores to a complex series of plot points (established from analysing large numbers of hit movies). These scores are fed into an algorithm, which then calculates how much (within a range) that the movie will make at the box office, and even make recommendations around script changes that might make a movie more marketable.
I'm less sure about using algorithms to make creative changes to a script but the idea of being able to remove at least some of the risk from big budget launches is rather intriguing. In 2012 Google released the results of a study called "Quantifying Movie Magic with Google Search" which showed how combining different data points including search volume could predict the success of upcoming movies with an apparent high level of accuracy. Using data it collected on the 99 top films of 2012, Google looked at search volume for a film's trailer and factored in other information such as franchise status and seasonality, and was seemingly able to predict opening weekend box office revenue with 94 percent accuracy.
We have to be careful here of-course, but my favourite example comes from an application that arose out of a hack day. Mid last year, predictive analytics company Levers ran a hack event at which it’s developers created a tool and an algorithm that was able to predict the opening weekend box office revenues of a succession of big blockbuster films to a high degree of accuracy.
What I like about it is how they did it - using a series of key creative inputs and then applying a series of constraints. You can read more about it here, but essentially they quantified the value of a film’s cast and crew using graph theory applied to the entire IMDB database. That's 3.4 million people, who were then connected via 2.6 million films, creating in all 28 million connections between all of the actors, directors and writers in Hollywood. They then applied Google’s PageRank algorithm to the set of data, assigning each film’s opening weekend box office revenue as the value of each connection which meant they could compute a score for each cast and crew member and make an approximation of their relative contribution to the Hollywood economy.
They weighed the impact of the cast and crew, including by role (actor, director or writer), on opening weekend revenue and then looked at other variables to success including the MPAA rating, genre, language, social media, studio and release date. The factors that were statistically significant in predicting revenue were seperated out from those that had little impact (turns out MPAA rating and release weekend date had the most impact on projected revenue forecasts) and then quantified that by (amongst other things) looking at the impact of opening weekends over the past 30 years (turns out that, major holidays excluded, release date can impact a film’s opening weekend revenue by up to 15%). The resultant application was tested against some existing popular movies and the margin of error between their prediction and the actual box office revenues (top) was surprisingly small.
It's easy to get carried away with big numbers, and confuse correlation with causation, but I think the art in a lot of this kind of stuff (as we're finding out with Fraggl) is in the selection of all the different data points that might have meaning and how you then put them together and this seems to me to be an example of where this has been done in a smart way to create interesting results.