Ginger Wang
Graduate Business Analyst | Project Management |
Coordinator | Problem Solver
Screen Australia
Screen Australia is the Australian Federal Government's key funding body for the Australian screen production industry, created under the Screen Australia Act 2008. This project supported Screen Australia to analyse its data and provide strategies to improve revenues based on model training results.


Data Preparation
1
Data Understanding
The data consists of 9304 rows and 39 features.
2
Missing Data
28.8% of the data was missing which would affect the reliability and accuracy of the findings of the report if left unaddressed.
In addition, a significant portion of categorical data is missing, especially for features where there are primary, secondary, and tertiary data, such as ‘actor’, ‘director’ and ‘country’.
3
Data Cleaning
1. Rows with more than 70% of their data missing were dropped as it was concluded that these rows would offer little predictive and analytical value to the dataset.
2. Divided into numerical and categorical data, based on the data type to clean the data.
-
Numerical
-
Machine-learning-based imputation
-
Multivariate imputer estimates each feature based on all other features
-
Extreme Gradient Boosting (XGBoost)
-
-
Categorical
-
No predictive capabilities were dropped
-
Important missing categorical data:
-
imputed using the central tendency measure of mode: Genre, Rating
-
high amount of missing data in each column placed into ‘Unknown’ category: sequel and source.
-
-
Feature Engineering
-
Yeo-Johnson transformation
-
Reduce the skewness & generate normal distribution
-
-
Categorical
-
Inspect and engineer so that could be implemented in modelling
-
-
Numerical
-
Generation of Movie Size based on opening day screens
-
Transform production budget to Australian Dollars
-
-
Dummy Encoding
-
Avoid dummy variable trap and also provide a reference group
-

Correlation Heatmap
Analysing the correlation among variables can provide insight into the strength of the relationship between variables.
-
Strong Relationship:
-
opening_day_screens
-
max_screens
-
-
Weak Relationship:
-
running_time
-

Feature Selection
Exploratory Data Analysis

Declining Box Office Trend
The graph below demonstrates a clear and declining trend in box office revenues being driven primarily through more movies being available now then there ever was before, thus diluting the median and IQR.
How Box Office Revenues Vary Amongst Regions?
Movies from North America perform the best at the box office, whereas African movies perform the worst.
The rise of Hollywood has pushed the growth in quantity of movies being produced along with the spending on movies, as a result there is no surprise of the sheer dominance of North American movies.
A surprising finding is that the Australian market does not appear to prefer Australian origin movies in comparison to those from other regions. Additionally, there are more negative performing outliers compared to those positive outliers, thus further demonstrating the difficulties in producing a box office hit.


Distributors Key in Generating Higher Returns
The impact of a distributor
cannot be underestimated
based on the chart with a clear
demonstration of the impact
large distributors have on box
office revenues. The six biggest
distributors, not including
‘Other’, account for almost all
blockbuster movies, with Walt
Disney’s blockbuster box
office alone accounting for
more box office revenues than
that of 8 other distributors' entire box office history.
Model Building
-
Linear Regression (Ordinary Least Squares) Model
-
Tree-Based Methods:
-
Regression Tree
-
Random Forest
-
-
Gradient Boosting:
-
XGBoost
-
Light GBM performed the best
-
CatBoost
-


Once the final model was created, in order to understand what drives lifetime gross revenue, feature importance analysis was conducted. Here, SHAP values were used to understand in which direction each feature affected the model's output. In the final model, it appears that max screens, previous gross and production budget were the three most important features. From the waterfall chart on the left-hand side, we see that the previous gross and production budget had a net positive contribution to the response variable, while max screens had a negative effect. Furthermore, according to the bees-warm model on the lefthand side, high values of max screens and previous gross had significantly positive contributions to lifetime gross, while the same cannot be said for the production budget.
Recommendations
-
For domestic titles, reduce spending on low budget productions and shift those resources towards higher budget productions
-
Improve access of Australian audiences to domestic titles by increasing the number of screens available to them.
-
Increase the amount of preview screening cinemas and provide the audience with a platform to provide feedback on the movie
-
Emphasise the need to decrease duration of films or prioritise films of shorter durations.