One of the main assumptions of linear regression is that the variables are independent of each other. Achieving zero multicollinearity is unrealistic when you get into real-life data because, often, multiple variables are correlated. For example, if someone has a high grade, they are likely to do well on a test and probably likely to study more, so a model predicting test scores will have a hard time splitting these related factors.
In marketing, we often have this problem because we tend to spend more money in peak periods, and spending across channels often moves up and down simultaneously. So, for example, if you drop spending on Facebook and TikTok ads, how can you find out which one causes a drop in total sales?
There are multiple methods for dealing with multicollinearity, for example, combining related variables or dropping one. However, when we discuss multicollinearity in a marketing context, we want to ensure all channels are included, so we know what to spend.
That can make multicollinearity a tough problem to solve: even if the model's accuracy isn't affected much by the issue, it leads to 'implausible' results, like a negative coefficient implying that you make less revenue when you spend more on Facebook ads - see below.
What is multicollinearity?
Multicollinearity occurs when independent variables are highly correlated (r=0.8 or greater), meaning that you cannot separate the effects of the independent variables on the outcome variable. Or in other words, predictor variables can be near-perfectly predicted by other predictor variables.
When independent variables in a regression model are correlated, they stop being independent, which violates one of the assumptions of regression analysis, alongside linearity, homoscedasticity, independent errors, etc.
Regression analysis seeks to isolate the relationships between dependent and independent variables, which multicollinearity hinders. You don’t want to create a situation where changing one independent variable changes several others, as this will make your model results tough to interpret.
Another way to describe multicollinearity is the interaction between X1 and X2, i.e., X1 correlates with X2. Interactions between Xi indicate the “linear relationship” between Xi and Y should be analyzed and addressed.
At its most extreme, dependent variables are perfectly correlated with each other, which might happen when the same information is included twice (e.g., weight in inches and height in centimeters).
Collinear independent variables can appear in regression models when two or more features related, are time together, or otherwise coincide in some way.
Some simple examples of collinear independent variables include:
- Height and weight; taller people are likely to weigh more.
- Household income and electricity usage; richer households are likely to use more.
- Car cost and fuel expense; more expensive cars are likely to use more fuel.
- Temperature and ice cream sales; hot days are likely correlated with more ice cream sales.
- Earnings and work hours; the more someone works, the more they’re likely to earn.
In any of these situations, the presence of collinearities makes linear relationships hard to estimate.
The importance of multicollinearity
Multicollinearity is a statistical issue that arises when two or more predictors are highly correlated. It can be caused by many factors, including measurement error, redundant variables, and confounding variables.
Multicollinearity makes it difficult to determine the relative importance of each predictor because they are correlated with each other. As a result, a model built with collinear independent variables will be unstable when introduced to new data and will likely overfit.
This presents a unique challenge for marketing mix modeling, which purposefully models a ‘mix’ of independent variables. This is called data multicollinearity, where multicollinearity is present in the data and observations rather than being an artifact of the model itself (structural multicollinearity).
How to check multicollinearity
There are two main ways to detect and check multicollinearity; creating a correlation matrix and checking the Variance Inflation Factor (VIF). But first, simply look at your dataset and the kind of variables you’re including.
Often, you’ll be able to identify variables likely to be colinear, such as house size and the number of rooms, height and weight, etc. You might be able to drop an unnecessary variable immediately.
Once you've checked over your variables for signs of obvious multicollinearity, you can quantitatively check your data using two methods:
1) Correlation matrix: You can first create a correlation matrix for Pearson’s correlation coefficient. This will detect the correlations between pairs of variables, expressed as a single number ranging from -1 to +1. The values of -1 and 1 indicate a perfectly linear relationship between independent variables, where a change in one variable is accompanied by a perfect change in the other.
The minus or plus indicates the direction of change, with positive coefficients indicating that an increase of one results in the increase of another, and the opposite for minus coefficients. As a rule of thumb, correlation coefficients over 0.5 queue suspicion of multicollinearity. Correlation coefficients exceeding 0.8 indicate a stronger correlation.
Example: The dataset below contains various housing data, such as the year built, basement square feet, first-floor square feet, etc. Here, we can see a few pairs of variables that approach 0.8, and one pair that exceeds it; total basement square foot (TotalBsmtSF) and 1st floor square foot (1stFlrSF). This produces a correlation coefficient of 0.81953. This makes sense - the larger the basement, the larger the first floor is likely to be.
2) Variance Inflation Factor (VIF)
The second test is the variance inflation factor (VIF). This identifies how strong the correlation is between independent variables expressed as a single number starting at 1. 1 indicates no correlation, a VIF between 1 to 5 indicates low to moderate correlation, a VIF of over 5 may be a problem, and a VIF of over 10 is almost definitely a problem. However, there are at least two situations where high VIFs can be safely ignored:
- High VIFs are viewed in control variables but not those that matter to the model’s overall results. Here, variables of interest are not collinear with each other or the control variables and regression coefficients remain unimpacted.
- High VIFs may be produced by the deliberate inclusions of products or powers of other variables.
Dealing with multicollinearity
There are a few things you can do when you encounter multicollinearity. Your choices depend on what you’re trying to do and the impact of multicollinearity on your model.
- Drop a variable
The most simple solution is to drop highly correlated variables. Dropping the least important highly correlated variables first may help reduce multicollinearity. This is sensible if collinear variables are redundant. For example, you might have ‘number of rooms’, ‘number of bedrooms, ’ and ‘floor area’ in the same set. Removing ‘number of rooms’ and ‘number of bedrooms’ while leaving ‘floor area’ can reduce multicollinearity and data redundancy.
- Combine or transform variables
If you have two similar variables, such as ‘number of baths’ and ‘number of bedrooms’, these can be combined into one variable. Similarly, variables can be transformed to combine their information while removing one from the set. For example, rather than including variables like GDP and population in a model, include GDP/population (i.e., GDP per capita) instead.
Here’s another example: suppose your dataset includes both ‘age’ and ‘years in employment’. These are multicollinear in that older individuals are more likely to have been employed for longer. These variables can be combined into ‘age at joining’ by subtracting ‘years in employment’ from ‘age’. You could also drop one or the other.
In these situations, be aware of omitted variable bias, which occurs when you omit too many variables, leading to invalid conclusions. For example, in marketing, omitting key sales drivers from a model may lead to their effects being improperly attributed to other channels.
A change in sales might be improperly attributed to variables left in the model when in reality, it’s more strongly associated with a variable omitted from the model to lower multicollinearity. Sometimes, distilling marketing models into smaller models can help delineate collinearities.
- Keep the predictors in the model
If you need to keep the predictors in the model, you can use a different statistical method designed to handle highly correlated variables. Examples include ridge regression, lasso regression, or partial least squares regression.
- Do nothing
Multicollinearity can be tolerated in some situations, especially when the results aren’t guiding serious or expensive strategies and are being used for research or learning purposes. Moreover, finding high VIFs that indicate multicollinearity does not always negatively impact a model.
Overall, it’s best to select the correct variables first and perform transformations and feature engineering to ensure the most important independent variables are high-quality.
Multicollinearity is somewhat inevitable when dealing with real-world data that clashes and interacts throughout the observation period. While it’s not always an issue, some forms of multicollinearity can really impact a model’s predictive capabilities.
When multicollinearity is discovered through a correlation matrix or VIF, it should be investigated with a view to drop or transform multicollinear variables. This is sometimes very simple when both variables describe the same phenomenon. Still, it isn’t always possible when each variable is crucial to the overall model, and dropping or transforming might cause omission bias or otherwise invalidate the model.