Multicollinearity

Multicollinearity occurs when two independent variables are highly correlated

One of the main assumptions of linear regression is that the variables are independent of each other. Achieving zero multicollinearity is unrealistic when you get into real-life data because, often, multiple variables are correlated. For example, if someone has a high grade, they are likely to do well on a test and probably likely to study more, so a model predicting test scores will have a hard time splitting these related factors.

In marketing, we often have this problem because we tend to spend more money in peak periods, and spending across channels often moves up and down simultaneously. So, for example, if you drop spending on Facebook and TikTok ads, how can you find out which one causes a drop in total sales? 

There are multiple methods for dealing with multicollinearity, for example, combining related variables or dropping one. However, when we discuss multicollinearity in a marketing context, we want to ensure all channels are included, so we know what to spend.

That can make multicollinearity a tough problem to solve: even if the model's accuracy isn't affected much by the issue, it leads to 'implausible' results, like a negative coefficient implying that you make less revenue when you spend more on Facebook ads - see below.

What is multicollinearity? 

Multicollinearity occurs when independent variables are highly correlated (r=0.8 or greater), meaning that you cannot separate the effects of the independent variables on the outcome variable. Or in other words, predictor variables can be near-perfectly predicted by other predictor variables. 

When independent variables in a regression model are correlated, they stop being independent, which violates one of the assumptions of regression analysis, alongside linearity, homoscedasticity, independent errors, etc. 

Regression analysis seeks to isolate the relationships between dependent and independent variables, which multicollinearity hinders. You don’t want to create a situation where changing one independent variable changes several others, as this will make your model results tough to interpret. 

Another way to describe multicollinearity is the interaction between X1 and X2, i.e., X1 correlates with X2. Interactions between Xi indicate the “linear relationship” between Xi and Y should be analyzed and addressed.

At its most extreme, dependent variables are perfectly correlated with each other, which might happen when the same information is included twice (e.g., weight in inches and height in centimeters). 

Multicollinearity analogies

Collinear independent variables can appear in regression models when two or more features related, are time together, or otherwise coincide in some way.

Some simple examples of collinear independent variables include: 

  • Height and weight; taller people are likely to weigh more.
  • Household income and electricity usage; richer households are likely to use more.
  • Car cost and fuel expense; more expensive cars are likely to use more fuel.
  • Temperature and ice cream sales; hot days are likely correlated with more ice cream sales. 
  • Earnings and work hours; the more someone works, the more they’re likely to earn.

In any of these situations, the presence of collinearities makes linear relationships hard to estimate.

The importance of multicollinearity

Multicollinearity is a statistical issue that arises when two or more predictors are highly correlated. It can be caused by many factors, including measurement error, redundant variables, and confounding variables. 

Multicollinearity makes it difficult to determine the relative importance of each predictor because they are correlated with each other. As a result, a model built with collinear independent variables will be unstable when introduced to new data and will likely overfit. 

This presents a unique challenge for marketing mix modeling, which purposefully models a ‘mix’ of independent variables. This is called data multicollinearity, where multicollinearity is present in the data and observations rather than being an artifact of the model itself (structural multicollinearity).

How to check multicollinearity 

There are two main ways to detect and check multicollinearity; creating a correlation matrix and checking the Variance Inflation Factor (VIF). But first, simply look at your dataset and the kind of variables you’re including. 

Often, you’ll be able to identify variables likely to be colinear, such as house size and the number of rooms, height and weight, etc. You might be able to drop an unnecessary variable immediately. 

Once you've checked over your variables for signs of obvious multicollinearity, you can quantitatively check your data using two methods: 

1) Correlation matrix: You can first create a correlation matrix for Pearson’s correlation coefficient. This will detect the correlations between pairs of variables, expressed as a single number ranging from -1 to +1. The values of -1 and 1 indicate a perfectly linear relationship between independent variables, where a change in one variable is accompanied by a perfect change in the other. 

The minus or plus indicates the direction of change, with positive coefficients indicating that an increase of one results in the increase of another, and the opposite for minus coefficients. As a rule of thumb, correlation coefficients over 0.5 queue suspicion of multicollinearity. Correlation coefficients exceeding 0.8 indicate a stronger correlation.

Example: The dataset below contains various housing data, such as the year built, basement square feet, first-floor square feet, etc. Here, we can see a few pairs of variables that approach 0.8, and one pair that exceeds it; total basement square foot (TotalBsmtSF) and 1st floor square foot (1stFlrSF). This produces a correlation coefficient of 0.81953. This makes sense - the larger the basement, the larger the first floor is likely to be. 

2) Variance Inflation Factor (VIF)

The second test is the variance inflation factor (VIF). This identifies how strong the correlation is between independent variables expressed as a single number starting at 1. 1 indicates no correlation, a VIF between 1 to 5 indicates low to moderate correlation, a VIF of over 5 may be a problem, and a VIF of over 10 is almost definitely a problem. However, there are at least two situations where high VIFs can be safely ignored: 

  • High VIFs are viewed in control variables but not those that matter to the model’s overall results. Here, variables of interest are not collinear with each other or the control variables and regression coefficients remain unimpacted. 
  • High VIFs may be produced by the deliberate inclusions of products or powers of other variables. 

Dealing with multicollinearity 

There are a few things you can do when you encounter multicollinearity. Your choices depend on what you’re trying to do and the impact of multicollinearity on your model. 

  1. Drop a variable 

The most simple solution is to drop highly correlated variables. Dropping the least important highly correlated variables first may help reduce multicollinearity. This is sensible if collinear variables are redundant. For example, you might have ‘number of rooms’, ‘number of bedrooms, ’ and ‘floor area’ in the same set. Removing ‘number of rooms’ and ‘number of bedrooms’ while leaving ‘floor area’ can reduce multicollinearity and data redundancy. 

  1. Combine or transform variables 

If you have two similar variables, such as ‘number of baths’ and ‘number of bedrooms’, these can be combined into one variable. Similarly, variables can be transformed to combine their information while removing one from the set. For example, rather than including variables like GDP and population in a model, include GDP/population (i.e., GDP per capita) instead. 

Here’s another example: suppose your dataset includes both ‘age’ and ‘years in employment’. These are multicollinear in that older individuals are more likely to have been employed for longer. These variables can be combined into ‘age at joining’ by subtracting ‘years in employment’ from ‘age’. You could also drop one or the other. 

In these situations, be aware of omitted variable bias, which occurs when you omit too many variables, leading to invalid conclusions. For example, in marketing, omitting key sales drivers from a model may lead to their effects being improperly attributed to other channels. 

A change in sales might be improperly attributed to variables left in the model when in reality, it’s more strongly associated with a variable omitted from the model to lower multicollinearity. Sometimes, distilling marketing models into smaller models can help delineate collinearities. 

  1. Keep the predictors in the model

If you need to keep the predictors in the model, you can use a different statistical method designed to handle highly correlated variables. Examples include ridge regression, lasso regression, or partial least squares regression. 

  1. Do nothing

Multicollinearity can be tolerated in some situations, especially when the results aren’t guiding serious or expensive strategies and are being used for research or learning purposes. Moreover, finding high VIFs that indicate multicollinearity does not always negatively impact a model. 

Overall, it’s best to select the correct variables first and perform transformations and feature engineering to ensure the most important independent variables are high-quality. 

Summary: Multicollinearity

Multicollinearity is somewhat inevitable when dealing with real-world data that clashes and interacts throughout the observation period. While it’s not always an issue, some forms of multicollinearity can really impact a model’s predictive capabilities. 

When multicollinearity is discovered through a correlation matrix or VIF, it should be investigated with a view to drop or transform multicollinear variables. This is sometimes very simple when both variables describe the same phenomenon. Still, it isn’t always possible when each variable is crucial to the overall model, and dropping or transforming might cause omission bias or otherwise invalidate the model. 

Relevant Courses

No items found.

Frequently Asked Questions

What exactly is multicollinearity?

Multicollinearity is a statistical phenomenon that occurs when two or more predictor variables in a regression model are highly correlated.

What is multicollinearity and why is it a problem?

Multicollinearity exists whenever an independent variable is highly correlated with one or more of the other independent variables in a multiple regression equation. Multicollinearity is a problem because it undermines the statistical significance of an independent variable.

What is multicollinearity problem in regression?

Multicollinearity happens when independent variables in the regression model are highly correlated to each other and predictor variables can be near-perfectly predicted by other predictor variables. It makes it hard to interpret of model and may create an overfitting problem.

How do you fix multicollinearity?

Remove some of the highly correlated independent variables. Linearly combine the independent variables, such as adding them together. Perform an analysis designed for highly correlated variables, such as principal components analysis or partial least squares regression.

What is multicollinearity?

Multicollinearity occurs when two or more independent variables (X) in a regression model are correlated with each other. The correlation between two independent variables indicates that the change in one variable's value (e.g., X1) will be reflected in the change in another variable's value (e.g., X2). If the correlation between independent variables is high, then it may not be possible to determine which independent variable has an effect on the dependent variable.
Become a Member
$30/m for unlimited access to 70+ courses (plus more every week!).
GET ACCESS