Linear regression is used to predict the value of a variable based on the value of another variable. This fundamental statistical method and algorithm is used for linear analysis and forms the basis of numerous more advanced models.
The Ordinary Least Squares (OLS) linear regression is the simplest method for linear regression - it fits a best-fit line between the points, providing the lowest number when we sum the difference between the real values and the regression line.
OLS is simple; especially now that we can employ code and software to perform OLS with even large datasets. But, as is often the case, the devil is in the details. In his book Naked Statistics, Charles Wheelan says, “The LR is operationally easy to perform. Even a child can do that with the great software available these days, but to perform it correctly and well done, that’s another story.”
To do LR well, you need to meet linear regression assumptions. In reality, meeting all assumptions perfectly is nigh-impossible. Instead, the goal is to transform data, perform feature engineering and otherwise refine the model to make the best out of what you have. Ensuring you have quality, clean data in the first place is also essential.
The Gauss-Markov Theorem and BLUE
The acronym BLUE frequently appears in the context of linear regression. According to the Gauss-Markov theorem, linear regression with OLS provides the Best Linear Unbiased Estimator (BLUE) of the coefficients if the following are met:
- The expectation of errors (residuals) is 0
- Errors have equal variance, i.e., homoscedasticity of errors
- Errors are uncorrelated
The ‘best’ component of the BLUE acronym describes the lowest variance of the estimate.
Here, the errors do not need to be normal or independent and identically distributed (but they do need to be uncorrelated with mean zero and homoscedastic).
Linear regression assumptions
The Gauss-Markov theorem’s assumptions are fundamental but are weak for most applications of linear regression outside of educational contexts and doesn't necessarily improve the efficiency of linear regression.
Linear regression is more typically defined by five or six assumptions. In reality, linear regression assumptions vary considerably with the context, including the dataset and the problem space (i.e. what you’re trying to achieve).
For non-experimental science, such as marketing mix modeling and econometrics, the tradition is to define at least five assumptions of linear regression. These go by various names but are definable as:
- Linearity: The dependent variable is a linear function of the independent variables.
- No or low multicollinearity: Independent variables cannot be predicted with other independent variables.
- Gaussian errors: Errors are normally distributed with mean = 0.
- Homoscedasticity: Errors have equal or similar variance.
- Independent errors: There is no relationship between the residuals of our model (the errors) and the response variables (observations).
Each of these assumptions is testable using a number of statistical tests.
While it may not be necessary to run all statistical tests every time you construct a regression model, it’s essential to get into the habit of running tests when you suspect that your data is weak, messy, or otherwise uncertain to suit linear regression immediately without cleaning, feature engineering, transformation, etc.
Statistical tests for linear regression
Firstly, the data requires a linear relationship to be suited to linear regression. The dependent variable (y) should be a linear function of the independent variables (x), which are specified in the model. You cannot fit a linear model to non-linear data without risking serious errors.
Linearity can be inspected using plots of observed vs. predicted values or residuals vs. predicted values. Points should be distributed along the diagonal line with a relatively constant variance. If you notice that data bends or bows show that the linearity assumption is not satisfied.
You can also use the Harvey-Collier test and Rainbow test. The Harvey-Collier test indicates whether the residuals are linear, while the Rainbow test discerns whether a linear model is appropriate even if some underlying relationships are not linear.
The Rainbow test is useful here; the basic idea is that if the true relationship is non-linear, a good linear fit can be achieved on a subsample in the "middle" of the data. The rainbow test describes whether a model improves when data is removed, under the null hypothesis that the true model is linear. However, if the true model is not linear, then the improvement gained by removing data will be greater than expected.
Regression analysis needs to isolate the relationships between dependent and independent variables. Multicollinearity indicates correlated independent variables and features, which makes this difficult. Multicollinearity implies an interaction between X1 and X2, i.e., X1 correlates with X2. Multicollinearity is one of the more flexible assumptions, as it can be present in the model with varying tolerance, as long as it’s not ‘perfect’. Multicollinearity creates estimates that are less efficient but unbiased, but does create overfitting.
VIF and correlation matrices are used to detect multicollinearity. VIF identifies how strong a correlation is between independent variables and is expressed as a single number starting at 1. 1 indicates no correlation with other features in the middle, a VIF between 1 to 5 indicates low to moderate correlation, a VIF of over 5 may be a problem, and a VIF of over 10 is almost definitely indicative of multicollinearity.
Removing features with high VIF is typically the simplest way to deal with multicollinearity. Read more here.
Gaussian errors/Normality of residuals
In linear regression, errors are assumed to follow a normal distribution with a mean of zero.
If you violate this assumption, there are numerous issues with calculating confidence intervals, e.g. intervals may be too wide or too narrow, and it complicates other statistical tests, i.e., the t-test.
This assumption can be tested with the Jarque–Bera test, which is a goodness-of-fit test to determine whether skewness and kurtosis match a normal distribution. QQ plots of residuals are also helios, as a bow-shaped pattern of deviation implies excessive skewness - indicating that there are too many large residuals skewed in one direction.
An S-shaped pattern instead applies excessive kurtosis - indicating too many or too few large errors in both directions.
Homoscedasticity, meaning ‘same variance’, describes how the error term is the same across the values of the independent variables.
Simply, as the value of the dependent variable changes, the error term does not vary considerably or at all. Heteroscedasticity occurs when the size of the error term differs across the independent variable’s value.
Heteroscedasticity affects the OLS estimator, which remains unbiased, but makes the results hard to trust - heteroscedasticity increases the variance of the coefficient estimates, but the regression model fails to pick up that.
To detect homoscedasticity, you can plot residuals vs. predicted (fitted) values. If residuals grow and cone or fan-out, this would indicate potential heteroscedasticity.
The Breusch-Pagan and Goldfeld-Quandt tests can also be used. Both tests produce a p-value, which if below a certain level (typically 0.05), indicates heteroscedasticity. If the p-value is not less than 0.05, you don’t reject the null hypothesis, showing there is insufficient evidence of heteroscedasticity.
Independent errors/No autocorrelation of residuals
It’s crucial for errors to be independent in time series models. Serial correlation in the residuals could be caused by other violations (e.g., in linearity), or due to bias caused by omitted variables (e.g. to reduce multicollinearity). Moreover, if residuals have the same patterns under particular conditions, the model will likely under or over-predict when predictors are configured in a certain way.
This is testable with autocorrelation plots, Durbin–Watson, and Ljungbox. Ideally, for autocorrelation plots, most residual autocorrelations should fall within 95% confidence around zero, roughly +/- 2-over-the-square-root-of-n, where n is the sample size. So, if the sample size is 50, autocorrelations should fall between +/- 0.3. If the sample size is 100, they fall between +/- 0.2, etc.
The Durbin-Watson test produces a value between 0 and 4. Lower values indicate that successive residuals are positively correlated. If the result is less than 2, there is evidence of positive autocorrelation. If much greater than 2, there is evidence of negative autocorrelation.
Using statistical tests
There are a number of statistical tests you can run for linear regression, but there are three we need to focus on the most.
The variance inflation factor, or VIF, helps us identify multicollinearity, or variables that are correlated with other variables in the dataset. We're looking for a value below 10 for each variable, but values above 100 indicate definite multicollinearity.
Breusch Pagan identifies heteroscedasticity, which is where we see patterns in the errors. This tells us if you’re likely to have a missing variable from our model.
Q-Q plot chart enables you to visually identify the non-normal errors identified with Jarque-Bera and Omnibus tests, indicating there are non-linearities in the model.
The benefits of statistical tests
Statistical tests in the context of linear regression help practitioners build reliable, accurate, and efficient models. While linear regression is relatively simple compared to other algorithms, especially when employing the OLS method, it’s still not guaranteed to produce accurate results even when you believe your data will suit linear regression.
As is often the case, the devil is in the details, and it’s essential to objectively identify potential problems in your data. Specifically, there are five assumptions for linear regression that should be met to ensure LR is appropriate and trustworthy. All assumptions can be validated using statistical tests.
It’s important to highlight that statistical tests for linear regression should inform decisions. The goal is to highlight potentially catastrophic issues and find opportunities for improving the model. By fixing issues now, you’ll avoid having to backtrack and troubleshoot errors in retrospect. It’s best to be proactive about confronting issues with data - it’s quicker to fix issues before you dive into building your models!
The disadvantages of statistical tests
Conducting all of these statistical tests is long-winded and cumbersome. Of course, Python expedites the process of conducting statistical tests, and once you know how to interpret the results, the process is not particularly time-consuming.
In general, statistical tests for linear regression still rely on intuition. Depending on the model's intent, data, and problem space, you can employ various degrees of leniency in interpreting results.
Summary: Statistical tests for linear regression
There are numerous statistical tests for linear regression assumptions and whether you can/want to perform every single one depends on your data, the intent of the model, and the problem space.
If your model is bound to inform serious decisions, it’s best to put as much time into testing these assumptions as you can. The chances are that you’ll find at least one or two avenues to explore for greatly enhancing your model while also transforming and enriching your data for future models.