Homoscedasticity essentially means ‘same variance' and is an important concept in linear regression.
Homoscedasticity describes how the error term (the noise or disturbance between independent and dependent variables) is the same across the values of the independent variables. So, in homoscedasticity, the residual term is constant across observations, i.e., the variance is constant. In simple terms, as the value of the dependent variable changes, the error term does not vary much.
In contrast, heteroscedasticity occurs when the size of the error term differs across the independent variable’s value. Heteroscedasticity may lead to inaccurate inferences and occurs when the standard deviations of a predicted variable, monitored across different values of an independent variable, are non-constant.
Heteroscedasticity is an issue for linear regression because ordinary least squares (OLS) regression assumes that residuals have constant variance (homoscedasticity).
Heteroscedasticity doesn’t create bias, but it means the results of a regression analysis become hard to trust. More specifically, while heteroscedasticity increases the variance of the regression coefficient estimates, the regression model itself fails to pick up on this. Homoscedasticity and heteroscedasticity form a scale; as one increases, the other decreases.
Detecting homoscedasticity and heteroscedasticity
If you see patterns in the errors (for example, if they fan out over time and create a classic cone shape), that's an indication of heteroscedasticity. Seeing patterns in the errors is an indication there's something missing from your model that is generating those patterns.
There are many tests for heteroscedasticity, but you can also just plot the errors against predicted values and see visually detect the hallmark pattern of heteroscedasticity. The Breusch-Pagan and Goldfeld-Quandt are two tests that detect and analyze homoscedasticity and heteroscedasticity.
A simple method to detect heteroscedasticity is to create a fitted value vs residual plot. Once you fit your regression line to your dataset, you can create a scatterplot that shows the values of the models compared to the residuals of the fitted values.
The example plot below indicates Heteroscedasticity and its classic cone or fan shape.
The importance of homoscedasticity
Homoscedasticity occurs when the variance in a dataset is constant, making it easier to estimate the standard deviation and variance of a data set.
This means that when you measure the variation in a data set, there is no difference between the variations in one part of the data and another. Homoscedasticity also means that when you measure the variation in a data set, there is no difference between different samples from the same population.
Homoscedasticity is a key assumption for employing linear regression analysis. To validate the appropriateness of a linear regression analysis, homoscedasticity must not be violated outside a certain tolerance.
Though, it’s important also to note that OLS regression can tolerate some heteroskedasticity. One rule of thumb suggests that “the highest variability shouldn’t be greater than four times that of the smallest.”
Why does heteroscedasticity occur?
Heteroscedasticity occurs for many reasons, but many issues lie in the dataset itself.
Models that utilize a wider range of observed values are more prone to heteroscedasticity. This is generally because the difference between the smallest and large values is more significant in these datasets, thus increasing the chance of heteroscedasticity.
For example, if we consider a dataset a dataset ranging from values of 1,000 to 1,000,000 and a dataset that range from 100 to 1,000. In the latter dataset, a 10% increase is 100, whereas, in the former, a 10% increase is 100,000. Therefore, larger residuals are more likely to occur in the wider range to cause heteroscedasticity. This can be applied to a range of scenarios where a wide range of values are present, especially when those values change considerably over time.
Suppose you’re analyzing ecommerce sales over 30 years. In that case, the sales in the past ten or so years would be considerably higher than prior, thus creating a much greater range spanning a small number of sales per day to millions every day.
Such a dataset would likely skew residuals towards heteroscedasticity, as the errors expand as the range increases. You’d expect this to create the classic cone shape plot of heteroscedasticity.
Another generic example is a cross-sectional dataset. For example, if you compared the salaries of all UberEats drivers, there wouldn’t be a significant deviation as they all earn similar salaries. However, if you expand this to all salaries, value distribution will become unequal, risking heteroscedasticity and other issues besides.
It’s also important to note that issues in a model can masquerade as each other, and you can’t really assume Heteroscedasticity based on cursory expectation alone. For example, nonlinearity, multicollinearity, outliers, non-normality, etc., can masquerade as each other.
Examples of heteroscedasticity
Here are three classic examples of heteroscedasticity:
- Suppose you have a dataset that includes the annual income and expenditure of 100,000 individuals. You want to see how annual income interacts with expenditure. Very generally, you’d expect those with more money to spend more money. In reality, however, those with lower incomes have lower variability in their expenditure. This is because the closer someone’s income comes to the cost of living, the less disposable income they have. As income increases, it eventually reaches a point where individuals have higher disposable income and, therefore, a greater choice on what they can spend their money on. But some high-income individuals choose not to spend their money, which means expenditure variability increases as income increases.
- Another example is a dataset that includes the populations of cities and how many bakeries they have. For a small city, a smaller number of bakeries is likely, with low variability. As city size increases, the variability in bakeries will increase. Some large cities may have many more bakeries than others, for example. A larger city might have anywhere between 100 and 5000 bakeries, compared to just 1 to 25 in a smaller city.
- A third example is plotting income vs. age. Younger people generally have access to a lower range of jobs, and most will sit close to minimum wage. As the population ages, the variability in job access will only expand. Some will stick closer to minimum wage, others will become very successful, etc. This is demonstrated in the plot below, which shows the classic fan-like cone of heteroscedasticity.
In any of these cases, OLS linear regression analysis will be inaccurate due to the greater variability in values across the higher ranges.
Heteroscedasticity is chiefly an issue with ordinary least-squares (OLS) regression, which seeks to minimize residuals to produce the smallest possible standard error. Since OLS regression always gives equal weight to observations, when heteroscedasticity is present, disturbances have more influence or pull than other observations.
In other words, OLS does not discriminate between the quality of the observations and weights each one equally, irrespective of whether they have a favorable or non-favorable impact on the line's location.
Firstly, as noted, Heteroscedasticity and other linear regression problems can masquerade as each other. As such, it’s crucial first to check all five core assumptions of a regression model:
- Linearity: Firstly, there needs to be a linear relationship between features and responses. The Harvey Collier and the Rainbow Test can discern this.
- Multicollinearity: Then, features must not be highly correlated with each other (e.g., house size and bedroom number). If this is not satisfied, predictions may be invalid or overfit.
- Gaussian errors: Errors are normally distributed with mean = 0. This is necessary for a range of statistical tests, i.e., the t-test. We can relax this assumption in large samples due to the central limit theorem. Testable with Jarque-Bera tests for skewness and kurtosis.
- Homoscedasticity: Errors have equal variance, and there is no pattern in the residuals (error). Testable with Breusch-Pagan and Goldfeld-Quandt.
- Independent errors: Errors are independent, which means there is no relationship between the residuals of our model (the errors) and the response variables (observations). For example, each day can be forecast separately without data from the previous day. Testable with Durbin–Watson and Ljungbox.
Three common ways to fix heteroscedasticity
In some cases, making minor transformations to input data suffices - maybe some outliers can be removed. Otherwise, try the following:
1: Weighted ordinary least squares
Given the evident issues with OLS and heteroscedasticity, weighted ordinary least squares (WOLS) could be a solution. Here, weight is assigned to high-quality observations to obtain a better fit. So, the estimators for coefficients will become more efficient. WOLS works by incorporating extra nonnegative constants (weights) with each data point.
When implemented properly, weighted regression minimizes the sum of the weighted squared residuals, replacing heteroscedasticity with homoscedasticity. Finding the theoretically correct wrights can be difficult, however.
2: Transform the dependent variable
Another way to fix heteroscedasticity is to transform the dependent variable. Take the bakery example above. Here, population size (the independent variable) is used to predict the number of bakeries in the city (the deponent variable). Instead, the population size could be used to predict the log of the number of bakeries in the city. Performing logistic or square root transformation to the dependent variable may also help.
3: Redefine the dependent variable
Another method is to redefine the dependent variable. One possibility is to use the rate of the dependent variable rather than the raw value.
So, instead of using population size to predict the number of bakeries in a city, we use population size to predict the number of bakeries per capita. So here, we’re measuring the number of bakeries per individual rather than the raw number of bakeries.
However, this does pivot the question somewhat and might not always be the best option for a dataset and problem.
Summary: Homoscedasticity and heteroscedasticity
One assumption of linear regression is homoscedasticity. Homoscedasticity means the error is constant across the values of the dependent variable. The easiest way to check homoscedasticity is to make a scatterplot with the residuals against the dependent variable.
If a model violates homoscedasticity, it will exhibit heteroscedasticity. Heteroscedasticity isn’t always a fatal issue for OLS, as it can tolerate some heteroscedasticity. Each dataset and problem is different. Remember, heteroscedasticity does not necessarily occur in absence of other issues - so don’t forget to check all linear regression assumptions before addressing heteroscedasticity.