The bias-variance tradeoff is fundamental in machine learning, data science, and statistical modeling.
Data and algorithms work together to produce a model, but numerous interactions ‘under the hood’ affect the model's performance and accuracy.
Predictive algorithms are designed to estimate the relationship or mapping function of an output variable when exposed to input data, but the accuracy of this mapping function depends on both the algorithm itself, hyperparameter tuning, and the data.
Put simply; the bias-variance trade-off is the tension between bias error and variance error. A perfect model will be able to generalize or predict well without too much bias or too much variance.
Bias is the difference between what your model predicts and the true values. You of course expect your predictions to be close to the true values, but it’s not always the case.It's not always that easy because with the wrong model you won’t learn complex signals from the dataset, for example if you fit a linear regression to a non-linear variable like ad spend. High bias is a sign your model is too simple, and is underfitting the data.
Variance refers to your algorithm's sensitivity to this specific set of data: how well it generalizes to new data it hasn’t seen before. High variance models will produce dramatically different predictions depending on the dataset it saw when training. For example, if you have 50 data points and 50 parameters, your model could learn the data by assigning one parameter to each data point. The accuracy would look good, but it would be useless at predicting any new data, because it’s overfitting the training data.
This is demonstrated with a regression model, seen below.
Overview of Bias and Variance Error
1: Bias error
Bias error is the error between average model prediction and the ground truth. Incorrect assumptions made during model training can create a biased model that simplifies mapping functions, occluding data in favor of an overly simplified model. You can see this from the above diagram - the high bias, underfitted model is highly generalized and fails to describe the relationships between variables.
Very biased models are overly simple and fail to capture the relationship between data - you might as well just draw a line through the middle!
- If a model has high bias, it’ll feature low variance. Bias also implies underfitting, which is intrinsically related (more on that shortly).
- If a model has low bias, it’ll feature high variance, making fewer assumptions about the target function of the model.
- Biased models usually generalize poorly on both training and real data, as they’re incapable of locating the relationship between variables with either.
High bias algorithms are simpler, with key examples being logistic regression, linear regression, and discriminant regression. Simple forms of these equations are likely to ignore complex data in favor of a simpler, biased model.
In the above example, the Degree 1 form of this linear function (polynomial with degree 1) produces high bias and underfitting. Increasing the complexity (degree) of the equation lowers bias initially. Degree 4 represents an ideal fit before moving into overfitting territory until Degree 15, which shows extremely high variance and overfitting.
2: Variance error
Variance error is pretty much the opposite of bias error and is intrinsically linked to the concept of overfitting. High variance models fit extremely well to the data, following variables from point to point, as we can see above.
When training a model, a high-variance model will initially appear to be accurate, but it fails to capture a relationship when exposed to new data. This is because high-variance models are trained to overfit the training data - they learn the specifics and noise in addition to the trend.
- Low variance algorithms influence the target function minimally when changes to the data are made.
- High variance algorithms over-exaggerate the mapping function when changes to the data are made.
- Complex algorithms typically start by overfitting (if the data is sufficiently complex) and then need to be tuned to generalize more efficiently. The reverse is true with simple algorithms.
Algorithms prone to overfitting out-of-the-box include k-nearest neighbors, decision trees, and support vector machines. These need to be tuned (e.g., by pruning the decision tree) to increase generalization.
Bias and variance are linked and bridged by the concepts of underfitting and overfitting - the bias-variance trade-off.
If there are two models with different levels of bias, but one has higher variance than the other, then the one with lower variance is preferable unless the biased model exhibits extreme overfitting. Higher bias means that predictions are more likely to be correct for new data points, while higher variance means that predictions are less likely to be incorrect for new data points.
The goal is to find a balance between these two extremes so that decisions can be made on new data points without being too confident or too uncertain.
How to solve underfitting
At its simplest, underfitting occurs when there are two few features or data points in the dataset. If the data is thin, the model risks bias and underfitting because there’s simply insufficient data for the algorithm to find the trend.
The simplest way to solve a biased model is to add more data, adding variance in the process - assuming you have the data to add. But that’s often an issue in itself, as complex models such as marketing mix models (MMM) require a good deal of long-term data elapsing over a period of 3 years or so.
Feature engineering is an excellent way to tackle the type of bias and underfitting created by sparse data. You can also consider splitting by geography to introduce more variance, i.e., geo-level modeling.
For example, in marketing mix modeling, it’s often suggested to have at least 3 years of historical data. 3 years of data combined with all the variables and observations you need to create a complete dataset is a lot of data - and obtaining that much clean data is tricky. If you train models using insufficient data, then bias is likely, resulting in poor generalizations and predictions.
Therefore, building variance into the model is imperative, which often requires significant feature engineering. Geo-level modeling is one way to enrich datasets with variance, which involves taking different data (e.g., sales data) from different regions (e.g., if you have sales data for each state in the USA). If unavailable, that data can be engineered, e.g., through interpolation from existing figures.
Here are ways to tackle underfitting:
- Choose a more complex model or algorithm: You’ll firstly need to tune your algorithm or use a more complex algorithm to get the most out of what data you have. Sometimes, model optimization and hyperparameter tuning are all you need to fit your model better, but other times, you’ll need to look at the data itself, or both.
- Feature engineering: Adding more features to the data by engineering them is a great way to increase your variance, while also modeling for useful variables that don’t already exist in your data. In machine learning, when working with unstructured data, you can also create models to replicate that data to broaden your datasets using data you already own (e.g., via image transformation)
- Decrease regularization: Techniques to reduce variance in noisy datasets can push the model too far towards bias. Be cautious if you’re over-regularising data to tune out noise.
In the above example of nonlinear regression, the green line represents an overfitted trend that rigidly follows variables. The black line represents the ideal trend that generalizes well to the data.
How to solve overfitting
Overfitting typically occurs when datasets fluctuate too much, are bloated, noisy, or dirty. Excessively complex models will also overfit the data if the data is too simple for the model.
Overfitted models usually exhibit high accuracy during training but low accuracy when simulated with new data.
The end result is a model which cannot generalize to data. Ideally, you’d detect overfitting in the training phase using k-fold cross-validation. This will show you how well your model generalizes to different folds of your data.
Here are ways to tackle overfitting:
- Add data: While this seems contradictory, adding more data may help reinforce trends in the data, helping the model generalize to the overall trend. The opposite may occur if the additional data doesn’t help ‘iron out’ noise and outliers, and reinforces them instead. Ultimately, if your data isn’t clean, adding new data won’t help. Obtaining a sufficient quantity of clean data with adequate but not excessive variance is essential for reducing overfitting out-of-the-box.
- Feature engineering: If your data has too many features, you can simply remove some. If you’ve added features ‘just because you can’ and realized that these made your model excessively complex, consider removing some. You can also perform feature engineering to reduce variance, e.g., discretization, which transforms continuous variables into discrete categories.
- Regularization: Regularization uses various methods to reduce variance. Some examples include pruning a decision tree, using dropouts on a neural network, or adding penalties to functions in regression.
- Ensemble training: In the realms of ML, ensemble bagging involves using multiple algorithms that are considered strong learners, i.e., prone to overfitting. It then averages these out to smooth or average the complexity of the model. Boosting instead combines weak learners into strong learners, which helps combat underfitting.
Summary: Bias-variance trade-off
The bias-variance trade-off is traditionally discussed in machine learning but is relevant when building any predictive model.
Bias and variance error exists in a trade-off and are linked to the concepts of overfitting and underfitting. A biased model will tend to underfit the data, generalizing poorly on training and test data. Biased models fail to find the trend in the data.
Conversely, high variance models tend to overfit the data, which results in high accuracy with training data but poor accuracy when exposed to unseen data. This is because high variance models learn too many of the fluctuations, outliers, and noise in the dataset - resulting in trends they can’t tease out of unseen datasets.
It’s possible to reduce both bias and variance using a combination of model optimization, hyperparameter tuning, feature engineering, regularization, and ensemble training. Obtaining more real data or engineering new data from existing data is one of the best ways to improve overall model accuracy, adding just enough variance without risking overfitting.