Linear Regression

Linear regression is a simple, dependable technique for analyzing the relationship between linearly related variables.

Linear regression is a statistical modeling method and technique conceived in the 19th century by Francis Galton, who described the concept of regression towards the mean. Statisticians Udny Yule and Karl Pearson later described regression analysis, of which linear regression is the most common form. 

Linear regression (or linear regression analysis) is used to predict a variable's value from another variable's value. The variable linear regression attempts to predict is called the dependent variable, as it’s dependent on the values of variables before it, which are called the independent variables. 

There are two broad forms of linear regression; simple linear regression and multiple linear regression. 

  • When there is a single input variable, the analysis is described as simple linear regression. 
  • When there are multiple input variables, the method is called multiple linear regression.

Linear regression is a simple and robust way to predict future variables and is used in a huge range of fields, such as the physical and biological sciences, economics, psychology, business and marketing. 

Linear regression in marketing

Linear regression is one of the most-used techniques in marketing, as it enables marketers to model the relationship between inputs and outputs. Since linear regression is clean and simple to calculate, it’s pretty straightforward to mobilize in one’s marketing strategy, even without a background in statistics or math.

For example ,we know that Facebook ad spend is correlated with organic installs for an app, but do we know how much a move in one affects the other? 

For that, we need to use Linear Regression, which lets us model the relationship between spend and installs. The output will be an equation that we can plug numbers into to predict how many organic installs we'll get for our budget. 

In a real marketing situation, multiple independent variables would probably impact sales values, hence why multiple linear regression would be the better fit. However, in this situation, it’s also essential to be aware of multicollinearity, which occurs when different campaigns are triggered at the same time/around the same time. 

Linear regression in machine learning

Linear regression is also one of the most basic machine learning algorithms. While linear regression is fundamentally a statistical technique or method, machine learning ‘borrows’ it as an algorithm. 

Developing a predictive model for output variables using input variables helps marketers ‘predict the future’ and make data-based decisions. However, linear regression has its fair share of limitations too, which we’ll investigate shortly. 

Breakdown of linear regression 

Linear regression

In the above example, suppose the y-axis shows sales and the x-axis shows marketing budget.

The standard equation of the linear regression line is Y= mx+c. 

  • Y is the dependent variable (usually measured on an interval or ratio scale).
  • X is the independent variable (usually measured on an interval or ratio scale).
  • M is the slope or gradient, which can be positive or negative depending on whether we go up as x increases (a positive slope), or down as x increases (a negative slope).
  • C is the intercept (the point where the line crosses the y axis). 

Finding the best fit line involves using the ordinary least squares method, which produces a regression line that minimizes the vertical distance from the data points to the regression line. Performing linear regression tasks is simple in Python, thanks to libraries such as NumPy

Assumptions of linear regression

For the straight-line path of linear regression to be appropriate, there are certain assumptions to apply to the data, such as:

  • Error terms are independent of each other. 
  • Error terms are normally distributed or form a bell-shaped curve. 
  • Error terms exhibit constant variance (homoscedasticity).
  • There is a probable linear relation between X (independent) and Y (dependent) variables.

You can check for these assumptions as follows:

  1. Variables in the dataset should be measured at a continuous level, e.g., time, sales, weight, and test scores. 
  2. You can use a scatter plot to discover if there is a linear relationship between those two variables.
Discovering linear relationships with scatter plots
  1. Observations should be independent of each other with no dependency. 
  2. No significant outliers. 
  3. Check for homoscedasticity, meaning variance along the best-fit linear-regression line remains similar through the line.
Homoscedasticity implies linearly related variance
  1. The residuals or errors of the best-fit regression line follow a normal distribution.

The benefits of linear regression

Linear regression is a statistical technique that allows for the study of the relationship between one or more independent variables and a dependent variable. The independent variables are also known as "predictors," while the dependent variable is also known as "the outcome."

Linear regression is reliable, but requires the data to fit certain assumptions. The benefits of linear regression include:

  • It can be used to make predictions about future values of the outcome.
  • It can be used to identify which predictor(s) significantly affect the outcome.
  • It can be used to quantify how much each predictor affects the outcome.
  • It provides an estimate for how much change in one variable causes changes in another variable, which is often called "regression coefficient."

The disadvantages of linear regression

Linear regression is a linear equation that can be used to predict the relationship between two variables. It has many advantages, such as it is easy to understand and use. However, it also has some disadvantages. 

  • One disadvantage is that it doesn't account for non-linear relationships between two variables. In this situation, non-linear regression is appropriate. 
  • Another disadvantage of linear regression is that the slope and intercept depend on the units of measurement and can be difficult to interpret without scaling these variables first.
  • Multicollinearity occurs when two or more independent variables are strongly correlated in a regression model, making it tough or impossible to discern cause and effect in the model. This is typical in marketing because campaigns usually trigger at the same time or in close sequence, making it hard to discern what inputs are producing what outputs. 

The importance of linear regression

Linear regression is a statistical technique that allows the analyst to conclude the relationship between two variables. It can be used to predict future values of one variable based on past observations of the other variable.

The importance of linear regression is that it can be used to predict future values of one variable based on past observations of the other variable.

Three non-marketing examples of linear-regression success

Price elasticity 

Prices affect consumer behavior. For example, if you change the price of a particular product, regression shows you what happens to sales as you drop or increase the price. 

Of course, you’d expect sales to drop as you increase the price, but regression analysis can tell you how steeply sales drop, and what the maximum price is that maintains a specific revenue. Regression is often used in retail to analyze price changes and other inputs alongside consumer outputs, e.g., sales. 

Risk analysis

Linear regression is frequently deployed in insurance and other risk-based businesses and business models. For example, insurers use regression to investigate insurance claims and build models estimating claim costs. This enables the insurers to make data-backed decisions about what risks to take on by predicting the cost of potential claims. 

Sports analysis

Linear regression is used to analyze sports. For example, you could find out if the number of games won by a sports team is related to the average number of points or goals they score, discovering whether there’s a linear relationship or a threshold where winning is more likely or not. Regression is used by sports betting companies to set their odds and analyze risk. 

Summary: Linear Regression

Linear regression (or linear regression analysis) is used to predict the value of a variable from the value of another variable. 

This simple yet robust statistical technique is excellent at predicting future values, providing the dataset adheres to assumptions that are determinable when plotting the data as a scatter plot. 

Performing regression analysis in Python is simple with NumPy and Scikit-learn, especially when handling lots of different variables in a multilinear regression analysis. 

 

Frequently Asked Questions

What is linear regression and how does it work?

Linear Regression is the process of finding a line that best fits the data points available on the plot, so that we can use it to predict output values for inputs that are not present in the data set.

What is an example of linear regression?

Suppose you find that Instagram ad spend results in increased organic sales. Linear regression would enable you to plot ad spend and organic sales to predict how much sales would increase or decrease with higher investment in that channel.

Why linear regression is bad?

It is sensitive to outliers and poor quality data. In the real world, data is often contaminated with outliers and poor quality data. If the number of outliers relative to non-outlier data points is substantial, then the linear regression model will be skewed away from the true underlying relationship.

What is logistic regression?

Logistic regression is a statistical method used in machine learning, pattern recognition, and predictive analytics. It is used for binary classification, that is, to predict an outcome that can take on the values of 0 or 1.

What is the difference between logistic and linear regression?

In linear regression we try to fit a straight line to the points whereas in logistic regression we try to fit sigmoid curve.

How does linear regression work?

In statistics and machine learning, linear regression is a method of estimating the relationship between a dependent variable (y) and one or more independent variables (x). We can estimate this relationship by finding a "best fit" line for our data points.