Introduction to Linear Regression
This equation seems familiar, doesn't it? . We all have seen this equation in school while studying geometry/statistics. It's Technical definition being - Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable.
In Simple words it means that we can use Linear Regression to predict Y using X (Or Multiplpe X's). Like we can predict Marks(y) using hours studied(x) OR predict House Prices (y) using Area of the house (x) etc.
For those who do not know what this equation is (could also find this article quite advanced)-- y stands for the dependent variable (which we predict) , x stands for independent variable (which helps in predicting y) , b0 is constant/intercept and b1 is coefficient/slope. If all this sounds new to you, please visit Khan academy(highly recommended) or Youtube (Krish naik's playlist and others).
This is the first blog of a two part series, this one is theoretical explaining about regression assumptions in detail, second one will be focused on the coding part as to how we check for this assumption in real time while doing Statistical analysis with python. Linear Regression equation works on 5 main Assumptions, which are - Linearity , No Endogeneity , Normality & Homoscedasticity , No Autocorrelation , No Multicollinearity. We'll discuss everything in brief in a minute, let's first discuss why do we even need to analyze regression?
Why do we need Regression Analysis?
Regression is not only used for Prediction. With the rise of machine learning, a lot of folks first studying these concepts think that we just model a best fitting line and start predicting! . We can obviously do that but we'll be highly underestimating the power of Regression . Making a good fitting line is indeed a very important part but only a part of the process. Regression can tell us about the whole data.
Regression analysis tells 2 very important things -
Another very useful application of regression is that it can help us spot all sorts of patterns in the data. Apart from predicting and forecasting, These new insights can be extremely valuable in understanding what can make a difference in your business.
Now one point where a lot of people are confused is with correlation and causation. If you find one variable y is highly correlated with x , does NOT mean that one is causing the other.
For Example - Your marks in Mathematics over the months can be positively correlated with Sales of Puma Shoes over the months. Does this means one is causing the other? Absolutely not!
So we have tools like Hypothesis testing / A/B Testing , ANOVA F-Test on Regression, Regression Analysis etc. to test that which can tells us whether or not one is causing the other and not just by random chance. However, We are going to discuss only Regression Analysis here.
Assumptions of Linear Regression
This Assumption states that we are using a straight line to fit the data points.
To check for this assumption you can try plotting the Dependent(y) variable against all the predictor(x) variables and should see data points in a way that can be fitted using a straight line.
If you see some kind of a curve in data points like this (please ignore the axis labels, picture is only for reference)..
Then you can do some kind of transformations like Exponential or Log. In Log, you can do Semi-log transformation or a Log-Log transformation (aka elasticity) where we convert both the axis values to log i.e. x-->log(x) & y-->log(y). After Transformations you can fit a linear line.
We usually don't check for this assumption , as when we perform OLS , only a linear line is fit to the data points.
This assumption is to check that our Errors/residuals and Independent variables are not Correlated.
If you have endogeneity in your model , it means that errors and estimators(x's) are somewhat correlated. This is also known as Omitted Variable Bias (OVB).
For Example our equation is --> Y = A + Bx (means y is explained by x)
Imagine there is a feature z ,such that, Y = A + Bx + Cz (C is a constant here)
This means that our model is explained(/somewhat correlated) by z feature also but it is not included in the model, this is OVB . Now when we forget to include a relevant variable, the errors increase and now we will notice that the errors/residuals and independent variable x are somewhat correlated. (This can be Proved Mathematically also.)
For example in Real life we might build a model predicting house prices using using area of the house and we see less area giving more price , we're perplexed but maybe we didn't include an important feature like place of apartment. (Ex- Mumbai and Kanpur would be a good example here) . If you have some other feature that you can think of please leave that as a comment below!!
Fixing OVB requires some combinations of skills like Logical/Critical thinking, Domain Knowledge and Data Collection skills.
Normality & Homoscedasticity
For Normality we want to make sure that the error terms or Residuals are normally distributed.
We can safely assume this by Central Limit Theorem.
Homoscedasticity means the error terms have equal Variance.
To check for Homoscedasticity, Just plot a scatterplot with standardized absolute residuals and standardized Predicted values, if there's a pattern then Heteroscedasticity holds and assumption is violated ,if there's no correlation then Homoscedasticity holds.
There are also some tests which are more accurate in checking for Homo/Heteroscedasticity, we'll discuss them in the Coding Part.
Autocorrelation (aka Serial Correlation) is a condition in which the error terms are themselves correlated/OR follow a trend.
Technical defn : Correlation of a signal with a delayed copy of itself as a function of the delay.
This mostly happens in Time series data like predicting stock prices where stock prices may be higher on Friday and lower on Monday (aka Day of the Week Effect), following a trend and hence residuals being somewhat correlated.
No autocorrelation assumption is important for regression and unfortunately there is no way to fix it ,if it occurs. Alternatively for time series data , instead of OLS/Linear Regression we can use something like Auto-Regressive integrated moving average model (ARIMA)
Ok. Now we're onto the last assumption. As it is evident by it's name itself, Multicollinearity exists when their exists some correlation between 2 or more Independent variables.
If one variable if some function of the other there's no point in keeping both, as . Even if not exactly a defined function, but both are highly correlated, even then then one can almost be represented by the other hence we don't need both. Multicollinearity is a problem because it undermines the statistical significance of an independent variable .
Fixes for Multicollinearity--
Now I know that as far as Assumptions of OLS is concerned, there can be some more assumptions (even 7 or more!) according to theoretical statistics but for practicality and real world business scenarios, these 5 should be enough.
So this was it for the 'Theoretical part of Regression analysis'. I know this article has been really theoretical without any implementation but having a sound theoretical foundation is important in statistics . In the next part I am going to code & check every assumption in python on a real world dataset.
Thank you for reading and please leave your valuable comments and feedback!