So first of all I would like to say a big thanks to the 'Sigma coding' Youtube channel whose videos helped me in the coding part of testing the assumptions of Linear regression in Python. If you're someone who better understands through video lectures, you can watch all implementations in video HERE, after reading my blog. Also if you haven't read the previous article where I explained the theory behind each and every Assumption of Linear Regression, you can find it HERE.
We have a Dataset which contains some economic info of Korea over the years. We're gonna build a OLS model predicting GDP GROWTH (y = gdp_growth) using the other independent variables(like birth_rate, gross_capital etc) in the data. We'll first check for assumptions, outliers and then select variables for model building and finally interpret the model after building it.
The notebook and data are available here. I suggest you run Jupyter notebook along with the blog to best understand it.
If you are running the code alongside you'll need these libraries - Numpy, Pandas, Scipy, Matplotlib, Seaborn & Pylab . Make sure you take care of these dependencies beforehand.
We'll first import the data and do a bit of data cleaning (changing the column names). There was only one Missing value which I imputed manually after a simple google search.
Looking for Outliers
In a Normal Distribution , 99% (almost) of the data lies between 3 Standard deviations of the mean. So I looked for values more extreme than that as they would be outliers.
Looks like only 2 rows contain outliers, which are Year 1998 & 2001.
Now we can remove these outliers , but that would mean that we are assuming that Financial crisis don't happen but in reality they happen all the time, so I'm keeping them.
As we'll be fitting a Ordinary Least Squares Linear Regression (aka straight line) onto the data, so we are assuming that there is a linear relationship between dependent and independent variable.
We see high multicollinearity between many variables , for ex- correlation between birth_rate & pop_growth is almost 1.0 which means that one is almost a function of the other. So it's suggested that we drop one of the two correlated features. However, I'm gonna take a more technical approach and decide multicollinearity using VIF (Variance Inflation factor). Thumb rule is that VIF>5 for a variable means that variable should be removed (one of two actually).
VIFs are calculated by taking a predictor, and regressing it against every other predictor in the model. This gives you the R-squared values, which can then be plugged into the VIF formula, which is,
where 'i' is the predictor variable.
Note: gdp_growth is the dependent variable (y) so don't worry about it's VIF, also ignore the const column VIF as it's just a constant.
After removing correlated variables/columns from the data with VIF>5 (any 1 of the 2 correlated) ,we see much stabilized VIF after that.
We are not going to test for this assumption. I forgot to mention in the last theoretical blog that testing for endogeneity is optional. It's mostly used in regression for econometrics studies and requires good knowledge of the same. Moreover the tests are advanced, using methods like Panel data regression, 2-stage least squares, GMM etc. which are quite outside the scope of Simple regression analysis.
However, for our day to day analysis , the main 4 assumptions - Linearity, multicollinearity, Normality & homoscedasticity and autocorrelation will be enough.
Normality & Homoscedaticity
After creating the model we will now test for the rest of the assumptions (as the rest of the assumptions requires residuals/prediction-errors)
First we check for normality of residuals and that they have a mean of 0.
A Simple Q-Q Plot can tell us about normality, if the data points tightly hug the line then normality can be assumed..
So we can assume Normal distribution , the mean of residuals is also very close to zero
For Homoscedasticity we do Breush-pagan general test where ,
We want to fail to reject the null hypothesis where variance of error terms are equal and homoscedasticity holds.
As you can see the p-value is quite larger than the significance level, we fail to reject the null hypothesis , and variance is equal at all levels. Therefore Homoscedasticity & Normality holds (along with 0 mean)
We use the Ljung-Box test for no autocorrelation of residuals.
Here the hypothesis are :
By data we mean residuals here.
We want to fail to reject the null hypothesis here, which means that the residuals are random and there is no correlation between residuals. Having a large enough p-value will mean no Auto-correlation.
As the even the minimum p-value is greater than 0.05 , we fail to reject the null hypothesis, that is, the data is random i.e. No Autocorrelation between residuals.
Also graphically those lines shouldn't cross the blue region to satisfy No-Autocorrelation condition.
Finally we are done with all the assumptions. Now we'll look at the Confidence intervals & p-values of slopes/coefficients of independent variables and select what features are statistically significant for the model.
This is the Ordinary Least Squares model summary (fitted using Scipy package)
Lower bound of CI is 0.025 and upper bound is 0.975 ,that is, 95% Confidence interval (CI).
P>|t| : This is P-value for that specific variable.
t is Test statistic and coeff, std err are coefficient and standard error respectively.
Ignore the constant variable 'const'.
The variables whose CI have 0 in them are insignificant, because that means that there is 95% chance that slope/coefficient of that variable is 0 , which means it is highly likely that independent variable has no effect on the dependent variable.
Similarly, the variables whose p_value > 0.05 means that we fail to reject the null hypothesis (null hypothesis being that the slope/coefficient of variable is 0) and thus those variables are statistically insignificant. Whereas a p_value < 0.05 tells us that - 'Assuming that the null hypothesis was true , getting a value(coefficient) this extreme has a probability of less than 5%, which means that it is highly unlikely to receive a value like that when the null hypothesis was assumed true' So we reject the null hypothesis and suggest the alternate hypothesis (alternate hyp: the slope/coefficient is not 0) and the variable is statistically significant.
One such example of an insignificant variable is 'broad_money_growth' column as it's CI IS (-0.035 , 0.040) which clearly includes 0 AND it's P-value is 0.894 >> 0.05 which is enormous, so we remove this variable.
After finally removing all the insignificant variables, rebuilding the model again, we get this summary of OLS..
Now looking at C.I.'s and P-values of variables, we can see that all the variables are statistically significant, even the constant is LOL.
Don't worry about the R-squared value, the only purpose of this article was to do regression analysis to know more about the data. After all, this data has only around 50 rows, so it can't be use for prediction anyways.
We can now interpret the model safely (we are able to do this only because all our assumptions are True) that if all variables are kept constant (their coeff=0,except pop_growth) then a unit increase in pop_growth would lead to a 1.93% increase in gdp_growth (y / Dependent variable).
Thank you! This was it for regression analysis from my side, hope you liked the blog. If you have any questions regarding the assumptions, python code OR if you think I went wrong somewhere, feel free to comment down below and I'll be more than happy to answer all your queries.