# Understanding of Regression Model. Maitry Sinha Oct 06 2020 · 7 min read

### what is Regression Model?

Regression models basically used in Machine Learning for continuous Datasets. In Regression models output has no limit like 0 or 1(True or False) it can be any constant value. In  regression model basically , we get the relationship between two or more variable and compute one variable with respect to the others , it means in Regression models we have some Independent Variables through which we find out the dependent or the output variable by evaluating the relationship between them. For example , In case of housing datasets , we have many independent variables through which we implement the output or the dependent variable by establishing the relationship between them through regression models.

In this particular article , I am gonna discuss end-to-end Regression Model  description  about famous "Boston-Housing-Datasets" .

Importing necessary Libraries....

``````import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')``````

``````from sklearn.datasets import load_boston
bos = pd.DataFrame(boston.data)
``bs = boston['data']``

The features are..

``````feature = boston['feature_names']
feature``````

Feature description..

EDA & FEATURE ENGINEERING

``````dict= {'CRIM':['per capita crime rate by town'],
'ZN':['proportion of residential land over 25,000 sq.ft.'],
'INDUS':['proportion of non-retail business acres per town'],
'CHAS':['Charles River dummy variable'],
'NOX':['nitric oxides concentration per 10 million'],
'RM':['average number of rooms per dwelling'],
'AGE':['proportion of owner-occupied prior to 1940'],
'DIS':['distances to five Boston employment centres'],
'TAX':['full-value property-tax rate per \$10,000'],
'PTRATIO':['pupil-teacher ratio by town'],
'B':['1000(Bk - 0.63)^2,Bk is the proportion of blacks'],
'LSTAT':['% lower status of the population']}

dict``````
``````des = pd.DataFrame(dict,index=['Description'])
print('\nDescription of the features:')
des.T``````

Metadata of  the independent columns for doing the further operation..

``````df = pd.DataFrame(bs,columns=feature)
``target = boston['target']y = pd.DataFrame(target,columns=['SalePrice'])``

Pearson correlation of all the datasets

``````data = pd.concat([y,df],1).copy()
corr = data.corr()
plt.figure(figsize=(18,12))
sns.heatmap(corr,annot=True,fmt='.2g',linewidths=0.5,cmap='viridis')
plt.show()``````

Pearson correlation of all the columns with respect to the output ('SalePrice') column.

``````imp = corr[['SalePrice']]
imp = imp.sort_values(by='SalePrice',ascending=False)
print("\n   Feature Importancee:")
plt.figure(figsize=(18,2))
sns.heatmap(imp.T,annot=True,fmt='.3g',linewidths=0.5,cbar=False,cmap='viridis')
plt.show()``````

#### Making Scatter-Plot, Box-Plot & Distribution-plot Simultaneously for comparison using a for loop to show all the plots together.

``````for i in range(len(data.columns)):
a = data[data.columns[i]]
b = data['SalePrice']
f,(ax1,ax2,ax3)=plt.subplots(1,3,figsize=(20,4))
sns.scatterplot(a,b,ax=ax1,color='g')
sns.boxplot(a,ax=ax2,color='m')
sns.distplot(a,ax=ax3,color='r')``````

Here, we can understand from the graphs(marked) that some  datasets have outliers and the 'LSTAT' datasets also polynomial behavior.So, in the 'col' list I have put those datasets together along with 'SalePrice'.

``col = ['SalePrice','RM','AGE','LSTAT']``

For removing some outliers  Z-score transformation method I have made a function 'z_score()'.

``````def Z_score(data,left,right):
index= []
mean = data.mean()
std = data.std()
for i in range(len(data)):
z = (data[i]-mean)/std
if (z >= -left) and (z <= right):
index.append(i)
else:
pass
return index``````
``````clean_index = Z_score(y.SalePrice,2.5,2.4)
print("Data to be cleaned:{}".format(len(y) - len(clean_index)))``````

Data to be cleaned : 23

Cleaning  datasets.

``````Df = data.iloc[clean_index].reset_index(drop=True).copy()
Df.shape``````

Now data has been cleaned and the shape is (483,14).

[Note :- As I have put all the cleaned data in 'Df' , So , it is always a good practice to delete the extra data frames ('y' ,' df' , 'data') to free up the space of the memory ].

``del y,df,data``

Separating X-data and Y-data.

``````y = Df['SalePrice'].copy()
X = Df.drop(['SalePrice'],1).copy()``````

Feature selection With LASSO transformation.

``````from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.50,max_iter=2000)
lasso.fit(X,y)``````
``````coeff_values = pd.DataFrame({'coeff':lasso.coef_},index=X.columns).sort_values(by='coeff')
c = (abs(coeff_values.coeff) == 0)
col_non_imp = list(X.columns[c])
print("Length of the non-important columns are :",len(col_non_imp))
print('\n')
print('Non-important columns are :',col_non_imp) ``````

We got  the  Non-important  columns  are   'DIS'  & 'RAD' . Drop those columns.

``X.drop(col_non_imp,1,inplace=True)``
``y = pd.DataFrame(y)``
``````#Making log transformation of the features, Except 'CHAS'
feat = [col for col in X.columns if col != 'CHAS']
feat``````

** Replacing Zero by-> One, otherwise logarithm will give error ,for that reason I did not taken 'CHAS'.

``````for col in feat:
X[col] = X[col].replace({0:1})``````
``````for col in feat:
X[col] = X[col].apply(lambda x : np.log(x))``````

From the above distribution plots I can say that  'CHAS' & 'B' features have many outliers . So, for better performance  we can remove those features.

``X.drop(['CHAS','B'],1,inplace=True)``

Scaling the data sets to standard Normal distribution.

``````from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x = scaler.fit_transform(X).copy()
x.shape,y.shape
x = pd.DataFrame(x,columns=X.columns)``````
``x.to_csv('X.csv',index=False)y.to_csv('y.csv',index=False)``
``del x,y``

[ Deleted 'x' and 'y' to free up the memory space..]

Importing scaled CSV files.

``````X = pd.read_csv('X.csv')

shape of  X csv file is (483,9) and  y csv file is (483,1).

### Making Regression Model.

It is found that , if some datasets have some curvature in nature ,then linear model for perfect straight line would not give you the best result , in that case polynomial regression of degree two or three or more will give better result with some regularisation factor lambda, which can regulate or shrink the coefficients of higher degrees of features that generated, otherwise we can say that by regularization factor(lambda) we can regulate overfitting or under-fitting of the model.

I have seen from EDA that, this datasets have some curve in nature so, I will go for polynomial regression method of degree 2.

Importing Polynomial-features.

``````from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2,interaction_only=False)
x = poly.fit_transform(X)
x = pd.DataFrame(x)
x.shape``````

shape  of  the 'x'  is (483 , 55).

So, after adding higher degrees of 2, the number of features are increased to 55.

The equation should be like this :-

Splitting the datasets into train sets and test sets.

``````from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)
xtrain=np.array(X_train)
xtest=np.array(X_test)
ytrain=y_train
ytest=y_test
xtrain.shape,xtest.shape``````

shape of the xtrain  is (386 , 55)  and xtest  is (97,55).

### Method-1.

In this case I have written the equation for Gradient Descent , where the cost(MSE) of the training datasets will gradually decrease with no. of iteration.

Here is my cost function without any regularisation factor :-

However, the cost function with regularisation factor would be like this:-

Simultaneously update theta with no. of iteration.

In this equation learning rate(alpha) is given..

I have used first equation for calculating cost to see how it works without regularisation factor.

If you see my equation(in my code) I also added momentum with the learning rate.

``````def GradDiscent(xtrain,ytrain,xtest,ytest):
m=len(xtrain)
n=len(xtrain.T)

p = int(input('Enter no. of Iteration:'))
alpha = float(input('Enter Learning rate(alpha):'))
mom = float(input('Enter momentum :'))
theta=pd.DataFrame(np.zeros([n,p]))
f,(ax1,ax2) = plt.subplots(1,2,figsize=(20,7))

# here is the equation
J=np.zeros(p+1)
J1=np.zeros(p+1)
for i in range(1,p+1):
# momentum is added with learning rate alpha
alpha = alpha*mom
for j in range(m):
theta[i]=theta[i-1] - (sum(np.dot(xtrain,theta[i-1]) -ytrain)/m)*alpha*xtrain[j]#calculating thetas
J[i]=(1/(2*m))*sum((ytrain-np.dot(xtrain,theta[i]))**2) #calculating costs for train data
J1[i]=(1/(2*m))*sum((ytest-np.dot(xtest,theta[i]))**2)  #calculating costs for test data
Theta = pd.DataFrame(np.array(theta[p]),columns='Theta'.split(),index=x.columns)#DataFrame for final theta

#plotting cost-vs-iteration
ax1.plot(np.arange(1,p),J[2:],color='r')
ax2.plot(np.arange(1,p),J1[2:],color='g')
ax1.set_xlabel('No. of Iteration')
ax1.set_ylabel('MSE, Cost for Train Data')
ax1.set_title('Gradient descent For Train Data Set')
ax2.set_xlabel('No. of Iteration')
ax2.set_ylabel('MSE, Cost for Test Data')
ax2.set_title('Gradient descent for Test Data Set')
sns.despine(left=True)

#calculating MSE and RMSE LOSS
print('\nMSE  from GD: ',J1[p])
print('RMSE from GD: ',np.sqrt(J1[p]))
print('\n')
return Theta

def Prediction(theta,xtest,y):
thet = theta['Theta']
PtestGD = np.dot(xtest,thet)
p4=np.arange(int(round(y.max())))
y4=np.arange(int(round(y.max())))

#plotting prediction plot
f,(ax1,ax2) = plt.subplots(1,2,figsize=(20,8))

ax1.scatter(PtestGD,ytest,color='g')
ax1.plot(p4,y4,color='y')
ax1.set_xlabel('Predicted Data Set')
ax1.set_ylabel('Original Data Set')
ax1.set_title('Comparisign Graph')
sns.despine(left=True)

sns.distplot((PtestGD-ytest),color='red',bins=10)
ax2.set_xlabel('VALUE')
ax2.set_xlabel('[VALUE - PREDICTED VALUE]')
ax2.set_title('Distribution Plot for Predicted & Test Data')
ax2.grid()
sns.despine(left=True)  ``````
``theta = GradDiscent(xtrain,ytrain,xtest,ytest)``

Result...

``Prediction(theta,xtest,y)  ``

conclusion :

We can find that,the result is not upto the mark.

Only reason for that is-> I have not added regularisation factor to the cost function,However my regression equation is polynomial of degree 2, I have already taken.

It's not the fault of the equation, cause the equation I applied here is for absolutely linear(straight-line) model but my model is polynomial of degree 2..

So,in this case the model having an overfitting problem. However to balance it , a regularisation factor is necessary to shrink the higher degrees of cofficients but, I have not used it..

For next method I must use the regularisation factor to balance this bias-variance problem.

### NORMAL EQUATION MODEL

In a single iteration we also can calculate thetas (intercept & coefficients) with matrix multiplication as well as transpose and inverse methods of matrix.

The equation would be like this (for absolute linear model) :

In the above equation regularisation factor is not given..

As my regression model is polynomial type of degree 2 so, I have to apply some regularisation factor which will automatically shrink the coefficients of higher degree features to regularise overfitting of the model.

The normal equation with regularisation factor is like this :

In most of the cases the above two equations are used to calculate directly intercept as well as coefficients of the features, which I have learnt from Prof. Andrew Ng.

So, I used the second equation with regularisation factor lambda.

The first advantage of this equation is that, up to 10000 features it works very good

The second advantage is that, it is very fast, and with single iteration it can calculate intercept as well as thetas.

Using  'normalEq()'  function Normal-Equation-Model has implemented.

``````def normalEq(xtrain,ytrain,xtest,ytest):
m = len(xtrain)
n = len(xtrain.T)
lam = eval(input('Enter Regularisation Factor(Lambda): '))
p=pd.DataFrame(np.eye(n-1).T)
o =pd.DataFrame(np.zeros(n-1))
#making regularisation factor
rg = pd.concat([pd.DataFrame(np.zeros(n-1)),pd.concat([o,p],1).T],1)
#normal equation,directly calculating thetas(intercept & cofficients)
Theta = np.dot(np.dot(np.linalg.inv(np.dot(xtrain.T,xtrain) + np.array(rg)*lam),xtrain.T),ytrain)

Ptest = np.dot(xtest,Theta)   # predicting for test data
Ptrain = np.dot(xtrain,Theta) # predicting for train data

from sklearn.metrics import r2_score,mean_squared_error
mse = mean_squared_error(Ptest,ytest)
score = r2_score(Ptest,ytest)
mse1 = mean_squared_error(Ptrain,ytrain)
score1 = r2_score(Ptrain,ytrain)

print('\n')
print('*'*45)
print('MSE,for test data     : ',mse)
print('RMSE,for test data    : ',np.sqrt(mse))
print('R2-score for test data: ',round(score*100,2),'%')
print('*'*45)

print('\n')
print('*'*45)
print('MSE,for train data     : ',mse1)
print('RMSE,for train data    : ',np.sqrt(mse1))
print('R2-score for train data: ',round(score1*100,2),'%')
print('*'*45)

#ploting for test data result
f,(ax1,ax2) = plt.subplots(1,2,figsize=(20,10))

ax1.scatter(Ptest,ytest,color='g')
ax1.plot([0,Ptest.max()],[0,ytest.max()],color='y')
ax1.set_xlabel('Predicted Data Set')
ax1.set_ylabel('Original Data Set')
ax1.set_title('Comparisign Graph')
sns.despine(left=True)
ax1.grid()

sns.distplot((Ptest-ytest),color='r',bins=10)
ax2.set_title('Distribution Plot for Predicting & Testing Data Set')
ax2.grid()
sns.despine(left=True)

#ploting for train data result
f,(ax1,ax2) = plt.subplots(1,2,figsize=(20,10))

ax1.scatter(Ptrain,ytrain,color='g')
ax1.plot([0,Ptrain.max()],[0,ytrain.max()],color='y')
ax1.set_xlabel('Predicted Data Set')
ax1.set_ylabel('Original Data Set')
ax1.set_title('Comparisign Graph')
sns.despine(left=True)
ax1.grid()

sns.distplot((Ptrain-ytrain),color='r',bins=10)
ax2.set_title('Distribution Plot for Predicting & Traing Data Set')
ax2.grid()
sns.despine(left=True)
return pd.DataFrame(Theta).T``````
``theta = normalEq(xtrain,ytrain,xtest,ytest)``

So , here we can see from NORMAL EQUATION MODEL that accuracy is way better now . For test data accuracy is 87.49% and for train data accuracy is 86.25%.

Let's see it more clearly by visualisation graphs...

So, from this graphs we can see the accuracy is really good.

output

``````print('Values of Intercept(col=0) and coefficients :-')
theta``````