Understanding of Regression Model.

## #regression #model #normal #equation

Maitry Sinha Oct 06 2020 · 7 min read
Share this

what is Regression Model?

Regression models basically used in Machine Learning for continuous Datasets. In Regression models output has no limit like 0 or 1(True or False) it can be any constant value. In  regression model basically , we get the relationship between two or more variable and compute one variable with respect to the others , it means in Regression models we have some Independent Variables through which we find out the dependent or the output variable by evaluating the relationship between them. For example , In case of housing datasets , we have many independent variables through which we implement the output or the dependent variable by establishing the relationship between them through regression models.

In this particular article , I am gonna discuss end-to-end Regression Model  description  about famous "Boston-Housing-Datasets" .

you can download the dataset by this command from sklearn directly....

 load_boston()

Importing necessary Libraries....

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')

Loading the Datasets...

from sklearn.datasets import load_boston
boston = load_boston()
bos = pd.DataFrame(boston.data)
print("shape is:",bos.shape)
bos.head(2)
bs = boston['data']

The features are..

feature = boston['feature_names']
feature

Feature description..

EDA & FEATURE ENGINEERING

dict= {'CRIM':['per capita crime rate by town'],
       'ZN':['proportion of residential land over 25,000 sq.ft.'],
       'INDUS':['proportion of non-retail business acres per town'],
       'CHAS':['Charles River dummy variable'],
       'NOX':['nitric oxides concentration per 10 million'],
       'RM':['average number of rooms per dwelling'],
       'AGE':['proportion of owner-occupied prior to 1940'],
       'DIS':['distances to five Boston employment centres'],
       'RAD':['index of accessibility to radial highways'],
       'TAX':['full-value property-tax rate per $10,000'],
       'PTRATIO':['pupil-teacher ratio by town'],
       'B':['1000(Bk - 0.63)^2,Bk is the proportion of blacks'],
       'LSTAT':['% lower status of the population']}

dict
des = pd.DataFrame(dict,index=['Description'])
print('\nDescription of the features:')
des.T

Metadata of  the independent columns for doing the further operation..

df = pd.DataFrame(bs,columns=feature)
df.head(2)

target = boston['target']
y = pd.DataFrame(target,columns=['SalePrice'])

Pearson correlation of all the datasets

data = pd.concat([y,df],1).copy()
corr = data.corr()
plt.figure(figsize=(18,12))
sns.heatmap(corr,annot=True,fmt='.2g',linewidths=0.5,cmap='viridis')
plt.show()

Pearson correlation of all the columns with respect to the output ('SalePrice') column.

imp = corr[['SalePrice']]
imp = imp.sort_values(by='SalePrice',ascending=False)
print("\n   Feature Importancee:")
plt.figure(figsize=(18,2))
sns.heatmap(imp.T,annot=True,fmt='.3g',linewidths=0.5,cbar=False,cmap='viridis')
plt.show()

Making Scatter-Plot, Box-Plot & Distribution-plot Simultaneously for comparison using a for loop to show all the plots together.

for i in range(len(data.columns)):
    a = data[data.columns[i]]
    b = data['SalePrice']
    f,(ax1,ax2,ax3)=plt.subplots(1,3,figsize=(20,4))
    sns.scatterplot(a,b,ax=ax1,color='g')
    sns.boxplot(a,ax=ax2,color='m')
    sns.distplot(a,ax=ax3,color='r')

Here, we can understand from the graphs(marked) that some  datasets have outliers and the 'LSTAT' datasets also polynomial behavior.So, in the 'col' list I have put those datasets together along with 'SalePrice'.

col = ['SalePrice','RM','AGE','LSTAT']

For removing some outliers  Z-score transformation method I have made a function 'z_score()'.

def Z_score(data,left,right):
    index= []
    mean = data.mean()
    std = data.std()
    for i in range(len(data)):
        z = (data[i]-mean)/std
        if (z >= -left) and (z <= right):
            index.append(i)
        else:
            pass
    return index
clean_index = Z_score(y.SalePrice,2.5,2.4)
print("Data to be cleaned:{}".format(len(y) - len(clean_index)))

Data to be cleaned : 23

 Cleaning  datasets.

Df = data.iloc[clean_index].reset_index(drop=True).copy()
Df.shape

Now data has been cleaned and the shape is (483,14).

[Note :- As I have put all the cleaned data in 'Df' , So , it is always a good practice to delete the extra data frames ('y' ,' df' , 'data') to free up the space of the memory ].

del y,df,data

Separating X-data and Y-data.

y = Df['SalePrice'].copy()
X = Df.drop(['SalePrice'],1).copy()

Feature selection With LASSO transformation.

from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.50,max_iter=2000)
lasso.fit(X,y)
coeff_values = pd.DataFrame({'coeff':lasso.coef_},index=X.columns).sort_values(by='coeff')
c = (abs(coeff_values.coeff) == 0)
col_non_imp = list(X.columns[c])
print("Length of the non-important columns are :",len(col_non_imp))
print('\n')
print('Non-important columns are :',col_non_imp) 

We got  the  Non-important  columns  are   'DIS'  & 'RAD' . Drop those columns.

X.drop(col_non_imp,1,inplace=True)
y = pd.DataFrame(y)
#Making log transformation of the features, Except 'CHAS'
feat = [col for col in X.columns if col != 'CHAS']
feat

** Replacing Zero by-> One, otherwise logarithm will give error ,for that reason I did not taken 'CHAS'.

for col in feat:
    X[col] = X[col].replace({0:1})
for col in feat:
    X[col] = X[col].apply(lambda x : np.log(x))

From the above distribution plots I can say that  'CHAS' & 'B' features have many outliers . So, for better performance  we can remove those features.

X.drop(['CHAS','B'],1,inplace=True)

Scaling the data sets to standard Normal distribution.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x = scaler.fit_transform(X).copy()
x.shape,y.shape
x = pd.DataFrame(x,columns=X.columns)
x.to_csv('X.csv',index=False)
y.to_csv('y.csv',index=False)
del x,y

[ Deleted 'x' and 'y' to free up the memory space..]

Importing scaled CSV files.

X = pd.read_csv('X.csv')
y = pd.read_csv('y.csv')

shape of  X csv file is (483,9) and  y csv file is (483,1).

Making Regression Model.

It is found that , if some datasets have some curvature in nature ,then linear model for perfect straight line would not give you the best result , in that case polynomial regression of degree two or three or more will give better result with some regularisation factor lambda, which can regulate or shrink the coefficients of higher degrees of features that generated, otherwise we can say that by regularization factor(lambda) we can regulate overfitting or under-fitting of the model.

I have seen from EDA that, this datasets have some curve in nature so, I will go for polynomial regression method of degree 2.

Importing Polynomial-features.

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2,interaction_only=False)
x = poly.fit_transform(X)
x = pd.DataFrame(x)
x.shape

shape  of  the 'x'  is (483 , 55).

So, after adding higher degrees of 2, the number of features are increased to 55.

The equation should be like this :-

Splitting the datasets into train sets and test sets.

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)
xtrain=np.array(X_train)
xtest=np.array(X_test)
ytrain=y_train
ytest=y_test
xtrain.shape,xtest.shape

shape of the xtrain  is (386 , 55)  and xtest  is (97,55).

Method-1.

Gradient Descent

  In this case I have written the equation for Gradient Descent , where the cost(MSE) of the training datasets will gradually decrease with no. of iteration.

Gradient descent graph....

Here is my cost function without any regularisation factor :-

However, the cost function with regularisation factor would be like this:-

Equation for Gradient Descent :-

Simultaneously update theta with no. of iteration.

In this equation learning rate(alpha) is given..

I have used first equation for calculating cost to see how it works without regularisation factor.

If you see my equation(in my code) I also added momentum with the learning rate.

def GradDiscent(xtrain,ytrain,xtest,ytest):    
    m=len(xtrain)
    n=len(xtrain.T)

    p = int(input('Enter no. of Iteration:'))
    alpha = float(input('Enter Learning rate(alpha):'))
    mom = float(input('Enter momentum :'))
    theta=pd.DataFrame(np.zeros([n,p]))
    f,(ax1,ax2) = plt.subplots(1,2,figsize=(20,7))
    
    # here is the equation
    J=np.zeros(p+1)
    J1=np.zeros(p+1)
    for i in range(1,p+1):
        # momentum is added with learning rate alpha
        alpha = alpha*mom
        for j in range(m):
            theta[i]=theta[i-1] - (sum(np.dot(xtrain,theta[i-1]) -ytrain)/m)*alpha*xtrain[j]#calculating thetas
        J[i]=(1/(2*m))*sum((ytrain-np.dot(xtrain,theta[i]))**2) #calculating costs for train data
        J1[i]=(1/(2*m))*sum((ytest-np.dot(xtest,theta[i]))**2)  #calculating costs for test data
    Theta = pd.DataFrame(np.array(theta[p]),columns='Theta'.split(),index=x.columns)#DataFrame for final theta
    
    #plotting cost-vs-iteration
    ax1.plot(np.arange(1,p),J[2:],color='r')
    ax2.plot(np.arange(1,p),J1[2:],color='g')
    ax1.set_xlabel('No. of Iteration')
    ax1.set_ylabel('MSE, Cost for Train Data')
    ax1.set_title('Gradient descent For Train Data Set')
    ax2.set_xlabel('No. of Iteration')
    ax2.set_ylabel('MSE, Cost for Test Data')
    ax2.set_title('Gradient descent for Test Data Set')
    sns.despine(left=True)
    
    #calculating MSE and RMSE LOSS
    print('\nMSE  from GD: ',J1[p])
    print('RMSE from GD: ',np.sqrt(J1[p]))
    print('\n')
    return Theta

def Prediction(theta,xtest,y):
    thet = theta['Theta']
    PtestGD = np.dot(xtest,thet)
    p4=np.arange(int(round(y.max())))
    y4=np.arange(int(round(y.max())))
    
    #plotting prediction plot
    f,(ax1,ax2) = plt.subplots(1,2,figsize=(20,8))
     
    ax1.scatter(PtestGD,ytest,color='g')
    ax1.plot(p4,y4,color='y')
    ax1.set_xlabel('Predicted Data Set')
    ax1.set_ylabel('Original Data Set')
    ax1.set_title('Comparisign Graph')
    sns.despine(left=True)
    
    sns.distplot((PtestGD-ytest),color='red',bins=10)
    ax2.set_xlabel('VALUE')
    ax2.set_xlabel('[VALUE - PREDICTED VALUE]')
    ax2.set_title('Distribution Plot for Predicted & Test Data')
    ax2.grid()
    sns.despine(left=True)  
theta = GradDiscent(xtrain,ytrain,xtest,ytest)

Result...

Prediction(theta,xtest,y)  

conclusion :

We can find that,the result is not upto the mark.

Only reason for that is-> I have not added regularisation factor to the cost function,However my regression equation is polynomial of degree 2, I have already taken.

It's not the fault of the equation, cause the equation I applied here is for absolutely linear(straight-line) model but my model is polynomial of degree 2..

So,in this case the model having an overfitting problem. However to balance it , a regularisation factor is necessary to shrink the higher degrees of cofficients but, I have not used it..

For next method I must use the regularisation factor to balance this bias-variance problem.

Method-2.

NORMAL EQUATION MODEL

In a single iteration we also can calculate thetas (intercept & coefficients) with matrix multiplication as well as transpose and inverse methods of matrix.

The equation would be like this (for absolute linear model) : 

In the above equation regularisation factor is not given..

As my regression model is polynomial type of degree 2 so, I have to apply some regularisation factor which will automatically shrink the coefficients of higher degree features to regularise overfitting of the model.

The normal equation with regularisation factor is like this :

In most of the cases the above two equations are used to calculate directly intercept as well as coefficients of the features, which I have learnt from Prof. Andrew Ng.

So, I used the second equation with regularisation factor lambda.

The first advantage of this equation is that, up to 10000 features it works very good

The second advantage is that, it is very fast, and with single iteration it can calculate intercept as well as thetas.

Using  'normalEq()'  function Normal-Equation-Model has implemented.

def normalEq(xtrain,ytrain,xtest,ytest):
    m = len(xtrain)
    n = len(xtrain.T)
    lam = eval(input('Enter Regularisation Factor(Lambda): '))
    p=pd.DataFrame(np.eye(n-1).T)
    o =pd.DataFrame(np.zeros(n-1))
    #making regularisation factor
    rg = pd.concat([pd.DataFrame(np.zeros(n-1)),pd.concat([o,p],1).T],1)
    #normal equation,directly calculating thetas(intercept & cofficients)
    Theta = np.dot(np.dot(np.linalg.inv(np.dot(xtrain.T,xtrain) + np.array(rg)*lam),xtrain.T),ytrain)

    Ptest = np.dot(xtest,Theta)   # predicting for test data
    Ptrain = np.dot(xtrain,Theta) # predicting for train data
                
    from sklearn.metrics import r2_score,mean_squared_error
    mse = mean_squared_error(Ptest,ytest) 
    score = r2_score(Ptest,ytest)
    mse1 = mean_squared_error(Ptrain,ytrain)
    score1 = r2_score(Ptrain,ytrain)
    
    print('\n')
    print('*'*45)
    print('MSE,for test data     : ',mse)
    print('RMSE,for test data    : ',np.sqrt(mse))
    print('R2-score for test data: ',round(score*100,2),'%')
    print('*'*45)
    
    print('\n')
    print('*'*45)
    print('MSE,for train data     : ',mse1)
    print('RMSE,for train data    : ',np.sqrt(mse1))
    print('R2-score for train data: ',round(score1*100,2),'%')
    print('*'*45)
    
    #ploting for test data result
    f,(ax1,ax2) = plt.subplots(1,2,figsize=(20,10))

    ax1.scatter(Ptest,ytest,color='g')
    ax1.plot([0,Ptest.max()],[0,ytest.max()],color='y')
    ax1.set_xlabel('Predicted Data Set')
    ax1.set_ylabel('Original Data Set')
    ax1.set_title('Comparisign Graph')
    sns.despine(left=True)
    ax1.grid()

    sns.distplot((Ptest-ytest),color='r',bins=10)
    ax2.set_title('Distribution Plot for Predicting & Testing Data Set')
    ax2.grid()
    sns.despine(left=True)
    
    #ploting for train data result
    f,(ax1,ax2) = plt.subplots(1,2,figsize=(20,10))

    ax1.scatter(Ptrain,ytrain,color='g')
    ax1.plot([0,Ptrain.max()],[0,ytrain.max()],color='y')
    ax1.set_xlabel('Predicted Data Set')
    ax1.set_ylabel('Original Data Set')
    ax1.set_title('Comparisign Graph')
    sns.despine(left=True)
    ax1.grid()

    sns.distplot((Ptrain-ytrain),color='r',bins=10)
    ax2.set_title('Distribution Plot for Predicting & Traing Data Set')
    ax2.grid()
    sns.despine(left=True)
    return pd.DataFrame(Theta).T
theta = normalEq(xtrain,ytrain,xtest,ytest)

So , here we can see from NORMAL EQUATION MODEL that accuracy is way better now . For test data accuracy is 87.49% and for train data accuracy is 86.25%.

Let's see it more clearly by visualisation graphs...

So, from this graphs we can see the accuracy is really good.

 output

print('Values of Intercept(col=0) and coefficients :-') 
theta

So, by adding regularisation factor I got very good result with only single iteration.

I also get the values of intercept & coefficients for each of the features.

Thank you so much for reading this..... 

Comments
Read next