Linear Regression Begineer's Guide

##linearregression ##datascience ##machinelearning

Praneeth Kumar Jan 08 2021 · 2 min read
Share this

What is Linear Regression?

A linear regression is a fundamental or gate of the machine learning algorithems from which the begineer's are enter in to machine learning field. It is a statistical technique which is used to check the relationship between the input(independent variable) and output variable (called as dependent variable). The simplest form of the regression equation with one dependent and one independent variable is defined by the formula y = mx+c , where y = estimated dependent variable score, c = constant, m = regression coefficient, and x = score on the independent variable.

Best fit line 

Three major uses for regression analysis are (1) determining the strength of predictors, (2) forecasting an effect, and (3) trend forecasting.

First, the regression might be used to identify the strength of the effect that the independent variable(s) have on a dependent variable.  Typical questions are what is the strength of relationship between dose and effect, sales and marketing spending, age and income, or area and profit.

Second, it can be used to forecast effects or impact of changes.  That is, the regression analysis helps us to understand how much the dependent variable changes with a change in one or more independent variables.  A typical question is, “how much additional sales income do I get for each additional $1000 spent on marketing?”

Third, regression analysis predicts trends and future values.  The regression analysis can be used to get point estimates.  A typical question is, “what will the price of gold be in 6 months?”

Types of Linear Regression?

Simple linear Regression

1 independent variable, 1 dependent variables(interval or ratio or continues)

Mulitple linear Regression.

2+ independent variable, 1 dependent variables(interval or ratio or continues)

Logistic Regression.

2+ independent variables, 1 dependent variables(interval or ratio or dichotomous)

Ordinal Regression.

1+ independent variable,1 dependent variables(nominal or dischotomous)

Multinomial Regression.

1+ independent variable, 1 dependent variables(interval or ratio or dischotomous)

Discriminant analysis.

1+ independent variable, 1 dependent variables(interval or ratio)

General questions?

Is scaling is required for the linear regression?

It depends on how you're solving for the optimal solution. Let's say you're performing a OLS regression . There are two ways to solve for the optimal solution. You can solve it via the analytical solution or via an iterative algorithm such as gradient descent.

If you're using the analytical solution, feature scaling wont be of much use. In fact, you may want to refrain from feature scaling so that the model is more comprehensive. However, if you are using the gradient descent algorithm, feature scaling will help the solution converge in a shorter period of time.

Note: both methods will arrive at the same solution since the problem is a convex one. In higher dimension (when you have alot of independent variables), the effect of using unscaled features w/ gradient descent is like rolling a ball down a taco shell -> inefficient.

Is linear regression sensitive to outliers?

First, linear regression  needs the relationship between the independent and dependent variables to be linear. It is also important to check for outliers since linear regression is sensitive to outlier effects.The slope of a line changes when we have a outliers in the data it leads to the error in predictions. Removing or handling the outliers is the best approach to get an good accuracy.

ex: Outliers

Assumptions.

  • Linearity.
  • Homoscedasticity.
  • Independence.
  • Data Preparation.

  • Transform data for linear relationship (ex: log transform for exponential relationship)
  • Remove noise such as outliers.
  • Rescale inputs using standardization or normalization.
  • Usecases.

  • Sales Forcasting.
  • Trend Forcasting.
  • House price prediction.
  • Predicting salaries.
  • Automobiles.
  • Price of a product.
  • Advantages.

  • Linear regression is simple to implement and easy to intrepret the output coefficients.
  • When you know the independent and dependent variable have a relationship this algorithem is the best to use because of its less complexity than other algorithems.
  • Over fitting can be reduced by the Regularization (L1 ,L2)
  • Disadvantages.

  • Prone to underfitting: A situation that arises a machine learning model that fails to capture a data properly. This is happens when the hypothesis function cannot fit well.
  • Sensitive to outliers.
  •   Very often the inputs aren't independent of each other and hence any multicollinearity must be removed before applying linear regression.  
  • Comments
    Read next