# L1 & L2 Regularization

Ashwini kumar Apr 03 2021 · 2 min read

While making a regression model using some training data, there is fair chance that our model can become overfitted (high variance). Regularization techniques help to sort this problem by restricting the degrees of freedom in any given equation, i.e., reducing the model weights.

General equation of regression model:

y = m1.x1 + m2.x2 + m3.x3 + ... + C

For training the model, the cost function (for OLS) used is:

RSS (Residual Sum of Squares) = Σ(y-ÿ)^2

Now, to regularize a model, a shrinkage penalty term is added to this cost function, that penalizes our model for any feature that is not contributing much in our model, and is making it unnecessarily complex.

Let’s see two popular regularization techniques, and try to understand better.

• LASSO (L1 regularization): Here, model is penalized on the basis of sum of magnitudes of absolute model coefficients (m1, m2, m3 …).
•                                 Regularization term:   λ*Σ(m)

So, the cost function to be minimized now becomes,

C = Σ(y-ÿ)^2 + λ *Σ(|m|),

where, λ is the shrinkage factor, and is to be determined though cross validation.

• Ridge (L2 regularization): Here, model is penalized on the basis of sum of squares of magnitudes of model coefficients (m1^2, m2^2, m3^2 …).
•                                  Regularization term:    λ*Σ(m^2)

So, the cost function to be minimized now becomes,

C = Σ(y-ÿ)^2 + λ *Σ(m^2),

where, λ is the shrinkage factor, and is to be determined though cross validation.

### Difference between L1 and L2 Regularization:

Clearly, the difference between the two is that of penalty term. In L1, we take sum of absolute coefficients, whereas in L2, we take sum of square of coefficients.

Now, both these methods deal with over-fitting and reduces model variance, but what is the difference in results of the two methods??

The difference is that in L2, the weights (or coefficients) of the less significant features are minimized very close to zero, but is never equal to it. So, however insignificant the feature might be, it will be there in the model as a predictor.

On the other hand, in L1, the weights of non-significant features actually become zero and hence, along with regularizing the model, it also does the job of feature selection.