L1 & L2 Regularization

#regularization #lasso #ridge #regression

Ashwini kumar Apr 03 2021 · 2 min read
Share this

While making a regression model using some training data, there is fair chance that our model can become overfitted (high variance). Regularization techniques help to sort this problem by restricting the degrees of freedom in any given equation, i.e., reducing the model weights. 

General equation of regression model:  

                                 y = m1.x1 + m2.x2 + m3.x3 + ... + C 

For training the model, the cost function (for OLS) used is: 

                                 RSS (Residual Sum of Squares) = Σ(y-ÿ)^2

Now, to regularize a model, a shrinkage penalty term is added to this cost function, that penalizes our model for any feature that is not contributing much in our model, and is making it unnecessarily complex.

Let’s see two popular regularization techniques, and try to understand better. 

  • LASSO (L1 regularization): Here, model is penalized on the basis of sum of magnitudes of absolute model coefficients (m1, m2, m3 …). 
  •                                 Regularization term:   λ*Σ(m) 

                        So, the cost function to be minimized now becomes,

                                     C = Σ(y-ÿ)^2 + λ *Σ(|m|), 

    where, λ is the shrinkage factor, and is to be determined though cross validation. 

  • Ridge (L2 regularization): Here, model is penalized on the basis of sum of squares of magnitudes of model coefficients (m1^2, m2^2, m3^2 …). 
  •                                  Regularization term:    λ*Σ(m^2) 

                         So, the cost function to be minimized now becomes,  

                                      C = Σ(y-ÿ)^2 + λ *Σ(m^2), 

    where, λ is the shrinkage factor, and is to be determined though cross validation. 

    Difference between L1 and L2 Regularization: 

    Clearly, the difference between the two is that of penalty term. In L1, we take sum of absolute coefficients, whereas in L2, we take sum of square of coefficients. 

    Now, both these methods deal with over-fitting and reduces model variance, but what is the difference in results of the two methods?? 

    The difference is that in L2, the weights (or coefficients) of the less significant features are minimized very close to zero, but is never equal to it. So, however insignificant the feature might be, it will be there in the model as a predictor. 

    On the other hand, in L1, the weights of non-significant features actually become zero and hence, along with regularizing the model, it also does the job of feature selection. 

    ‘λ’ and Variance-bias tradeoff: 

    Regularization helps to reduce the variance of the model, without a substantial increase in the bias. If there is variance in the model, it won’t fit well for dataset different than training data, leading to overfitting. But if bias is too high, the model won’t train well at the first place, leading to under-fitting. 

    The tuning parameter λ controls this bias and variance tradeoff. When the value of λ is increased up to a certain limit, it reduces the variance without losing any important properties in the data (i.e., no significant increase in bias). But after a certain limit, the model will start losing some important properties which will increase the bias in the data. Thus, the selection of good value of λ is important.  

    The value of λ is selected using cross-validation method. A set of λ is selected and cross-validation error is calculated for each value of λ and that value of λ is selected for which the cross-validation error is minimum. 

    Read next