The Elastic Net


Suggested Prerequisites


Notes:

  • Compromise between Ridge Regression and the Lasso

  • Helps to reduce overfitting by bringing model coefficients towards zero and selectively to zero

  • Arguably the most robust linear regularized method

Loss Function and Optimization Problem

The associated loss function for the Elastic Net modifies the OLS loss function through the addition of both an \(L_1\) and \(L_2\) penalties controlled by tuning parameters \(\lambda\) and \(\alpha\) respectively as:

\[ L(\mathbf{\beta}) = \|\mathbf{y} - \mathbf{X}\mathbf{\beta}\|_2^2 + \lambda [(1-\alpha)\frac{1}{2} \|\mathbf{\beta}\|_2^2 + \alpha \|\mathbf{\beta}\| ] \: \: \: \text{ with tuning parameters $\lambda \geq 0, 0 \leq \alpha \leq 1$ } \]

In this context, \(\alpha\) can be considered as the parameter controlling the ratio of \(L_1\) penalty and \(\lambda\) is the intensity of regularization to apply.

Formulating the loss function as a least-squares optimization problem yields:

\[ \hat{\mathbf{\beta}} = \arg\min_{\mathbf{\beta}} L(\mathbf{\beta}) = \arg\min_{\mathbf{\beta}} \frac{1}{2n}\|\mathbf{y} - \mathbf{X}\mathbf{\beta}\|_2^2 + \lambda [(1-\alpha)\frac{1}{2} \|\mathbf{\beta}\|_2^2 + \alpha \|\mathbf{\beta}\| ] \]

Similarly to the Lasso, a discrete optimization technique needs to be applied to yield a solution for the coefficient estimates.

Pathwise Coordinate Descent

The algorithm is similar to that of the Lasso. Features should be should be standardized to have zero mean and unit variance. Coefficients should be updated as:

\[ \beta_j = \frac{\mathbf{S}(\beta_j^*, \lambda\alpha)}{1 + \lambda(1-\alpha)} \]

where \(\mathbf{S}\) is the same soft-thresholding operator applied in the case of the Lasso:

\[ sign(\beta_j^*)(\left|\beta_j^*\right| - \lambda\alpha)_+ \]

Furthermore, if warm starts are to be utilized then \(\lambda_{max}\) can be found as:

\[ \lambda_{\text{max}} = \frac{\max_l \left|\langle x_l, y \rangle \right|}{n\alpha} \]

Implementation in Python Using NumPy

Warning

In practice it is recommended to use a cross-validation technique such as K-Fold cross-validation to choose the tuning parameter, \(\lambda\).


Sources

1

Regularization: ridge regression and the lasso. Nov 2006. URL: http://statweb.stanford.edu/~tibs/sta305files/Rudyregularization.pdf.

2

Anil Aswani. Ieor 165 – engineering statistics, quality control, and forecasting lecture notes 8. Jan 2021. URL: http://courses.ieor.berkeley.edu/ieor165/lecture_notes/ieor165_lec8.pdf.

3

Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models via coordinate descent. Aug 2010. URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2929880/.

4

Trevor Hastie, Jerome Friedman, and Rob Tibshirani. Fast regularization paths via coordinate descent talk. 2009. URL: https://web.stanford.edu/~hastie/TALKS/glmnet.pdf.

5

Trevor Hastie, Jerome Friedman, and Robert Tisbshirani. The Elements of statistical learning: data mining, inference, and prediction. Springer, 2017. URL: https://web.stanford.edu/~hastie/ElemStatLearn//printings/ESLII_print12_toc.pdf.


Contributions made by our wonderful GitHub Contributors: @wyattowalsh