Ridge Regression¶

Suggested Prerequisites¶

Notes¶

Also known as Tikhonov Regularization
Helps to reduce overfitting by reducing model variance through the addition of shrinkage towards zero across all coefficients.
Can be useful in times when high multicollinearity is found between predictors

Loss Function and Optimization Problem¶

For the case of Ridge Regression, the OLS loss function is modified by the addition of an $\mathbf{L}_2$ penalty with an associated tuning parameter, $\lambda$:

\[ L(\mathbf{\beta}) = \|\mathbf{y} - \mathbf{X}\mathbf{\beta}\|_2^2 + \lambda\|\mathbf{\beta}\|_2^2 \: \: \: \text{ with tuning parameter $\lambda \geq 0$} \]

Using this function to formulate a least-squares optimization problem yields:

\[ \hat{\mathbf{\beta}} = \arg\min_{\mathbf{\beta}} L(\mathbf{\beta}) = \arg\min_{\mathbf{\beta}} \frac{1}{2n} \|\mathbf{y}-\mathbf{X}\mathbf{\beta} \|_{2}^{2} + \lambda\|\mathbf{\beta}\|_2^2 \]

Just like OLS, the $\frac{1}{2n}$ term is added in order to simplify gradient solving ($\frac{1}{2}$) and allow objective function convergence to the expected value of model error by the Law of Large Numbers ($\frac{1}{n}$).

Model Estimator¶

By setting the gradient of the loss function equal to zero and solving for the coefficient vector, $ \hat{\mathbf{ \beta }} $, the Ridge Estimator is found:

\[ {\hat {\beta }}=(\mathbf {X} ^{\mathsf {T}}\mathbf {X} +\lambda \mathbf {I} )^{-1}\mathbf {X} ^{\mathsf {T}}\mathbf {y} \]

Proving Uniqueness of the Estimator¶

It turns out that the Ridge problem can be shown to be strongly convex with a positive definite associated Hessian matrix. This Hessian is found as:

\[ \mathbf{H} = 2\mathbf{X}^\mathbf{T}\mathbf{X} + 2 \lambda \mathbf {I} \]

And to show its positive definiteness:

\[ \mathbf{\beta}^\mathbf{T} (\mathbf{X}^\mathbf{T}\mathbf{X} + \lambda \mathbf {I})\mathbf{\beta} = (\mathbf{X}\mathbf{\beta})\mathbf{X}\mathbf{\beta} + \lambda \mathbf{\beta}^\mathbf{T}\mathbf{\beta} = \|\mathbf{X}\mathbf{\beta}\|_2^2 + \lambda \|\mathbf{\beta}\|_2^2 \succ 0 \: \: \: \forall \:\:\: \mathbf{\beta} \neq \mathbf{0} \]

Thus, the Ridge estimator is the unique global minimizer to the Ridge Regression problem. [1][2]

Implementation¶

Python

Sources¶

1: Uc berkeley fall 2020 cs189 (introduction to machine learning) note 2. Sep 2020. URL: https://www.eecs189.org/static/notes/n2.pdf.
2: Anil Aswani. Ieor 165 – engineering statistics, quality control, and forecasting lecture notes 8. Jan 2021. URL: http://courses.ieor.berkeley.edu/ieor165/lecture_notes/ieor165_lec8.pdf.

Contributions made by our wonderful GitHub Contributors: @wyattowalsh

Data Science Notes