BUA 302 · Predictive & Prescriptive Business Analytics | Interactive reference
Same squared loss, but the penalty is λβ² (L2 norm squared) instead of λ|β|. The squared penalty is smooth everywhere — no kinks — which makes the calculus straightforward.
Both the loss and the penalty are smooth parabolas. Completing the square:
L(β) is a single smooth parabola centered at β̃/(1+2λ) — one minimum, findable with one derivative.
No subgradients. No case analysis. Four lines of algebra give the exact answer for every value of λ.
The factor 1/(1+2λ) is always in (0, 1) for λ > 0 — Ridge multiplies the OLS estimate by this fraction. As λ → 0 the factor → 1 (recovers OLS). As λ → ∞ the factor → 0 (β̂ → 0). But it never reaches exactly zero for any finite λ.
In matrix form for p predictors: β̂Ridge = (XTX + λI)−1XTy. The λI ensures the matrix is always invertible — a key advantage over OLS when predictors are collinear.
Combines RSS (squared loss) with an L1 penalty on the coefficients. Unlike Ridge, the absolute value |βj| forces some coefficients to exactly zero — enabling variable selection.
Completing the square (all terms not involving β absorbed into constant C):
Two shapes are competing: a parabola centered at β̃OLS (loss) and a V-shape centered at 0 (penalty).
The derivative of L(β) exists for all β ≠ 0. Split into two regions and apply the chain rule:
Setting each piece to zero gives a candidate minimizer in that region. But at β = 0 the standard derivative does not exist — the absolute value creates a kink. We need the subgradient of |β| instead:
The optimality condition for L(β) at any point β̂ is that zero must belong to the subgradient of L evaluated at β̂:
The soft-thresholding operator S(β̃, λ). It shrinks the OLS estimate toward zero by λ and sets it exactly to zero when |β̃| ≤ λ. This is what produces sparse models.
Penalty λβ² is a smooth parabola — differentiable everywhere. Standard derivative gives a one-step closed form. Coefficients shrink proportionally toward zero but never reach it.
Penalty λ|β| has a kink at 0 — non-differentiable there. Requires subgradient analysis and three cases. Coefficients shrink by a fixed amount and can reach exactly zero.
| Property | Ridge | Lasso |
|---|---|---|
| Penalty term | λΣβ² (L2) | λΣ|β| (L1) |
| Differentiable? | Yes — everywhere | No — kink at β = 0 |
| Calculus tool | Standard derivative | Subgradient + 3 cases |
| Shrinkage type | Multiplicative: β̂ = β̃ · 1/(1+2λ) | Translational: β̂ = sign(β̃)·max(|β̃|−λ, 0) |
| Exact zeros? | Never (finite λ) | Yes, when |β̃| ≤ λ |
| Variable selection? | No | Yes |
| Matrix form | (XᵀX + λI)⁻¹Xᵀy | No general closed form |
| Best used when | Many small effects; collinear predictors | Sparse true model; variable selection needed |