Ridge & Lasso — Calculus Derivations

Step 1 — The objective

Ridge minimization problem

Same squared loss, but the penalty is λβ² (L2 norm squared) instead of λ|β|. The squared penalty is smooth everywhere — no kinks — which makes the calculus straightforward.

Step 2 — Simplify to one predictor

With a single standardized predictor

Both the loss and the penalty are smooth parabolas. Completing the square:

L(β) is a single smooth parabola centered at β̃/(1+2λ) — one minimum, findable with one derivative.

Steps 3–6 — Standard calculus, no cases needed

Because L(β) is differentiable everywhere, set dL/dβ = 0 and solve

Step 3 — differentiate

Step 4 — set equal to zero

Step 5 — collect β terms

Step 6 — solve for β̂

No subgradients. No case analysis. Four lines of algebra give the exact answer for every value of λ.

Step 7 — Interpreting the solution

The factor 1/(1+2λ) is always in (0, 1) for λ > 0 — Ridge multiplies the OLS estimate by this fraction. As λ → 0 the factor → 1 (recovers OLS). As λ → ∞ the factor → 0 (β̂ → 0). But it never reaches exactly zero for any finite λ.

Interactive — see proportional shrinkage live

OLS estimate β̃ 1.00

Penalty λ 0.50

OLS estimate β̃

—

Shrinkage factor

—

Ridge solution β̂

—

Applied formula

Closed-form solution

In matrix form for p predictors: β̂_Ridge = (X^TX + λI)⁻¹X^Ty. The λI ensures the matrix is always invertible — a key advantage over OLS when predictors are collinear.

Step 1 — The objective

Lasso minimization problem

Combines RSS (squared loss) with an L1 penalty on the coefficients. Unlike Ridge, the absolute value |β_j| forces some coefficients to exactly zero — enabling variable selection.

Step 2 — Simplify to one predictor

With a single standardized predictor the objective reduces to

Completing the square (all terms not involving β absorbed into constant C):

Two shapes are competing: a parabola centered at β̃^OLS (loss) and a V-shape centered at 0 (penalty).

Step 3 — Differentiate in the two smooth regions

L(β) is smooth everywhere except at β = 0 — differentiate each piece

The derivative of L(β) exists for all β ≠ 0. Split into two regions and apply the chain rule:

Setting each piece to zero gives a candidate minimizer in that region. But at β = 0 the standard derivative does not exist — the absolute value creates a kink. We need the subgradient of |β| instead:

The optimality condition for L(β) at any point β̂ is that zero must belong to the subgradient of L evaluated at β̂:

Step 4 — Three cases from the optimality condition

Solve the optimality condition in each region — setting dL/dβ = 0 for β ≠ 0, and checking the subgradient condition at β = 0

Case 1 — β̃ > λ

Case 2 — β̃ < −λ

Case 3 — |β̃| ≤ λ

Interactive — drag the sliders

OLS estimate β̃ 1.00

Penalty λ 0.50

Active case

—

Lasso solution β̂

—

Applied formula

Step 5 — Unified closed form

The soft-thresholding operator S(β̃, λ). It shrinks the OLS estimate toward zero by λ and sets it exactly to zero when |β̃| ≤ λ. This is what produces sparse models.

The core contrast

Same structure, different penalty shape — completely different behavior

Ridge (L2)

Penalty λβ² is a smooth parabola — differentiable everywhere. Standard derivative gives a one-step closed form. Coefficients shrink proportionally toward zero but never reach it.

Lasso (L1)

Penalty λ|β| has a kink at 0 — non-differentiable there. Requires subgradient analysis and three cases. Coefficients shrink by a fixed amount and can reach exactly zero.

Property	Ridge	Lasso
Penalty term	λΣβ² (L2)	λΣ\|β\| (L1)
Differentiable?	Yes — everywhere	No — kink at β = 0
Calculus tool	Standard derivative	Subgradient + 3 cases
Shrinkage type	Multiplicative: β̂ = β̃ · 1/(1+2λ)	Translational: β̂ = sign(β̃)·max(\|β̃\|−λ, 0)
Exact zeros?	Never (finite λ)	Yes, when \|β̃\| ≤ λ
Variable selection?	No	Yes
Matrix form	(XᵀX + λI)⁻¹Xᵀy	No general closed form
Best used when	Many small effects; collinear predictors	Sparse true model; variable selection needed

Interactive comparison — same β̃, same λ, different penalties

OLS estimate β̃ 1.00

Penalty λ 0.50

Ridge β̂ = β̃ / (1+2λ)

—

Lasso β̂ = sign(β̃)·max(|β̃|−λ, 0)

—