Step 1 — The objective
Ridge minimization problem

Same squared loss, but the penalty is λβ² (L2 norm squared) instead of λ|β|. The squared penalty is smooth everywhere — no kinks — which makes the calculus straightforward.

Step 2 — Simplify to one predictor
With a single standardized predictor

Both the loss and the penalty are smooth parabolas. Completing the square:

L(β) is a single smooth parabola centered at β̃/(1+2λ) — one minimum, findable with one derivative.

Steps 3–6 — Standard calculus, no cases needed
Because L(β) is differentiable everywhere, set dL/dβ = 0 and solve
Step 3 — differentiate
Step 4 — set equal to zero
Step 5 — collect β terms
Step 6 — solve for β̂

No subgradients. No case analysis. Four lines of algebra give the exact answer for every value of λ.

Step 7 — Interpreting the solution

The factor 1/(1+2λ) is always in (0, 1) for λ > 0 — Ridge multiplies the OLS estimate by this fraction. As λ → 0 the factor → 1 (recovers OLS). As λ → ∞ the factor → 0 (β̂ → 0). But it never reaches exactly zero for any finite λ.

Interactive — see proportional shrinkage live
OLS estimate β̃ 1.00
Penalty λ 0.50
OLS estimate β̃
Shrinkage factor
Ridge solution β̂
Applied formula
Closed-form solution

In matrix form for p predictors: β̂Ridge = (XTX + λI)−1XTy. The λI ensures the matrix is always invertible — a key advantage over OLS when predictors are collinear.

Step 1 — The objective
Lasso minimization problem

Combines RSS (squared loss) with an L1 penalty on the coefficients. Unlike Ridge, the absolute value |βj| forces some coefficients to exactly zero — enabling variable selection.

Step 2 — Simplify to one predictor
With a single standardized predictor the objective reduces to

Completing the square (all terms not involving β absorbed into constant C):

Two shapes are competing: a parabola centered at β̃OLS (loss) and a V-shape centered at 0 (penalty).

Step 3 — Differentiate in the two smooth regions
L(β) is smooth everywhere except at β = 0 — differentiate each piece

The derivative of L(β) exists for all β ≠ 0. Split into two regions and apply the chain rule:

Setting each piece to zero gives a candidate minimizer in that region. But at β = 0 the standard derivative does not exist — the absolute value creates a kink. We need the subgradient of |β| instead:

The optimality condition for L(β) at any point β̂ is that zero must belong to the subgradient of L evaluated at β̂:

Step 4 — Three cases from the optimality condition
Solve the optimality condition in each region — setting dL/dβ = 0 for β ≠ 0, and checking the subgradient condition at β = 0
Case 1 — β̃ > λ

Case 2 — β̃ < −λ

Case 3 — |β̃| ≤ λ

Interactive — drag the sliders
OLS estimate β̃ 1.00
Penalty λ 0.50
Active case
Lasso solution β̂
Applied formula
Step 5 — Unified closed form

The soft-thresholding operator S(β̃, λ). It shrinks the OLS estimate toward zero by λ and sets it exactly to zero when |β̃| ≤ λ. This is what produces sparse models.

The core contrast
Same structure, different penalty shape — completely different behavior
Ridge (L2)

Penalty λβ² is a smooth parabola — differentiable everywhere. Standard derivative gives a one-step closed form. Coefficients shrink proportionally toward zero but never reach it.

Lasso (L1)

Penalty λ|β| has a kink at 0 — non-differentiable there. Requires subgradient analysis and three cases. Coefficients shrink by a fixed amount and can reach exactly zero.

PropertyRidgeLasso
Penalty termλΣβ² (L2)λΣ|β| (L1)
Differentiable?Yes — everywhereNo — kink at β = 0
Calculus toolStandard derivativeSubgradient + 3 cases
Shrinkage typeMultiplicative: β̂ = β̃ · 1/(1+2λ)Translational: β̂ = sign(β̃)·max(|β̃|−λ, 0)
Exact zeros?Never (finite λ)Yes, when |β̃| ≤ λ
Variable selection?NoYes
Matrix form(XᵀX + λI)⁻¹XᵀyNo general closed form
Best used whenMany small effects; collinear predictorsSparse true model; variable selection needed
Interactive comparison — same β̃, same λ, different penalties
OLS estimate β̃ 1.00
Penalty λ 0.50
Ridge β̂ = β̃ / (1+2λ)
Lasso β̂ = sign(β̃)·max(|β̃|−λ, 0)