Optimization in Action

The test surface

$$\mathcal{L}(\theta_1,\,\theta_2) \;=\; \theta_1^2 \;+\; 10\,\theta_2^2$$

This is an elongated bowl: the curvature along $\theta_2$ is 10 times larger than along $\theta_1$. The unique minimum is $(\theta_1^*,\theta_2^*)=(0,0)$. The gradient is $\nabla\mathcal{L} = [2\theta_1,\; 20\theta_2]$, so the $\theta_2$ direction generates gradients 10 times larger than $\theta_1$. This ill-conditioning (condition number = 10) is exactly what breaks plain SGD and motivates adaptive optimizers: a single learning rate cannot simultaneously be “right” for a steep axis and a shallow one.

$\theta_1$ $\theta_2$

Speed 2

η (LR) 0.10

β (mom.) 0.90

vectors trails

step 0

Press Play to start. All optimizers begin at $(-5,\;3)$ with the same learning rate. Watch how each one handles the steep $\theta_2$ direction differently.

The Algorithms

SGD

$$\theta_{t+1} = \theta_t - \eta \cdot \nabla\mathcal{L}(\theta_t)$$

The simplest rule: step in the direction of steepest descent, scaled by the learning rate $\eta$. No memory of past gradients. On our elongated surface the $\theta_2$ gradient ($=20\theta_2$) is so large that any $\eta$ fast enough to move along $\theta_1$ will cause SGD to overshoot and oscillate on $\theta_2$. Watch the red trail bounce vertically.

SGD + Momentum

$$v_t = \beta\, v_{t-1} + (1-\beta)\,\nabla\mathcal{L}(\theta_t)$$ $$\theta_{t+1} = \theta_t - \eta\, v_t$$

Maintains a velocity $v_t$: a running average of past gradients weighted by $\beta$. Along $\theta_2$, consecutive gradients alternate sign and cancel in the average. Along $\theta_1$ they always agree and compound, accelerating progress. Watch the blue velocity arrow straighten out over time.

RMSProp

$$s_t = \beta\, s_{t-1} + (1-\beta)\,(\nabla\mathcal{L})^2$$ $$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{s_t + \varepsilon}}\,\nabla\mathcal{L}(\theta_t)$$

Instead of tracking the gradient itself, tracks its squared running average $s_t$. Dividing by $\sqrt{s_t}$ normalises each dimension independently. $\varepsilon$ prevents division by zero when $s_t \approx 0$ in early steps. Both axes receive an appropriately sized update regardless of curvature.

Adam

$$m_t = \beta_1 m_{t-1} + (1-\beta_1)\,\nabla\mathcal{L}$$ $$v_t = \beta_2 v_{t-1} + (1-\beta_2)\,(\nabla\mathcal{L})^2$$ $$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t}+\varepsilon}\,\hat{m}_t$$

Combines momentum (first moment $m_t$) and per-parameter scaling (second moment $v_t$), with bias correction via $\hat{m}_t,\hat{v}_t$ to offset the zero-initialisation artifact in early steps. $\beta_1, \beta_2$ defaults (0.9, 0.999) work well across most deep learning tasks.