Optimization in Action
This is an elongated bowl: the curvature along $\theta_2$ is 10 times larger than along $\theta_1$. The unique minimum is $(\theta_1^*,\theta_2^*)=(0,0)$. The gradient is $\nabla\mathcal{L} = [2\theta_1,\; 20\theta_2]$, so the $\theta_2$ direction generates gradients 10 times larger than $\theta_1$. This ill-conditioning (condition number = 10) is exactly what breaks plain SGD and motivates adaptive optimizers: a single learning rate cannot simultaneously be “right” for a steep axis and a shallow one.
The Algorithms
The simplest rule: step in the direction of steepest descent, scaled by the learning rate $\eta$. No memory of past gradients. On our elongated surface the $\theta_2$ gradient ($=20\theta_2$) is so large that any $\eta$ fast enough to move along $\theta_1$ will cause SGD to overshoot and oscillate on $\theta_2$. Watch the red trail bounce vertically.
Maintains a velocity $v_t$: a running average of past gradients weighted by $\beta$. Along $\theta_2$, consecutive gradients alternate sign and cancel in the average. Along $\theta_1$ they always agree and compound, accelerating progress. Watch the blue velocity arrow straighten out over time.
Instead of tracking the gradient itself, tracks its squared running average $s_t$. Dividing by $\sqrt{s_t}$ normalises each dimension independently. $\varepsilon$ prevents division by zero when $s_t \approx 0$ in early steps. Both axes receive an appropriately sized update regardless of curvature.
Combines momentum (first moment $m_t$) and per-parameter scaling (second moment $v_t$), with bias correction via $\hat{m}_t,\hat{v}_t$ to offset the zero-initialisation artifact in early steps. $\beta_1, \beta_2$ defaults (0.9, 0.999) work well across most deep learning tasks.