Sophia: Why Curvature-Aware Optimizers Matter for LLM Training
Training large language models is expensive. A single GPT-3 scale run costs millions in compute. So when a paper claims to halve the number of steps needed to reach the same lossâwith only 5% overhead per stepâitâs worth paying attention.
Sophia, introduced by Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma at Stanford, is a second-order optimizer that achieves exactly this. It reaches the same pre-training loss as Adam in 50% fewer steps on GPT-2 models ranging from 125M to 770M parameters, translating to a roughly 2x wall-clock speedup.
The Problem: Anisotropic Loss Surfaces
To understand why Sophia works, you need to understand where Adam falls short.
The loss landscape of a large neural network is not a smooth bowl. Itâs highly anisotropicâsteep in some directions and flat in others. The Hessian (the matrix of second derivatives) has eigenvalues spanning many orders of magnitude.
Adam partially addresses this by maintaining per-parameter adaptive learning rates based on the history of squared gradients. But squared gradients are a noisy, indirect proxy for curvature. In flat directions where the gradient happens to be large (e.g., near a saddle point), Adam can take steps that are too aggressive. In steep directions where the gradient is small, it can be too conservative.
The fundamental issue is that Adamâs second moment estimate conflates gradient magnitude with curvature. Theyâre related, but theyâre not the same thing.
Sophiaâs Approach: Diagonal Hessian Preconditioning
Sophia takes a more principled approach. Instead of using squared gradients as a curvature proxy, it directly estimates the diagonal of the Hessianâthe actual second-order curvature informationâand uses it as a preconditioner.
The update rule is deceptively simple. Sophia maintains an exponential moving average of the gradient (like Adamâs first moment) and divides the update by the diagonal Hessian estimate. The per-parameter update becomes:
$$\theta_t = \theta_{t-1} - \eta \cdot \text{clip}\left(\frac{m_t}{h_t}, \rho\right)$$
where $m_t$ is the EMA of gradients and $h_t$ is the EMA of the diagonal Hessian estimate.
The key insight is that the diagonal Hessian tells you the actual curvature along each parameter direction. In steep directions (large Hessian), Sophia takes smaller steps. In flat directions (small Hessian), it takes larger steps. This is precisely what you want for navigating anisotropic landscapes efficiently.
The Clipping Mechanism
Thereâs a subtlety that makes Sophia practical: the element-wise clipping by $\rho$. The ratio $m_t / h_t$ can blow up when the Hessian estimate is near zero (flat directions), potentially causing catastrophic updates.
Sophia clips this ratio to $[-\rho, \rho]$, bounding the maximum per-parameter update size. This is more than just a safety measureâthe authors show it provides a worst-case guarantee on the loss decrease per step, something Adam lacks. The clipping acts as an implicit trust region, ensuring stability without the complexity of full trust-region methods.
Two Lightweight Hessian Estimators
Computing the full Hessian is out of the question for models with hundreds of millions of parameters. Sophia offers two practical estimators for the diagonal:
Hutchinsonâs estimator uses random vector products with the Hessian. Draw a random vector $u$ from a Rademacher distribution, compute the Hessian-vector product $Hu$ via automatic differentiation, then $u \odot (Hu)$ gives an unbiased estimate of the diagonal. This requires one extra backward pass.
Gauss-Newton-Bartlett (GNB) estimator leverages the structure of language modeling losses. It samples a label from the modelâs predicted distribution and computes the gradient of the loss with respect to that sample. The squared gradient gives an estimate of the Gauss-Newton matrix diagonalâa positive semidefinite approximation to the Hessian thatâs often more stable.
Both estimators add only about 5% overhead per step. Crucially, Sophia doesnât need to recompute the Hessian every step. Updating the estimate every 10 steps works well in practice, amortizing the cost further.
Scaling Properties
Perhaps the most exciting finding is how Sophiaâs advantage scales. The speedup over Adam grows as model size increasesâfrom around 1.6x at 125M parameters to 2x at 770M. The authors attribute this to larger models having more anisotropic loss surfaces, where accurate curvature information matters more.
This scaling trend suggests that the benefits of curvature-aware optimization become increasingly important as we push toward larger models. If the trend holds, second-order methods like Sophia could become essential infrastructure for frontier model training.
Looking Forward
Sophia demonstrates that lightweight second-order information can meaningfully accelerate LLM training without exotic hardware or algorithmic complexity. The code is available on GitHub, and the method is straightforward to implement on top of existing training pipelines.
The broader lesson is that we may be leaving significant compute savings on the table by defaulting to Adam. As training runs grow more expensive, the optimization algorithm itself becomes a lever worth pulling.