Sophia: Why Curvature-Aware Optimizers Matter for LLM Training

By Prahlad Menon 4 min read

Training large language models is expensive. A single GPT-3 scale run costs millions in compute. So when a paper claims to halve the number of steps needed to reach the same loss—with only 5% overhead per step—it’s worth paying attention.

Sophia, introduced by Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma at Stanford, is a second-order optimizer that achieves exactly this. It reaches the same pre-training loss as Adam in 50% fewer steps on GPT-2 models ranging from 125M to 770M parameters, translating to a roughly 2x wall-clock speedup.

The Problem: Anisotropic Loss Surfaces

To understand why Sophia works, you need to understand where Adam falls short.

The loss landscape of a large neural network is not a smooth bowl. It’s highly anisotropic—steep in some directions and flat in others. The Hessian (the matrix of second derivatives) has eigenvalues spanning many orders of magnitude.

Adam partially addresses this by maintaining per-parameter adaptive learning rates based on the history of squared gradients. But squared gradients are a noisy, indirect proxy for curvature. In flat directions where the gradient happens to be large (e.g., near a saddle point), Adam can take steps that are too aggressive. In steep directions where the gradient is small, it can be too conservative.

The fundamental issue is that Adam’s second moment estimate conflates gradient magnitude with curvature. They’re related, but they’re not the same thing.

Sophia’s Approach: Diagonal Hessian Preconditioning

Sophia takes a more principled approach. Instead of using squared gradients as a curvature proxy, it directly estimates the diagonal of the Hessian—the actual second-order curvature information—and uses it as a preconditioner.

The update rule is deceptively simple. Sophia maintains an exponential moving average of the gradient (like Adam’s first moment) and divides the update by the diagonal Hessian estimate. The per-parameter update becomes:

$$\theta_t = \theta_{t-1} - \eta \cdot \text{clip}\left(\frac{m_t}{h_t}, \rho\right)$$

where $m_t$ is the EMA of gradients and $h_t$ is the EMA of the diagonal Hessian estimate.

The key insight is that the diagonal Hessian tells you the actual curvature along each parameter direction. In steep directions (large Hessian), Sophia takes smaller steps. In flat directions (small Hessian), it takes larger steps. This is precisely what you want for navigating anisotropic landscapes efficiently.

The Clipping Mechanism

There’s a subtlety that makes Sophia practical: the element-wise clipping by $\rho$. The ratio $m_t / h_t$ can blow up when the Hessian estimate is near zero (flat directions), potentially causing catastrophic updates.

Sophia clips this ratio to $[-\rho, \rho]$, bounding the maximum per-parameter update size. This is more than just a safety measure—the authors show it provides a worst-case guarantee on the loss decrease per step, something Adam lacks. The clipping acts as an implicit trust region, ensuring stability without the complexity of full trust-region methods.

Two Lightweight Hessian Estimators

Computing the full Hessian is out of the question for models with hundreds of millions of parameters. Sophia offers two practical estimators for the diagonal:

Hutchinson’s estimator uses random vector products with the Hessian. Draw a random vector $u$ from a Rademacher distribution, compute the Hessian-vector product $Hu$ via automatic differentiation, then $u \odot (Hu)$ gives an unbiased estimate of the diagonal. This requires one extra backward pass.

Gauss-Newton-Bartlett (GNB) estimator leverages the structure of language modeling losses. It samples a label from the model’s predicted distribution and computes the gradient of the loss with respect to that sample. The squared gradient gives an estimate of the Gauss-Newton matrix diagonal—a positive semidefinite approximation to the Hessian that’s often more stable.

Both estimators add only about 5% overhead per step. Crucially, Sophia doesn’t need to recompute the Hessian every step. Updating the estimate every 10 steps works well in practice, amortizing the cost further.

Scaling Properties

Perhaps the most exciting finding is how Sophia’s advantage scales. The speedup over Adam grows as model size increases—from around 1.6x at 125M parameters to 2x at 770M. The authors attribute this to larger models having more anisotropic loss surfaces, where accurate curvature information matters more.

This scaling trend suggests that the benefits of curvature-aware optimization become increasingly important as we push toward larger models. If the trend holds, second-order methods like Sophia could become essential infrastructure for frontier model training.

Looking Forward

Sophia demonstrates that lightweight second-order information can meaningfully accelerate LLM training without exotic hardware or algorithmic complexity. The code is available on GitHub, and the method is straightforward to implement on top of existing training pipelines.

The broader lesson is that we may be leaving significant compute savings on the table by defaulting to Adam. As training runs grow more expensive, the optimization algorithm itself becomes a lever worth pulling.