From Gradient Descent to Langevin Dynamics

By Prahlad Menon 4 min read

Every deep learning practitioner knows stochastic gradient descent. You compute a gradient on a mini-batch, multiply by a learning rate, and step downhill. Repeat until converged.

But what if the noise in that process isn’t a nuisance to be eliminated — but a feature to be designed?

That’s the core insight behind Langevin dynamics, and it transforms SGD from a pure optimizer into something far more powerful: a sampler from a probability distribution over parameters.

SGD: Optimization with Accidental Noise

The standard SGD update is straightforward:

θ_{t+1} = θ_t − η_t · ∇L̃(θ_t)

where ∇L̃ is the gradient estimated from a mini-batch. The noise here is incidental — it comes from the variance between mini-batch gradients and the true gradient. As the learning rate η_t decays toward zero, both the signal and the noise shrink. The algorithm converges to a single point estimate: one minimum, no uncertainty quantification, no sense of what other good solutions might exist nearby.

This is fine for point estimation. But it leaves information on the table. The loss landscape of a neural network contains a rich geometry of minima, saddle points, and flat regions. SGD, by design, collapses all of that into one answer.

Langevin Dynamics: Noise by Design

Langevin dynamics starts from physics — specifically, the motion of a particle in a potential field subject to friction and thermal fluctuations. Adapted to optimization, the update becomes:

θ_{t+1} = θ_t − η_t · ∇L(θ_t) + √(2η_t) · ε_t, where ε_t ∼ N(0, I)

The critical difference is the last term. Instead of relying on mini-batch variance for randomness, Langevin dynamics injects Gaussian noise at every step, scaled to √(2η_t). This scaling is not arbitrary — it satisfies a detailed balance condition that guarantees, under mild assumptions, that the iterates converge to the Gibbs (Boltzmann) distribution:

p(θ) ∝ exp(−L(θ))

This distribution concentrates mass on low-loss regions while still assigning nonzero probability to higher-loss areas. The algorithm doesn’t just find a minimum — it explores the landscape proportionally to how good each region is.

The consequences are significant. Langevin dynamics can escape shallow local minima that trap SGD. It naturally quantifies uncertainty by producing a distribution rather than a point. And it bridges two worlds: when the noise dominates, it behaves like a sampler; when the gradient dominates, it behaves like an optimizer.

SGLD: Making It Scale

Pure Langevin dynamics requires computing the full gradient ∇L(θ) over the entire dataset at each step — impractical for modern-scale problems. Welling and Teh (2011) resolved this with stochastic gradient Langevin dynamics (SGLD), which simply replaces the true gradient with a mini-batch estimate:

θ_{t+1} = θ_t − η_t · ∇L̃(θ_t) + √(2η_t) · ε_t

This looks almost identical to SGD. The only addition is the explicit noise term. But that small modification changes the algorithm’s asymptotic behavior entirely.

In SGD, as η_t → 0, the mini-batch noise vanishes and the iterates freeze at a point. In SGLD, the injected noise also shrinks — but at a rate (√η_t) that dominates the mini-batch noise (which scales as η_t). The system keeps exploring even as the step size decays, and the iterates converge to samples from the posterior distribution rather than collapsing to a mode.

Welling and Teh made another elegant observation: as the step size decreases, the injected noise overwhelms the mini-batch noise, so you can skip the Metropolis-Hastings acceptance step entirely. No accept/reject ratio, no second forward pass. The algorithm is as cheap as SGD per iteration, but produces approximate posterior samples.

Why This Matters

SGLD and its descendants (SGHMC, preconditioned variants, cyclical schedules) sit at the intersection of optimization and Bayesian inference. They give you:

  • Uncertainty estimates from the spread of sampled parameters, useful for active learning, calibration, and safety-critical systems.
  • Better exploration of flat, wide minima — which tend to generalize better than sharp ones.
  • A principled framework for tempering: raise the “temperature” to explore more, lower it to exploit. The Gibbs distribution makes this knob explicit.

The deeper lesson is that noise in optimization is not just tolerable — it’s informative. The difference between SGD and SGLD is a single line of code. But that line encodes a fundamentally different relationship with the loss landscape: not “find the bottom” but “map the terrain.”


References:

  • Welling, M. & Teh, Y.W. (2011). Bayesian Learning via Stochastic Gradient Langevin Dynamics. ICML 2011.
  • Chen, T., Fox, E., & Guestrin, C. (2014). Stochastic Gradient Hamiltonian Monte Carlo. ICML 2014.
  • Zhang, R., Li, C., Zhang, J., Chen, C., & Wilson, A.G. (2020). Cyclical Stochastic Gradient MCMC for Bayesian Deep Learning. ICLR 2020.