Estimating Gradients

6.5 Estimating Gradients

Generative Modeling by Estimating Gradients of the Data Distribution

Created Date: 2025-05-24

This blog post focuses on a promising new direction for generative modeling. We can learn score functions (gradients of log probability density functions) on a large number of noise-perturbed data distributions, then generate samples with Langevin-type sampling.

The resulting generative models, often called score-based generative models, has several important advantages over existing model families: GAN-level sample quality without adversarial training, flexible model architectures, exact log-likelihood computation, and inverse problem solving without re-training models.

In this blog post, we will show you in more detail the intuition, basic concepts, and potential applications of score-based generative models.

6.5.1 Introduction

Existing generative modeling techniques can largely be grouped into two categories based on how they represent probability distributions.

likelihood-based models - which directly learn the distribution’s probability density (or mass) function via (approximate) maximum likelihood. Typical likelihood-based models include autoregressive models, normalizing flow models, energy-based models (EBMs), and variational auto-encoders (VAEs).
implicit generative models - where the probability distribution is implicitly represented by a model of its sampling process. The most prominent example is generative adversarial networks (GANs), where new samples from the data distribution are synthesized by transforming a random Gaussian vector with a neural network.

Likelihood-based models and implicit generative models, however, both have significant limitations. Likelihood-based models either require strong restrictions on the model architecture to ensure a tractable normalizing constant for likelihood computation, or must rely on surrogate objectives to approximate maximum likelihood training. Implicit generative models, on the other hand, often require adversarial training, which is notoriously unstable and can lead to mode collapse.

In this blog post, I will introduce another way to represent probability distributions that may circumvent several of these limitations. The key idea is to model the gradient of the log probability density function, a quantity often known as the (Stein) score function. Such score-based models are not required to have a tractable normalizing constant, and can be directly learned by score matching.

6.5.2 Score-based Generative Modeling

Suppose our dataset consists of i.i.d. samples \({\{x_i \in \mathbb{R}^D\}}_{i=1}^N\) from an unknown data distribution \(p_{data}(x)\). We define the score of a probability density \(p(x)\) to be \({\Delta}_x {log}_p(x)\). The score network \(s_{\theta} : \mathbb{R}^D \rightarrow \mathbb{R}^D\) is a neural network parameterized by \(\theta\), which will be trained to approximate the score of \(p_{data}(x)\).

The goal of generative modeling is to use the dataset to learn a model for generating new samples from \(p_{data}(x)\). The framework of score-based generative modeling has two ingredients: score matching and Langevin dynamics.

6.5.2.1 Score Matching for Score Estimation

Score matching [24] is originally designed for learning non-normalized statistical models based on i.i.d. samples from an unknown data distribution. Following [53], we repurpose it for score estimation.

6.5.2.2 Sampling with Langevin Dynamics

Langevin dynamics can produce samples from a probability density \(p(x)\) using only the score function \(\Delta_x logp(x)\). Given a fixed step size \(\epsilon \gt 0\), and an initial value \(\hat{x}_0 ~ \pi(x)\) with \(pi\) being a prior distribution, the Langevin method recursively computes the following:

                    \(\hat{x}_t = \hat{x}_{t-1} + \frac{\epsilon}{2} \Delta_x log p(\hat{x}_{t-1}) + \sqrt{\epsilon} z_t\)