Mathematical Foundation

6.3 Mathematical Foundation

Mathematical Foundation of Diffusion Generative Models

Created Date: 2025-05-24

In this turorial, we covered the mathematical foundation of diffusion generative models. We aim to give you a solid understanding of

The score function as the gradient to data distribution.
Score function enables the reversal of forward diffusion process.
Learning the score function by denoising score matching (and its equivalence to explicit score matching).
Approximate the score function with a neural network.
Sampling from diffusion models.

with concrete examples in low dimension data (2d) and apply them to high dimensional data (point cloud or images).

6.3.1 Mathematical Fundation

This session is on diffusion generative models, an approach to generating samples from distributions; it differs substantially from other approaches, like generative adversarial networks (GANs) and variational autoencoders (VAEs). Notably, it was one of the core ingredients of OpenAI's recent DALL·E 2 system for converting natural language descriptions to detailed images.

Diffusion generative models use the following idea. Samples from some distribution (e.g. of images) are gradually corrupted with more and more noise until they become unrecognizable static—this is the 'diffusion'. We then learn to reverse this mapping, so that we can turn unrecognizable static into a sample from our distribution of interest. Because it's easy to sample unrecognizable static, the mapping allows us to easily sample from our target distribution.

We will focus on the absolute basics of diffusion generative modeling, and restrict ourselves mostly to toy examples involving analytically tractable distributions.

6.3.1.1 Motivation

Generative modeling is an area of statistics and machine learning that involves approximating the probability distribution from which a given set of data is sampled. By learning such a distribution (e.g. of pictures of faces), we can sample new data from it—e.g. a new image of a face.

There are different approaches to generative modeling, including using generative adversarial networks (GANs) and variational autoencoders (VAEs). One somewhat newer approach involves adding a large amount of noise to samples, and then learning to undo the noise.

In this way, we can learn to convert pure noise into samples from some distribution we're interested in. Because it's easy to generate all sorts of samples of pure noise, and each gets transformed by our noise-undoing-mapping to samples from our target distribution, we have a generative model.

There are various appealing features to this approach to generative modeling. One notable one is that we do not have to directly learn a probability distribution, but can instead learn a related function sometimes called the score. This is the gradient of the log of the probability density function.

\(\mathbf{s}(\mathbf{x}) := \nabla_{\mathbf{x}} \log p(\mathbf{x})\)

This allows us to avoid having to learn a normalization constant for our probability density, which is an often troublesome task.

6.3.1.2 Forward Diffusion

How do we add noise to distribution samples 'properly'? Although views on this have varied, recent work has unified different approaches. By and large, various ideas about how to do this amount to subjecting a given distribution sample to a kind of diffusion process. More generally, we can imagine applying some kind of stochastic process (defined by a stochastic differential equation, or SDE) to a distribution sample.

Basics of diffusion

Let's view the basics of diffusion. A one-dimensional diffusion process (on the real line) can be defined by the SDE

\(\dot{x} = \sigma \cdot \eta(t)\)

where \(\sigma > 0\) is a constant, and \(\eta(t)\) denotes a Gaussion white noise term.

What does this even mean? We can make the above expression concrete by defining it to be the small \(\Delta t\) limit of the discrete process for which:

\(x(t + \Delta t) = x(t) + \sigma \cdot \sqrt{\Delta t} \cdot r\)

where \(r \sim \mathcal{N}(0, 1)\) is a sample from a standard normal distribution. This is an example of an Euler-Maruyama time step.

The above description of diffusion is useful for doing simulations. Given a number \(x\), we can use the above update rule to subject it to one time step of diffusion. But a complementary view describes how the probability density \(p(x, t)\), the probability of the system being at position \(x\) at time \(t\), changes over time.

In particular, \(p(x, t)\) changes according to the well-known diffusion equation:

                    \(\frac{\partial p(x, t)}{\partial t} = \frac{\sigma^2}{2} \cdot \frac{\partial^2 p(x, t)}{\partial x^2}\)
                

You may be used to seeing this equation with a diffusion constant \(D\) instead of a parameter \(\sigma\). The conversion between them is that \(D = \frac{\sigma^2}{2}\).

The solution to the diffusion equation, given a delta function initial condition \(p(x, 0) = \delta(x - x_0)\) (that is, given that we are initially certain that the system is at location \(x_0\)) is the well-known heat kernel:

                    \(p(x, t | x_0, 0) = \frac{1}{\sqrt{2 \pi \sigma^2 t}} \exp\left\{ - \frac{(x - x_0)^2}{2 \sigma^2 t}  \right\}\)
                

Given an arbitrary initial condition \(p(x, 0) = p_0(x)\), the general solution of the diffusion equation is then:

\(p(x, t) = \int_{-\infty}^{\infty} p(x, t | x_0, 0) \cdot p_0(x_0) \cdot dx_0\)

Basics of stochastic differential equations (SDEs)

Diffusion is just one particularly simple example of a stochastic process governed by an SDE. The most general one-dimensional SDE looks like

\(\dot{x} = f(x, t) + g(x, t) \cdot \eta(t)\)

Where \(f(x, t)\) and \(g(x, t)\) are functions, and \(\eta(t)\) is again a Gaussian white noise term. The function \(f(x, t)\) is the drift term, and describes how the average value of \(x(t)\) evolves with time. The term involving \(g(x, t)\) is the noise or diffusion term, and describes stochastic fluctuations in the value of \(x(t)\).

What does this even mean? Unlike in the case of diffusion, for which \(g(x, t)\) is a constant, there are additional mathematical subtleties associated with making the above expression well-defined. For our purposes, we can choose the convention known as the Ito interpretation, which defines the above expression to be the small \(\Delta t\) limit of the discrete process:

                    \(x(t + \Delta t) = x(t) + f(x(t), t) \cdot \Delta t + g(x, t) \cdot \sqrt{\Delta t} \cdot r\)
                

Where \(r \sim \mathcal{N}(0, 1)\) is a sample from a standard normal distribution. This is a more general Euler-Maruyama time step (although it is still not the most general form of it, given that we are only working in one dimension for now).

The analogue to the diffusion equation is the Fokker-Planck equation (FPE), which describes how \(p(x, t)\) evolves in time for a system described by an SDE. For the one-dimensional Ito-interpreted system written above, it is given by:

                    \(\frac{\partial p(x, t)}{\partial t} = - \frac{\partial}{\partial x} \left[ f(x, t) \cdot p(x, t) \right] + \frac{1}{2} \cdot \frac{\partial^2}{\partial x^2} \left[ g^2(x, t) \cdot p(x, t) \right]\)
                

Most FPEs are difficult or impossible to solve analytically, so we may not be able to get exact solutions for \(p(x, t)\). Notable cases where we can solve for \(p(x, t)\) exactly include diffusion and Ornstein-Uhlenbeck-like processes (OU processes). OU processes are mean-reverting stochastic processes; a one-dimensional example is the SDE:

                    \(\dot{x} = \frac{1}{\tau} \left[ \mu - x \right] + \sigma \cdot \sqrt{\frac{2}{\tau}} \cdot \eta(t)\)
                

Its solution (for an initial condition \(p(x, 0) = \delta(x - x_0)\)) is:

                    \(p(x, t | x_0, 0) = \frac{1}{\sqrt{2 \pi {s(t)}^2}} \exp\left\{ - \frac{[ x - \mu(t) ]^2}{2 {s(t)}^2}  \right\}\)
                

Where

\(\mu(t) = x_0 \cdot e^{-t/\tau} + \mu \cdot (1 - e^{-t/\tau})\)

\(s(t) = \sigma \cdot \sqrt{\frac{1 - e^{-2t/\tau}}{2/\tau}}\)

In the long-time limit \((t \rightarrow \infty)\), we have:

                    \(p(x, t | x_0, 0) \xrightarrow{t \to \infty} p_{ss}(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left\{ - \frac{(x - \mu)^2}{2 \sigma^2}  \right\}\)
                

i.e. the steady-state solution \(p_{ss}(x)\) is a Gaussian with mean \(\mu\) and variance \(\sigma^2\).

The general solution to this OU process (given an initial distribution \(p(x, 0) = p_0(x)\)) is:

                    \(p(x, t) = \int_{-\infty}^{\infty} p(x, t | x_0, 0) \cdot p_0(x_0) \cdot dx_0\)
                

One thing that makes OU processes somewhat different from a strict diffusion process is that their long-time behavior loses all memory of their initial condition. That is, as \(t\) increases, \(p(x, t)\) depends less and less on \(p_0(x)\).

Basic Example Code

Let's see what it might look like to simulate 1D diffusion, a particularly simple stochastic process. You can also play with the code yourself to simulate other SDEs, like an OU process.

6.3.1.3 Reverse Diffusion

How do we 'undo' the noise that we added to our samples from the forward diffusion process? Although a bit counter to intuition—for example, we expect to observe cream and coffee to mix via diffusion, but not for them to spontaneously separate—it turns out that it is mathematically straightforward to define the time-reverse of a stochastic process governed by an SDE. The time-reversed process is itself a stochastic process governed by an SDE, whose probability density evolves in time according to a Fokker-Planck equation.

Motivating Example: Reversing 1D diffusion

Recall that 1D diffusion (for an amount of time \(T\)) can be described by the SDE:

\(\dot{x} = \sigma \cdot \eta(t)\)

Intuitively, diffusion causes things to spread out. We would expect the time-reverse of this process to cause things to come back together.

One stochastic process which concentrates things, rather than spreads them out, is an OU process. Might some kind of OU process be the time-reverse of 1D diffusion? It turns out that the OU process described by:

\(\dot{x} = \frac{x_0 - x}{T - t} + \sigma \cdot \eta(t)\)

is the time-reverse of a diffusion process that begins concentrated at the point \(x_0\), and spreads out for an amount of time \(T\). In other words, this process compresses a Gaussian until it becomes a delta function centered at \(x_0\) at time \(T\).

The transition probability is also opposite:

                    \(q(x, t | x_0, 0) = \frac{1}{\sqrt{2 \pi \sigma^2 (T - t)}} \exp\left\{ - \frac{(x - x_0)^2}{2 \sigma^2 (T - t)}  \right\}\)
                

To not confuse these probabilities with those of the forward process, we will denote them by \(q\) instead of \(p\).

Since the reverse process is also governed by an SDE, we can use Euler-Maruyama time steps to simulate it; here, we would have:

                    \(x(t + \Delta t) = x(t) + \frac{x_0 - x}{T - t} \Delta t + \sigma \cdot \sqrt{\Delta t} \cdot r\)
                

Note that we're using the convention that time runs forward, from \(t = 0\) to \(t = T\), here. People also use the convention that times runs backward (from \(t = T\) to \(t = 0\)), in which case we would instead have:

                    \(x(t - \Delta t) = x(t) + \frac{x_0 - x}{t} \Delta t + \sigma \cdot \sqrt{\Delta t} \cdot r\)
                

They are mathematically equivalent, so you can use either one when you're doing simulations.

6.3.1.4 Learning the Score Function

6.3.2 Day 1 Coding Tutorial

In this tutorial you will gain more intuition about the score functions by examining the analytical score function of a general class of distribution: Gaussian Mixture model.

You will empirically validate that the exact score functions enabled the reversal of diffusion process and recovers the original data distribution.

6.3.3 Day 2 Coding Tutorial

In this notebook, you will build a toy neural network model to learn the score function in a few different ways: by supervised learning on the exact score, by denoising score matching from the data samples. You will empirically validate that these two methods can both approximate the score of data and be used to recover the original data distribution in reverse diffusion.