SDE vs ODE: Mathematical Foundations of Score-based Diffusion

Stochastic vs Deterministic. Same distribution, different paths.

TL;DR

SDE (Stochastic DE): Probabilistic paths with noise, theoretical basis of DDPM
ODE (Ordinary DE): Deterministic paths, basis of DDIM and Flow Matching
Probability Flow ODE: An ODE with the same marginal distribution as SDE
Key Difference: SDE = more diversity, less speed; ODE = less diversity, more speed

1. Why Differential Equations?

The Essence of Diffusion

Diffusion models are transformations between two distributions:

Forward: Data $p_{\text{data}}$ → Noise $\mathcal{N}(0, I)$
Reverse: Noise $\mathcal{N}(0, I)$ → Data $p_{\text{data}}$

Modeling this transformation in continuous time gives us differential equations.

Discrete vs Continuous

DDPM (discrete):

$x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(x_t, t)\right) + \sigma_t z$

Continuous-time SDE:

$dx = f(x, t)dt + g(t)dw$

The continuous-time view is more flexible and enables various sampler designs.

2. Forward SDE: From Data to Noise

Variance Preserving SDE (VP-SDE)

The continuous SDE corresponding to DDPM:

$dx = -\frac{1}{2}\beta(t)x \, dt + \sqrt{\beta(t)} \, dw$

Where:

$\beta(t)$: noise schedule (noise intensity over time)
$dw$: Wiener process (Brownian motion)

Variance Exploding SDE (VE-SDE)

The SDE corresponding to SMLD/NCSN:

$dx = \sqrt{\frac{d[\sigma^2(t)]}{dt}} \, dw$

Where $\sigma(t)$ is the noise scale increasing over time.

Solution of Forward Process

For VP-SDE, the distribution at time $t$ is:

$x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon$

Where $\bar{\alpha}_t = e^{-\int_0^t \beta(s)ds}$

This exactly matches DDPM's forward process!

3. Reverse SDE: From Noise to Data

Anderson's Theorem

A remarkable fact: Running the forward SDE backwards in time is also an SDE!

Forward:

$dx = f(x, t)dt + g(t)dw$

Reverse:

$dx = [f(x, t) - g(t)^2 \nabla_x \log p_t(x)]dt + g(t)d\bar{w}$

Where:

$\nabla_x \log p_t(x)$: Score function (the key!)
$d\bar{w}$: Reverse-time Wiener process

What is the Score Function?

$\nabla_x \log p_t(x) = -\frac{\epsilon}{\sqrt{1-\bar{\alpha}_t}}$

The score is the "gradient pointing toward data from current position."

Relationship between DDPM's noise prediction $\epsilon_\theta$ and score:

$s_\theta(x_t, t) = -\frac{\epsilon_\theta(x_t, t)}{\sqrt{1-\bar{\alpha}_t}}$

4. Probability Flow ODE

The Key Discovery

A crucial finding by Song et al. (2021):

There exists a **deterministic ODE** with the **same marginal distribution** $p_t(x)$ as the SDE!

$dx = \left[f(x, t) - \frac{1}{2}g(t)^2 \nabla_x \log p_t(x)\right]dt$

The noise term $g(t)dw$ disappears, only the drift is modified.

Probability Flow ODE for VP-SDE

$dx = \left[-\frac{1}{2}\beta(t)x - \frac{1}{2}\beta(t)\nabla_x \log p_t(x)\right]dt$

Substituting score with $\epsilon_\theta$:

$dx = \left[-\frac{1}{2}\beta(t)x + \frac{\beta(t)}{2\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(x, t)\right]dt$

This is identical to DDIM with $\eta=0$!

5. SDE vs ODE: Characteristic Comparison

Sampling Paths

Property	SDE (Reverse)	ODE (Probability Flow)
Path	Stochastic (different each time)	Deterministic (always same)
Noise	Added at each step	None
Diversity	High	Low (same z → same x)
Speed	Slow (small steps needed)	Fast (large steps possible)

Mathematical Relationship

python

          SDE                    ODE
     ┌───────────┐         ┌───────────┐
z ~  │ Reverse   │   z ~   │ Probability│
N(0,I)│ SDE      │  N(0,I) │ Flow ODE  │
     │           │         │           │
     └─────┬─────┘         └─────┬─────┘
           │                     │
           ▼                     ▼
        x ~ p_data           x ~ p_data

     Same marginal distribution, different paths!

DDPM vs DDIM

Model	Basis	Characteristics
DDPM	Reverse SDE	$\eta=1$, stochastic
DDIM	Probability Flow ODE	$\eta=0$, deterministic
DDIM (general)	Mixture of both	$0 \leq \eta \leq 1$

DDIM's $\eta$ parameter:

$\eta = 0$: Pure ODE (deterministic)
$\eta = 1$: Pure SDE (same as DDPM)
$0 < \eta < 1$: Interpolation

6. Score Matching: Learning the Score

Denoising Score Matching

Learning the score function directly is difficult. Instead, we use Denoising Score Matching:

$\mathcal{L} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]$

This is identical to DDPM's training objective!

Equivalence of Score and Noise Prediction

$\text{Score: } s_\theta(x, t) \approx \nabla_x \log p_t(x)$

$\text{Noise: } \epsilon_\theta(x, t) \approx \epsilon$

Relationship:

$s_\theta = -\frac{\epsilon_\theta}{\sigma_t}$

Thus noise prediction = score prediction (only scale differs)

7. Numerical Solvers

SDE Solvers

Euler-Maruyama (most basic):

$x_{t-\Delta t} = x_t + f(x_t, t)\Delta t + g(t)\sqrt{\Delta t} \cdot z$

Predictor-Corrector (Song et al.):

Predictor: Euler step
Corrector: Refine with Langevin dynamics

ODE Solvers

Euler (1st order):

$x_{t-\Delta t} = x_t + f(x_t, t)\Delta t$

Heun (2nd order):

$\tilde{x} = x_t + f(x_t, t)\Delta t$

$x_{t-\Delta t} = x_t + \frac{1}{2}[f(x_t, t) + f(\tilde{x}, t-\Delta t)]\Delta t$

DPM-Solver (specialized higher-order solver):

Exploits structure of diffusion ODE
High quality with 10-20 steps

Solver Comparison

Solver	Order	Steps	Characteristics
Euler-Maruyama	1	1000+	Basic SDE
DDPM	1	1000	Discrete SDE
DDIM	1	50-100	ODE
DPM-Solver	2-3	10-25	Higher-order ODE
DPM-Solver++	2-3	10-20	Improved version

8. Connection to Flow Matching

Conditional Flow Matching

Flow Matching is also ODE-based:

$dx = v_\theta(x, t)dt$

Differences:

Diffusion ODE: Drift derived from score
Flow Matching: Directly learn velocity

Same Result, Different Paths

Both transform $p_{\text{noise}} \to p_{\text{data}}$ but:

Property	Diffusion ODE	Flow Matching
Path	Curved (score-based)	Straight (optimal transport)
Derivation	Derived from SDE	Directly defined
Training Target	$\epsilon$ prediction	$v$ prediction

9. Practical Selection Guide

When to Use SDE?

When diversity is important
When sufficient compute is available
When stochastic refinement is needed (e.g., inpainting)

When to Use ODE?

When speed is important
When deterministic results are needed (reproducibility)
When latent interpolation is needed

Choices of Modern Models

Model	Choice	Reason
DALL-E 2	SDE (DDPM)	Quality priority
Stable Diffusion	ODE (DDIM/DPM)	Speed-quality balance
SD3/FLUX	Flow ODE	Fast generation with straight paths

10. Advanced Topics

Continuous Normalizing Flows (CNF)

From the ODE perspective, diffusion is a type of Normalizing Flow:

$\log p_0(x_0) = \log p_T(x_T) - \int_0^T \text{div}(f(x_t, t)) dt$

This enables likelihood computation as well.

Optimal Transport Perspective

Probability Flow ODE connects to Optimal Transport:

"Shortest path" between two distributions
Related to Wasserstein distance

Guidance in SDE vs ODE

Classifier-Free Guidance applies to both SDE and ODE:

$\tilde{s}(x, t) = s(x, t) + w \cdot (s(x, t | c) - s(x, t))$

Conclusion

Concept	SDE	ODE
Formula	$dx = f dt + g dw$	$dx = f dt$
Representative Models	DDPM	DDIM, Flow Matching
Path	Stochastic	Deterministic
Advantages	Diversity, theoretical foundation	Speed, reproducibility
Disadvantages	Slow	Reduced diversity

Key Insight: SDE and ODE solve the same problem in different ways. Thanks to Probability Flow ODE, we can maintain the theoretical advantages of SDE while gaining the practical benefits of ODE.

References

Song, Y., et al. "Score-Based Generative Modeling through Stochastic Differential Equations" (ICLR 2021)
Ho, J., et al. "Denoising Diffusion Probabilistic Models" (NeurIPS 2020)
Song, J., et al. "Denoising Diffusion Implicit Models" (ICLR 2021)
Lipman, Y., et al. "Flow Matching for Generative Modeling" (ICLR 2023)
Lu, C., et al. "DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling" (NeurIPS 2022)