ViBT: The Beginning of Noise-Free Generation, Vision Bridge Transformer (Paper Review)

ViBT: The Beginning of Noise-Free Generation, Vision Bridge Transformer

Introduction

"Why do we need to go through noise to edit an image?"

Existing Diffusion models always follow the Noise-to-Vision paradigm, even for conditional generation tasks. For image style transfer, editing, and Depth-to-Video, they first create noise and then generate results from that noise.

But if you think about it, this seems strange. When the source and result are similar, why go through a noise state that destroys all information?

ViBT (Vision Bridge Transformer) starts from this question. Using a mathematical framework called Brownian Bridge, it models probabilistic paths that directly connect source to target. A Vision-to-Vision paradigm that transforms data to data without going through noise.

1. Core Problem: Why Go Through Noise?

1.1 Inefficiency of Existing Diffusion

Let's examine how existing Conditional Diffusion models work:

Problems:

Information Loss: Source image information is completely destroyed into noise and must be recovered through conditions.
Non-intuitive Path: Even for style transfer with similar outputs, it goes through a completely different noise state.
Inference Cost: Requires a separate Condition Encoder, and these tokens increase computation.

1.2 Need for Vision-to-Vision Paradigm

ViBT's new perspective:

"If source and target are similar, why not just learn a path that directly connects them?"

This is the core idea of Bridge Models.

2. Brownian Bridge: Mathematical Foundation

2.1 What is Brownian Bridge?

Brownian Bridge is a stochastic process with fixed endpoints. While regular Brownian Motion is a "free random walk" with only a starting point, Brownian Bridge is a "constrained random walk" with both start and end points fixed.

Mathematical Definition:

Given source $x_0$ and target $x_1$ , the intermediate state $X_t$ at time $t$ follows:

$X_t | (x_0, x_1) \sim \mathcal{N}((1-t)x_0 + tx_1, t(1-t)I)$

Key Properties:

$t=0$ : Exactly $x_0$ (source)
$t=1$ : Exactly $x_1$ (target)
$t=0.5$ : Intermediate state, maximum variance

2.2 Why is Bridge Effective?

Crucial difference from Diffusion:

Aspect	Diffusion	Bridge
Start	Pure noise $z \sim N(0,I)$	Source data $x_0$
End	Target data $x_1$	Target data $x_1$
Information Flow	Noise → Data	Data → Data
Source Usage	Only as condition	Directly as path start

Bridge models directly utilize source information as part of the path, making them more efficient for conditional generation.

3. ViBT's Technical Innovations

3.1 Problem: Instability in Large-Scale Training

Scaling Bridge models to 20B parameters causes serious problems.

Velocity Target Divergence:

The instantaneous velocity of Bridge is defined as:

$u_t(X_t|x_0, x_1) = \frac{x_1 - X_t}{1-t}$

As $t \rightarrow 1$ , the denominator $(1-t)$ approaches 0, causing velocity to diverge. This diverges at $O(1/\sqrt{1-t})$ rate, destabilizing training loss.

3.2 Solution: Stabilized Velocity Matching

ViBT's key contribution is introducing a normalization factor α:

$\alpha(x_0, x_1, t)^2 = 1 + \frac{tD}{(1-t)\|x_1 - x_0\|^2}$

where $D$ is the latent dimension.

Stabilized Training Objective:

$\mathcal{L}_{velocity} = \mathbb{E}\left[\left\|\frac{v_\theta(x_t, t) - u_t(x_t|x_1)}{\alpha}\right\|^2\right]$

Effects:

When $t$ is small: $\alpha \approx 1$ (same as before)
As $t \rightarrow 1$ : $\alpha$ grows to cancel diverging velocity
Result: Balanced loss contributions across all timesteps

3.3 Variance-Corrected Sampling

Problems exist not only in training but also in inference.

Standard Euler-Maruyama Problem:

Standard discretization ignores Brownian Bridge's variance characteristics. In Bridge, variance should decrease as $t \rightarrow 1$ , but standard sampling doesn't reflect this.

ViBT's Corrected Sampling:

$x_{k+1} = x_k + \Delta t_k v_\theta(x_k, t_k) + \sqrt{\Delta t_k \cdot \frac{1-t_{k+1}}{1-t_k}} \epsilon_k$

The key is multiplying the noise scale by $\frac{1-t_{k+1}}{1-t_k}$ :

Early ( $t$ small): High stochasticity
Late ( $t \rightarrow 1$ ): Low stochasticity for smooth convergence

4. Architecture and Training

4.1 Model Configuration

ViBT builds on the DiT (Diffusion Transformer) architecture:

Image Model (20B):

Base: Qwen-Image-Editing
Fine-tuning: LoRA (rank 128)
Training: 20,000 iterations, 1 H100 GPU

Video Model (1.3B):

Base: Wan 2.1
Fine-tuning: Full parameter
Training: 50,000 iterations, 4 H100 GPUs

4.2 Training Data

Task	Data Scale	Source
Image Editing	~6K pairs	Open Images + Qwen3-VL generated
Video Stylization	10K videos	Ditto-1M subset
Depth-to-Video	~1K videos	Wan 2.2 generated + Depth Anything V2

Remarkably, strong performance was achieved with very little data.

5. Experimental Results

5.1 Benchmark Performance

Image Editing (ImgEdit-Bench):

Model	Average Score
InstructPix2Pix	2.91
FLUX Kontext	3.71
UniWorld	3.20
ViBT	3.55
ViBT (s=0.5)	3.76

ViBT excels particularly in Object Addition (4.20) and Style Transfer (4.85).

Video Stylization:

Model	CLIPIQA ↑	MUSIQ ↑
TokenFlow	0.378	59.12
InsV2V	0.441	60.62
RAVE	0.413	62.53
ViBT	0.486	64.05

Depth-to-Video:

Model	VBench ↑	SSIM ↑
ControlVideo	0.48	0.312
Control-A-Video	0.56	0.369
VideoComposer	0.57	0.401
ViBT	0.71	0.429

5.2 Speed Comparison

One of ViBT's biggest advantages is inference speed:

Resolution	Conditional DiT	ViBT	Speedup
Image (1024²)	437ms	192ms	2.28×
Video (720P, 10s)	28,577ms	7,097ms	4.03×

Secret to Speed Improvement:

No Condition Encoder needed
No additional conditioning tokens
About 50% token reduction

6. Ablation Study: Impact of Noise Scale

One interesting finding is that optimal noise scale differs by task.

Noise Scale (s)	VBench Score
0 (deterministic)	0.604
0.5	0.709
1	0.709
2	0.711
4	0.482

Insights:

$s=0$ (fully deterministic): Performance degradation
Around $s=2$ : Optimal for Depth-to-Video
$s=0.5$ : Optimal for Image Editing
$s=4$ : Excessive stochasticity causes performance drop

This contradicts prior work claiming "extremely small noise scales are optimal."

7. Limitations and Future Directions

7.1 Current Limitations

Task-specific noise scale tuning needed: No automatic method to find optimal scale yet.
Complex structural changes: May have limitations when source and target are very different (e.g., completely different compositions).

7.2 Future Possibilities

Universal Bridge Model: Handle various tasks with a single model
Scaling further: Performance verification at 100B+ scale
Real-time applications: Interactive editing through faster inference

8. Conclusion

ViBT proposes a paradigm shift in conditional generation:

Noise-free Generation: Direct data-to-data transformation without going through noise
Stabilized Training: Solving technical barriers for large-scale Bridge model training
Efficiency: Up to 4× faster inference without Condition Encoder

Particularly impressive is achieving strong performance with very little training data (thousands of samples). This suggests Bridge models are inherently efficient structures for conditional generation.

ViBT's message that "generation is possible without noise" points to a new direction for future generative model research.

References

Paper - arXiv:2511.23199: Tan et al., "Vision Bridge Transformer at Scale", 2025
Project Page
GitHub Repository
HuggingFace Demo

ViBT: The Beginning of Noise-Free Generation, Vision Bridge Transformer (Paper Review)