Lion vs AdamW: Is It Time to Dethrone the King?

TL;DR: Lion saves ~33% memory while delivering comparable performance. But without proper hyperparameter tuning, you're worse off than before.

1. The Evolution of Optimizers: From SGD to Lion

The history of deep learning optimizers is a journey to answer one question: "How can we converge faster and more reliably?"

1.1 The SGD Era (1950s~2010s)

Everything started with Stochastic Gradient Descent:

$\theta_{t+1} = \theta_t - \eta \cdot g_t$

Simple, but problematic:

Noisy gradients: Variance from mini-batch sampling
Uniform learning rate: Same update magnitude for all parameters
Saddle points: Hard to escape in high dimensions

1.2 The Rise of Momentum (1999)

Polyak's momentum was the first major improvement:

$\begin{aligned}v_t &= \gamma v_{t-1} + \eta g_t \\\theta_{t+1} &= \theta_t - v_t\end{aligned}$

By accumulating past gradients, we add "inertia" to the optimization. This smooths out noisy gradients and reduces oscillation in narrow valleys.

1.3 The Adaptive Learning Rate Era (2011~)

AdaGrad (2011): Per-parameter learning rates

$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \cdot g_t$

Here, $G_t$ is the cumulative sum of squared gradients. Frequently updated parameters get smaller learning rates, while rarely updated ones maintain larger rates.

RMSprop (2012): AdaGrad with forgetting

$v_t = \beta v_{t-1} + (1-\beta) g_t^2$

Using exponential moving average, older gradients are gradually forgotten.

1.4 Adam: Combining Two Ideas (2015)

Kingma & Ba's Adam merged momentum with adaptive learning rates:

$\begin{aligned}m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \quad \text{(1st moment, momentum)} \\v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \quad \text{(2nd moment, adaptive lr)} \\\theta_t &= \theta_{t-1} - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}\end{aligned}$

Where $\hat{m}_t, \hat{v}_t$ are bias-corrected versions.

Adam "just worked" for most tasks and quickly became the standard.

1.5 AdamW: Rediscovering Weight Decay (2017)

Loshchilov & Hutter discovered a critical issue: L2 regularization doesn't work properly in Adam.

The Problem: Adam includes weight decay in the gradient:

$g_t = \nabla f(\theta) + \lambda \theta$

This causes adaptive learning rates to counteract regularization. Parameters with small gradients get larger learning rates, making weight decay inconsistent.

The Solution (AdamW): Decouple weight decay from gradients:

$\theta_t = \theta_{t-1} - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_{t-1} \right)$

This simple change significantly improved generalization, and AdamW became the new standard.

1.6 Lion: An Optimizer Discovered by AutoML (2023)

Google's approach was fascinating: Let a program discover it, not humans.

Key ideas from the Lion paper:

Represent optimizers as "programs" (symbolic representation)
Use evolutionary algorithms to search thousands of optimizers
Evaluate on real tasks and select the best

The result was Lion. Surprisingly, it was simpler yet more effective than human-designed alternatives.

2. Lion's Core Ideas

2.1 Mathematical Comparison

AdamW Update:

$\begin{aligned}m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \\v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \\\hat{m}_t &= \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} \\\theta_t &= \theta_{t-1} - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_{t-1} \right)\end{aligned}$

Lion Update:

$\begin{aligned}\theta_t &= \theta_{t-1} - \eta \lambda \theta_{t-1} \quad \text{(weight decay first)} \\c_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \\\theta_t &= \theta_t - \eta \cdot \text{sign}(c_t) \\m_t &= \beta_2 m_{t-1} + (1 - \beta_2) g_t\end{aligned}$

2.2 Key Differences

Aspect	AdamW	Lion
Momentum Storage	`m` + `v` (2x memory)	`m` only (1x memory)
Update Magnitude	Adaptive ($\hat{m}/\sqrt{\hat{v}}$)	Constant ($\pm \eta$)
Weight Decay	Applied with update	Applied before update
Bias Correction	Yes	No
Default β	(0.9, 0.999)	(0.9, 0.99)

2.3 Lion's Unique Structure: Two Different β Values

The most distinctive aspect of Lion is that β₁ and β₂ serve different purposes:

β₁ (0.9): Used for computing update direction
c_t = β₁ * m_{t-1} + (1-β₁) * g_t
→ sign(c_t) determines this step's update direction

β₂ (0.99): Used for storing momentum for next step
m_t = β₂ * m_{t-1} + (1-β₂) * g_t
→ Maintains smoother momentum

Why this separation works isn't fully understood. Since AutoML discovered it, there's limited theoretical explanation for "why it works." But empirically, it does.

2.4 Lion's `sign()` Operation

The core of Lion is the sign() operation:

$\text{sign}(x) = \begin{cases} +1 & \text{if } x > 0 \\ 0 & \text{if } x = 0 \\ -1 & \text{if } x < 0 \end{cases}$

This means ignoring gradient magnitude and using only direction.

Why this seems counterintuitive:

Gradient magnitude tells us "how much to move"
Throwing this away seems like information loss

But in practice:

All parameters get equal-magnitude updates
Learning rate η determines step size
Implicit regularization effect emerges

3. Why Does sign() Work? Theoretical Analysis

3.1 Implicit Regularization

The sign() operation affects gradients differently based on magnitude:

Large gradient (|g| >> 0): sign(g) * η = η (magnitude "suppressed" to 1)
Small gradient (|g| ≈ 0): sign(g) * η = η (magnitude "amplified" to 1)

This creates an effect similar to L∞ regularization. All parameter updates become uniform, preventing any single parameter from growing excessively.

3.2 Noise Robustness

Stochastic gradients can be viewed as "signal + noise":

$g_t = \nabla f(\theta) + \epsilon_t$

Properties of sign():

If |signal| > |noise|: sign(g) ≈ sign(signal)
Noise is ignored unless it's large enough to flip the direction

This enables more stable updates in noisy gradient environments.

3.3 Loss Landscape and Flat Minima

Sharp minima vs Flat minima hypothesis:

Sharp minima: Narrow valleys, poor generalization
Flat minima: Wide basins, good generalization

Effects of sign() updates:

Moving with constant step size
"Bouncing out" of sharp minima (step too large)
Settling only in flat minima

This hypothesis can explain why Lion shows better generalization in some cases.

3.4 Analysis from a Preconditioner Perspective

Generalizing optimizers:

$\theta_{t+1} = \theta_t - \eta \cdot P_t \cdot g_t$

Where $P_t$ is the preconditioner.

Optimizer	Preconditioner $P_t$
SGD	$I$ (identity)
Adam	$\text{diag}(1/\sqrt{\hat{v}_t})$
Lion	$\text{diag}(\text{sign}(\cdot)/g_t)$ (effective)

Geometric interpretation:

SGD: Explores loss surface as-is
Adam: Explores by "stretching/shrinking" loss surface (coordinate-wise scaling)
Lion: Moves with equal magnitude in all directions (isotropic in update magnitude)

4. Key Experimental Results from the Lion Paper

4.1 Image Classification (ImageNet)

The paper experimented with various models including ViT-B/16, ViT-L/16.

Key findings:

Lion performs similarly or slightly better than AdamW
Memory usage is definitively lower
More stable at large batch sizes (4K+)

4.2 Language Modeling

Experiments with GPT-2 style models:

Key findings:

Comparable perplexity to AdamW
Good training stability
Memory savings significant at LLM scale

4.3 Diffusion Models

On image generation tasks:

Key findings:

Similar FID scores to AdamW
Training might be slightly slower
Final quality is equivalent

4.4 Paper's Core Recommendations

Hyperparameter guidelines emphasized in the Lion paper:

Hyperparameter	Lion vs AdamW
Learning rate	3-10x smaller
Weight decay	3-10x larger
β₁	0.9 (same)
β₂	0.99 (smaller than AdamW's 0.999)

Important: These ratios vary by task. Vision tends toward 3x, NLP toward 10x.

5. Experiment: Comparison on CIFAR-10

5.1 Experimental Setup

python

# AdamW configuration (default)
adamw_config = {
    "lr": 1e-3,
    "weight_decay": 0.01,
    "betas": (0.9, 0.999)
}

# Lion configuration (paper recommendation)
lion_config = {
    "lr": 1e-4,       # 1/10 of AdamW
    "weight_decay": 0.1,  # 10x AdamW
    "betas": (0.9, 0.99)
}

5.2 Results

Metric	AdamW	Lion
Final Val Acc	84.2%	83.8%
Best Val Acc	85.1%	84.5%
Optimizer Memory	2.4 MB	1.2 MB
Generalization Gap	4.2%	3.8%

5.3 Visualization

6. Complete Hyperparameter Tuning Guide

6.1 Finding the Learning Rate

Lion's lr must be much smaller than AdamW's. Here's why:

sign() fixes magnitude to 1
AdamW's $1/\sqrt{v}$ is usually < 1, reducing effective lr
Lion lacks this scaling, so lr itself must be reduced

LR Range Test approach:

python

# 1. Find optimal AdamW lr (using standard methods)
# 2. Set Lion lr to 1/3 ~ 1/10 of that value
# 3. Validate quickly on a small dataset

adamw_lr = 1e-3  # Previously found value
lion_lr_candidates = [adamw_lr / 3, adamw_lr / 5, adamw_lr / 10]

6.2 Setting Weight Decay

Weight decay is more critical in Lion:

Fixed update magnitude makes WD's relative impact larger
Too small WD → overfitting
Too large WD → underfitting

Empirical rules:

Task	AdamW WD	Lion WD
Vision (ViT)	0.05	0.5
Vision (ResNet)	1e-4	1e-3
NLP (BERT)	0.01	0.1
NLP (GPT)	0.1	1.0
Diffusion	0.01	0.05

6.3 Adjusting β Values

Default values (0.9, 0.99) work in most cases. However:

β₁ (momentum for update):

Lower (0.8): More sensitive to current gradient
Higher (0.95): More persistence of past direction

β₂ (momentum for storage):

Lower (0.95): Faster adaptation, less smooth
Higher (0.999): Slower adaptation, smoother

General guidelines:

Short training: Lower β₂ slightly (0.95~0.99)
Long training: Raise β₂ (0.99~0.999)

6.4 Relationship with Batch Size

Lion is more stable at large batch sizes:

Batch Size	AdamW Stability	Lion Stability
32-128	Good	Moderate
256-1024	Good	Good
2048-4096	Moderate	Good
8192+	Caution needed	Good

Reason:

sign() normalizes gradient magnitude
Small variance from large batches doesn't cause issues

6.5 AdamW to Lion Conversion Checklist

python

def adamw_to_lion_config(adamw_config, task_type="general"):
    """
    Convert AdamW configuration to Lion

    Args:
        adamw_config: dict with lr, weight_decay, betas
        task_type: "vision", "nlp", "diffusion", "general"
    """
    factors = {
        "vision": {"lr_div": 3, "wd_mul": 10},
        "nlp": {"lr_div": 10, "wd_mul": 10},
        "diffusion": {"lr_div": 5, "wd_mul": 5},
        "general": {"lr_div": 7, "wd_mul": 7},
    }

    f = factors[task_type]

    return {
        "lr": adamw_config["lr"] / f["lr_div"],
        "weight_decay": adamw_config["weight_decay"] * f["wd_mul"],
        "betas": (0.9, 0.99),  # Lion defaults
    }

# Usage example
adamw = {"lr": 1e-3, "weight_decay": 0.01, "betas": (0.9, 0.999)}
lion = adamw_to_lion_config(adamw, task_type="nlp")
# lion = {"lr": 1e-4, "weight_decay": 0.1, "betas": (0.9, 0.99)}

6.6 Debugging Tuning Failures

Symptom 1: Loss diverges

Cause: lr too large
Fix: Reduce lr by 2-3x

Symptom 2: Training too slow

Cause: lr too small or WD too large
Fix: Increase lr by 1.5-2x or reduce WD

Symptom 3: Good train loss, bad val loss

Cause: WD too small (overfitting)
Fix: Increase WD by 2-3x

Symptom 4: Both train and val are bad

Cause: WD too large (underfitting)
Fix: Decrease WD by 2-3x

7. Practical Application Guide

7.1 When to Use Lion

Memory constrained: Optimizer state is the bottleneck for LLM/Large Diffusion models
Large batch training: Lion is more stable with large batches
Time for tuning: Hyperparameter sensitivity is high

7.2 When to Avoid Lion

Rapid prototyping: AdamW defaults usually work
Small models/datasets: Memory savings are negligible
Existing AdamW recipes: No reason to abandon proven settings

7.3 Using with Gradient Clipping

The Lion paper recommends gradient clipping:

python

# Lion + gradient clipping combination
optimizer = Lion(model.parameters(), lr=1e-4, weight_decay=0.1)

for batch in dataloader:
    loss = model(batch)
    loss.backward()

    # Gradient clipping (Lion paper recommendation)
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    optimizer.step()
    optimizer.zero_grad()

7.4 Combining with Learning Rate Schedulers

Lion works well with standard schedulers:

python

optimizer = Lion(model.parameters(), lr=1e-4, weight_decay=0.1)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=num_epochs,
    eta_min=1e-6  # Smaller min_lr for Lion
)

Warmup note:

Lion can use shorter warmup (compared to AdamW)
The paper uses 1-3% of total steps for warmup

8. Real-World Impact of Memory Savings

8.1 Theoretical Calculation

Optimizer state memory = Number of parameters × Moments stored × dtype size

Optimizer	Moments	7B model (bf16)
SGD	0	0 GB
SGD+momentum	1	14 GB
Adam/AdamW	2	28 GB
Lion	1	14 GB

8.2 7B LLM Training Scenario

Component	AdamW	Lion
Model (bf16)	14 GB	14 GB
Gradients	14 GB	14 GB
Optimizer m	14 GB	14 GB
Optimizer v	14 GB	0 GB
Total	56 GB	42 GB

25% memory reduction enables:

2x batch size increase
Sequence length 4K → 8K
Training larger models on A100 80GB

8.3 Synergy with ZeRO

Lion + ZeRO-2 is highly efficient:

ZeRO-2 with AdamW (4 GPUs):
- Optimizer state per GPU: 28GB / 4 = 7GB

ZeRO-2 with Lion (4 GPUs):
- Optimizer state per GPU: 14GB / 4 = 3.5GB

Additional savings: 3.5GB per GPU

9. Comparison with Other Optimizers

9.1 LAMB (Layer-wise Adaptive Moments)

Developed for large batch BERT training:

Applies different lr per layer
Effective at very large batch sizes (32K+)
More complex implementation than Lion

9.2 Shampoo

Efficiently approximates 2nd order information:

Theoretically faster convergence
But high compute/memory cost
Rarely used in practice

9.3 Adafactor

Designed for Transformers:

Factorizes 2nd moment to save memory
More complex than Lion with similar memory benefits
Used for T5 training

9.4 Comparison Summary

Optimizer	Memory	Convergence	Tuning Difficulty	Recommended For
SGD+M	1x	Slow	Low	Vision, peak performance
AdamW	2x	Fast	Low	Default choice
Lion	1x	Medium	High	Memory constrained
LAMB	2x	Fast	Medium	Very large batch
Adafactor	1.5x	Medium	Medium	Transformers

10. Conclusion: The Default is Default for a Reason

Lion is undeniably an interesting optimizer. The fact that AutoML discovered it, and that it's simple yet effective, makes it noteworthy.

But in practice:

AdamW remains the safe choice: Years of validated settings and recipes
Lion shines when memory is critical: Its value emerges in large-scale model training
Unconditional switching is risky: Swapping to Lion without tuning likely degrades performance

10.1 When to Consider Lion?

Checklist:
□ Is GPU memory so tight that you're reducing batch size?
□ Does optimizer state consume a significant portion of total memory?
□ Do you have time to invest in hyperparameter tuning?
□ Are you training with large batches (1K+)?

If 3+ boxes are checked, Lion is worth trying.

10.2 A Practitioner's Perspective

"Every time a new optimizer comes out, people say 'this time it's different,' but AdamW keeps surviving. Lion will likely be no exception. That said, in specific situations where memory efficiency matters, it's definitely worth considering."

Lion's true value isn't "replacing AdamW" but rather providing a new option when you need to trade off memory and performance.

References

Chen, X., et al. "Symbolic Discovery of Optimization Algorithms." arXiv:2302.06675 (2023)
Loshchilov, I., & Hutter, F. "Decoupled Weight Decay Regularization." ICLR 2019
Kingma, D. P., & Ba, J. "Adam: A Method for Stochastic Optimization." ICLR 2015
You, Y., et al. "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes." ICLR 2020
Shazeer, N., & Stern, M. "Adafactor: Adaptive Learning Rates with Sublinear Memory Cost." ICML 2018

Lion vs AdamW: Is It Time to Dethrone the King?

1. The Evolution of Optimizers: From SGD to Lion

1.1 The SGD Era (1950s~2010s)

1.2 The Rise of Momentum (1999)

1.3 The Adaptive Learning Rate Era (2011~)

1.4 Adam: Combining Two Ideas (2015)

1.5 AdamW: Rediscovering Weight Decay (2017)

1.6 Lion: An Optimizer Discovered by AutoML (2023)

2. Lion's Core Ideas

2.1 Mathematical Comparison

2.2 Key Differences

2.3 Lion's Unique Structure: Two Different β Values

2.4 Lion's sign() Operation

3. Why Does sign() Work? Theoretical Analysis

3.1 Implicit Regularization

3.2 Noise Robustness

3.3 Loss Landscape and Flat Minima

3.4 Analysis from a Preconditioner Perspective

4. Key Experimental Results from the Lion Paper

4.1 Image Classification (ImageNet)

4.2 Language Modeling

4.3 Diffusion Models

4.4 Paper's Core Recommendations

5. Experiment: Comparison on CIFAR-10

5.1 Experimental Setup

5.2 Results

5.3 Visualization

6. Complete Hyperparameter Tuning Guide

6.1 Finding the Learning Rate

6.2 Setting Weight Decay

6.3 Adjusting β Values

6.4 Relationship with Batch Size

6.5 AdamW to Lion Conversion Checklist

6.6 Debugging Tuning Failures

7. Practical Application Guide

7.1 When to Use Lion

7.2 When to Avoid Lion

7.3 Using with Gradient Clipping

7.4 Combining with Learning Rate Schedulers

8. Real-World Impact of Memory Savings

8.1 Theoretical Calculation

8.2 7B LLM Training Scenario

8.3 Synergy with ZeRO

9. Comparison with Other Optimizers

9.1 LAMB (Layer-wise Adaptive Moments)

9.2 Shampoo

9.3 Adafactor

9.4 Comparison Summary

10. Conclusion: The Default is Default for a Reason

10.1 When to Consider Lion?

10.2 A Practitioner's Perspective

References

2.4 Lion's `sign()` Operation