Models & Algorithms

AdamW vs Lion: Save 33% GPU Memory While Keeping the Same Performance

How Lion optimizer saves 33% memory compared to AdamW, and the hyperparameter tuning guide for real-world application. Use it wrong and you lose.

AdamW vs Lion: Save 33% GPU Memory While Keeping the Same Performance

Lion vs AdamW: Is It Time to Dethrone the King?

TL;DR: Lion saves ~33% memory while delivering comparable performance. But without proper hyperparameter tuning, you're worse off than before.

1. The Evolution of Optimizers: From SGD to Lion

The history of deep learning optimizers is a journey to answer one question: "How can we converge faster and more reliably?"

1.1 The SGD Era (1950s~2010s)

Everything started with Stochastic Gradient Descent:

θt+1=θtηgt\theta_{t+1} = \theta_t - \eta \cdot g_t

Simple, but problematic:

  • Noisy gradients: Variance from mini-batch sampling
  • Uniform learning rate: Same update magnitude for all parameters
  • Saddle points: Hard to escape in high dimensions

1.2 The Rise of Momentum (1999)

Polyak's momentum was the first major improvement:

vt=γvt1+ηgtθt+1=θtvt\begin{aligned}v_t &= \gamma v_{t-1} + \eta g_t \\\theta_{t+1} &= \theta_t - v_t\end{aligned}

By accumulating past gradients, we add "inertia" to the optimization. This smooths out noisy gradients and reduces oscillation in narrow valleys.

1.3 The Adaptive Learning Rate Era (2011~)

AdaGrad (2011): Per-parameter learning rates

θt+1=θtηGt+ϵgt\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \cdot g_t

Here, GtG_t is the cumulative sum of squared gradients. Frequently updated parameters get smaller learning rates, while rarely updated ones maintain larger rates.

RMSprop (2012): AdaGrad with forgetting

vt=βvt1+(1β)gt2v_t = \beta v_{t-1} + (1-\beta) g_t^2

Using exponential moving average, older gradients are gradually forgotten.

1.4 Adam: Combining Two Ideas (2015)

Kingma & Ba's Adam merged momentum with adaptive learning rates:

mt=β1mt1+(1β1)gt(1st moment, momentum)vt=β2vt1+(1β2)gt2(2nd moment, adaptive lr)θt=θt1ηm^tv^t+ϵ\begin{aligned}m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \quad \text{(1st moment, momentum)} \\v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \quad \text{(2nd moment, adaptive lr)} \\\theta_t &= \theta_{t-1} - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}\end{aligned}

Where m^t,v^t\hat{m}_t, \hat{v}_t are bias-corrected versions.

Adam "just worked" for most tasks and quickly became the standard.

1.5 AdamW: Rediscovering Weight Decay (2017)

Loshchilov & Hutter discovered a critical issue: L2 regularization doesn't work properly in Adam.

The Problem: Adam includes weight decay in the gradient:

gt=f(θ)+λθg_t = \nabla f(\theta) + \lambda \theta

This causes adaptive learning rates to counteract regularization. Parameters with small gradients get larger learning rates, making weight decay inconsistent.

The Solution (AdamW): Decouple weight decay from gradients:

θt=θt1η(m^tv^t+ϵ+λθt1)\theta_t = \theta_{t-1} - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_{t-1} \right)

This simple change significantly improved generalization, and AdamW became the new standard.

1.6 Lion: An Optimizer Discovered by AutoML (2023)

Google's approach was fascinating: Let a program discover it, not humans.

Key ideas from the Lion paper:

  1. Represent optimizers as "programs" (symbolic representation)
  2. Use evolutionary algorithms to search thousands of optimizers
  3. Evaluate on real tasks and select the best

The result was Lion. Surprisingly, it was simpler yet more effective than human-designed alternatives.

2. Lion's Core Ideas

2.1 Mathematical Comparison

AdamW Update:

mt=β1mt1+(1β1)gtvt=β2vt1+(1β2)gt2m^t=mt1β1t,v^t=vt1β2tθt=θt1η(m^tv^t+ϵ+λθt1)\begin{aligned}m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \\v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \\\hat{m}_t &= \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} \\\theta_t &= \theta_{t-1} - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_{t-1} \right)\end{aligned}

Lion Update:

θt=θt1ηλθt1(weight decay first)ct=β1mt1+(1β1)gtθt=θtηsign(ct)mt=β2mt1+(1β2)gt\begin{aligned}\theta_t &= \theta_{t-1} - \eta \lambda \theta_{t-1} \quad \text{(weight decay first)} \\c_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \\\theta_t &= \theta_t - \eta \cdot \text{sign}(c_t) \\m_t &= \beta_2 m_{t-1} + (1 - \beta_2) g_t\end{aligned}

2.2 Key Differences

AspectAdamWLion
**Momentum Storage**`m` + `v` (2x memory)`m` only (1x memory)
**Update Magnitude**Adaptive ($\hat{m}/\sqrt{\hat{v}}$)Constant ($\pm \eta$)
**Weight Decay**Applied with updateApplied before update
**Bias Correction**YesNo
**Default β**(0.9, 0.999)(0.9, 0.99)

2.3 Lion's Unique Structure: Two Different β Values

The most distinctive aspect of Lion is that β₁ and β₂ serve different purposes:

β₁ (0.9): Used for computing update direction
c_t = β₁ * m_{t-1} + (1-β₁) * g_t
→ sign(c_t) determines this step's update direction

β₂ (0.99): Used for storing momentum for next step
m_t = β₂ * m_{t-1} + (1-β₂) * g_t
→ Maintains smoother momentum

Why this separation works isn't fully understood. Since AutoML discovered it, there's limited theoretical explanation for "why it works." But empirically, it does.

2.4 Lion's sign() Operation

The core of Lion is the sign() operation:

sign(x)={+1if x>00if x=01if x<0\text{sign}(x) = \begin{cases} +1 & \text{if } x > 0 \\ 0 & \text{if } x = 0 \\ -1 & \text{if } x < 0 \end{cases}

This means ignoring gradient magnitude and using only direction.

Why this seems counterintuitive:

  • Gradient magnitude tells us "how much to move"
  • Throwing this away seems like information loss

But in practice:

  1. All parameters get equal-magnitude updates
  2. Learning rate η determines step size
  3. Implicit regularization effect emerges

3. Why Does sign() Work? Theoretical Analysis

3.1 Implicit Regularization

The sign() operation affects gradients differently based on magnitude:

  • Large gradient (|g| >> 0): sign(g) * η = η (magnitude "suppressed" to 1)
  • Small gradient (|g| ≈ 0): sign(g) * η = η (magnitude "amplified" to 1)

This creates an effect similar to L∞ regularization. All parameter updates become uniform, preventing any single parameter from growing excessively.

3.2 Noise Robustness

Stochastic gradients can be viewed as "signal + noise":

gt=f(θ)+ϵtg_t = \nabla f(\theta) + \epsilon_t

Properties of sign():

  • If |signal| > |noise|: sign(g) ≈ sign(signal)
  • Noise is ignored unless it's large enough to flip the direction

This enables more stable updates in noisy gradient environments.

3.3 Loss Landscape and Flat Minima

Sharp minima vs Flat minima hypothesis:

  • Sharp minima: Narrow valleys, poor generalization
  • Flat minima: Wide basins, good generalization

Effects of sign() updates:

  1. Moving with constant step size
  2. "Bouncing out" of sharp minima (step too large)
  3. Settling only in flat minima

This hypothesis can explain why Lion shows better generalization in some cases.

3.4 Analysis from a Preconditioner Perspective

Generalizing optimizers:

θt+1=θtηPtgt\theta_{t+1} = \theta_t - \eta \cdot P_t \cdot g_t

Where PtP_t is the preconditioner.

OptimizerPreconditioner $P_t$
SGD$I$ (identity)
Adam$\text{diag}(1/\sqrt{\hat{v}_t})$
Lion$\text{diag}(\text{sign}(\cdot)/g_t)$ (effective)

Geometric interpretation:

  • SGD: Explores loss surface as-is
  • Adam: Explores by "stretching/shrinking" loss surface (coordinate-wise scaling)
  • Lion: Moves with equal magnitude in all directions (isotropic in update magnitude)

4. Key Experimental Results from the Lion Paper

4.1 Image Classification (ImageNet)

The paper experimented with various models including ViT-B/16, ViT-L/16.

Key findings:

  • Lion performs similarly or slightly better than AdamW
  • Memory usage is definitively lower
  • More stable at large batch sizes (4K+)

4.2 Language Modeling

Experiments with GPT-2 style models:

Key findings:

  • Comparable perplexity to AdamW
  • Good training stability
  • Memory savings significant at LLM scale

4.3 Diffusion Models

On image generation tasks:

Key findings:

  • Similar FID scores to AdamW
  • Training might be slightly slower
  • Final quality is equivalent

4.4 Paper's Core Recommendations

Hyperparameter guidelines emphasized in the Lion paper:

HyperparameterLion vs AdamW
Learning rate**3-10x smaller**
Weight decay**3-10x larger**
β₁0.9 (same)
β₂0.99 (smaller than AdamW's 0.999)
Important: These ratios vary by task. Vision tends toward 3x, NLP toward 10x.

5. Experiment: Comparison on CIFAR-10

5.1 Experimental Setup

python
# AdamW configuration (default)
adamw_config = {
    "lr": 1e-3,
    "weight_decay": 0.01,
    "betas": (0.9, 0.999)
}

# Lion configuration (paper recommendation)
lion_config = {
    "lr": 1e-4,       # 1/10 of AdamW
    "weight_decay": 0.1,  # 10x AdamW
    "betas": (0.9, 0.99)
}

5.2 Results

MetricAdamWLion
Final Val Acc84.2%83.8%
Best Val Acc85.1%84.5%
Optimizer Memory2.4 MB1.2 MB
Generalization Gap4.2%3.8%

5.3 Visualization

AdamW vs Lion Memory Comparison

6. Complete Hyperparameter Tuning Guide

6.1 Finding the Learning Rate

Lion's lr must be much smaller than AdamW's. Here's why:

  1. sign() fixes magnitude to 1
  2. AdamW's 1/v1/\sqrt{v} is usually < 1, reducing effective lr
  3. Lion lacks this scaling, so lr itself must be reduced

LR Range Test approach:

python
# 1. Find optimal AdamW lr (using standard methods)
# 2. Set Lion lr to 1/3 ~ 1/10 of that value
# 3. Validate quickly on a small dataset

adamw_lr = 1e-3  # Previously found value
lion_lr_candidates = [adamw_lr / 3, adamw_lr / 5, adamw_lr / 10]

6.2 Setting Weight Decay

Weight decay is more critical in Lion:

  1. Fixed update magnitude makes WD's relative impact larger
  2. Too small WD → overfitting
  3. Too large WD → underfitting

Empirical rules:

TaskAdamW WDLion WD
Vision (ViT)0.050.5
Vision (ResNet)1e-41e-3
NLP (BERT)0.010.1
NLP (GPT)0.11.0
Diffusion0.010.05

6.3 Adjusting β Values

Default values (0.9, 0.99) work in most cases. However:

β₁ (momentum for update):

  • Lower (0.8): More sensitive to current gradient
  • Higher (0.95): More persistence of past direction

β₂ (momentum for storage):

  • Lower (0.95): Faster adaptation, less smooth
  • Higher (0.999): Slower adaptation, smoother

General guidelines:

  • Short training: Lower β₂ slightly (0.95~0.99)
  • Long training: Raise β₂ (0.99~0.999)

6.4 Relationship with Batch Size

Lion is more stable at large batch sizes:

Batch SizeAdamW StabilityLion Stability
32-128GoodModerate
256-1024GoodGood
2048-4096ModerateGood
8192+Caution neededGood

Reason:

  • sign() normalizes gradient magnitude
  • Small variance from large batches doesn't cause issues

6.5 AdamW to Lion Conversion Checklist

python
def adamw_to_lion_config(adamw_config, task_type="general"):
    """
    Convert AdamW configuration to Lion

    Args:
        adamw_config: dict with lr, weight_decay, betas
        task_type: "vision", "nlp", "diffusion", "general"
    """
    factors = {
        "vision": {"lr_div": 3, "wd_mul": 10},
        "nlp": {"lr_div": 10, "wd_mul": 10},
        "diffusion": {"lr_div": 5, "wd_mul": 5},
        "general": {"lr_div": 7, "wd_mul": 7},
    }

    f = factors[task_type]

    return {
        "lr": adamw_config["lr"] / f["lr_div"],
        "weight_decay": adamw_config["weight_decay"] * f["wd_mul"],
        "betas": (0.9, 0.99),  # Lion defaults
    }

# Usage example
adamw = {"lr": 1e-3, "weight_decay": 0.01, "betas": (0.9, 0.999)}
lion = adamw_to_lion_config(adamw, task_type="nlp")
# lion = {"lr": 1e-4, "weight_decay": 0.1, "betas": (0.9, 0.99)}

6.6 Debugging Tuning Failures

Symptom 1: Loss diverges

  • Cause: lr too large
  • Fix: Reduce lr by 2-3x

Symptom 2: Training too slow

  • Cause: lr too small or WD too large
  • Fix: Increase lr by 1.5-2x or reduce WD

Symptom 3: Good train loss, bad val loss

  • Cause: WD too small (overfitting)
  • Fix: Increase WD by 2-3x

Symptom 4: Both train and val are bad

  • Cause: WD too large (underfitting)
  • Fix: Decrease WD by 2-3x

7. Practical Application Guide

7.1 When to Use Lion

  1. Memory constrained: Optimizer state is the bottleneck for LLM/Large Diffusion models
  2. Large batch training: Lion is more stable with large batches
  3. Time for tuning: Hyperparameter sensitivity is high

7.2 When to Avoid Lion

  1. Rapid prototyping: AdamW defaults usually work
  2. Small models/datasets: Memory savings are negligible
  3. Existing AdamW recipes: No reason to abandon proven settings

7.3 Using with Gradient Clipping

The Lion paper recommends gradient clipping:

python
# Lion + gradient clipping combination
optimizer = Lion(model.parameters(), lr=1e-4, weight_decay=0.1)

for batch in dataloader:
    loss = model(batch)
    loss.backward()

    # Gradient clipping (Lion paper recommendation)
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    optimizer.step()
    optimizer.zero_grad()

7.4 Combining with Learning Rate Schedulers

Lion works well with standard schedulers:

python
optimizer = Lion(model.parameters(), lr=1e-4, weight_decay=0.1)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=num_epochs,
    eta_min=1e-6  # Smaller min_lr for Lion
)

Warmup note:

  • Lion can use shorter warmup (compared to AdamW)
  • The paper uses 1-3% of total steps for warmup

8. Real-World Impact of Memory Savings

8.1 Theoretical Calculation

Optimizer state memory = Number of parameters × Moments stored × dtype size

OptimizerMoments7B model (bf16)
SGD00 GB
SGD+momentum114 GB
Adam/AdamW228 GB
Lion114 GB

8.2 7B LLM Training Scenario

ComponentAdamWLion
Model (bf16)14 GB14 GB
Gradients14 GB14 GB
Optimizer m14 GB14 GB
Optimizer v14 GB**0 GB**
**Total****56 GB****42 GB**

25% memory reduction enables:

  • 2x batch size increase
  • Sequence length 4K → 8K
  • Training larger models on A100 80GB

8.3 Synergy with ZeRO

Lion + ZeRO-2 is highly efficient:

ZeRO-2 with AdamW (4 GPUs):
- Optimizer state per GPU: 28GB / 4 = 7GB

ZeRO-2 with Lion (4 GPUs):
- Optimizer state per GPU: 14GB / 4 = 3.5GB

Additional savings: 3.5GB per GPU

9. Comparison with Other Optimizers

9.1 LAMB (Layer-wise Adaptive Moments)

Developed for large batch BERT training:

  • Applies different lr per layer
  • Effective at very large batch sizes (32K+)
  • More complex implementation than Lion

9.2 Shampoo

Efficiently approximates 2nd order information:

  • Theoretically faster convergence
  • But high compute/memory cost
  • Rarely used in practice

9.3 Adafactor

Designed for Transformers:

  • Factorizes 2nd moment to save memory
  • More complex than Lion with similar memory benefits
  • Used for T5 training

9.4 Comparison Summary

OptimizerMemoryConvergenceTuning DifficultyRecommended For
SGD+M1xSlowLowVision, peak performance
AdamW2xFastLowDefault choice
Lion1xMediumHighMemory constrained
LAMB2xFastMediumVery large batch
Adafactor1.5xMediumMediumTransformers

10. Conclusion: The Default is Default for a Reason

Lion is undeniably an interesting optimizer. The fact that AutoML discovered it, and that it's simple yet effective, makes it noteworthy.

But in practice:

  1. AdamW remains the safe choice: Years of validated settings and recipes
  2. Lion shines when memory is critical: Its value emerges in large-scale model training
  3. Unconditional switching is risky: Swapping to Lion without tuning likely degrades performance

10.1 When to Consider Lion?

Checklist:
□ Is GPU memory so tight that you're reducing batch size?
□ Does optimizer state consume a significant portion of total memory?
□ Do you have time to invest in hyperparameter tuning?
□ Are you training with large batches (1K+)?

If 3+ boxes are checked, Lion is worth trying.

10.2 A Practitioner's Perspective

"Every time a new optimizer comes out, people say 'this time it's different,' but AdamW keeps surviving. Lion will likely be no exception. That said, in specific situations where memory efficiency matters, it's definitely worth considering."

Lion's true value isn't "replacing AdamW" but rather providing a new option when you need to trade off memory and performance.

References

  1. Chen, X., et al. "Symbolic Discovery of Optimization Algorithms." arXiv:2302.06675 (2023)
  2. Loshchilov, I., & Hutter, F. "Decoupled Weight Decay Regularization." ICLR 2019
  3. Kingma, D. P., & Ba, J. "Adam: A Method for Stochastic Optimization." ICLR 2015
  4. You, Y., et al. "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes." ICLR 2020
  5. Shazeer, N., & Stern, M. "Adafactor: Adaptive Learning Rates with Sublinear Memory Cost." ICML 2018