AI Research•February 12, 2026•🇰🇷 한국어

Diffusion LLM Part 4: LLaDA 2.0 -> 2.1 -- Breaking 100B with MoE + Token Editing

MoE scaling, Token Editing (T2T+M2T), S-Mode/Q-Mode, RL Framework -- how LLaDA 2.X makes diffusion LLMs practical.

Diffusion LLM Part 4: LLaDA 2.0 -> 2.1 -- Breaking 100B with MoE + Token Editing

In Part 3, LLaDA proved that "Diffusion LLMs are viable" by scaling Masked Diffusion to the 8B parameter range. But practical challenges remained: inference speed was far behind AR models, and alignment training like RLHF was absent.

In November 2025, Ant Group's InclusionAI began closing this gap with LLaDA 2.0. Then in February 2026, LLaDA 2.1 redefined the speed-quality tradeoff with an innovation called Token Editing.

This post covers the scaling journey from 8B to 100B, the adoption of MoE architecture, and how Token Editing works under the hood.

LLaDA 2.0: The Leap to 100B

🔒

Sign in to continue reading

Create a free account to access the full content.

AI Research

MiniMax M2.5: Opus-Level Performance at $1 per Hour

MiniMax M2.5 achieves SWE-bench 80.2% using only 10B active parameters from a 230B MoE architecture. 1/20th the cost of Claude Opus with comparable coding performance. Forge RL framework, benchmark analysis, pricing comparison.

AI Research

Backpropagation From Scratch: Chain Rule, Computation Graphs, and Topological Sort

How microgpt.py's 15-line backward() works. From high school calculus to chain rule, computation graphs, topological sort, and backpropagation.

AI Research

Karpathy's microgpt.py Dissected: Understanding GPT's Essence in 150 Lines

A line-by-line dissection of microgpt.py -- a pure Python GPT implementation with zero dependencies. Training, inference, and autograd in 150 lines.

Diffusion LLM Part 4: LLaDA 2.0 -> 2.1 -- Breaking 100B with MoE + Token Editing

LLaDA 2.0: The Leap to 100B

Sign in to continue reading

Related Posts

MiniMax M2.5: Opus-Level Performance at $1 per Hour

Backpropagation From Scratch: Chain Rule, Computation Graphs, and Topological Sort

Karpathy's microgpt.py Dissected: Understanding GPT's Essence in 150 Lines