[Nanochat 분석하기] 6. LLM 훈련 마스터하기
LLM 훈련 마스터하기: 데이터부터 최적화까지
**시리즈**: 나노챗(nanochat)으로 배우는 LLM 구축 - Part 6/9
Training의 4가지 핵심 요소¶
모델을 만들었으니 이제 학습시킬 차례입니다. 성공적인 훈련을 위해 필요한 것:
1. Data: 무엇을 학습할 것인가?
2. Loss: 얼마나 틀렸는가?
3. Optimizer: 어떻게 개선할 것인가?
4. Schedule: 언제 무엇을 할 것인가?
DataLoader: 효율적인 배치 처리¶
Raw 텍스트를 훈련 배치로 변환:
class DataLoader:
    def __init__(self, filename, batch_size, seq_len):
        # 1. Load binary tokenized data
        self.tokens = np.memmap(filename, dtype=np.uint16, mode='r')
        self.batch_size = batch_size
        self.seq_len = seq_len
        # 2. Calculate how many batches we can make
        tokens_per_batch = batch_size * seq_len
        self.num_batches = len(self.tokens) // tokens_per_batch
    def __iter__(self):
        # 3. Yield batches
        for i in range(self.num_batches):
            start = i * self.batch_size * self.seq_len
            end = start + self.batch_size * self.seq_len
            # Get chunk and reshape
            chunk = self.tokens[start:end]
            batch = torch.from_numpy(chunk.astype(np.int64))
            batch = batch.view(self.batch_size, self.seq_len)
            # Input and target (shifted by 1)
            x = batch[:, :-1]
            y = batch[:, 1:]
            yield x, y예시:
# FineWeb 데이터: 7.4B 토큰
# batch_size = 512, seq_len = 1024
dataloader = DataLoader('fineweb_train.bin', 512, 1024)
for x, y in dataloader:
    print(f"x shape: {x.shape}")  # [512, 1024]
    print(f"y shape: {y.shape}")  # [512, 1024]
    # x[0]: [15496, 995, 318, ...]  "Hello world is"
    # y[0]: [995, 318, 11, ...]     "world is ,"
    #       ↑ Shifted by 1!왜 shift by 1?
Language modeling = Next token prediction
Input:  "The cat sat"
Target: "cat sat on"
         ↑ Predict next token at each positionTraining Loop: Forward-Backward-Update¶
핵심 3단계:
def train_step(model, x, y, optimizer):
    # 1. Forward: Compute loss
    logits, loss = model(x, y)
    # 2. Backward: Compute gradients
    loss.backward()
    # 3. Update: Apply gradients
    optimizer.step()
    optimizer.zero_grad()
    return loss.item()Step 1: Forward Pass¶
logits, loss = model(x, y)
# logits shape: [batch_size, seq_len, vocab_size]
#             = [512, 1024, 50304]
# For each of 512*1024 = 524,288 positions:
# - Model predicts probability distribution over 50,304 tokens
# - Compare with ground truth
# - Compute cross-entropy lossCross-Entropy Loss:
# Position i:
prediction = logits[0, i, :]  # [50304] probabilities
target = y[0, i]              # scalar (e.g., 995 for "world")
# Loss at this position:
loss_i = -log(prediction[target])
# Total loss = average over all positions
loss = mean(loss_i for all i)Step 2: Backward Pass¶
loss.backward()
# PyTorch magic: Computes ∂loss/∂w for EVERY weight!
# Example:
for name, param in model.named_parameters():
    print(f"{name}: grad shape {param.grad.shape}")
# Output:
# token_embedding.weight: grad shape [50304, 1280]
# blocks.0.attn.c_attn.weight: grad shape [1280, 3840]
# ...
# lm_head.weight: grad shape [50304, 1280]Gradient 의미:
# If weight = 0.5, grad = 0.3:
# → "If I increase weight by 0.01, loss increases by 0.003"
# → To reduce loss, DECREASE weight!
# Update rule:
weight_new = weight_old - learning_rate * gradStep 3: Optimizer Step¶
optimizer.step()
# For each parameter:
# param.data -= lr * param.grad
# Example with lr=0.001:
# weight: 0.5000
# grad: 0.3000
# new weight: 0.5 - 0.001 * 0.3 = 0.4997Learning Rate Scheduling¶
고정된 LR의 문제:
# LR too high:
step 0: loss=4.5
step 1: loss=2.0  ← Good!
step 2: loss=5.3  ← Overshoot! ❌
step 3: loss=1.8
step 4: loss=6.1  ← Diverging! ❌
# LR too low:
step 0: loss=4.5
step 100: loss=4.48  ← Too slow! ❌
step 200: loss=4.46해결책: 3-phase schedule!
Phase 1: Warmup (0-10%)¶
def get_warmup_lr(step, warmup_steps, max_lr):
    return max_lr * (step + 1) / warmup_steps
# Example: max_lr=6e-4, warmup_steps=100
# step 0: lr = 6e-6
# step 50: lr = 3e-4
# step 100: lr = 6e-4왜 warmup?
# Start of training:
# - Weights are random
# - Gradients are chaotic
# - High LR → explosion! 💥
# With warmup:
# - Slowly increase LR
# - Model stabilizes
# - Then use full LR ✅Phase 2: Constant (10-80%)¶
# Main training phase
lr = max_lr  # 6e-4
# Model learns most here!Phase 3: Cosine Decay (80-100%)¶
def get_decay_lr(step, decay_start, max_steps, max_lr, min_lr):
    progress = (step - decay_start) / (max_steps - decay_start)
    cosine_decay = 0.5 * (1 + math.cos(math.pi * progress))
    return min_lr + (max_lr - min_lr) * cosine_decay
# progress=0.0: lr = max_lr
# progress=0.5: lr = (max_lr + min_lr) / 2
# progress=1.0: lr = min_lrComplete schedule:
def get_lr(step, max_steps, max_lr=6e-4, min_lr=6e-5):
    warmup_steps = int(0.1 * max_steps)
    decay_start = int(0.8 * max_steps)
    if step < warmup_steps:
        # Warmup
        return max_lr * (step + 1) / warmup_steps
    elif step < decay_start:
        # Constant
        return max_lr
    else:
        # Cosine decay
        progress = (step - decay_start) / (max_steps - decay_start)
        cosine = 0.5 * (1 + math.cos(math.pi * progress))
        return min_lr + (max_lr - min_lr) * cosineGradient Accumulation: 큰 배치 시뮬레이션¶
GPU 메모리 문제:
# Want: batch_size = 2048
# Reality: GPU OOM! (Out of Memory)
# GPU can only fit batch_size = 512해결책: Gradient accumulation!
# Accumulate gradients over 4 mini-batches
grad_accum_steps = 4
for step in range(num_steps):
    optimizer.zero_grad()
    # Accumulate
    for micro_step in range(grad_accum_steps):
        x, y = next(dataloader)
        # Forward
        logits, loss = model(x, y)
        loss = loss / grad_accum_steps  # Normalize!
        # Backward (gradients accumulate automatically)
        loss.backward()
    # Single update with accumulated gradients
    optimizer.step()왜 normalize?
# Without normalization:
grad = grad1 + grad2 + grad3 + grad4
# This is 4x larger than normal!
# Effective LR becomes 4x higher ❌
# With normalization:
grad = grad1/4 + grad2/4 + grad3/4 + grad4/4
# Average gradient ✅
# Effective LR stays the same메모리 절약:
# Normal (OOM):
batch_size = 2048
memory = 80 GB  ❌
# Gradient accumulation:
batch_size = 512 (4 times)
memory = 20 GB per step ✅
effective_batch = 512 * 4 = 2048Muon Optimizer: nanochat의 비밀 무기¶
AdamW보다 2.5배 빠른 수렴!
class Muon(torch.optim.Optimizer):
    def __init__(self, params, lr=0.02, momentum=0.95):
        defaults = dict(lr=lr, momentum=momentum)
        super().__init__(params, defaults)
    @torch.no_grad()
    def step(self):
        for group in self.param_groups:
            for p in group['params']:
                if p.grad is None:
                    continue
                grad = p.grad
                state = self.state[p]
                # 1. Momentum buffer
                if 'momentum_buffer' not in state:
                    buf = state['momentum_buffer'] = torch.zeros_like(grad)
                else:
                    buf = state['momentum_buffer']
                # 2. Update momentum
                buf.mul_(group['momentum']).add_(grad, alpha=1 - group['momentum'])
                # 3. Orthogonalize (for 2D matrices only)
                if buf.ndim == 2:
                    buf_orth = newton_schulz_orthogonalize(buf)
                else:
                    buf_orth = buf
                # 4. Aspect ratio scaling
                if p.ndim == 2:
                    scale = max(1.0, p.size(0) / p.size(1)) ** 0.5
                else:
                    scale = 1.0
                # 5. Update parameter
                p.add_(buf_orth, alpha=-group['lr'] * scale)Newton-Schulz Orthogonalization:
def newton_schulz_orthogonalize(G, steps=5):
    """
    Make matrix orthogonal: G @ G.T ≈ I
    """
    # Normalize
    X = G / (G.norm() + 1e-7)
    # Iterate
    for _ in range(steps):
        A = X @ X.T
        X = 1.5 * X - 0.5 * A @ X
    return X왜 orthogonalization?
# Neural net weights work better when orthogonal:
# - Better gradient flow
# - More stable training
# - Faster convergence
# Empirical results:
AdamW: 1000 steps to loss=2.5
Muon:  400 steps to loss=2.5  (2.5x faster!)Hybrid strategy:
# nanochat uses TWO optimizers:
# Muon for 2D matrices (attention, MLP):
muon_params = [p for p in model.parameters() if p.ndim == 2]
muon = Muon(muon_params, lr=0.02)
# AdamW for everything else (embeddings, norms):
other_params = [p for p in model.parameters() if p.ndim != 2]
adamw = AdamW(other_params, lr=6e-4)Complete Training Script¶
모든 것을 합치면:
def train(model, dataloader, num_steps=4681):
    # Setup optimizers
    muon_params = [p for p in model.parameters() if p.ndim == 2]
    other_params = [p for p in model.parameters() if p.ndim != 2]
    muon = Muon(muon_params, lr=0.02, momentum=0.95)
    adamw = AdamW(other_params, lr=6e-4)
    # Training loop
    for step in range(num_steps):
        # 1. Get learning rate
        lr_muon = get_lr(step, num_steps, max_lr=0.02)
        lr_adamw = get_lr(step, num_steps, max_lr=6e-4)
        for pg in muon.param_groups:
            pg['lr'] = lr_muon
        for pg in adamw.param_groups:
            pg['lr'] = lr_adamw
        # 2. Gradient accumulation
        muon.zero_grad()
        adamw.zero_grad()
        total_loss = 0
        for micro_step in range(3):  # grad_accum=3
            x, y = next(dataloader)
            x, y = x.cuda(), y.cuda()
            # Forward
            with torch.amp.autocast('cuda', dtype=torch.bfloat16):
                logits, loss = model(x, y)
                loss = loss / 3
            # Backward
            loss.backward()
            total_loss += loss.item()
        # 3. Optimizer step
        muon.step()
        adamw.step()
        # 4. Logging
        if step % 10 == 0:
            print(f"Step {step}/{num_steps} | Loss: {total_loss:.4f} | "
                  f"LR: {lr_muon:.6f}")
# Run training
model = GPT(config).cuda()
dataloader = DataLoader('fineweb_train.bin', 512, 1024)
train(model, dataloader)nanochat 훈련 결과:
Step 0: Loss 10.8234 (random)
Step 100: Loss 6.4521
Step 500: Loss 4.2103
Step 1000: Loss 3.3452
Step 2000: Loss 2.8234
Step 4000: Loss 2.4125
Step 4681: Loss 2.3456 (final)
Time: 4 hours on 8×H100
Cost: ~$100핵심 요약¶
✅ DataLoader = 효율적 배치 생성
- Binary format (빠른 로딩)
- Shifted targets (next token prediction)
✅ Training Loop = Forward → Backward → Update
- Cross-entropy loss
- Automatic differentiation
- Gradient descent
✅ LR Schedule = Warmup → Constant → Decay
- 안정적인 시작
- 빠른 학습
- 부드러운 수렴
✅ Gradient Accumulation = 큰 배치 시뮬레이션
- 메모리 절약
- 동일한 효과
✅ Muon = 더 빠른 optimizer
- Orthogonalization
- 2.5x 빠른 수렴
다음 단계¶
Part 7: "3배 빠른 훈련"에서 다룰 내용:
- Mixed precision (bfloat16)
- Flash Attention
- torch.compile
- MFU 계산
- Profiling
최적화 기법으로 훈련을 3배 더 빠르게! 🚀
---
📘 참고¶
본 포스트는 Andrej Karpathy의 nanochat 오픈소스 프로젝트를 기반으로, 코드 구조와 학습 과정을 분석/설명하기 위해 작성되었습니다. 원본 코드는 MIT License 하에 배포됩니다. 모든 코드 예제는 교육 목적으로 단순화되었습니다.
참고 자료:
- [nanochat/execution.py](https://github.com/karpathy/nanochat/blob/main/nanochat/execution.py)
- [Muon optimizer 논문](https://arxiv.org/abs/2402.03432)
태그: #Training #Optimizer #Muon #LearningRate #GradientAccumulation #nanochat