[Nanochat 분석하기] 6. LLM 훈련 마스터하기

Read time: 2 minutes

LLM 훈련 마스터하기: 데이터부터 최적화까지

**시리즈**: 나노챗(nanochat)으로 배우는 LLM 구축 - Part 6/9

Training의 4가지 핵심 요소

모델을 만들었으니 이제 학습시킬 차례입니다. 성공적인 훈련을 위해 필요한 것:

1. Data: 무엇을 학습할 것인가?
2. Loss: 얼마나 틀렸는가?
3. Optimizer: 어떻게 개선할 것인가?
4. Schedule: 언제 무엇을 할 것인가?

DataLoader: 효율적인 배치 처리

Raw 텍스트를 훈련 배치로 변환:

  • class DataLoader:
        def __init__(self, filename, batch_size, seq_len):
            # 1. Load binary tokenized data
            self.tokens = np.memmap(filename, dtype=np.uint16, mode='r')
    
            self.batch_size = batch_size
            self.seq_len = seq_len
    
            # 2. Calculate how many batches we can make
            tokens_per_batch = batch_size * seq_len
            self.num_batches = len(self.tokens) // tokens_per_batch
    
        def __iter__(self):
            # 3. Yield batches
            for i in range(self.num_batches):
                start = i * self.batch_size * self.seq_len
                end = start + self.batch_size * self.seq_len
    
                # Get chunk and reshape
                chunk = self.tokens[start:end]
                batch = torch.from_numpy(chunk.astype(np.int64))
                batch = batch.view(self.batch_size, self.seq_len)
    
                # Input and target (shifted by 1)
                x = batch[:, :-1]
                y = batch[:, 1:]
    
                yield x, y

    예시:

  • # FineWeb 데이터: 7.4B 토큰
    # batch_size = 512, seq_len = 1024
    
    dataloader = DataLoader('fineweb_train.bin', 512, 1024)
    
    for x, y in dataloader:
        print(f"x shape: {x.shape}")  # [512, 1024]
        print(f"y shape: {y.shape}")  # [512, 1024]
    
        # x[0]: [15496, 995, 318, ...]  "Hello world is"
        # y[0]: [995, 318, 11, ...]     "world is ,"
        #       ↑ Shifted by 1!

    왜 shift by 1?

  • Language modeling = Next token prediction
    
    Input:  "The cat sat"
    Target: "cat sat on"
             ↑ Predict next token at each position

    Training Loop: Forward-Backward-Update

    핵심 3단계:

  • def train_step(model, x, y, optimizer):
        # 1. Forward: Compute loss
        logits, loss = model(x, y)
    
        # 2. Backward: Compute gradients
        loss.backward()
    
        # 3. Update: Apply gradients
        optimizer.step()
        optimizer.zero_grad()
    
        return loss.item()

    Step 1: Forward Pass

  • logits, loss = model(x, y)
    
    # logits shape: [batch_size, seq_len, vocab_size]
    #             = [512, 1024, 50304]
    
    # For each of 512*1024 = 524,288 positions:
    # - Model predicts probability distribution over 50,304 tokens
    # - Compare with ground truth
    # - Compute cross-entropy loss

    Cross-Entropy Loss:

  • # Position i:
    prediction = logits[0, i, :]  # [50304] probabilities
    target = y[0, i]              # scalar (e.g., 995 for "world")
    
    # Loss at this position:
    loss_i = -log(prediction[target])
    
    # Total loss = average over all positions
    loss = mean(loss_i for all i)

    Step 2: Backward Pass

  • loss.backward()
    
    # PyTorch magic: Computes ∂loss/∂w for EVERY weight!
    
    # Example:
    for name, param in model.named_parameters():
        print(f"{name}: grad shape {param.grad.shape}")
    
    # Output:
    # token_embedding.weight: grad shape [50304, 1280]
    # blocks.0.attn.c_attn.weight: grad shape [1280, 3840]
    # ...
    # lm_head.weight: grad shape [50304, 1280]

    Gradient 의미:

  • # If weight = 0.5, grad = 0.3:
    # → "If I increase weight by 0.01, loss increases by 0.003"
    # → To reduce loss, DECREASE weight!
    
    # Update rule:
    weight_new = weight_old - learning_rate * grad

    Step 3: Optimizer Step

  • optimizer.step()
    
    # For each parameter:
    # param.data -= lr * param.grad
    
    # Example with lr=0.001:
    # weight: 0.5000
    # grad: 0.3000
    # new weight: 0.5 - 0.001 * 0.3 = 0.4997

    Learning Rate Scheduling

    고정된 LR의 문제:

  • # LR too high:
    step 0: loss=4.5
    step 1: loss=2.0  ← Good!
    step 2: loss=5.3  ← Overshoot! ❌
    step 3: loss=1.8
    step 4: loss=6.1  ← Diverging! ❌
    
    # LR too low:
    step 0: loss=4.5
    step 100: loss=4.48  ← Too slow! ❌
    step 200: loss=4.46

    해결책: 3-phase schedule!

    Phase 1: Warmup (0-10%)

  • def get_warmup_lr(step, warmup_steps, max_lr):
        return max_lr * (step + 1) / warmup_steps
    
    # Example: max_lr=6e-4, warmup_steps=100
    # step 0: lr = 6e-6
    # step 50: lr = 3e-4
    # step 100: lr = 6e-4

    왜 warmup?

  • # Start of training:
    # - Weights are random
    # - Gradients are chaotic
    # - High LR → explosion! 💥
    
    # With warmup:
    # - Slowly increase LR
    # - Model stabilizes
    # - Then use full LR ✅

    Phase 2: Constant (10-80%)

  • # Main training phase
    lr = max_lr  # 6e-4
    
    # Model learns most here!

    Phase 3: Cosine Decay (80-100%)

  • def get_decay_lr(step, decay_start, max_steps, max_lr, min_lr):
        progress = (step - decay_start) / (max_steps - decay_start)
        cosine_decay = 0.5 * (1 + math.cos(math.pi * progress))
        return min_lr + (max_lr - min_lr) * cosine_decay
    
    # progress=0.0: lr = max_lr
    # progress=0.5: lr = (max_lr + min_lr) / 2
    # progress=1.0: lr = min_lr

    Complete schedule:

  • def get_lr(step, max_steps, max_lr=6e-4, min_lr=6e-5):
        warmup_steps = int(0.1 * max_steps)
        decay_start = int(0.8 * max_steps)
    
        if step < warmup_steps:
            # Warmup
            return max_lr * (step + 1) / warmup_steps
        elif step < decay_start:
            # Constant
            return max_lr
        else:
            # Cosine decay
            progress = (step - decay_start) / (max_steps - decay_start)
            cosine = 0.5 * (1 + math.cos(math.pi * progress))
            return min_lr + (max_lr - min_lr) * cosine

    Gradient Accumulation: 큰 배치 시뮬레이션

    GPU 메모리 문제:

  • # Want: batch_size = 2048
    # Reality: GPU OOM! (Out of Memory)
    
    # GPU can only fit batch_size = 512

    해결책: Gradient accumulation!

  • # Accumulate gradients over 4 mini-batches
    grad_accum_steps = 4
    
    for step in range(num_steps):
        optimizer.zero_grad()
    
        # Accumulate
        for micro_step in range(grad_accum_steps):
            x, y = next(dataloader)
    
            # Forward
            logits, loss = model(x, y)
            loss = loss / grad_accum_steps  # Normalize!
    
            # Backward (gradients accumulate automatically)
            loss.backward()
    
        # Single update with accumulated gradients
        optimizer.step()

    왜 normalize?

  • # Without normalization:
    grad = grad1 + grad2 + grad3 + grad4
    # This is 4x larger than normal!
    # Effective LR becomes 4x higher ❌
    
    # With normalization:
    grad = grad1/4 + grad2/4 + grad3/4 + grad4/4
    # Average gradient ✅
    # Effective LR stays the same

    메모리 절약:

  • # Normal (OOM):
    batch_size = 2048
    memory = 80 GB  ❌
    
    # Gradient accumulation:
    batch_size = 512 (4 times)
    memory = 20 GB per step ✅
    effective_batch = 512 * 4 = 2048

    Muon Optimizer: nanochat의 비밀 무기

    AdamW보다 2.5배 빠른 수렴!

  • class Muon(torch.optim.Optimizer):
        def __init__(self, params, lr=0.02, momentum=0.95):
            defaults = dict(lr=lr, momentum=momentum)
            super().__init__(params, defaults)
    
        @torch.no_grad()
        def step(self):
            for group in self.param_groups:
                for p in group['params']:
                    if p.grad is None:
                        continue
    
                    grad = p.grad
                    state = self.state[p]
    
                    # 1. Momentum buffer
                    if 'momentum_buffer' not in state:
                        buf = state['momentum_buffer'] = torch.zeros_like(grad)
                    else:
                        buf = state['momentum_buffer']
    
                    # 2. Update momentum
                    buf.mul_(group['momentum']).add_(grad, alpha=1 - group['momentum'])
    
                    # 3. Orthogonalize (for 2D matrices only)
                    if buf.ndim == 2:
                        buf_orth = newton_schulz_orthogonalize(buf)
                    else:
                        buf_orth = buf
    
                    # 4. Aspect ratio scaling
                    if p.ndim == 2:
                        scale = max(1.0, p.size(0) / p.size(1)) ** 0.5
                    else:
                        scale = 1.0
    
                    # 5. Update parameter
                    p.add_(buf_orth, alpha=-group['lr'] * scale)

    Newton-Schulz Orthogonalization:

  • def newton_schulz_orthogonalize(G, steps=5):
        """
        Make matrix orthogonal: G @ G.T ≈ I
        """
        # Normalize
        X = G / (G.norm() + 1e-7)
    
        # Iterate
        for _ in range(steps):
            A = X @ X.T
            X = 1.5 * X - 0.5 * A @ X
    
        return X

    왜 orthogonalization?

  • # Neural net weights work better when orthogonal:
    # - Better gradient flow
    # - More stable training
    # - Faster convergence
    
    # Empirical results:
    AdamW: 1000 steps to loss=2.5
    Muon:  400 steps to loss=2.5  (2.5x faster!)

    Hybrid strategy:

  • # nanochat uses TWO optimizers:
    
    # Muon for 2D matrices (attention, MLP):
    muon_params = [p for p in model.parameters() if p.ndim == 2]
    muon = Muon(muon_params, lr=0.02)
    
    # AdamW for everything else (embeddings, norms):
    other_params = [p for p in model.parameters() if p.ndim != 2]
    adamw = AdamW(other_params, lr=6e-4)

    Complete Training Script

    모든 것을 합치면:

  • def train(model, dataloader, num_steps=4681):
        # Setup optimizers
        muon_params = [p for p in model.parameters() if p.ndim == 2]
        other_params = [p for p in model.parameters() if p.ndim != 2]
    
        muon = Muon(muon_params, lr=0.02, momentum=0.95)
        adamw = AdamW(other_params, lr=6e-4)
    
        # Training loop
        for step in range(num_steps):
            # 1. Get learning rate
            lr_muon = get_lr(step, num_steps, max_lr=0.02)
            lr_adamw = get_lr(step, num_steps, max_lr=6e-4)
    
            for pg in muon.param_groups:
                pg['lr'] = lr_muon
            for pg in adamw.param_groups:
                pg['lr'] = lr_adamw
    
            # 2. Gradient accumulation
            muon.zero_grad()
            adamw.zero_grad()
    
            total_loss = 0
            for micro_step in range(3):  # grad_accum=3
                x, y = next(dataloader)
                x, y = x.cuda(), y.cuda()
    
                # Forward
                with torch.amp.autocast('cuda', dtype=torch.bfloat16):
                    logits, loss = model(x, y)
                    loss = loss / 3
    
                # Backward
                loss.backward()
                total_loss += loss.item()
    
            # 3. Optimizer step
            muon.step()
            adamw.step()
    
            # 4. Logging
            if step % 10 == 0:
                print(f"Step {step}/{num_steps} | Loss: {total_loss:.4f} | "
                      f"LR: {lr_muon:.6f}")
    
    # Run training
    model = GPT(config).cuda()
    dataloader = DataLoader('fineweb_train.bin', 512, 1024)
    train(model, dataloader)

    nanochat 훈련 결과:

  • Step 0: Loss 10.8234 (random)
    Step 100: Loss 6.4521
    Step 500: Loss 4.2103
    Step 1000: Loss 3.3452
    Step 2000: Loss 2.8234
    Step 4000: Loss 2.4125
    Step 4681: Loss 2.3456 (final)
    
    Time: 4 hours on 8×H100
    Cost: ~$100

    핵심 요약

    DataLoader = 효율적 배치 생성
    - Binary format (빠른 로딩)
    - Shifted targets (next token prediction)

    Training Loop = Forward → Backward → Update
    - Cross-entropy loss
    - Automatic differentiation
    - Gradient descent

    LR Schedule = Warmup → Constant → Decay
    - 안정적인 시작
    - 빠른 학습
    - 부드러운 수렴

    Gradient Accumulation = 큰 배치 시뮬레이션
    - 메모리 절약
    - 동일한 효과

    Muon = 더 빠른 optimizer
    - Orthogonalization
    - 2.5x 빠른 수렴

    다음 단계

    Part 7: "3배 빠른 훈련"에서 다룰 내용:
    - Mixed precision (bfloat16)
    - Flash Attention
    - torch.compile
    - MFU 계산
    - Profiling

    최적화 기법으로 훈련을 3배 더 빠르게! 🚀

    ---

    📘 참고

    본 포스트는 Andrej Karpathy의 nanochat 오픈소스 프로젝트를 기반으로, 코드 구조와 학습 과정을 분석/설명하기 위해 작성되었습니다. 원본 코드는 MIT License 하에 배포됩니다. 모든 코드 예제는 교육 목적으로 단순화되었습니다.

    참고 자료:
    - [nanochat/execution.py](https://github.com/karpathy/nanochat/blob/main/nanochat/execution.py)
    - [Muon optimizer 논문](https://arxiv.org/abs/2402.03432)

    태그: #Training #Optimizer #Muon #LearningRate #GradientAccumulation #nanochat