[Nanochat 분석하기] 6. LLM 훈련 마스터하기

Oct 18, 2025

Read time: 2 minutes

LLM 훈련 마스터하기: 데이터부터 최적화까지

**시리즈**: 나노챗(nanochat)으로 배우는 LLM 구축 - Part 6/9

Training의 4가지 핵심 요소¶

모델을 만들었으니 이제 학습시킬 차례입니다. 성공적인 훈련을 위해 필요한 것:

1. Data: 무엇을 학습할 것인가?
2. Loss: 얼마나 틀렸는가?
3. Optimizer: 어떻게 개선할 것인가?
4. Schedule: 언제 무엇을 할 것인가?

DataLoader: 효율적인 배치 처리¶

Raw 텍스트를 훈련 배치로 변환:

class DataLoader:
    def __init__(self, filename, batch_size, seq_len):
        # 1. Load binary tokenized data
        self.tokens = np.memmap(filename, dtype=np.uint16, mode='r')

        self.batch_size = batch_size
        self.seq_len = seq_len

        # 2. Calculate how many batches we can make
        tokens_per_batch = batch_size * seq_len
        self.num_batches = len(self.tokens) // tokens_per_batch

    def __iter__(self):
        # 3. Yield batches
        for i in range(self.num_batches):
            start = i * self.batch_size * self.seq_len
            end = start + self.batch_size * self.seq_len

            # Get chunk and reshape
            chunk = self.tokens[start:end]
            batch = torch.from_numpy(chunk.astype(np.int64))
            batch = batch.view(self.batch_size, self.seq_len)

            # Input and target (shifted by 1)
            x = batch[:, :-1]
            y = batch[:, 1:]

            yield x, y

예시:

# FineWeb 데이터: 7.4B 토큰
# batch_size = 512, seq_len = 1024

dataloader = DataLoader('fineweb_train.bin', 512, 1024)

for x, y in dataloader:
    print(f"x shape: {x.shape}")  # [512, 1024]
    print(f"y shape: {y.shape}")  # [512, 1024]

    # x[0]: [15496, 995, 318, ...]  "Hello world is"
    # y[0]: [995, 318, 11, ...]     "world is ,"
    #       ↑ Shifted by 1!

왜 shift by 1?

Language modeling = Next token prediction

Input:  "The cat sat"
Target: "cat sat on"
         ↑ Predict next token at each position

Training Loop: Forward-Backward-Update¶

핵심 3단계:

def train_step(model, x, y, optimizer):
    # 1. Forward: Compute loss
    logits, loss = model(x, y)

    # 2. Backward: Compute gradients
    loss.backward()

    # 3. Update: Apply gradients
    optimizer.step()
    optimizer.zero_grad()

    return loss.item()

Step 1: Forward Pass¶

logits, loss = model(x, y)

# logits shape: [batch_size, seq_len, vocab_size]
#             = [512, 1024, 50304]

# For each of 512*1024 = 524,288 positions:
# - Model predicts probability distribution over 50,304 tokens
# - Compare with ground truth
# - Compute cross-entropy loss

Cross-Entropy Loss:

# Position i:
prediction = logits[0, i, :]  # [50304] probabilities
target = y[0, i]              # scalar (e.g., 995 for "world")

# Loss at this position:
loss_i = -log(prediction[target])

# Total loss = average over all positions
loss = mean(loss_i for all i)

Step 2: Backward Pass¶

loss.backward()

# PyTorch magic: Computes ∂loss/∂w for EVERY weight!

# Example:
for name, param in model.named_parameters():
    print(f"{name}: grad shape {param.grad.shape}")

# Output:
# token_embedding.weight: grad shape [50304, 1280]
# blocks.0.attn.c_attn.weight: grad shape [1280, 3840]
# ...
# lm_head.weight: grad shape [50304, 1280]

Gradient 의미:

# If weight = 0.5, grad = 0.3:
# → "If I increase weight by 0.01, loss increases by 0.003"
# → To reduce loss, DECREASE weight!

# Update rule:
weight_new = weight_old - learning_rate * grad

Step 3: Optimizer Step¶

optimizer.step()

# For each parameter:
# param.data -= lr * param.grad

# Example with lr=0.001:
# weight: 0.5000
# grad: 0.3000
# new weight: 0.5 - 0.001 * 0.3 = 0.4997

Learning Rate Scheduling¶

고정된 LR의 문제:

# LR too high:
step 0: loss=4.5
step 1: loss=2.0  ← Good!
step 2: loss=5.3  ← Overshoot! ❌
step 3: loss=1.8
step 4: loss=6.1  ← Diverging! ❌

# LR too low:
step 0: loss=4.5
step 100: loss=4.48  ← Too slow! ❌
step 200: loss=4.46

해결책: 3-phase schedule!

Phase 1: Warmup (0-10%)¶

def get_warmup_lr(step, warmup_steps, max_lr):
    return max_lr * (step + 1) / warmup_steps

# Example: max_lr=6e-4, warmup_steps=100
# step 0: lr = 6e-6
# step 50: lr = 3e-4
# step 100: lr = 6e-4

왜 warmup?

# Start of training:
# - Weights are random
# - Gradients are chaotic
# - High LR → explosion! 💥

# With warmup:
# - Slowly increase LR
# - Model stabilizes
# - Then use full LR ✅

Phase 2: Constant (10-80%)¶

# Main training phase
lr = max_lr  # 6e-4

# Model learns most here!

Phase 3: Cosine Decay (80-100%)¶

def get_decay_lr(step, decay_start, max_steps, max_lr, min_lr):
    progress = (step - decay_start) / (max_steps - decay_start)
    cosine_decay = 0.5 * (1 + math.cos(math.pi * progress))
    return min_lr + (max_lr - min_lr) * cosine_decay

# progress=0.0: lr = max_lr
# progress=0.5: lr = (max_lr + min_lr) / 2
# progress=1.0: lr = min_lr

Complete schedule:

def get_lr(step, max_steps, max_lr=6e-4, min_lr=6e-5):
    warmup_steps = int(0.1 * max_steps)
    decay_start = int(0.8 * max_steps)

    if step < warmup_steps:
        # Warmup
        return max_lr * (step + 1) / warmup_steps
    elif step < decay_start:
        # Constant
        return max_lr
    else:
        # Cosine decay
        progress = (step - decay_start) / (max_steps - decay_start)
        cosine = 0.5 * (1 + math.cos(math.pi * progress))
        return min_lr + (max_lr - min_lr) * cosine

Gradient Accumulation: 큰 배치 시뮬레이션¶

GPU 메모리 문제:

# Want: batch_size = 2048
# Reality: GPU OOM! (Out of Memory)

# GPU can only fit batch_size = 512

해결책: Gradient accumulation!

# Accumulate gradients over 4 mini-batches
grad_accum_steps = 4

for step in range(num_steps):
    optimizer.zero_grad()

    # Accumulate
    for micro_step in range(grad_accum_steps):
        x, y = next(dataloader)

        # Forward
        logits, loss = model(x, y)
        loss = loss / grad_accum_steps  # Normalize!

        # Backward (gradients accumulate automatically)
        loss.backward()

    # Single update with accumulated gradients
    optimizer.step()

왜 normalize?

# Without normalization:
grad = grad1 + grad2 + grad3 + grad4
# This is 4x larger than normal!
# Effective LR becomes 4x higher ❌

# With normalization:
grad = grad1/4 + grad2/4 + grad3/4 + grad4/4
# Average gradient ✅
# Effective LR stays the same

메모리 절약:

# Normal (OOM):
batch_size = 2048
memory = 80 GB  ❌

# Gradient accumulation:
batch_size = 512 (4 times)
memory = 20 GB per step ✅
effective_batch = 512 * 4 = 2048

Muon Optimizer: nanochat의 비밀 무기¶

AdamW보다 2.5배 빠른 수렴!

class Muon(torch.optim.Optimizer):
    def __init__(self, params, lr=0.02, momentum=0.95):
        defaults = dict(lr=lr, momentum=momentum)
        super().__init__(params, defaults)

    @torch.no_grad()
    def step(self):
        for group in self.param_groups:
            for p in group['params']:
                if p.grad is None:
                    continue

                grad = p.grad
                state = self.state[p]

                # 1. Momentum buffer
                if 'momentum_buffer' not in state:
                    buf = state['momentum_buffer'] = torch.zeros_like(grad)
                else:
                    buf = state['momentum_buffer']

                # 2. Update momentum
                buf.mul_(group['momentum']).add_(grad, alpha=1 - group['momentum'])

                # 3. Orthogonalize (for 2D matrices only)
                if buf.ndim == 2:
                    buf_orth = newton_schulz_orthogonalize(buf)
                else:
                    buf_orth = buf

                # 4. Aspect ratio scaling
                if p.ndim == 2:
                    scale = max(1.0, p.size(0) / p.size(1)) ** 0.5
                else:
                    scale = 1.0

                # 5. Update parameter
                p.add_(buf_orth, alpha=-group['lr'] * scale)

Newton-Schulz Orthogonalization:

def newton_schulz_orthogonalize(G, steps=5):
    """
    Make matrix orthogonal: G @ G.T ≈ I
    """
    # Normalize
    X = G / (G.norm() + 1e-7)

    # Iterate
    for _ in range(steps):
        A = X @ X.T
        X = 1.5 * X - 0.5 * A @ X

    return X

왜 orthogonalization?

# Neural net weights work better when orthogonal:
# - Better gradient flow
# - More stable training
# - Faster convergence

# Empirical results:
AdamW: 1000 steps to loss=2.5
Muon:  400 steps to loss=2.5  (2.5x faster!)

Hybrid strategy:

# nanochat uses TWO optimizers:

# Muon for 2D matrices (attention, MLP):
muon_params = [p for p in model.parameters() if p.ndim == 2]
muon = Muon(muon_params, lr=0.02)

# AdamW for everything else (embeddings, norms):
other_params = [p for p in model.parameters() if p.ndim != 2]
adamw = AdamW(other_params, lr=6e-4)

Complete Training Script¶

모든 것을 합치면:

def train(model, dataloader, num_steps=4681):
    # Setup optimizers
    muon_params = [p for p in model.parameters() if p.ndim == 2]
    other_params = [p for p in model.parameters() if p.ndim != 2]

    muon = Muon(muon_params, lr=0.02, momentum=0.95)
    adamw = AdamW(other_params, lr=6e-4)

    # Training loop
    for step in range(num_steps):
        # 1. Get learning rate
        lr_muon = get_lr(step, num_steps, max_lr=0.02)
        lr_adamw = get_lr(step, num_steps, max_lr=6e-4)

        for pg in muon.param_groups:
            pg['lr'] = lr_muon
        for pg in adamw.param_groups:
            pg['lr'] = lr_adamw

        # 2. Gradient accumulation
        muon.zero_grad()
        adamw.zero_grad()

        total_loss = 0
        for micro_step in range(3):  # grad_accum=3
            x, y = next(dataloader)
            x, y = x.cuda(), y.cuda()

            # Forward
            with torch.amp.autocast('cuda', dtype=torch.bfloat16):
                logits, loss = model(x, y)
                loss = loss / 3

            # Backward
            loss.backward()
            total_loss += loss.item()

        # 3. Optimizer step
        muon.step()
        adamw.step()

        # 4. Logging
        if step % 10 == 0:
            print(f"Step {step}/{num_steps} | Loss: {total_loss:.4f} | "
                  f"LR: {lr_muon:.6f}")

# Run training
model = GPT(config).cuda()
dataloader = DataLoader('fineweb_train.bin', 512, 1024)
train(model, dataloader)

nanochat 훈련 결과:

Step 0: Loss 10.8234 (random)
Step 100: Loss 6.4521
Step 500: Loss 4.2103
Step 1000: Loss 3.3452
Step 2000: Loss 2.8234
Step 4000: Loss 2.4125
Step 4681: Loss 2.3456 (final)

Time: 4 hours on 8×H100
Cost: ~$100

핵심 요약¶

✅ DataLoader = 효율적 배치 생성
- Binary format (빠른 로딩)
- Shifted targets (next token prediction)

✅ Training Loop = Forward → Backward → Update
- Cross-entropy loss
- Automatic differentiation
- Gradient descent

✅ LR Schedule = Warmup → Constant → Decay
- 안정적인 시작
- 빠른 학습
- 부드러운 수렴

✅ Gradient Accumulation = 큰 배치 시뮬레이션
- 메모리 절약
- 동일한 효과

✅ Muon = 더 빠른 optimizer
- Orthogonalization
- 2.5x 빠른 수렴

다음 단계¶

Part 7: "3배 빠른 훈련"에서 다룰 내용:
- Mixed precision (bfloat16)
- Flash Attention
- torch.compile
- MFU 계산
- Profiling

최적화 기법으로 훈련을 3배 더 빠르게! 🚀

---

📘 참고¶

본 포스트는 Andrej Karpathy의 nanochat 오픈소스 프로젝트를 기반으로, 코드 구조와 학습 과정을 분석/설명하기 위해 작성되었습니다. 원본 코드는 MIT License 하에 배포됩니다. 모든 코드 예제는 교육 목적으로 단순화되었습니다.

참고 자료:
- [nanochat/execution.py](https://github.com/karpathy/nanochat/blob/main/nanochat/execution.py)
- [Muon optimizer 논문](https://arxiv.org/abs/2402.03432)

태그: #Training #Optimizer #Muon #LearningRate #GradientAccumulation #nanochat