[Nanochat 분석하기] 5. GPT 모델 완성하기

Oct 18, 2025

Read time: 2 minutes

완전한 GPT 모델 조립하기

**시리즈**: 나노챗(nanochat)으로 배우는 LLM 구축 - Part 5/9

Transformer Block의 완성¶

지금까지 배운 것:
- ✅ Tokenization
- ✅ Self-Attention (Q, K, V)
- ✅ RoPE, MQA, Flash Attention

이제 나머지 피스들을 조립할 시간입니다!

Feed-Forward Network: 정보 처리¶

Attention이 "정보 수집"이라면, FFN은 "정보 처리"입니다.

class FeedForward(nn.Module):
    def __init__(self, n_embd):
        super().__init__()
        # 4x expansion → compression
        self.c_fc = nn.Linear(n_embd, 4 * n_embd, bias=False)
        self.c_proj = nn.Linear(4 * n_embd, n_embd, bias=False)

    def forward(self, x):
        # Expand
        x = self.c_fc(x)  # (B, T, 4*n_embd)

        # ReLU² activation
        x = F.relu(x).square()  # (B, T, 4*n_embd)

        # Contract
        x = self.c_proj(x)  # (B, T, n_embd)

        return x

왜 4배 확장?¶

n_embd = 1280

# FFN structure:
1280 → 5120 → 1280
narrow → wide → narrow

# Wide hidden layer = more capacity!

이를 "bottleneck architecture"라고 합니다:

Think of it like:
입구(좁음) → 방(넓음) → 출구(좁음)

좁은 입구: 압축된 정보
넓은 방: 풍부한 변환 공간
좁은 출구: 다시 압축

ReLU² vs ReLU¶

# Standard ReLU:
ReLU(x) = max(0, x)

# ReLU² (Squared ReLU):
ReLU²(x) = (max(0, x))²

비교:

x = [-2, -1, 0, 1, 2, 3]

ReLU(x)  = [0, 0, 0, 1, 2, 3]
ReLU²(x) = [0, 0, 0, 1, 4, 9]
#                    ^^^^^^^^^ Stronger activation!

장점:
- 더 강한 non-linearity
- 더 나은 gradient flow
- 실험적으로 더 좋은 성능

PaLM, nanochat 등 최신 모델들이 사용!

Normalization: LayerNorm vs RMSNorm¶

LayerNorm (구식)¶

class LayerNorm(nn.Module):
    def __init__(self, normalized_shape, eps=1e-5):
        self.gamma = nn.Parameter(torch.ones(normalized_shape))
        self.beta = nn.Parameter(torch.zeros(normalized_shape))
        self.eps = eps

    def forward(self, x):
        # Mean and variance
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)

        # Normalize
        x_norm = (x - mean) / torch.sqrt(var + self.eps)

        # Scale and shift (learnable)
        return self.gamma * x_norm + self.beta

문제: Mean, variance, gamma, beta 모두 계산 → 느림

RMSNorm (현대식)¶

def rms_norm(x, eps=1e-6):
    """Root Mean Square Normalization"""
    # RMS 계산 (mean은 skip!)
    rms = torch.sqrt(x.pow(2).mean(dim=-1, keepdim=True) + eps)

    # RMS로 나누기 (gamma, beta도 skip!)
    return x / rms

차이점:

# LayerNorm:
1. mean 계산
2. variance 계산
3. (x - mean) / sqrt(var)
4. gamma * ... + beta

# RMSNorm:
1. RMS 계산 (sqrt(mean(x²)))
2. x / RMS
Done!

# RMSNorm이 훨씬 간단!

예시:

x = torch.tensor([1.0, 2.0, 3.0, 4.0])

# LayerNorm:
mean = 2.5
std = 1.118
x_ln = (x - 2.5) / 1.118
     = [-1.34, -0.45, 0.45, 1.34]
# 평균이 0이 됨

# RMSNorm:
rms = sqrt(mean([1, 4, 9, 16])) = sqrt(7.5) = 2.74
x_rms = x / 2.74
      = [0.37, 0.73, 1.10, 1.46]
# 평균이 0이 아님! (그래도 괜찮음)

왜 RMSNorm?

연구 결과: Mean 제거가 꼭 필요하지 않음!
- RMSNorm만으로 충분한 성능
- 더 빠르고 간단
- Llama, GPT-4, nanochat 모두 사용

Residual Connections: 정보 고속도로¶

# Without residual:
x = attention(x)
x = ffn(x)
# 정보가 layer를 지나며 손실될 수 있음 ❌

# With residual:
x = x + attention(x)  # ← 원본 정보 보존!
x = x + ffn(x)        # ← 원본 정보 보존!
# 정보가 항상 흐를 수 있음 ✅

왜 필요한가?

깊은 네트워크의 문제:

# 32 layers without residual:
Layer 1: x → f₁(x)
Layer 2: f₁(x) → f₂(f₁(x))
...
Layer 32: f₃₂(...f₂(f₁(x))...)

# Gradient backpropagation:
∂Loss/∂x = ∂f₃₂/∂f₃₁ × ∂f₃₁/∂f₃₀ × ... × ∂f₁/∂x
# 32개 곱셈 → gradient vanishing! ❌

# With residual:
Layer 1: x → x + f₁(x)
Layer 2: x + f₁(x) → x + f₁(x) + f₂(x)
...
Layer 32: x + f₁(x) + ... + f₃₂(x)

# Gradient:
∂Loss/∂x = 1 + ∂f₁/∂x + ... + ∂f₃₂/∂x
# "1"이 항상 있음 → gradient flows! ✅

Pre-norm vs Post-norm¶

두 가지 배치 방식:

Post-norm (구식, 2017 Transformer)¶

def post_norm_block(x):
    # Attention → Add → Norm
    x = norm(x + attention(x))

    # FFN → Add → Norm
    x = norm(x + ffn(x))

    return x

Pre-norm (현대식, GPT-2+)¶

def pre_norm_block(x):
    # Norm → Attention → Add
    x = x + attention(norm(x))

    # Norm → FFN → Add
    x = x + ffn(norm(x))

    return x

왜 Pre-norm?

# Post-norm: Gradient가 norm을 거쳐야 함
# → Unstable

# Pre-norm: Gradient가 residual로 직접 흐름
# → Stable!

# 실험 결과:
Post-norm: 학습 중 loss spike, divergence
Pre-norm: 안정적인 학습

Complete Transformer Block¶

모든 것을 합치면:

class TransformerBlock(nn.Module):
    def __init__(self, config):
        self.attn = Attention(config)  # Multi-head with RoPE
        self.mlp = FeedForward(config.n_embd)  # FFN with ReLU²

    def forward(self, x):
        # Pre-norm architecture

        # 1. Attention block
        x = x + self.attn(rms_norm(x))

        # 2. FFN block
        x = x + self.mlp(rms_norm(x))

        return x

간단하죠? 핵심은:
1. Norm → Attention → Add
2. Norm → FFN → Add

전체 GPT 모델¶

이제 모든 블록을 쌓아 올립니다:

class GPT(nn.Module):
    def __init__(self, config):
        # Token embedding
        self.token_embedding = nn.Embedding(config.vocab_size, config.n_embd)

        # Transformer blocks
        self.blocks = nn.ModuleList([
            TransformerBlock(config)
            for _ in range(config.n_layer)
        ])

        # Final norm
        self.ln_f = rms_norm

        # LM head (output projection)
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # 1. Token embeddings (no position embedding - RoPE handles it!)
        x = self.token_embedding(idx)  # (B, T, n_embd)

        # 2. Transformer blocks
        for block in self.blocks:
            x = block(x)

        # 3. Final norm
        x = self.ln_f(x)

        # 4. LM head
        logits = self.lm_head(x)  # (B, T, vocab_size)

        # 5. Loss (if training)
        loss = None
        if targets is not None:
            B, T, C = logits.shape
            logits_flat = logits.view(B*T, C)
            targets_flat = targets.view(B*T)
            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

파라미터 분포:

# nanochat d20:
Total: 370M parameters

Token embedding: 64M (12%)
Transformer blocks: 294M (79%)
  ├─ Attention: 147M (40%)
  └─ FFN: 147M (40%)
LM head: 12M (3%)

Weight Tying: 임베딩 공유¶

작은 최적화:

# Token embedding과 LM head는 같은 정보를 인코딩
# → 가중치를 공유하자!

self.lm_head.weight = self.token_embedding.weight

# 파라미터 절약:
# Before: 64M (emb) + 64M (lm_head) = 128M
# After: 64M (shared) = 128M → 64M saved!

초기화 전략¶

단순히 nn.Linear()를 만들면 안 됩니다:

def _init_weights(self, module):
    if isinstance(module, nn.Linear):
        # GPT-2 style initialization
        std = 0.02
        torch.nn.init.normal_(module.weight, mean=0.0, std=std)

        # Residual projection에는 추가 스케일링
        if hasattr(module, 'NANOGPT_SCALE_INIT'):
            module.weight.data *= (2 * config.n_layer) ** -0.5

    elif isinstance(module, nn.Embedding):
        torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

왜?

# Random init with std=1.0:
layer1_output = W₁ @ x  # std ≈ sqrt(fan_in)
layer2_output = W₂ @ layer1_output  # std ≈ fan_in
# Explodes! ❌

# Scaled init with std=0.02:
layer1_output = W₁ @ x  # std ≈ 0.02
layer2_output = W₂ @ layer1_output  # std ≈ 0.02
# Stable! ✅

nanochat의 Config¶

@dataclass
class GPTConfig:
    vocab_size: int = 50304
    depth: int = 20
    aspect_ratio: int = 64

    def __post_init__(self):
        self.n_layer = self.depth
        self.n_embd = self.depth * self.aspect_ratio  # 20 * 64 = 1280

        # Attention config
        self.n_head = max(1, (self.n_embd + 127) // 128)  # 10 heads
        self.head_dim = 128
        self.n_kv_head = self.n_head  # No MQA (full quality)

        # FFN config
        self.intermediate_size = 4 * self.n_embd  # 5120

# Usage:
config = GPTConfig(depth=20)
model = GPT(config)
print(f"Parameters: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M")
# Parameters: 370.0M

실전 예제¶

모델을 만들고 테스트:

# 1. Create model
config = GPTConfig(depth=20)
model = GPT(config).cuda()

# 2. Dummy input
batch_size = 4
seq_len = 128
idx = torch.randint(0, config.vocab_size, (batch_size, seq_len)).cuda()
targets = torch.randint(0, config.vocab_size, (batch_size, seq_len)).cuda()

# 3. Forward pass
logits, loss = model(idx, targets)

print(f"Input: {idx.shape}")        # [4, 128]
print(f"Logits: {logits.shape}")    # [4, 128, 50304]
print(f"Loss: {loss.item():.4f}")   # ~10.8 (random model)

# 4. Backward
loss.backward()
print(f"Gradients computed: {model.token_embedding.weight.grad.shape}")

핵심 요약¶

✅ Feed-Forward Network
- 4배 확장 → 압축
- ReLU² activation
- 정보 처리 역할

✅ RMSNorm
- LayerNorm보다 간단
- Mean, variance 불필요
- 빠르고 효과적

✅ Residual Connections
- Gradient flow 보장
- 깊은 네트워크 학습 가능
- Pre-norm이 더 안정적

✅ Complete Block
- Norm → Attention → Add
- Norm → FFN → Add
- 이것을 N번 반복!

✅ GPT Model
- Token embedding
- N × Transformer blocks
- Final norm + LM head
- 간단하지만 강력!

다음 단계¶

Part 6: "LLM 훈련 마스터하기"에서 다룰 내용:
- DataLoader 구현
- Training loop (forward/backward/optimize)
- Learning rate scheduling
- Gradient accumulation
- Muon optimizer

모델을 만들었으니 이제 훈련시킬 차례입니다! 🚀

---

📘 참고¶

본 포스트는 Andrej Karpathy의 nanochat 오픈소스 프로젝트를 기반으로, 코드 구조와 학습 과정을 분석/설명하기 위해 작성되었습니다. 원본 코드는 MIT License 하에 배포됩니다. 모든 코드 예제는 교육 목적으로 단순화되었습니다.

참고 자료:
- [nanochat gpt.py](https://github.com/karpathy/nanochat/blob/main/nanochat/gpt.py)
- [RMSNorm 논문](https://arxiv.org/abs/1910.07467)
- [Deep Residual Learning](https://arxiv.org/abs/1512.03385)

태그: #GPT #Transformer #FFN #RMSNorm #ResidualConnection #nanochat