[Nanochat 분석하기] 1. $100로 시작하는 LLM 여정

Oct 18, 2025

Read time: 3 minutes

나만의 ChatGPT 만들기: $100로 시작하는 LLM 여정

**시리즈**: 나노챗(nanochat)으로 배우는 LLM 구축 - Part 1/10

들어가며¶

"ChatGPT 같은 AI를 만들려면 얼마나 들까요?"

많은 사람들이 "수백만 달러", "거대한 데이터센터", "수개월의 시간"이 필요하다고 생각합니다. 하지만 2025년 현재, 단돈 100달러로 여러분만의 대화형 AI를 만들 수 있습니다.

이 시리즈에서는 Andrej Karpathy가 만든 nanochat 프로젝트를 바탕으로, 토크나이저 훈련부터 웹 서빙까지 전체 파이프라인을 직접 구현하며 배웁니다.

왜 nanochat인가?¶

1. 완전한 파이프라인¶

대부분의 튜토리얼은 모델 구조만 다루지만, nanochat는 실제 프로덕션에 필요한 전체 과정을 포함합니다:

텍스트 데이터
    ↓
토크나이저 훈련 (Rust BPE)
    ↓
사전학습 (FineWeb 데이터셋)
    ↓
파인튜닝 (SmolTalk 대화 데이터)
    ↓
강화학습 (GSM8K 수학 문제)
    ↓
평가 (MMLU, HumanEval 등)
    ↓
웹 서빙 (FastAPI + WebSocket)

2. 현실적인 비용¶

nanochat는 3가지 예산 옵션을 제공합니다:

| 티어 | 모델 | 파라미터 | 훈련 시간 | 비용 | 성능 수준 |
|------|------|----------|-----------|------|-----------|
| 🥉 Bronze | d20 | 370M | 4시간 | $100 | 유치원 레벨 |
| 🥈 Silver | d26 | 770M | 12시간 | $300 | GPT-2 급 |
| 🥇 Gold | d32 | 1.9B | 41시간 | $1,000 | GPT-2 이상 |

3. 교육 중심 설계¶

nanochat는 단 8,000줄의 깔끔한 코드로 구성되어 있습니다. 비교해보면:

- GPT-2 구현 (OpenAI): ~15,000줄 + 레거시 코드
- Llama 구현 (Meta): ~20,000줄 + 복잡한 의존성
- nanochat: ~8,000줄, 최신 기법, 명확한 구조

프로젝트 구조 살펴보기¶

nanochat의 디렉토리 구조는 다음과 같습니다:

nanochat/
├── nanochat/              # 핵심 라이브러리
│   ├── gpt.py            # GPT 모델 (Transformer 구현)
│   ├── engine.py         # 추론 엔진 (KV cache)
│   ├── tokenizer.py      # 토크나이저 래퍼
│   ├── muon.py           # Muon 옵티마이저
│   └── execution.py      # 훈련 루프
├── scripts/               # 실행 스크립트
│   ├── base_train.py     # 사전학습
│   ├── chat_sft.py       # 파인튜닝
│   ├── chat_rl.py        # 강화학습
│   └── chat_eval.py      # 평가
├── rustbpe/               # Rust BPE 토크나이저
│   └── src/lib.rs        # 10배 빠른 토크나이저
└── tasks/                 # 평가 태스크
    ├── mmlu.py           # 지식 평가
    └── gsm8k.py          # 수학 추론 평가

핵심 파일 상세¶

#### 1. gpt.py - 모델의 심장 (400줄)

class GPT(nn.Module):
    def __init__(self, config):
        self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size, config.n_embd),
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f = RMSNorm(config.n_embd),
        ))
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

    def forward(self, idx, targets=None):
        # Token embedding
        x = self.transformer.wte(idx)

        # Transformer blocks (attention + FFN)
        for block in self.transformer.h:
            x = block(x)

        # Final norm + output projection
        x = self.transformer.ln_f(x)
        logits = self.lm_head(x)

        return logits

이 간단한 구조에서 출발해, RoPE, MQA, Flash Attention 등 최신 기법을 하나씩 추가해갈 것입니다.

#### 2. muon.py - 비밀 병기 (250줄)

nanochat의 핵심 혁신 중 하나는 Muon 옵티마이저입니다. Adam 대신 Muon을 사용하면:

- 2.5배 빠른 수렴: 같은 loss에 도달하는데 필요한 step 수 감소
- 더 안정적인 훈련: Newton-Schulz orthogonalization으로 gradient 안정화
- 행렬에 특화: Attention과 MLP의 weight matrix에 최적화

class Muon(torch.optim.Optimizer):
    def step(self):
        for group in self.param_groups:
            for p in group['params']:
                if p.grad is None:
                    continue

                # 1. Momentum update
                buf = self.momentum * buf + grad

                # 2. Newton-Schulz orthogonalization
                if p.ndim == 2:  # Only for matrices
                    buf = orthogonalize(buf)

                # 3. Update with aspect ratio scaling
                scale = sqrt(max(1, p.size(0) / p.size(1)))
                p.add_(buf, alpha=-lr * scale)

모델 크기 계산하기¶

nanochat는 "depth"라는 단일 하이퍼파라미터로 모델 크기를 결정합니다:

def calculate_model_size(depth, aspect_ratio=64):
    """
    depth: 레이어 수 (20, 26, 32 등)
    aspect_ratio: hidden_dim / depth 비율
    """
    n_layer = depth
    model_dim = depth * aspect_ratio  # d20: 20 * 64 = 1280
    n_head = max(1, (model_dim + 127) // 128)  # head_dim = 128
    vocab_size = 50304

    # 1. Token Embedding
    embedding_params = vocab_size * model_dim

    # 2. Transformer Blocks
    # Attention: Q, K, V, O projections
    attn_params = 4 * (model_dim * model_dim)
    # FFN: 4x expansion
    mlp_params = 2 * (model_dim * 4 * model_dim)
    block_params = (attn_params + mlp_params) * n_layer

    # 3. LM Head
    lm_head_params = vocab_size * model_dim

    total = embedding_params + block_params + lm_head_params
    return total / 1e6  # Convert to millions

실행 결과:

d20 (depth=20): 522M parameters → $100, 4 hours
d26 (depth=26): 1031M parameters → $300, 12 hours
d32 (depth=32): 1817M parameters → $1000, 41 hours

환경 설정¶

필요한 하드웨어¶

최소 사양:
- GPU: NVIDIA GPU (16GB+ VRAM)
- RTX 4090 (24GB): 개인 학습용 최적
- A100 (40/80GB): 클라우드 권장
- RAM: 32GB+
- Storage: 100GB+ (데이터셋 + 체크포인트)

소프트웨어 설치¶

# 1. Python 환경 (3.10+)
python3 --version

# 2. PyTorch 2.0+ (CUDA 11.8+)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# 3. 의존성
pip install numpy matplotlib tiktoken wandb tqdm requests

# 4. nanochat 클론
git clone https://github.com/karpathy/nanochat.git
cd nanochat

# 5. Rust 설치 (토크나이저용)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# 6. Rust BPE 빌드
cd rustbpe
cargo build --release

GPU 확인¶

import torch

print(f"PyTorch 버전: {torch.__version__}")
print(f"CUDA 사용 가능: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU 개수: {torch.cuda.device_count()}")
    for i in range(torch.cuda.device_count()):
        props = torch.cuda.get_device_properties(i)
        print(f"  GPU {i}: {props.name}")
        print(f"    메모리: {props.total_memory / 1024**3:.1f} GB")
        print(f"    Compute: {props.major}.{props.minor}")

예상 출력:

PyTorch 버전: 2.5.0+cu118
CUDA 사용 가능: True
GPU 개수: 1
  GPU 0: NVIDIA A100-SXM4-80GB
    메모리: 79.2 GB
    Compute: 8.0

전체 파이프라인 미리보기¶

nanochat의 speedrun.sh를 실행하면 전체 파이프라인이 자동으로 실행됩니다:

#!/bin/bash
# speedrun.sh - $100로 ChatGPT 만들기

# Stage 1: 데이터 다운로드
python -m nanochat.dataset -n 8  # 8 shards (~2GB)

# Stage 2: 토크나이저 훈련
python -m scripts.tok_train --max_chars=2000000000

# Stage 3: 토크나이저 평가
python -m scripts.tok_eval

# Stage 4: 사전학습 (메인 이벤트!)
python -m scripts.base_train \
    --depth 20 \
    --num_iterations 4681 \
    --batch_size 512

# Stage 5: 파인튜닝
python -m scripts.chat_sft --num_iterations 500

# Stage 6: 강화학습
python -m scripts.chat_rl --num_iterations 200

# Stage 7: 평가
python -m scripts.chat_eval

타임라인:
- 토크나이저: 30분
- 사전학습: 4시간 (8x H100 기준)
- 파인튜닝: 1시간
- 강화학습: 30분
- 총 소요 시간: ~6시간

다음 단계¶

이제 기본적인 개념과 환경 설정을 마쳤습니다. 다음 포스트에서는:

Part 2: "토크나이저의 모든 것"에서 다룰 내용:
- Character vs Word vs BPE 비교
- Byte-Pair Encoding 알고리즘 구현
- GPT-4 스타일 regex splitting
- Special tokens 설계
- Rust로 10배 빠른 토크나이저 만들기

핵심 요약¶

✅ nanochat = 완전한 LLM 파이프라인
- 토크나이저부터 웹 서빙까지 모든 단계 포함
- 단 8,000줄의 깔끔한 코드

✅ 3가지 예산 옵션
- Bronze ($100): 학습용
- Silver ($300): GPT-2 급
- Gold ($1,000): 프로덕션급

✅ 최신 기법 적용
- RoPE, MQA, Flash Attention
- Muon optimizer
- bfloat16 mixed precision

✅ 현실적인 시간
- 전체 파이프라인: 6시간
- 클라우드 GPU 활용 가능

다음 포스트에서 만나요! 🚀

---

📘 참고¶

본 포스트는 Andrej Karpathy의 nanochat 오픈소스 프로젝트를 기반으로, 코드 구조와 학습 과정을 분석/설명하기 위해 작성되었습니다. 원본 코드는 MIT License 하에 배포됩니다. 모든 코드 예제는 교육 목적으로 단순화되었습니다.

참고 자료:
- [nanochat GitHub](https://github.com/karpathy/nanochat)
- [Andrej Karpathy YouTube](https://www.youtube.com/@AndrejKarpathy)
- [The Annotated Transformer](http://nlp.seas.harvard.edu/annotated-transformer/)

태그: #LLM #ChatGPT #Transformer #nanochat #DeepLearning #PyTorch #AI교육