[Nanochat 분석하기] 9. 평가와 배포

Oct 18, 2025

Read time: 2 minutes

평가와 배포: 나만의 ChatGPT 완성하기

**시리즈**: 나노챗(nanochat)으로 배우는 LLM 구축 - Part 9/9 (Final)

여정의 마지막¶

축하합니다! 지금까지 우리는:

✅ Tokenizer 구현 (BPE)
✅ GPT 모델 구축 (Attention, RoPE, Flash)
✅ 훈련 ($100, 4시간)
✅ Fine-tuning (SFT + RL)

이제 마지막 단계: 모델을 평가하고 배포합시다!

왜 평가가 중요한가?¶

Training loss: 4.5 → 2.3  # 낮아졌다!

But is the model actually better? 🤔

# We need standardized benchmarks!

Loss만으로는 모델의 실제 능력을 알 수 없습니다:
- 지식이 있나? (Facts)
- 추론할 수 있나? (Reasoning)
- 코드를 짤 수 있나? (Coding)

1. Perplexity: 모델의 자신감¶

가장 기본적인 metric:

def compute_perplexity(model, text):
    tokens = tokenize(text)

    # Forward
    logits = model(tokens[:-1])
    targets = tokens[1:]

    # Cross-entropy loss
    loss = F.cross_entropy(logits.view(-1, vocab_size), targets.view(-1))

    # Perplexity = exp(loss)
    perplexity = torch.exp(loss)
    return perplexity.item()

해석:

Perplexity = 100:
  "Model is choosing from ~100 options at each step"
  → Not very confident

Perplexity = 10:
  "Model is choosing from ~10 options"
  → More confident!

Perplexity = 2:
  "Model is almost certain"
  → Very good! (near human level)

예시:

text = "The cat sat on the mat"

Bad model:
  P("cat"|"The") = 0.01  # Surprised!
  P("sat"|"cat") = 0.02  # Very surprised!
  Perplexity: 50  ❌

Good model:
  P("cat"|"The") = 0.15  # Expected
  P("sat"|"cat") = 0.20  # Expected
  Perplexity: 5  ✅

nanochat results:

Base model (after pretraining):
  Perplexity on FineWeb test: 8.2

After SFT:
  Perplexity on FineWeb test: 8.5 (slightly worse - expected!)
  Perplexity on SmolTalk test: 3.1 (much better on chat!)

2. MMLU: 지식 테스트¶

MMLU = Massive Multitask Language Understanding

57개 과목 (고등학교 → 대학원 수준):
- Elementary Mathematics
- US History
- Computer Science
- Medical Genetics
- International Law
- ...

Format:

Question: What is the capital of France?
A) London
B) Berlin
C) Paris
D) Madrid

Correct: C

평가:

def eval_mmlu(model, dataset):
    correct = 0
    total = 0

    for question in dataset:
        # Format as prompt
        prompt = f"""
Question: {question['question']}
A) {question['choices'][0]}
B) {question['choices'][1]}
C) {question['choices'][2]}
D) {question['choices'][3]}

Answer:"""

        # Generate (single token!)
        response = model.generate(prompt, max_tokens=1)

        # Check
        if response.strip() == question['answer']:
            correct += 1
        total += 1

    accuracy = correct / total * 100
    return accuracy

nanochat results:

Random guessing: 25% (4 choices)

d20 (370M, $100):  28%  # Slightly better than random
d26 (770M, $300):  32%
d32 (1.9B, $1000): 36%

# For comparison:
GPT-2 (1.5B): ~30%
GPT-3.5: 70%
GPT-4: 86%

# nanochat $100 model beats GPT-2! 🎉

Per-subject breakdown:

nanochat d20 performance:
Elementary Math: 35%  (good!)
History: 28%
Computer Science: 31%
Medical Knowledge: 24%  (struggles)
Law: 26%

3. GSM8K: 수학 추론¶

GSM8K = Grade School Math 8K

초등학교 수준 word problems:

Problem:
"If John has 5 apples and buys 3 more,
then gives 2 to Mary, how many does he have?"

Solution:
John starts: 5
Buys: +3 → 8
Gives: -2 → 6
Answer: 6

평가:

def eval_gsm8k(model, dataset):
    correct = 0

    for problem in dataset:
        # Generate solution
        response = model.generate(
            problem['question'],
            max_tokens=200
        )

        # Extract answer (regex)
        predicted = extract_number(response)
        ground_truth = problem['answer']

        if predicted == ground_truth:
            correct += 1

    return correct / len(dataset) * 100

Answer extraction:

def extract_number(text):
    # Look for patterns:
    # "The answer is 42"
    # "#### 42"
    # "= 42"

    patterns = [
        r'#### (\d+)',
        r'[Aa]nswer is (\d+)',
        r'= (\d+)',
    ]

    for pattern in patterns:
        match = re.search(pattern, text)
        if match:
            return int(match.group(1))

    return None

nanochat results:

Before RL: 20%
After RL: 35%  # +15% improvement!

# For comparison:
GPT-2: ~5%
GPT-3.5: 57%
GPT-4: 92%

# RL training works! ✅

Example outputs:

Problem: "A baker makes 12 cupcakes. He sells 5. How many left?"

Bad response (before RL):
"The baker has cupcakes. Some were sold."
→ Reward: 0.0

Good response (after RL):
"Let's solve step by step:
Starting cupcakes: 12
Sold: 5
Remaining: 12 - 5 = 7
#### 7"
→ Reward: 1.2 (correct + format!)

4. HumanEval: 코드 생성¶

HumanEval = Python function completion

164 programming problems.

Format:

def is_prime(n):
    """
    Return True if n is a prime number.

    >>> is_prime(7)
    True
    >>> is_prime(10)
    False
    """
    # Model must complete this function!

평가:

def eval_humaneval(model, dataset):
    passed = 0

    for problem in dataset:
        # Generate code
        code = model.generate(problem['prompt'], max_tokens=200)

        # Run test cases
        try:
            # Execute generated code
            exec(code)

            # Run all tests
            for test in problem['tests']:
                result = eval(test)
                assert result, f"Test failed: {test}"

            passed += 1
        except Exception as e:
            # Code didn't work
            print(f"Failed: {e}")

    return passed / len(dataset) * 100

nanochat results:

d20: 8%   # Very few!
d32: 15%

# For comparison:
GPT-2: ~0%
GPT-3.5: 48%
GPT-4: 67%
Codex: 72%

# Coding is hard! Need more code training.

왜 낮은가?

# nanochat training data:
FineWeb: Mostly natural language text
Code percentage: ~5%

# Codex training data:
GitHub: Mostly code
Code percentage: ~80%

→ More code data = better coding!

5. CORE: 종합 점수¶

CORE = nanochat의 aggregate metric

CORE = weighted_average([
    MMLU * 0.25,     # Knowledge
    GSM8K * 0.25,    # Math reasoning
    ARC * 0.20,      # Common sense
    HumanEval * 0.15, # Coding
    Perplexity * 0.15 # Language modeling
])

nanochat CORE scores:

d20 (370M, $100):  40
d26 (770M, $300):  50
d32 (1.9B, $1000): 58

# For comparison:
GPT-2 (1.5B): ~45
GPT-3.5: ~75
GPT-4: ~90

# d32 beats GPT-2 with 1/75th the parameters! 🎉

배포: 웹 서빙¶

이제 모델을 사용하게 만들어봅시다!

FastAPI + WebSocket¶

from fastapi import FastAPI, WebSocket
from nanochat.engine import Engine

app = FastAPI()

# Load model
engine = Engine.from_pretrained('nanochat-d20')

@app.websocket("/chat")
async def chat(websocket: WebSocket):
    await websocket.accept()

    while True:
        # Receive message
        message = await websocket.receive_text()

        # Format prompt
        prompt = f"<|user_start|>{message}<|user_end|><|assistant_start|>"

        # Generate streaming
        async for token in engine.generate_stream(prompt):
            await websocket.send_json({
                "token": token,
                "done": False
            })

        # Send done signal
        await websocket.send_json({"done": True})

Streaming Generation¶

사용자는 토큰이 생성되는 것을 실시간으로 봅니다:

class Engine:
    async def generate_stream(self, prompt, max_tokens=500):
        tokens = self.tokenizer.encode(prompt)
        kv_cache = None

        for _ in range(max_tokens):
            # Generate next token
            logits, kv_cache = self.model(
                tokens[-1:],
                kv_cache=kv_cache
            )

            next_token = self.sample(logits)
            tokens.append(next_token)

            # Decode and yield
            text = self.tokenizer.decode([next_token])

            if next_token == self.tokenizer.eos_token_id:
                break

            yield text  # Stream to user!

사용자 경험:

User: "Explain quantum computing"

Streaming output (나타나는 대로):
"Q" → "Qu" → "Qua" → "Quan" → "Quant" → "Quantum" → ...

Instead of waiting 10 seconds for full response!

React Frontend¶

const ws = new WebSocket('ws://localhost:8000/chat');

function sendMessage(message) {
  ws.send(message);
}

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);

  if (data.done) {
    console.log('Response complete!');
  } else {
    // Append token to UI
    appendToChat(data.token);
  }
};

Docker Deployment¶

# Dockerfile
FROM pytorch/pytorch:2.0.0-cuda11.8

# Install dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt

# Copy model
COPY model/ /app/model/

# Copy code
COPY nanochat/ /app/nanochat/
COPY scripts/ /app/scripts/

# Run server
CMD ["uvicorn", "scripts.chat_web:app", "--host", "0.0.0.0", "--port", "8000"]

# Build
docker build -t nanochat:latest .

# Run
docker run -p 8000:8000 --gpus all nanochat:latest

# Access at http://localhost:8000

Production Checklist¶

# ✅ 1. Model optimization
model = torch.compile(model, mode='max-autotune')
model.eval()  # Disable dropout, etc.

# ✅ 2. Batch inference (multiple users)
def batch_generate(prompts):
    # Generate for multiple prompts in parallel
    # → Better GPU utilization

# ✅ 3. KV cache pooling
cache_pool = []
def get_cache():
    return cache_pool.pop() if cache_pool else init_cache()

# ✅ 4. Monitoring
import time
start = time.time()
response = generate(prompt)
latency = time.time() - start
log_metric("latency", latency)

# ✅ 5. Rate limiting
from fastapi_limiter import RateLimiter
@app.get("/chat", dependencies=[Depends(RateLimiter(times=10, seconds=60))])

# ✅ 6. Error handling
try:
    response = generate(prompt)
except Exception as e:
    logger.error(f"Generation failed: {e}")
    return {"error": "Sorry, something went wrong"}

최종 결과¶

축하합니다! 우리가 만든 것:

✅ 완전한 LLM 파이프라인
  - Tokenizer (Rust BPE)
  - GPT model (370M params)
  - Training (4 hours, $100)
  - Fine-tuning (SFT + RL)
  - Evaluation (CORE: 40)
  - Web serving

✅ 성능:
  - MMLU: 28%
  - GSM8K: 35%
  - HumanEval: 8%
  - Perplexity: 8.2

✅ 비용:
  - Total: $100
  - Time: ~6 hours

✅ Comparable to GPT-2!
  - With 1/4 the parameters
  - Built from scratch
  - Full understanding

다음 단계¶

이제 여러분만의 모델을 가지고 있습니다! 다음은:

1. 실험하기¶

# Try different architectures:
- Deeper models (d26, d32)
- Different aspect ratios
- More/less heads

# Try different data:
- Domain-specific (medical, legal)
- Multilingual
- Code-heavy

# Try different training:
- Longer training
- Different optimizers
- Different schedules

2. 개선하기¶

# Better tokenizer:
- Larger vocabulary
- Language-specific merges

# Better training:
- Curriculum learning
- Data filtering
- Hyperparameter tuning

# Better evaluation:
- More benchmarks
- Human evaluation
- A/B testing

3. 공유하기¶

# Open source!
- Push to GitHub
- Write documentation
- Share on HuggingFace

# Community:
- Blog posts
- Twitter/X
- YouTube tutorials

핵심 요약¶

✅ 평가 = 실제 능력 측정
- Perplexity: 자신감
- MMLU: 지식
- GSM8K: 추론
- HumanEval: 코딩
- CORE: 종합

✅ 배포 = 실제 사용
- FastAPI + WebSocket
- Streaming generation
- Docker containerization

✅ nanochat d20:
- 370M parameters
- $100, 4 hours
- CORE: 40
- Comparable to GPT-2!

✅ Complete understanding:
- Every line of code
- Every design decision
- Every optimization

마치며¶

이 시리즈를 통해 우리는:

1. Tokenization부터 시작해
2. Transformer를 구축하고
3. 훈련시키고
4. 최적화하고
5. Fine-tuning하고
6. 평가하고
7. 배포했습니다!

단돈 $100, 6시간으로 GPT-2급 모델을 만들었습니다. 🎉

LLM은 더 이상 마법이 아닙니다. 여러분은 이제 모든 것을 이해합니다.

Go build something amazing! 🚀

---

시리즈 완료! 🎓

모든 코드는 [nanochat GitHub](https://github.com/karpathy/nanochat)에서 확인하세요.

질문이나 피드백은 환영합니다!

📘 참고¶

본 포스트는 Andrej Karpathy의 nanochat 오픈소스 프로젝트를 기반으로, 코드 구조와 학습 과정을 분석/설명하기 위해 작성되었습니다. 원본 코드는 MIT License 하에 배포됩니다. 모든 코드 예제는 교육 목적으로 단순화되었습니다.

태그: #Evaluation #MMLU #GSM8K #HumanEval #Deployment #FastAPI #nanochat #Complete