[Nanochat 분석하기] 9. 평가와 배포

Read time: 2 minutes

평가와 배포: 나만의 ChatGPT 완성하기

**시리즈**: 나노챗(nanochat)으로 배우는 LLM 구축 - Part 9/9 (Final)

여정의 마지막

축하합니다! 지금까지 우리는:

✅ Tokenizer 구현 (BPE)
✅ GPT 모델 구축 (Attention, RoPE, Flash)
✅ 훈련 ($100, 4시간)
✅ Fine-tuning (SFT + RL)

이제 마지막 단계: 모델을 평가하고 배포합시다!

왜 평가가 중요한가?

  • Training loss: 4.52.3  # 낮아졌다!
    
    But is the model actually better? 🤔
    
    # We need standardized benchmarks!

    Loss만으로는 모델의 실제 능력을 알 수 없습니다:
    - 지식이 있나? (Facts)
    - 추론할 수 있나? (Reasoning)
    - 코드를 짤 수 있나? (Coding)

    1. Perplexity: 모델의 자신감

    가장 기본적인 metric:

  • def compute_perplexity(model, text):
        tokens = tokenize(text)
    
        # Forward
        logits = model(tokens[:-1])
        targets = tokens[1:]
    
        # Cross-entropy loss
        loss = F.cross_entropy(logits.view(-1, vocab_size), targets.view(-1))
    
        # Perplexity = exp(loss)
        perplexity = torch.exp(loss)
        return perplexity.item()

    해석:

  • Perplexity = 100:
      "Model is choosing from ~100 options at each step"
      → Not very confident
    
    Perplexity = 10:
      "Model is choosing from ~10 options"
      → More confident!
    
    Perplexity = 2:
      "Model is almost certain"
      → Very good! (near human level)

    예시:

  • text = "The cat sat on the mat"
    
    Bad model:
      P("cat"|"The") = 0.01  # Surprised!
      P("sat"|"cat") = 0.02  # Very surprised!
      Perplexity: 50
    
    Good model:
      P("cat"|"The") = 0.15  # Expected
      P("sat"|"cat") = 0.20  # Expected
      Perplexity: 5

    nanochat results:

  • Base model (after pretraining):
      Perplexity on FineWeb test: 8.2
    
    After SFT:
      Perplexity on FineWeb test: 8.5 (slightly worse - expected!)
      Perplexity on SmolTalk test: 3.1 (much better on chat!)

    2. MMLU: 지식 테스트

    MMLU = Massive Multitask Language Understanding

    57개 과목 (고등학교 → 대학원 수준):
    - Elementary Mathematics
    - US History
    - Computer Science
    - Medical Genetics
    - International Law
    - ...

    Format:

  • Question: What is the capital of France?
    A) London
    B) Berlin
    C) Paris
    D) Madrid
    
    Correct: C

    평가:

  • def eval_mmlu(model, dataset):
        correct = 0
        total = 0
    
        for question in dataset:
            # Format as prompt
            prompt = f"""
    Question: {question['question']}
    A) {question['choices'][0]}
    B) {question['choices'][1]}
    C) {question['choices'][2]}
    D) {question['choices'][3]}
    
    Answer:"""
    
            # Generate (single token!)
            response = model.generate(prompt, max_tokens=1)
    
            # Check
            if response.strip() == question['answer']:
                correct += 1
            total += 1
    
        accuracy = correct / total * 100
        return accuracy

    nanochat results:

  • Random guessing: 25% (4 choices)
    
    d20 (370M, $100):  28%  # Slightly better than random
    d26 (770M, $300):  32%
    d32 (1.9B, $1000): 36%
    
    # For comparison:
    GPT-2 (1.5B): ~30%
    GPT-3.5: 70%
    GPT-4: 86%
    
    # nanochat $100 model beats GPT-2! 🎉

    Per-subject breakdown:

  • nanochat d20 performance:
    Elementary Math: 35%  (good!)
    History: 28%
    Computer Science: 31%
    Medical Knowledge: 24%  (struggles)
    Law: 26%

    3. GSM8K: 수학 추론

    GSM8K = Grade School Math 8K

    초등학교 수준 word problems:

  • Problem:
    "If John has 5 apples and buys 3 more,
    then gives 2 to Mary, how many does he have?"
    
    Solution:
    John starts: 5
    Buys: +3 → 8
    Gives: -2 → 6
    Answer: 6

    평가:

  • def eval_gsm8k(model, dataset):
        correct = 0
    
        for problem in dataset:
            # Generate solution
            response = model.generate(
                problem['question'],
                max_tokens=200
            )
    
            # Extract answer (regex)
            predicted = extract_number(response)
            ground_truth = problem['answer']
    
            if predicted == ground_truth:
                correct += 1
    
        return correct / len(dataset) * 100

    Answer extraction:

  • def extract_number(text):
        # Look for patterns:
        # "The answer is 42"
        # "#### 42"
        # "= 42"
    
        patterns = [
            r'#### (\d+)',
            r'[Aa]nswer is (\d+)',
            r'= (\d+)',
        ]
    
        for pattern in patterns:
            match = re.search(pattern, text)
            if match:
                return int(match.group(1))
    
        return None

    nanochat results:

  • Before RL: 20%
    After RL: 35%  # +15% improvement!
    
    # For comparison:
    GPT-2: ~5%
    GPT-3.5: 57%
    GPT-4: 92%
    
    # RL training works! ✅

    Example outputs:

  • Problem: "A baker makes 12 cupcakes. He sells 5. How many left?"
    
    Bad response (before RL):
    "The baker has cupcakes. Some were sold."
    → Reward: 0.0
    
    Good response (after RL):
    "Let's solve step by step:
    Starting cupcakes: 12
    Sold: 5
    Remaining: 12 - 5 = 7
    #### 7"
    → Reward: 1.2 (correct + format!)

    4. HumanEval: 코드 생성

    HumanEval = Python function completion

    164 programming problems.

    Format:

  • def is_prime(n):
        """
        Return True if n is a prime number.
    
        >>> is_prime(7)
        True
        >>> is_prime(10)
        False
        """
        # Model must complete this function!

    평가:

  • def eval_humaneval(model, dataset):
        passed = 0
    
        for problem in dataset:
            # Generate code
            code = model.generate(problem['prompt'], max_tokens=200)
    
            # Run test cases
            try:
                # Execute generated code
                exec(code)
    
                # Run all tests
                for test in problem['tests']:
                    result = eval(test)
                    assert result, f"Test failed: {test}"
    
                passed += 1
            except Exception as e:
                # Code didn't work
                print(f"Failed: {e}")
    
        return passed / len(dataset) * 100

    nanochat results:

  • d20: 8%   # Very few!
    d32: 15%
    
    # For comparison:
    GPT-2: ~0%
    GPT-3.5: 48%
    GPT-4: 67%
    Codex: 72%
    
    # Coding is hard! Need more code training.

    왜 낮은가?

  • # nanochat training data:
    FineWeb: Mostly natural language text
    Code percentage: ~5%
    
    # Codex training data:
    GitHub: Mostly code
    Code percentage: ~80%
    
    → More code data = better coding!

    5. CORE: 종합 점수

    CORE = nanochat의 aggregate metric

  • CORE = weighted_average([
        MMLU * 0.25,     # Knowledge
        GSM8K * 0.25,    # Math reasoning
        ARC * 0.20,      # Common sense
        HumanEval * 0.15, # Coding
        Perplexity * 0.15 # Language modeling
    ])

    nanochat CORE scores:

  • d20 (370M, $100):  40
    d26 (770M, $300):  50
    d32 (1.9B, $1000): 58
    
    # For comparison:
    GPT-2 (1.5B): ~45
    GPT-3.5: ~75
    GPT-4: ~90
    
    # d32 beats GPT-2 with 1/75th the parameters! 🎉

    배포: 웹 서빙

    이제 모델을 사용하게 만들어봅시다!

    FastAPI + WebSocket

  • from fastapi import FastAPI, WebSocket
    from nanochat.engine import Engine
    
    app = FastAPI()
    
    # Load model
    engine = Engine.from_pretrained('nanochat-d20')
    
    @app.websocket("/chat")
    async def chat(websocket: WebSocket):
        await websocket.accept()
    
        while True:
            # Receive message
            message = await websocket.receive_text()
    
            # Format prompt
            prompt = f"<|user_start|>{message}<|user_end|><|assistant_start|>"
    
            # Generate streaming
            async for token in engine.generate_stream(prompt):
                await websocket.send_json({
                    "token": token,
                    "done": False
                })
    
            # Send done signal
            await websocket.send_json({"done": True})

    Streaming Generation

    사용자는 토큰이 생성되는 것을 실시간으로 봅니다:

  • class Engine:
        async def generate_stream(self, prompt, max_tokens=500):
            tokens = self.tokenizer.encode(prompt)
            kv_cache = None
    
            for _ in range(max_tokens):
                # Generate next token
                logits, kv_cache = self.model(
                    tokens[-1:],
                    kv_cache=kv_cache
                )
    
                next_token = self.sample(logits)
                tokens.append(next_token)
    
                # Decode and yield
                text = self.tokenizer.decode([next_token])
    
                if next_token == self.tokenizer.eos_token_id:
                    break
    
                yield text  # Stream to user!

    사용자 경험:

  • User: "Explain quantum computing"
    
    Streaming output (나타나는 대로):
    "Q" → "Qu" → "Qua" → "Quan" → "Quant" → "Quantum" → ...
    
    Instead of waiting 10 seconds for full response!

    React Frontend

  • const ws = new WebSocket('ws://localhost:8000/chat');
    
    function sendMessage(message) {
      ws.send(message);
    }
    
    ws.onmessage = (event) => {
      const data = JSON.parse(event.data);
    
      if (data.done) {
        console.log('Response complete!');
      } else {
        // Append token to UI
        appendToChat(data.token);
      }
    };

    Docker Deployment

  • # Dockerfile
    FROM pytorch/pytorch:2.0.0-cuda11.8
    
    # Install dependencies
    COPY requirements.txt .
    RUN pip install -r requirements.txt
    
    # Copy model
    COPY model/ /app/model/
    
    # Copy code
    COPY nanochat/ /app/nanochat/
    COPY scripts/ /app/scripts/
    
    # Run server
    CMD ["uvicorn", "scripts.chat_web:app", "--host", "0.0.0.0", "--port", "8000"]
  • # Build
    docker build -t nanochat:latest .
    
    # Run
    docker run -p 8000:8000 --gpus all nanochat:latest
    
    # Access at http://localhost:8000

    Production Checklist

  • # ✅ 1. Model optimization
    model = torch.compile(model, mode='max-autotune')
    model.eval()  # Disable dropout, etc.
    
    # ✅ 2. Batch inference (multiple users)
    def batch_generate(prompts):
        # Generate for multiple prompts in parallel
        # → Better GPU utilization
    
    # ✅ 3. KV cache pooling
    cache_pool = []
    def get_cache():
        return cache_pool.pop() if cache_pool else init_cache()
    
    # ✅ 4. Monitoring
    import time
    start = time.time()
    response = generate(prompt)
    latency = time.time() - start
    log_metric("latency", latency)
    
    # ✅ 5. Rate limiting
    from fastapi_limiter import RateLimiter
    @app.get("/chat", dependencies=[Depends(RateLimiter(times=10, seconds=60))])
    
    # ✅ 6. Error handling
    try:
        response = generate(prompt)
    except Exception as e:
        logger.error(f"Generation failed: {e}")
        return {"error": "Sorry, something went wrong"}

    최종 결과

    축하합니다! 우리가 만든 것:

  • ✅ 완전한 LLM 파이프라인
      - Tokenizer (Rust BPE)
      - GPT model (370M params)
      - Training (4 hours, $100)
      - Fine-tuning (SFT + RL)
      - Evaluation (CORE: 40)
      - Web serving
    
    ✅ 성능:
      - MMLU: 28%
      - GSM8K: 35%
      - HumanEval: 8%
      - Perplexity: 8.2
    
    ✅ 비용:
      - Total: $100
      - Time: ~6 hours
    
    ✅ Comparable to GPT-2!
      - With 1/4 the parameters
      - Built from scratch
      - Full understanding

    다음 단계

    이제 여러분만의 모델을 가지고 있습니다! 다음은:

    1. 실험하기

  • # Try different architectures:
    - Deeper models (d26, d32)
    - Different aspect ratios
    - More/less heads
    
    # Try different data:
    - Domain-specific (medical, legal)
    - Multilingual
    - Code-heavy
    
    # Try different training:
    - Longer training
    - Different optimizers
    - Different schedules

    2. 개선하기

  • # Better tokenizer:
    - Larger vocabulary
    - Language-specific merges
    
    # Better training:
    - Curriculum learning
    - Data filtering
    - Hyperparameter tuning
    
    # Better evaluation:
    - More benchmarks
    - Human evaluation
    - A/B testing

    3. 공유하기

  • # Open source!
    - Push to GitHub
    - Write documentation
    - Share on HuggingFace
    
    # Community:
    - Blog posts
    - Twitter/X
    - YouTube tutorials

    핵심 요약

    평가 = 실제 능력 측정
    - Perplexity: 자신감
    - MMLU: 지식
    - GSM8K: 추론
    - HumanEval: 코딩
    - CORE: 종합

    배포 = 실제 사용
    - FastAPI + WebSocket
    - Streaming generation
    - Docker containerization

    nanochat d20:
    - 370M parameters
    - $100, 4 hours
    - CORE: 40
    - Comparable to GPT-2!

    Complete understanding:
    - Every line of code
    - Every design decision
    - Every optimization

    마치며

    이 시리즈를 통해 우리는:

    1. Tokenization부터 시작해
    2. Transformer를 구축하고
    3. 훈련시키고
    4. 최적화하고
    5. Fine-tuning하고
    6. 평가하고
    7. 배포했습니다!

    단돈 $100, 6시간으로 GPT-2급 모델을 만들었습니다. 🎉

    LLM은 더 이상 마법이 아닙니다. 여러분은 이제 모든 것을 이해합니다.

    Go build something amazing! 🚀

    ---

    시리즈 완료! 🎓

    모든 코드는 [nanochat GitHub](https://github.com/karpathy/nanochat)에서 확인하세요.

    질문이나 피드백은 환영합니다!

    📘 참고

    본 포스트는 Andrej Karpathy의 nanochat 오픈소스 프로젝트를 기반으로, 코드 구조와 학습 과정을 분석/설명하기 위해 작성되었습니다. 원본 코드는 MIT License 하에 배포됩니다. 모든 코드 예제는 교육 목적으로 단순화되었습니다.

    태그: #Evaluation #MMLU #GSM8K #HumanEval #Deployment #FastAPI #nanochat #Complete