RAG Evaluation: Precision/Recall을 넘어서

"RAG가 잘 동작하는지 어떻게 알죠?" — Precision/Recall만으로는 부족합니다. Faithfulness, Relevance, Context Recall까지 측정해야 진짜 품질이 보입니다.

왜 기존 메트릭으로 부족한가?

전통적인 IR(Information Retrieval) 메트릭:

메트릭	측정 대상	RAG에서의 한계
Precision@K	상위 K개 중 관련 문서 비율	답변 품질과 무관할 수 있음
Recall@K	전체 관련 문서 중 검색된 비율	Ground truth 필요, 현실적으로 힘듦
MRR	첫 관련 문서 순위	여러 문서 필요한 경우 무의미

문제: 검색은 잘 됐는데 답변이 이상하거나, 검색 결과는 별로인데 답변은 괜찮은 경우를 구분 못함.

사례 1: 검색 O, 답변 X — 관련 문서 3개 검색됨 (Precision 높음), 하지만 LLM이 문서 내용 왜곡해서 답변 (Hallucination)

사례 2: 검색 △, 답변 O — 관련 문서 1개만 검색됨 (Precision 낮음), 하지만 그 문서로 정확한 답변 생성됨

RAG 평가의 세 축

RAG 시스템은 세 가지 축으로 평가해야 합니다:

Query → Retrieval → Generation

Retrieval 단계 → Context Quality (Context Recall, Context Precision)
Generation 단계 → Answer Quality (Faithfulness, Answer Relevance)

1. Context Quality (검색 품질)

검색된 문서가 질문에 얼마나 적합한가?

Context Recall: 답에 필요한 정보가 검색됐는가?
Context Precision: 검색된 문서 중 실제로 유용한 비율은?

2. Answer Quality (답변 품질)

생성된 답변이 얼마나 좋은가?

Faithfulness: 답변이 검색된 문서에 근거하는가? (Hallucination 체크)
Answer Relevance: 답변이 질문에 적절한가?

3. End-to-End Quality

전체 파이프라인의 최종 품질

Answer Correctness: 답변이 실제로 맞는가? (Ground truth 필요)

핵심 메트릭 상세

1. Faithfulness (충실도)

답변의 모든 주장이 검색된 문서에서 뒷받침되는가?

python

def compute_faithfulness(answer: str, contexts: List[str]) -> float:
    """
    1. 답변에서 개별 주장(claim) 추출
    2. 각 주장이 context에서 지지되는지 확인
    3. 지지되는 주장 비율 반환
    """
    claims = extract_claims(answer)
    supported = 0

    for claim in claims:
        if is_supported_by_context(claim, contexts):
            supported += 1

    return supported / len(claims) if claims else 0

예시:

Context: "Tesla는 2023년 1월 13일 가격을 최대 20% 인하했다."

Answer: "Tesla는 2023년 1월에 가격을 20% 인하했고, 이로 인해 경쟁사들도 가격을 낮췄다."

Claims:

"Tesla가 2023년 1월에 가격 인하" → 지지됨 ✓
"가격을 20% 인하" → 지지됨 ✓
"경쟁사들도 가격을 낮췄다" → Context에 없음 ✗

Faithfulness = 2/3 = 0.67

왜 중요한가: Faithfulness가 낮으면 Hallucination 위험

2. Answer Relevance (답변 적절성)

답변이 질문에 대한 답이 되는가?

python

def compute_answer_relevance(question: str, answer: str) -> float:
    """
    1. 답변에서 질문을 역생성
    2. 원본 질문과 역생성 질문의 유사도 측정
    """
    # 답변만 보고 질문을 추측
    generated_questions = generate_questions_from_answer(answer, n=3)

    # 원본 질문과의 유사도
    similarities = [
        semantic_similarity(question, gen_q)
        for gen_q in generated_questions
    ]

    return np.mean(similarities)

예시:

Question: "Tesla CEO는 누구야?"

Answer: "Tesla는 전기차 회사입니다."

Generated Questions from Answer:

"Tesla는 무슨 회사야?"
"전기차 회사는 뭐야?"

원본 질문과 유사도 낮음 → Answer Relevance 낮음

왜 중요한가: 검색은 잘 됐는데 질문을 안 봤거나 동문서답하는 경우 탐지

3. Context Recall (맥락 재현율)

정답에 필요한 정보가 검색된 문서에 있는가?

python

def compute_context_recall(
    ground_truth: str,
    contexts: List[str]
) -> float:
    """
    1. Ground truth에서 핵심 문장 추출
    2. 각 문장이 context에서 지지되는지 확인
    """
    gt_statements = extract_statements(ground_truth)
    attributed = 0

    for statement in gt_statements:
        if any(supports(ctx, statement) for ctx in contexts):
            attributed += 1

    return attributed / len(gt_statements) if gt_statements else 0

예시:

Ground Truth: "Sam Altman은 2023년 11월 17일 해고되었고, 11월 22일 복귀했다."

Contexts Retrieved:

[1] "Sam Altman이 OpenAI에서 해고됨 (2023-11-17)"
[2] "Microsoft CEO가 Sam Altman 지지 표명"

Ground Truth Statements:

"Sam Altman 2023-11-17 해고" → Context 1에서 지지 ✓
"Sam Altman 2023-11-22 복귀" → 없음 ✗

Context Recall = 1/2 = 0.5

왜 중요한가: 검색 실패를 직접 측정 (필요한 문서를 못 가져옴)

4. Context Precision (맥락 정밀도)

검색된 문서 중 실제로 답변에 쓰인 비율은?

python

def compute_context_precision(
    question: str,
    answer: str,
    contexts: List[str]
) -> float:
    """
    검색된 각 context가 답변에 기여했는지 확인
    """
    useful = 0

    for ctx in contexts:
        if contributes_to_answer(ctx, question, answer):
            useful += 1

    return useful / len(contexts) if contexts else 0

왜 중요한가: 노이즈 문서가 많으면 LLM이 혼란 → 답변 품질 저하

메트릭 간 관계

Question → Context Quality (Recall, Precision) → Answer Quality (Faithfulness, Relevance) → Answer Correctness

상황	Context Recall	Faithfulness	진단
검색 실패	낮음	높음	검색 개선 필요
Hallucination	높음	낮음	프롬프트/모델 개선
모두 낮음	낮음	낮음	전체 파이프라인 점검
이상적	높음	높음	정상 동작

실전 구현: RAGAS 활용

RAGAS는 RAG 평가를 위한 프레임워크로, 위 메트릭들을 쉽게 계산합니다.

설치 및 기본 사용

python

# pip install ragas

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)
from datasets import Dataset

# 평가 데이터 준비
eval_data = {
    "question": ["Tesla CEO는 누구야?"],
    "answer": ["Elon Musk가 Tesla의 CEO입니다."],
    "contexts": [["Elon Musk는 Tesla의 CEO이자 창업자이다."]],
    "ground_truth": ["Elon Musk"]  # Context Recall 계산에 필요
}

dataset = Dataset.from_dict(eval_data)

# 평가 실행
results = evaluate(
    dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_recall,
        context_precision,
    ]
)

print(results)

배치 평가

python

def evaluate_rag_batch(
    questions: List[str],
    rag_system,
    ground_truths: List[str] = None
) -> pd.DataFrame:
    """여러 질문에 대해 RAG 시스템 평가"""

    results = []
    for i, question in enumerate(questions):
        # RAG 실행
        answer, contexts = rag_system.query(question)

        # 개별 평가
        result = {
            "question": question,
            "answer": answer,
            "faithfulness": compute_faithfulness(answer, contexts),
            "relevance": compute_answer_relevance(question, answer),
            "context_precision": compute_context_precision(
                question, answer, contexts
            ),
        }

        if ground_truths:
            result["context_recall"] = compute_context_recall(
                ground_truths[i], contexts
            )

        results.append(result)

    return pd.DataFrame(results)

평가 없이 평가하기: LLM-as-Judge

Ground truth가 없을 때, LLM을 평가자로 사용할 수 있습니다.

Faithfulness 평가

python

FAITHFULNESS_PROMPT = """
Given the context and answer, determine if each claim in the answer
is supported by the context.

Context:
{context}

Answer:
{answer}

For each claim in the answer, respond with:
- Claim: [the claim]
- Verdict: [Supported/Not Supported]
- Evidence: [quote from context if supported]

Finally, provide the overall faithfulness score (0-1).
"""

def llm_faithfulness(answer: str, context: str, llm) -> float:
    prompt = FAITHFULNESS_PROMPT.format(context=context, answer=answer)
    response = llm.generate(prompt)
    return parse_faithfulness_score(response)

Answer Relevance 평가

python

RELEVANCE_PROMPT = """
Given the question and answer, rate how relevant the answer is
to the question on a scale of 0-1.

Question: {question}
Answer: {answer}

Consider:
- Does the answer address the question directly?
- Is the answer complete?
- Is there irrelevant information?

Score (0-1):
Reasoning:
"""

def llm_relevance(question: str, answer: str, llm) -> float:
    prompt = RELEVANCE_PROMPT.format(question=question, answer=answer)
    response = llm.generate(prompt)
    return parse_relevance_score(response)

평가 전략: 언제 무엇을 측정할까

개발 단계별 메트릭

단계	주요 메트릭	목적
프로토타입	Faithfulness, Relevance	빠른 피드백
검색 튜닝	Context Recall, Precision	검색 품질 개선
프롬프트 튜닝	Faithfulness	Hallucination 감소
프로덕션	전체 + Latency	종합 모니터링

평가 세트 구성

python

# 다양한 유형의 질문 포함
eval_set = {
    "simple": [  # 단일 문서로 답 가능
        "Tesla CEO는 누구야?",
        "OpenAI는 언제 설립됐어?",
    ],
    "multi_hop": [  # 여러 문서 필요
        "OpenAI CEO가 해고됐을 때 Microsoft CEO는 뭐라고 했어?",
    ],
    "temporal": [  # 시간 추론 필요
        "Sam Altman 복귀 전 CEO는 누구였어?",
    ],
    "comparison": [  # 비교 질문
        "Tesla와 BYD 중 2023년 판매량이 높은 곳은?",
    ],
    "unanswerable": [  # 답할 수 없는 질문
        "2025년 Tesla 판매량은?",
    ]
}

자동화된 평가 파이프라인

python

class RAGEvaluator:
    def __init__(self, rag_system, llm_judge):
        self.rag = rag_system
        self.judge = llm_judge
        self.metrics_history = []

    def evaluate(self, eval_set: Dict[str, List[str]]) -> Dict:
        results = {}

        for category, questions in eval_set.items():
            category_results = []

            for question in questions:
                answer, contexts = self.rag.query(question)

                metrics = {
                    "faithfulness": self.compute_faithfulness(answer, contexts),
                    "relevance": self.compute_relevance(question, answer),
                    "context_precision": self.compute_precision(
                        question, answer, contexts
                    ),
                }
                category_results.append(metrics)

            results[category] = {
                "avg_faithfulness": np.mean([r["faithfulness"] for r in category_results]),
                "avg_relevance": np.mean([r["relevance"] for r in category_results]),
                "avg_precision": np.mean([r["context_precision"] for r in category_results]),
            }

        self.metrics_history.append({
            "timestamp": datetime.now(),
            "results": results
        })

        return results

    def compare_versions(self, v1_results: Dict, v2_results: Dict) -> Dict:
        """두 버전의 RAG 시스템 비교"""
        comparison = {}
        for category in v1_results:
            comparison[category] = {
                metric: v2_results[category][metric] - v1_results[category][metric]
                for metric in v1_results[category]
            }
        return comparison

흔한 실수와 해결책

1. Ground Truth 과의존

문제: Ground truth 만들기 너무 힘들어서 평가를 안 함

해결: Faithfulness와 Relevance는 ground truth 없이 측정 가능

2. 평균의 함정

문제: 평균 Faithfulness 0.8인데 특정 유형에서 0.3

해결: 질문 유형별로 분리해서 평가

3. 메트릭 게이밍

문제: Faithfulness 높이려고 답변을 너무 보수적으로 생성

해결: Relevance와 함께 평가 (너무 짧거나 동문서답 탐지)

결론

RAG 평가는 검색 품질과 답변 품질을 분리해서 측정해야 합니다.

핵심 메트릭:

Context Quality — Context Recall (필요한 정보 검색됐나), Context Precision (노이즈 없이 검색됐나)
Answer Quality — Faithfulness (Hallucination 없나), Answer Relevance (질문에 답했나)

실전 권장:

개발 중: Faithfulness + Relevance (빠른 피드백)
검색 튜닝: Context Recall (검색 품질)
프로덕션: 전체 메트릭 + 질문 유형별 분석

이 네 가지 메트릭으로 RAG 시스템의 어디가 문제인지 정확히 진단할 수 있습니다.

RAG Evaluation: Precision/Recall을 넘어서