Models & Algorithms🇰🇷 한국어

RAG Evaluation: Beyond Precision/Recall

"How do I know if my RAG is working?" — Precision/Recall aren't enough. You need to measure Faithfulness, Relevance, and Context Recall to see the real quality.

RAG Evaluation: Beyond Precision/Recall

RAG Evaluation: Beyond Precision/Recall

"How do I know if my RAG is working?" — Precision/Recall aren't enough. You need to measure Faithfulness, Relevance, and Context Recall to see the real quality.

Why Traditional Metrics Fall Short

Traditional IR (Information Retrieval) metrics:

MetricMeasuresLimitation in RAG
**Precision@K**Relevant docs in top KMay not correlate with answer quality
**Recall@K**Retrieved relevant docs / all relevantRequires ground truth, often impractical
**MRR**Rank of first relevant docMeaningless when multiple docs needed

Problem: Can't distinguish between good retrieval with bad answer, or mediocre retrieval with good answer.

Case 1: Good Retrieval, Bad Answer — 3 relevant documents retrieved (High Precision), but LLM distorts content in answer (Hallucination)

Case 2: Mediocre Retrieval, Good Answer — Only 1 relevant document retrieved (Low Precision), but that document enabled accurate answer

The Three Axes of RAG Evaluation

RAG systems should be evaluated on three axes:

Query → Retrieval → Generation

  • Retrieval stage → Context Quality (Context Recall, Context Precision)
  • Generation stage → Answer Quality (Faithfulness, Answer Relevance)

1. Context Quality

How well do retrieved documents match the question?

  • Context Recall: Was necessary information retrieved?
  • Context Precision: What fraction of retrieved docs are actually useful?

2. Answer Quality

How good is the generated answer?

  • Faithfulness: Is the answer grounded in retrieved documents? (Hallucination check)
  • Answer Relevance: Does the answer address the question?

3. End-to-End Quality

Final quality of the entire pipeline

  • Answer Correctness: Is the answer actually correct? (Requires ground truth)

Core Metrics Deep Dive

1. Faithfulness

Are all claims in the answer supported by retrieved documents?
python
def compute_faithfulness(answer: str, contexts: List[str]) -> float:
    """
    1. Extract individual claims from answer
    2. Check if each claim is supported by context
    3. Return ratio of supported claims
    """
    claims = extract_claims(answer)
    supported = 0

    for claim in claims:
        if is_supported_by_context(claim, contexts):
            supported += 1

    return supported / len(claims) if claims else 0

Example:

Context: "Tesla cut prices by up to 20% on January 13, 2023."

Answer: "Tesla cut prices by 20% in January 2023, which caused competitors to lower their prices too."

Claims:

  • "Tesla cut prices in January 2023" → Supported ✓
  • "Prices cut by 20%" → Supported ✓
  • "Competitors lowered prices" → Not in context ✗

Faithfulness = 2/3 = 0.67

Why it matters: Low Faithfulness = Hallucination risk

2. Answer Relevance

Does the answer actually answer the question?
python
def compute_answer_relevance(question: str, answer: str) -> float:
    """
    1. Generate questions from the answer
    2. Measure similarity between original and generated questions
    """
    # Guess the question from just the answer
    generated_questions = generate_questions_from_answer(answer, n=3)

    # Similarity to original question
    similarities = [
        semantic_similarity(question, gen_q)
        for gen_q in generated_questions
    ]

    return np.mean(similarities)

Example:

Question: "Who is Tesla's CEO?"

Answer: "Tesla is an electric vehicle company."

Generated Questions from Answer:

  • "What kind of company is Tesla?"
  • "What are electric vehicle companies?"

Low similarity to original → Low Answer Relevance

Why it matters: Detects when LLM ignored the question or gave tangential answer

3. Context Recall

Is the information needed for the answer present in retrieved documents?
python
def compute_context_recall(
    ground_truth: str,
    contexts: List[str]
) -> float:
    """
    1. Extract key statements from ground truth
    2. Check if each statement is supported by context
    """
    gt_statements = extract_statements(ground_truth)
    attributed = 0

    for statement in gt_statements:
        if any(supports(ctx, statement) for ctx in contexts):
            attributed += 1

    return attributed / len(gt_statements) if gt_statements else 0

Example:

Ground Truth: "Sam Altman was fired on November 17, 2023, and returned on November 22."

Contexts Retrieved:

  • [1] "Sam Altman fired from OpenAI (2023-11-17)"
  • [2] "Microsoft CEO expressed support for Sam Altman"

Ground Truth Statements:

  • "Sam Altman fired 2023-11-17" → Supported by Context 1 ✓
  • "Sam Altman returned 2023-11-22" → Not found ✗

Context Recall = 1/2 = 0.5

Why it matters: Directly measures retrieval failure (missing necessary docs)

4. Context Precision

What fraction of retrieved documents actually contributed to the answer?
python
def compute_context_precision(
    question: str,
    answer: str,
    contexts: List[str]
) -> float:
    """
    Check if each retrieved context contributed to the answer
    """
    useful = 0

    for ctx in contexts:
        if contributes_to_answer(ctx, question, answer):
            useful += 1

    return useful / len(contexts) if contexts else 0

Why it matters: Too much noise confuses the LLM → degrades answer quality

Relationship Between Metrics

Question → Context Quality (Recall, Precision) → Answer Quality (Faithfulness, Relevance) → Answer Correctness

SituationContext RecallFaithfulnessDiagnosis
Retrieval failureLowHighImprove retrieval
HallucinationHighLowImprove prompt/model
Both lowLowLowCheck entire pipeline
IdealHighHighWorking correctly

Practical Implementation: Using RAGAS

RAGAS is a framework for RAG evaluation that computes these metrics easily.

Installation and Basic Usage

python
# pip install ragas

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)
from datasets import Dataset

# Prepare evaluation data
eval_data = {
    "question": ["Who is Tesla's CEO?"],
    "answer": ["Elon Musk is Tesla's CEO."],
    "contexts": [["Elon Musk is Tesla's CEO and founder."]],
    "ground_truth": ["Elon Musk"]  # Needed for Context Recall
}

dataset = Dataset.from_dict(eval_data)

# Run evaluation
results = evaluate(
    dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_recall,
        context_precision,
    ]
)

print(results)

Batch Evaluation

python
def evaluate_rag_batch(
    questions: List[str],
    rag_system,
    ground_truths: List[str] = None
) -> pd.DataFrame:
    """Evaluate RAG system on multiple questions"""

    results = []
    for i, question in enumerate(questions):
        # Run RAG
        answer, contexts = rag_system.query(question)

        # Evaluate
        result = {
            "question": question,
            "answer": answer,
            "faithfulness": compute_faithfulness(answer, contexts),
            "relevance": compute_answer_relevance(question, answer),
            "context_precision": compute_context_precision(
                question, answer, contexts
            ),
        }

        if ground_truths:
            result["context_recall"] = compute_context_recall(
                ground_truths[i], contexts
            )

        results.append(result)

    return pd.DataFrame(results)

Evaluation Without Ground Truth: LLM-as-Judge

When ground truth is unavailable, use an LLM as evaluator.

Faithfulness Evaluation

python
FAITHFULNESS_PROMPT = """
Given the context and answer, determine if each claim in the answer
is supported by the context.

Context:
{context}

Answer:
{answer}

For each claim in the answer, respond with:
- Claim: [the claim]
- Verdict: [Supported/Not Supported]
- Evidence: [quote from context if supported]

Finally, provide the overall faithfulness score (0-1).
"""

def llm_faithfulness(answer: str, context: str, llm) -> float:
    prompt = FAITHFULNESS_PROMPT.format(context=context, answer=answer)
    response = llm.generate(prompt)
    return parse_faithfulness_score(response)

Answer Relevance Evaluation

python
RELEVANCE_PROMPT = """
Given the question and answer, rate how relevant the answer is
to the question on a scale of 0-1.

Question: {question}
Answer: {answer}

Consider:
- Does the answer address the question directly?
- Is the answer complete?
- Is there irrelevant information?

Score (0-1):
Reasoning:
"""

def llm_relevance(question: str, answer: str, llm) -> float:
    prompt = RELEVANCE_PROMPT.format(question=question, answer=answer)
    response = llm.generate(prompt)
    return parse_relevance_score(response)

Evaluation Strategy: When to Measure What

Metrics by Development Stage

StagePrimary MetricsPurpose
**Prototype**Faithfulness, RelevanceQuick feedback
**Retrieval Tuning**Context Recall, PrecisionImprove retrieval
**Prompt Tuning**FaithfulnessReduce hallucination
**Production**All + LatencyComprehensive monitoring

Evaluation Set Design

python
# Include diverse question types
eval_set = {
    "simple": [  # Single doc sufficient
        "Who is Tesla's CEO?",
        "When was OpenAI founded?",
    ],
    "multi_hop": [  # Multiple docs needed
        "What did Microsoft's CEO say when OpenAI's CEO was fired?",
    ],
    "temporal": [  # Time reasoning required
        "Who was CEO before Sam Altman returned?",
    ],
    "comparison": [  # Comparison questions
        "Which sold more in 2023, Tesla or BYD?",
    ],
    "unanswerable": [  # Cannot be answered
        "What are Tesla's 2025 sales figures?",
    ]
}

Automated Evaluation Pipeline

python
class RAGEvaluator:
    def __init__(self, rag_system, llm_judge):
        self.rag = rag_system
        self.judge = llm_judge
        self.metrics_history = []

    def evaluate(self, eval_set: Dict[str, List[str]]) -> Dict:
        results = {}

        for category, questions in eval_set.items():
            category_results = []

            for question in questions:
                answer, contexts = self.rag.query(question)

                metrics = {
                    "faithfulness": self.compute_faithfulness(answer, contexts),
                    "relevance": self.compute_relevance(question, answer),
                    "context_precision": self.compute_precision(
                        question, answer, contexts
                    ),
                }
                category_results.append(metrics)

            results[category] = {
                "avg_faithfulness": np.mean([r["faithfulness"] for r in category_results]),
                "avg_relevance": np.mean([r["relevance"] for r in category_results]),
                "avg_precision": np.mean([r["context_precision"] for r in category_results]),
            }

        self.metrics_history.append({
            "timestamp": datetime.now(),
            "results": results
        })

        return results

    def compare_versions(self, v1_results: Dict, v2_results: Dict) -> Dict:
        """Compare two versions of RAG system"""
        comparison = {}
        for category in v1_results:
            comparison[category] = {
                metric: v2_results[category][metric] - v1_results[category][metric]
                for metric in v1_results[category]
            }
        return comparison

Common Mistakes and Solutions

1. Ground Truth Dependency

Problem: Ground truth is too hard to create, so no evaluation happens

Solution: Faithfulness and Relevance don't require ground truth

2. Average Trap

Problem: Average Faithfulness is 0.8 but 0.3 on specific question types

Solution: Evaluate separately by question type

3. Metric Gaming

Problem: Making answers overly conservative to increase Faithfulness

Solution: Evaluate with Relevance too (detects too-short or tangential answers)

Conclusion

RAG evaluation must separately measure retrieval quality and answer quality.

Core Metrics:

  • Context Quality — Context Recall (Was necessary info retrieved?), Context Precision (Was retrieval noise-free?)
  • Answer Quality — Faithfulness (No hallucination?), Answer Relevance (Did it answer the question?)

Practical Recommendations:

  1. During development: Faithfulness + Relevance (quick feedback)
  2. Retrieval tuning: Context Recall (retrieval quality)
  3. Production: All metrics + per-category analysis

These four metrics let you diagnose exactly where your RAG system is failing.

Related Posts

Stay Updated

Follow us for the latest posts and tutorials

Subscribe to Newsletter

Related Posts