Models & Algorithms

RAG Evaluation: Beyond Precision/Recall

"How do I know if my RAG is working?" — Precision/Recall aren't enough. You need to measure Faithfulness, Relevance, and Context Recall to see the real quality.

RAG Evaluation: Beyond Precision/Recall

RAG Evaluation: Beyond Precision/Recall

"How do I know if my RAG is working?" — Precision/Recall aren't enough. You need to measure Faithfulness, Relevance, and Context Recall to see the real quality.

Why Traditional Metrics Fall Short

Traditional IR (Information Retrieval) metrics:

MetricMeasuresLimitation in RAG
**Precision@K**Relevant docs in top KMay not correlate with answer quality
**Recall@K**Retrieved relevant docs / all relevantRequires ground truth, often impractical
**MRR**Rank of first relevant docMeaningless when multiple docs needed

Problem: Can't distinguish between good retrieval with bad answer, or mediocre retrieval with good answer.

Case 1: Good Retrieval, Bad Answer — 3 relevant documents retrieved (High Precision), but LLM distorts content in answer (Hallucination)

Case 2: Mediocre Retrieval, Good Answer — Only 1 relevant document retrieved (Low Precision), but that document enabled accurate answer

The Three Axes of RAG Evaluation

RAG systems should be evaluated on three axes:

Query → Retrieval → Generation

  • Retrieval stage → Context Quality (Context Recall, Context Precision)
  • Generation stage → Answer Quality (Faithfulness, Answer Relevance)

1. Context Quality

How well do retrieved documents match the question?

  • Context Recall: Was necessary information retrieved?
  • Context Precision: What fraction of retrieved docs are actually useful?

2. Answer Quality

How good is the generated answer?

  • Faithfulness: Is the answer grounded in retrieved documents? (Hallucination check)
  • Answer Relevance: Does the answer address the question?

3. End-to-End Quality

Final quality of the entire pipeline

  • Answer Correctness: Is the answer actually correct? (Requires ground truth)

Core Metrics Deep Dive

1. Faithfulness

Are all claims in the answer supported by retrieved documents?
python
def compute_faithfulness(answer: str, contexts: List[str]) -> float:
    """
    1. Extract individual claims from answer
    2. Check if each claim is supported by context
    3. Return ratio of supported claims
    """
    claims = extract_claims(answer)
    supported = 0

    for claim in claims:
        if is_supported_by_context(claim, contexts):
            supported += 1

    return supported / len(claims) if claims else 0

Example:

Context: "Tesla cut prices by up to 20% on January 13, 2023."

Answer: "Tesla cut prices by 20% in January 2023, which caused competitors to lower their prices too."

Claims:

  • "Tesla cut prices in January 2023" → Supported ✓
  • "Prices cut by 20%" → Supported ✓
  • "Competitors lowered prices" → Not in context ✗

Faithfulness = 2/3 = 0.67

Why it matters: Low Faithfulness = Hallucination risk

2. Answer Relevance

Does the answer actually answer the question?
python
def compute_answer_relevance(question: str, answer: str) -> float:
    """
    1. Generate questions from the answer
    2. Measure similarity between original and generated questions
    """
    # Guess the question from just the answer
    generated_questions = generate_questions_from_answer(answer, n=3)

    # Similarity to original question
    similarities = [
        semantic_similarity(question, gen_q)
        for gen_q in generated_questions
    ]

    return np.mean(similarities)

Example:

Question: "Who is Tesla's CEO?"

Answer: "Tesla is an electric vehicle company."

Generated Questions from Answer:

  • "What kind of company is Tesla?"
  • "What are electric vehicle companies?"

Low similarity to original → Low Answer Relevance

Why it matters: Detects when LLM ignored the question or gave tangential answer

3. Context Recall

Is the information needed for the answer present in retrieved documents?
python
def compute_context_recall(
    ground_truth: str,
    contexts: List[str]
) -> float:
    """
    1. Extract key statements from ground truth
    2. Check if each statement is supported by context
    """
    gt_statements = extract_statements(ground_truth)
    attributed = 0

    for statement in gt_statements:
        if any(supports(ctx, statement) for ctx in contexts):
            attributed += 1

    return attributed / len(gt_statements) if gt_statements else 0

Example:

Ground Truth: "Sam Altman was fired on November 17, 2023, and returned on November 22."

Contexts Retrieved:

  • [1] "Sam Altman fired from OpenAI (2023-11-17)"
  • [2] "Microsoft CEO expressed support for Sam Altman"

Ground Truth Statements:

  • "Sam Altman fired 2023-11-17" → Supported by Context 1 ✓
  • "Sam Altman returned 2023-11-22" → Not found ✗

Context Recall = 1/2 = 0.5

Why it matters: Directly measures retrieval failure (missing necessary docs)

4. Context Precision

What fraction of retrieved documents actually contributed to the answer?
python
def compute_context_precision(
    question: str,
    answer: str,
    contexts: List[str]
) -> float:
    """
    Check if each retrieved context contributed to the answer
    """
    useful = 0

    for ctx in contexts:
        if contributes_to_answer(ctx, question, answer):
            useful += 1

    return useful / len(contexts) if contexts else 0

Why it matters: Too much noise confuses the LLM → degrades answer quality

Relationship Between Metrics

Question → Context Quality (Recall, Precision) → Answer Quality (Faithfulness, Relevance) → Answer Correctness

SituationContext RecallFaithfulnessDiagnosis
Retrieval failureLowHighImprove retrieval
HallucinationHighLowImprove prompt/model
Both lowLowLowCheck entire pipeline
IdealHighHighWorking correctly

Practical Implementation: Using RAGAS

RAGAS is a framework for RAG evaluation that computes these metrics easily.

Installation and Basic Usage

python
# pip install ragas

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)
from datasets import Dataset

# Prepare evaluation data
eval_data = {
    "question": ["Who is Tesla's CEO?"],
    "answer": ["Elon Musk is Tesla's CEO."],
    "contexts": [["Elon Musk is Tesla's CEO and founder."]],
    "ground_truth": ["Elon Musk"]  # Needed for Context Recall
}

dataset = Dataset.from_dict(eval_data)

# Run evaluation
results = evaluate(
    dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_recall,
        context_precision,
    ]
)

print(results)

Batch Evaluation

python
def evaluate_rag_batch(
    questions: List[str],
    rag_system,
    ground_truths: List[str] = None
) -> pd.DataFrame:
    """Evaluate RAG system on multiple questions"""

    results = []
    for i, question in enumerate(questions):
        # Run RAG
        answer, contexts = rag_system.query(question)

        # Evaluate
        result = {
            "question": question,
            "answer": answer,
            "faithfulness": compute_faithfulness(answer, contexts),
            "relevance": compute_answer_relevance(question, answer),
            "context_precision": compute_context_precision(
                question, answer, contexts
            ),
        }

        if ground_truths:
            result["context_recall"] = compute_context_recall(
                ground_truths[i], contexts
            )

        results.append(result)

    return pd.DataFrame(results)

Evaluation Without Ground Truth: LLM-as-Judge

When ground truth is unavailable, use an LLM as evaluator.

Faithfulness Evaluation

python
FAITHFULNESS_PROMPT = """
Given the context and answer, determine if each claim in the answer
is supported by the context.

Context:
{context}

Answer:
{answer}

For each claim in the answer, respond with:
- Claim: [the claim]
- Verdict: [Supported/Not Supported]
- Evidence: [quote from context if supported]

Finally, provide the overall faithfulness score (0-1).
"""

def llm_faithfulness(answer: str, context: str, llm) -> float:
    prompt = FAITHFULNESS_PROMPT.format(context=context, answer=answer)
    response = llm.generate(prompt)
    return parse_faithfulness_score(response)

Answer Relevance Evaluation

python
RELEVANCE_PROMPT = """
Given the question and answer, rate how relevant the answer is
to the question on a scale of 0-1.

Question: {question}
Answer: {answer}

Consider:
- Does the answer address the question directly?
- Is the answer complete?
- Is there irrelevant information?

Score (0-1):
Reasoning:
"""

def llm_relevance(question: str, answer: str, llm) -> float:
    prompt = RELEVANCE_PROMPT.format(question=question, answer=answer)
    response = llm.generate(prompt)
    return parse_relevance_score(response)

Evaluation Strategy: When to Measure What

Metrics by Development Stage

StagePrimary MetricsPurpose
**Prototype**Faithfulness, RelevanceQuick feedback
**Retrieval Tuning**Context Recall, PrecisionImprove retrieval
**Prompt Tuning**FaithfulnessReduce hallucination
**Production**All + LatencyComprehensive monitoring

Evaluation Set Design

python
# Include diverse question types
eval_set = {
    "simple": [  # Single doc sufficient
        "Who is Tesla's CEO?",
        "When was OpenAI founded?",
    ],
    "multi_hop": [  # Multiple docs needed
        "What did Microsoft's CEO say when OpenAI's CEO was fired?",
    ],
    "temporal": [  # Time reasoning required
        "Who was CEO before Sam Altman returned?",
    ],
    "comparison": [  # Comparison questions
        "Which sold more in 2023, Tesla or BYD?",
    ],
    "unanswerable": [  # Cannot be answered
        "What are Tesla's 2025 sales figures?",
    ]
}

Automated Evaluation Pipeline

python
class RAGEvaluator:
    def __init__(self, rag_system, llm_judge):
        self.rag = rag_system
        self.judge = llm_judge
        self.metrics_history = []

    def evaluate(self, eval_set: Dict[str, List[str]]) -> Dict:
        results = {}

        for category, questions in eval_set.items():
            category_results = []

            for question in questions:
                answer, contexts = self.rag.query(question)

                metrics = {
                    "faithfulness": self.compute_faithfulness(answer, contexts),
                    "relevance": self.compute_relevance(question, answer),
                    "context_precision": self.compute_precision(
                        question, answer, contexts
                    ),
                }
                category_results.append(metrics)

            results[category] = {
                "avg_faithfulness": np.mean([r["faithfulness"] for r in category_results]),
                "avg_relevance": np.mean([r["relevance"] for r in category_results]),
                "avg_precision": np.mean([r["context_precision"] for r in category_results]),
            }

        self.metrics_history.append({
            "timestamp": datetime.now(),
            "results": results
        })

        return results

    def compare_versions(self, v1_results: Dict, v2_results: Dict) -> Dict:
        """Compare two versions of RAG system"""
        comparison = {}
        for category in v1_results:
            comparison[category] = {
                metric: v2_results[category][metric] - v1_results[category][metric]
                for metric in v1_results[category]
            }
        return comparison

Common Mistakes and Solutions

1. Ground Truth Dependency

Problem: Ground truth is too hard to create, so no evaluation happens

Solution: Faithfulness and Relevance don't require ground truth

2. Average Trap

Problem: Average Faithfulness is 0.8 but 0.3 on specific question types

Solution: Evaluate separately by question type

3. Metric Gaming

Problem: Making answers overly conservative to increase Faithfulness

Solution: Evaluate with Relevance too (detects too-short or tangential answers)

Conclusion

RAG evaluation must separately measure retrieval quality and answer quality.

Core Metrics:

  • Context Quality — Context Recall (Was necessary info retrieved?), Context Precision (Was retrieval noise-free?)
  • Answer Quality — Faithfulness (No hallucination?), Answer Relevance (Did it answer the question?)

Practical Recommendations:

  1. During development: Faithfulness + Relevance (quick feedback)
  2. Retrieval tuning: Context Recall (retrieval quality)
  3. Production: All metrics + per-category analysis

These four metrics let you diagnose exactly where your RAG system is failing.

Related Posts