The RAG Triad in 2026: Testing with LLM & DeepEval

Deep dives into LLM reliability, evaluation pipelines, and AI workflow orchestration - practical solutions from a Systems Reliability Architect’s perspective.

The Metrics: A Quick Refresher

Before we write code, let's clarify what we are measuring.

Context Recall ("The Net"): Did your retrieval system find the relevant chunk at all? If the answer is in document #50 but you only retrieved the top 5, your Recall is zero.

Context Precision ("The Ranking"): Is the relevant chunk at the top? If the answer is in chunk #1, your precision is perfect. If it's in chunk #5 (buried under 4 irrelevant chunks), your precision drops.

Faithfulness ("The Anchor"): Is the LLM's answer derived solely from the retrieved context? This is your hallucination safety net.

Scenario 1: Precision vs. Recall (The "Needle in the Haystack")

High recall with low precision is dangerous - it means you are flooding GPT-5/DeepSeek/Gemini with noise, increasing latency and cost. High precision with low recall is useless - you are missing the answer entirely.

Here is how to test for both.

Python

import pytest from deepeval import assert_test from deepeval.test_case import LLMTestCase from deepeval.metrics import ContextualPrecisionMetric, ContextualRecallMetric # We use GPT-5 as the judge for our metrics MODEL = "gpt-5" def test_retrieval_quality(): # 1. The Scenario # User asks about "Project Manhattan" input_prompt = "Who was the lead physicist on the Manhattan Project?" expected_output = "J. Robert Oppenheimer" # 2. The Retrieval Simulation # Ideally, our retriever fetches relevant chunks. # Let's simulate a case where the answer is retrieved but buried (Rank 3). retrieved_context = [ "The Manhattan Project cost $2 billion.", # Irrelevant "Los Alamos was the primary site.", # Irrelevant "J. Robert Oppenheimer led the Los Alamos laboratory.", # RELEVANT (Buried) "Trinity was the code name of the first test." # Irrelevant ] test_case = LLMTestCase( input=input_prompt, actual_output="J. Robert Oppenheimer", # What our RAG generated expected_output=expected_output, # The ground truth retrieval_context=retrieved_context ) # 3. The Metrics # Thresholds are strict: we want high recall (found it) and high precision (ranked it). recall_metric = ContextualRecallMetric( threshold=0.7, model=MODEL, include_reason=True ) precision_metric = ContextualPrecisionMetric( threshold=0.5, # Lower threshold because it was rank 3, not rank 1 model=MODEL, include_reason=True ) # 4. The Assertion assert_test(test_case, [recall_metric, precision_metric])

Why this matters

If you run this test, Context Recall will pass (the answer is in the list). However, Context Precision will be lower than 1.0 because the relevant chunk wasn't at the top. This tells you your re-ranker needs work, even if your retriever is fine.

Scenario 2: The "Poisoned Context" Test (Faithfulness)

This is the most critical test for production RAG systems. We are going to intentionally feed GPT-5/DeepSeek/Gemini irrelevant "poison" and assert that it ignores it.

If we ask about the moon, and the context talks about cheese, GPT-5 should answer based only on factual reality (or refuse to answer), depending on your system prompt. But specifically for Faithfulness, we want to ensure the model doesn't hallucinate an answer from the bad context.

Python

from deepeval.metrics import FaithfulnessMetric def test_poisoned_context_handling(): # 1. The Poison Scenario input_prompt = "What is the capital of France?" # We inject POISON into the context. # Completely irrelevant information. poisoned_context = [ "The capital of Mars is Elon City.", "France is known for good cheese.", "Paris is a character in Romeo and Juliet." ] # The Model's Response # A robust RAG system might ignore the context and use internal knowledge, # OR answer "I don't know" if restricted to context. # Let's assume our system is allowed to use internal knowledge if context is bad. actual_output = "The capital of France is Paris." test_case = LLMTestCase( input=input_prompt, actual_output=actual_output, retrieval_context=poisoned_context ) # 2. The Metric: Faithfulness # Faithfulness checks: "Is the answer supported by the context?" # Since 'Paris is capital' is NOT in our poisoned context, # a standard Faithfulness check should actually FAIL (score 0). # This is GOOD. It proves the model ignored the context. metric = FaithfulnessMetric( threshold=0.5, model=MODEL, include_reason=True ) metric.measure(test_case) # 3. The Negative Assertion # We expect Faithfulness to be LOW because the model used internal knowledge # instead of the (poisoned) context. print(f"Faithfulness Reason: {metric.reason}") # If the model had said "The capital of France is Elon City", # Faithfulness would be HIGH (1.0), but the answer would be wrong. # For this specific 'Robustness' test, we actually want to assert # that the model was NOT faithful to the poison. assert metric.score < 0.5, "Model fell for the trap and used poisoned context!"

Note: In a strict RAG system where the prompt is "Answer ONLY using the provided context", the correct behavior would be for the model to output "I cannot answer from the context." In that case, you would test for an exact string match of the refusal.

The Takeaway

LLM nowadays is smarter, but that doesn't make your RAG pipeline immune to failure. By splitting your metrics into Retrieval(Precision/Recall) and Generation (Faithfulness), you can pinpoint exactly where the break happens.

Low Recall? Fix your embeddings or chunking strategy.

Low Precision? Add a re-ranker (like Cohere or BGE).

Low Faithfulness? Adjust your system prompt temperature or penalize the model for hallucinating outside the context.

The RAG Triad in 2026: Testing with LLM & DeepEval

Comments

More from this blog

The Death of the Flaky Test: Why I Stopped Writing Scripts and Started Architecting Agents

The $47k Loop: Why Your AI Agent Needs a Circuit Breaker

Why Your RAG App Is Slow (and how to prove it)

The Evaluation Bottleneck: Building a "Golden Dataset" Without Losing Your Mind

The Metrics: A Quick Refresher

The Setup

Scenario 1: Precision vs. Recall (The "Needle in the Haystack")

Why this matters

Scenario 2: The "Poisoned Context" Test (Faithfulness)

The Takeaway

Command Palette

Comments

More from this blog

The Metrics: A Quick Refresher

The Setup

Scenario 1: Precision vs. Recall (The "Needle in the Haystack")

Why this matters

Scenario 2: The "Poisoned Context" Test (Faithfulness)

The Takeaway