The Automated Confidentiality Tripwire

Series: "When Models Talk Too Much - Auditing and Securing LLMs Against Data Leakage"

Hello, and welcome back! If you’ve been following our journey, you know we’ve spent time dissecting how large language models can inadvertently disclose sensitive information. Now, it's time to put on our engineering hats and answer the most crucial question: How do we stop it?

Let’s dive into the hands-on blueprint for building an integrated, production-ready leakage detection system https://github.com/iddimov/security-reverse-proxy-for-llm

The shift from testing deterministic code to auditing non-deterministic LLM output requires a complete upgrade of our Continuous Integration/Continuous Deployment (CI/CD) pipelines. This isn't just about adding a few checks; it’s about creating a living, automated quality system that treats confidentiality as a core, measurable feature.

1. Expanding the Scope: Moving Beyond Simple PII

The initial instinct when securing data is to scan for Personally Identifiable Information (PII) - things like phone numbers, addresses, and credit card details. This is essential, and we call it lexical safety. But in the world of generative AI, the risk surface is much wider. Our automated checks must account for two advanced forms of leakage:

RAG Context Contamination: Retrieval-Augmented Generation (RAG) is wonderful for grounded responses, but it involves feeding the model dynamic internal documents (emails, corporate strategies, config files). If this proprietary data wasn’t properly filtered upstream before being embedded, the model can inadvertently summarize or reveal it in an output - a major risk to internal knowledge.
System Prompt Disclosure: Every LLM needs a "system prompt" to define its personality, limits, and rules. If an attacker can coax the model into revealing these instructions, they gain a blueprint for bypassing established operational controls. Security experts are clear: the prompt shouldn't contain secrets, but revealing internal guardrails is still a serious security violation.

To address these sophisticated, semantic problems, we need a dual-layered approach that combines traditional pattern matching with modern AI evaluation techniques.

2. Layer 1: The Centralized PII Protection Hub

Integrating PII scrubbing logic into every single microservice that calls an external LLM is a recipe for operational inconsistency and compliance headaches . The elegant engineering solution is to centralize this protection using a Reverse Proxy.

Implementing PII Scanning with FastAPI and Presidio

We can establish a lightweight server - using a framework like FastAPI - sits between all our internal applications and the LLM provider's API. This proxy acts as a mandatory checkpoint.

Interception: The FastAPI server intercepts all API calls destined for the LLM endpoint (e.g., /v1/chat/completions) .
Scrubbing: It uses the open-source Microsoft Presidio SDK to perform highly accurate lexical analysis. Presidio's Analyzer identifies PII using regex, Named Entity Recognition (NER), and rule-based logic .
Anonymization: Presidio’s Anonymizer then redacts or replaces the sensitive data with context-preserving placeholders .
Forwarding: Only the sanitized, PII-free request is sent to the external LLM .

This centralized architecture guarantees every request meets compliance standards, massively simplifying our auditing process.

3. Layer 2: The Secret Weapon of Semantic Security

Lexical checks catch names and numbers. But how do we catch a cleverly paraphrased corporate secret? We pivot from looking for patterns to looking for meaning.

Vector Similarity: The Knowledge Guardrail

The trick lies in using Vector Similarity to measure the conceptual overlap between the model’s output and our confidential knowledge base .

Establish a Secret Knowledge Base: We take all sensitive, proprietary data - the system prompt text, key RAG source chunks, and internal documents - and convert them into high-dimensional vectors (embeddings). This becomes our "Secret Vector Store".
Test and Embed Output: During QA, send adversarial prompts (tests designed to force a leak) to the LLM. The model’s generated response is also converted into a vector.
Threshold Check: We calculate the Cosine Similarity (a measure of orientation between vectors) between the LLM output vector and every vector in our Secret Store .

If the similarity score exceeds a strict, pre-defined threshold (say, 0.90), we know the output is semantically too close to a known secret, and the test triggers an automated security violation . This technique is also powerful for improving RAG health by detecting and filtering redundant chunks during ingestion.

4. The Continuous Quality Checkpoint: Integration and Automation

Traditional QA models that demand a binary "Pass/Fail" are not compatible with the non-deterministic nature of LLMs. To manage this, we adopt acceptance bands - defining an acceptable range for risk scores rather than demanding an exact match.

We integrate our security pipeline using specialized MLOps tools:

Framework/Tool	Role in the Pipeline	Security Layer	Detection Focus
LangChain/LangSmith	Evaluation Harness & Observability	Internal Tracing	Running security datasets, identifying which component (agent, retriever) caused a leak
Playwright	API Black-Box Testing	External Validation	Sending adversarial requests to the deployed service and validating the final API response integrity
LLM Guard / Giskard	Runtime Filters and Scoring	Output Processing	Real-time PII scanning, prompt injection detection, and providing numerical risk scores

This layered approach ensures we know if the external boundary holds (Playwright) and why it failed internally (LangSmith tracing).

The Final Step: The Anonymize/Deanonymize Vault

In environments with high regulatory oversight (like finance or legal), we don't just need to block PII; we sometimes need to process it and then restore it for the end-user (e.g., summarizing a court transcript).

This is solved with the secure Vault Pattern using tools like LLM Guard and Langfuse:

Anonymize (Input): The input is scanned, PII is redacted and mapped to secure placeholders, and the original data is stored in a temporary, secure Vault.
LLM Call: The sanitized input is processed by the LLM.
Deanonymize (Output): The model’s response is scanned, the placeholders are identified, and the original PII is restored from the Vault before being delivered to the user.

This entire process is tracked for auditing, allowing us to measure the latency and accuracy cost of every security step.

5. Sample Code Deep Dive: The Leakage Test Runner

To make this actionable, we encapsulate all these checks into a single, executable class that runs as a mandatory CI/CD Security Gate. This runner translates our architectural requirements into quantifiable exit codes.

The key is enforcing the Hard Gate - the moment a security metric exceeds our acceptable risk band, we halt the deployment.

# LLM Leakage CI Check Hook (Conceptual)

from llm_test_runner import LLMLeakageTestRunner 
import sys

# 1. Define acceptable risk levels (Acceptance Bands)
PII_RISK_TOLERANCE = 0.40      # Max acceptable PII risk score from Presidio/LLM Guard 
SEMANTIC_RISK_TOLERANCE = 0.90 # Max acceptable semantic similarity to any known secret

# 2. Initialize Runner (Points to centralized FastAPI Proxy)
runner = LLMLeakageTestRunner(api_url="http://proxy.ci.corp/v1/chat/completions")

# 3. Execute Adversarial Test Suite
leakage_detected = False
for prompt in adversarial_prompts:
    
    # Check A: Lexical Security (PII)
    pii_score = runner.run_lexical_test(prompt)
    if pii_score > PII_RISK_TOLERANCE:
        print(f"SECURITY VIOLATION: PII risk score {pii_score} exceeds {PII_RISK_TOLERANCE}")
        leakage_detected = True
    
    # Check B: Semantic Security (Knowledge/RAG/Prompt)
    semantic_score = runner.run_semantic_test(prompt)
    if semantic_score > SEMANTIC_RISK_TOLERANCE:
        print(f"SECURITY VIOLATION: Semantic similarity score {semantic_score} exceeds {SEMANTIC_RISK_TOLERANCE}")
        leakage_detected = True
        
# 4. CI/CD Gate Decision
if leakage_detected:
    print("Deployment blocked: Security violation detected. Halting promotion.")
    sys.exit(1) # Non-zero exit code stops the CI pipeline
else:
    print("Security checks passed within acceptable risk bands. Proceeding to deployment.")
    sys.exit(0)

6. Conclusion: Confidentiality as Code

Automating leakage detection transforms confidentiality from a hopeful aspiration into a concrete, auditable engineering practice. By combining the speed of lexical scanning (Presidio) with the deep understanding of semantic analysis (Vector Similarity), and integrating these checks into a unified CI/CD harness (LangSmith, Playwright), we create a truly modern quality assurance system.

The future of responsible AI deployment depends on codifying these protections. When security metrics are treated as non-deterministic acceptance bands, we gain the confidence to innovate rapidly while ensuring our models remain trustworthy and compliant. Happy engineering!

The Automated Confidentiality Tripwire

1. Expanding the Scope: Moving Beyond Simple PII

2. Layer 1: The Centralized PII Protection Hub

Implementing PII Scanning with FastAPI and Presidio

3. Layer 2: The Secret Weapon of Semantic Security

Vector Similarity: The Knowledge Guardrail

4. The Continuous Quality Checkpoint: Integration and Automation

The Final Step: The Anonymize/Deanonymize Vault

5. Sample Code Deep Dive: The Leakage Test Runner

6. Conclusion: Confidentiality as Code

Comments

More from this blog

The Death of the Flaky Test: Why I Stopped Writing Scripts and Started Architecting Agents

The RAG Triad in 2026: Testing with LLM & DeepEval

The $47k Loop: Why Your AI Agent Needs a Circuit Breaker

Why Your RAG App Is Slow (and how to prove it)

The Evaluation Bottleneck: Building a "Golden Dataset" Without Losing Your Mind

Command Palette

1. Expanding the Scope: Moving Beyond Simple PII

2. Layer 1: The Centralized PII Protection Hub

Implementing PII Scanning with FastAPI and Presidio

3. Layer 2: The Secret Weapon of Semantic Security

Vector Similarity: The Knowledge Guardrail

4. The Continuous Quality Checkpoint: Integration and Automation

The Final Step: The Anonymize/Deanonymize Vault

5. Sample Code Deep Dive: The Leakage Test Runner

6. Conclusion: Confidentiality as Code

Comments

More from this blog