Blog post

Production RAG Systems Need More Than Retrieval Demos

A production RAG system should be treated as a retrieval and evaluation pipeline with explicit failure modes, not as a prompt wrapper around a vector store.

  • RAG
  • Evaluation
  • Vector Search
  • Production AI

Problem

Many teams describe a retrieval-augmented generation system as if it were a single feature. In practice, the system behaves more like a chain of narrow dependencies: chunking, indexing, retrieval, reranking, context assembly, generation, and output validation.

What breaks in production

  • Retrieval recall drops when indexing assumptions stop matching query behavior.
  • Latency budgets get consumed by avoidable search and post-processing steps.
  • Prompt changes mask underlying retrieval quality problems instead of fixing them.
  • Teams optimize for demo success rather than observed production usefulness.

Practical design approach

Treat the system as a pipeline with explicit checkpoints.

  • Define the user task and what constitutes a useful answer.
  • Measure retrieval quality before discussing final answer quality.
  • Track whether ranking, chunking, or source freshness is actually limiting the system.
  • Keep each layer observable enough that the team can explain failures.

Tradeoffs

The best retrieval setup is not always the one with the most complex architecture. In regulated or operational environments, simpler systems with clearer evaluation boundaries often outperform more elaborate stacks because engineers can reason about them when something goes wrong.

Production lesson

RAG becomes credible when retrieval quality, latency, and operational risk are measured directly. If those signals are missing, the system is not production-ready even if the generated answers look impressive in a demo.

Related projects

Case studies where these tradeoffs showed up in practice.

Project Legal TechPublic Sector AI

CNJ / PNUD

PEDRO Precedent Discovery Platform

Data Scientist · Jul 2022 - May 2023

National-scale precedent discovery initiative for CNJ and PNUD, combining FastAPI services, unsupervised NLP, semantic grouping, and governed experimentation to systematize qualified precedents from Brazil's highest courts.

Primary impact

Enabled discovery of more than 30 precedent categories across long-form judicial decisions.

  • FastAPI
  • NLP
  • Semantic Similarity
  • MLflow
  • Legal Tech

Key outcomes

  • More than 30 precedent categories identified through semantic workflows
  • AI services integrated with CNJ systems through REST APIs
Read project
Project Legal TechPublic Sector AI

PGDF

OSIRIS Legal-Fiscal AI Workflows

Data Scientist · May 2023 - May 2024

AI delivery for PGDF legal-fiscal operations, spanning production APIs, supervised and semi-supervised models, active learning, and early LLM exploration for document-heavy institutional workflows.

Primary impact

Brought governed ML workflows and production APIs into legal-fiscal operations, while designing active-learning paths for longer-term model adaptation.

  • FastAPI
  • Active Learning
  • MLflow
  • DVC
  • LLM

Key outcomes

  • Production APIs connected model outputs to PGDF internal systems
  • Active-learning loop designed to reduce model drift over time
Read project

Next step

Want the delivery context behind this line of thinking?

The project pages show where these technical decisions had to work inside real institutions, teams, and operational constraints.