Building RAG Systems That Actually Work
Retrieval-Augmented Generation (RAG) sounds simple: embed your documents, search for relevant chunks, pass to LLM. In practice, 90% of RAG systems deliver poor results.
Why Most RAG Systems Fail
Problem 1: Bad Chunking
# Bad: Naive chunking loses context
chunks = document.split_by_tokens(512)
# Good: Semantic chunking with overlap
chunks = semantic_chunker.chunk(
document,
max_tokens=512,
overlap=50,
preserve_sentences=True
)
Problem 2: Poor Retrieval
Simple vector similarity isn't enough. You need:
- Hybrid search (vector + keyword)
- Reranking with cross-encoders
- Query expansion for better recall
Problem 3: Context Window Management
You retrieve 10 chunks but can only fit 5 in the prompt. Which do you keep?
Production Architecture
Document Processing Pipeline
interface DocumentPipeline {
ingest: (doc: Document) => void;
chunk: (doc: Document) => Chunk[];
embed: (chunks: Chunk[]) => Embedding[];
index: (embeddings: Embedding[]) => void;
}
// Example: Process company knowledge base
const pipeline = new DocumentPipeline({
chunking: {
strategy: 'semantic',
maxTokens: 512,
overlap: 50
},
embedding: {
model: 'text-embedding-3-large',
dimensions: 1536
},
index: {
vectorDB: 'pinecone',
namespace: 'kb-v1'
}
});
Query Pipeline
// Multi-stage retrieval for accuracy
async function retrieve(query: string) {
// Stage 1: Initial retrieval (vector + keyword)
const candidates = await hybridSearch(query, {
vectorWeight: 0.7,
keywordWeight: 0.3,
limit: 50
});
// Stage 2: Rerank with cross-encoder
const reranked = await rerank(query, candidates, {
model: 'cross-encoder-ms-marco',
limit: 10
});
// Stage 3: Select best chunks within context window
const selected = selectChunks(reranked, {
maxTokens: 4000,
diversityPenalty: 0.3
});
return selected;
}
Generation with Citations
const response = await llm.generate({
system: `Answer based only on provided context.
Cite sources using [1], [2], etc.`,
context: chunks.map((c, i) => `[${i+1}] ${c.text}`),
query: userQuery
});
// Extract and verify citations
const citations = extractCitations(response);
const verified = verifyCitations(citations, chunks);
Performance Benchmarks
From production systems we've built:
Retrieval Quality:
- Precision@5: 92%
- Recall@10: 87%
- MRR: 0.89
Latency:
- Query time: p50 180ms, p99 420ms
- End-to-end: p50 1.2s, p99 2.8s
Cost:
- $0.003 per query (embedding + inference)
- Vector DB: $200-800/month for 10M chunks
Common Use Cases
1. Internal Knowledge Base
- 50K+ company documents
- Instant answers for support team
- 70% ticket deflection rate
2. Product Documentation Search
- Multi-product search
- Code examples + explanations
- 85% satisfaction score
3. Research Assistant
- Academic paper search
- Synthesize findings across papers
- Save 10+ hours/week per researcher
When NOT to Use RAG
RAG isn't always the answer:
- Small knowledge bases (under 100 docs): Fine-tuning may be better
- Highly structured data: Use SQL + text-to-SQL instead
- Real-time data: RAG is for historical knowledge
What's Next
Advanced topics we'll cover:
- Multi-hop reasoning across documents
- Image + text RAG for multimodal search
- Continuous learning from user feedback
- Cost optimization at scale
Need help building RAG infrastructure? We've deployed systems handling 100K+ queries/day. Book a call to discuss your use case.