Comprehensive guide to implementing RAG systems including vector database selection, chunking strategies, embedding models, and retrieval optimization. Use when building RAG systems, implementing semantic search, optimizing retrieval quality, or debugging RAG performance issues.
Add this skill
npx mdskills install applied-artificial-intelligence/rag-implementationComprehensive RAG tutorial with examples, patterns, and trade-offs across all major tools
1---2name: rag-implementation3description: Comprehensive guide to implementing RAG systems including vector database selection, chunking strategies, embedding models, and retrieval optimization. Use when building RAG systems, implementing semantic search, optimizing retrieval quality, or debugging RAG performance issues.4---56# RAG Implementation Patterns78Comprehensive guide to implementing Retrieval-Augmented Generation (RAG) systems including vector database selection, chunking strategies, embedding models, retrieval optimization, and production deployment patterns.910---1112## Quick Reference1314**When to use this skill:**15- Building RAG/semantic search systems16- Implementing document retrieval pipelines17- Optimizing vector database performance18- Debugging retrieval quality issues19- Choosing between vector database options20- Designing chunking strategies21- Implementing hybrid search2223**Technologies covered:**24- Vector DBs: Qdrant, Pinecone, Chroma, Weaviate, Milvus25- Embeddings: OpenAI, Sentence Transformers, Cohere26- Frameworks: LangChain, LlamaIndex, Haystack2728---2930## Part 1: Vector Database Selection3132### Database Comparison Matrix3334| Database | Best For | Deployment | Performance | Cost |35|----------|----------|------------|-------------|------|36| **Qdrant** | Self-hosted, production | Docker/K8s | Excellent (Rust) | Free (self-host) |37| **Pinecone** | Managed, rapid prototyping | Cloud | Excellent | Pay-per-use |38| **Chroma** | Local development, embedded | In-process | Good (Python) | Free |39| **Weaviate** | Complex schemas, GraphQL | Docker/Cloud | Excellent (Go) | Free + Cloud |40| **Milvus** | Large-scale, distributed | K8s | Excellent (C++) | Free (self-host) |4142### Qdrant Setup (Recommended for Production)4344```python45from qdrant_client import QdrantClient46from qdrant_client.models import Distance, VectorParams, PointStruct4748# Initialize client (local or cloud)49client = QdrantClient(url="http://localhost:6333") # or cloud URL5051# Create collection52client.create_collection(53 collection_name="documents",54 vectors_config=VectorParams(55 size=1536, # OpenAI text-embedding-3-small dimension56 distance=Distance.COSINE # or DOT, EUCLID57 )58)5960# Insert vectors with payload61client.upsert(62 collection_name="documents",63 points=[64 PointStruct(65 id=1,66 vector=[0.1, 0.2, ...], # 1536 dimensions67 payload={68 "text": "Document content",69 "source": "doc.pdf",70 "page": 1,71 "metadata": {...}72 }73 )74 ]75)7677# Search78results = client.search(79 collection_name="documents",80 query_vector=[0.1, 0.2, ...],81 limit=5,82 score_threshold=0.7 # Minimum similarity83)84```8586### Pinecone Setup (Managed Service)8788```python89from pinecone import Pinecone, ServerlessSpec9091# Initialize92pc = Pinecone(api_key="your-key")9394# Create index95pc.create_index(96 name="documents",97 dimension=1536,98 metric="cosine",99 spec=ServerlessSpec(cloud="aws", region="us-east-1")100)101102# Get index103index = pc.Index("documents")104105# Upsert vectors106index.upsert(vectors=[107 ("doc1", [0.1, 0.2, ...], {"text": "...", "source": "..."})108])109110# Query111results = index.query(112 vector=[0.1, 0.2, ...],113 top_k=5,114 include_metadata=True115)116```117118---119120## Part 2: Chunking Strategies121122### Strategy 1: Fixed-Size Chunking (Simple, Fast)123124```python125def fixed_size_chunking(text: str, chunk_size: int = 512, overlap: int = 50) -> list[str]:126 """127 Split text into fixed-size chunks with overlap.128129 Pros: Simple, predictable chunk sizes130 Cons: May break mid-sentence, poor semantic boundaries131 """132 words = text.split()133 chunks = []134135 for i in range(0, len(words), chunk_size - overlap):136 chunk = ' '.join(words[i:i + chunk_size])137 chunks.append(chunk)138139 return chunks140141# Usage142chunks = fixed_size_chunking(document, chunk_size=512, overlap=50)143```144145**When to use:**146- Simple documents (logs, transcripts)147- Prototyping/MVP148- Consistent token budgets needed149150### Strategy 2: Semantic Chunking (Better Quality)151152```python153from langchain.text_splitter import RecursiveCharacterTextSplitter154155def semantic_chunking(text: str, chunk_size: int = 1000, overlap: int = 200) -> list[str]:156 """157 Split on semantic boundaries (paragraphs, sentences).158159 Pros: Preserves meaning, better retrieval quality160 Cons: Variable chunk sizes, slower processing161 """162 splitter = RecursiveCharacterTextSplitter(163 chunk_size=chunk_size,164 chunk_overlap=overlap,165 separators=["\n\n", "\n", ". ", " ", ""], # Priority order166 length_function=len167 )168169 return splitter.split_text(text)170171# Usage172chunks = semantic_chunking(document, chunk_size=1000, overlap=200)173```174175**When to use:**176- Long-form documents (articles, books, reports)177- Quality > speed178- Natural language content179180### Strategy 3: Hierarchical Chunking (Best for Structured Docs)181182```python183def hierarchical_chunking(document: dict) -> list[dict]:184 """185 Chunk based on document structure (sections, subsections).186187 Pros: Preserves hierarchy, enables parent-child retrieval188 Cons: Requires structured input, more complex189 """190 chunks = []191192 for section in document['sections']:193 # Parent chunk (section summary)194 chunks.append({195 'text': section['title'] + '\n' + section['summary'],196 'type': 'parent',197 'section_id': section['id']198 })199200 # Child chunks (paragraphs)201 for para in section['paragraphs']:202 chunks.append({203 'text': para,204 'type': 'child',205 'parent_id': section['id']206 })207208 return chunks209```210211**When to use:**212- Technical documentation213- Books with TOC214- Legal documents215- Need to preserve context hierarchy216217### Strategy 4: Sliding Window (Maximum Context Preservation)218219```python220def sliding_window_chunking(text: str, window_size: int = 512, stride: int = 256) -> list[str]:221 """222 Overlapping windows for maximum context.223224 Pros: No information loss at boundaries225 Cons: Storage overhead (duplicate content)226 """227 words = text.split()228 chunks = []229230 for i in range(0, len(words) - window_size + 1, stride):231 chunk = ' '.join(words[i:i + window_size])232 chunks.append(chunk)233234 return chunks235```236237**When to use:**238- Critical retrieval accuracy needed239- Short queries need broader context240- Storage cost not a concern241242---243244## Part 3: Embedding Models245246### Model Selection Guide247248| Model | Dimensions | Speed | Quality | Cost | Use Case |249|-------|-----------|-------|---------|------|----------|250| **OpenAI text-embedding-3-small** | 1536 | Fast | Excellent | $0.02/1M tokens | Production, general purpose |251| **OpenAI text-embedding-3-large** | 3072 | Medium | Best | $0.13/1M tokens | High-quality retrieval |252| **all-MiniLM-L6-v2** | 384 | Very fast | Good | Free | Self-hosted, prototyping |253| **all-mpnet-base-v2** | 768 | Fast | Very good | Free | Self-hosted, quality |254| **Cohere embed-english-v3.0** | 1024 | Fast | Excellent | $0.10/1M tokens | Semantic search focus |255256### OpenAI Embeddings (Recommended)257258```python259from openai import OpenAI260261client = OpenAI(api_key="your-key")262263def get_embeddings(texts: list[str], model: str = "text-embedding-3-small") -> list[list[float]]:264 """265 Generate embeddings using OpenAI.266267 Batch size: Up to 2048 inputs per request268 Rate limits: Check tier limits269 """270 response = client.embeddings.create(271 model=model,272 input=texts273 )274275 return [item.embedding for item in response.data]276277# Usage278chunks = ["chunk 1", "chunk 2", ...]279embeddings = get_embeddings(chunks)280```281282### Sentence Transformers (Self-Hosted)283284```python285from sentence_transformers import SentenceTransformer286287# Load model (cached after first download)288model = SentenceTransformer('all-MiniLM-L6-v2')289290def get_embeddings_local(texts: list[str]) -> list[list[float]]:291 """292 Generate embeddings locally (no API costs).293294 GPU recommended for batches > 100295 CPU acceptable for small batches296 """297 return model.encode(texts, show_progress_bar=True).tolist()298299# Usage300embeddings = get_embeddings_local(chunks)301```302303---304305## Part 4: Retrieval Optimization306307### Technique 1: Hybrid Search (Dense + Sparse)308309```python310from qdrant_client.models import Filter, FieldCondition, MatchValue311312def hybrid_search(query: str, query_vector: list[float], top_k: int = 10):313 """314 Combine dense (vector) and sparse (keyword) search.315316 Dense: Semantic similarity317 Sparse: Exact keyword matches318 """319 # Dense search320 dense_results = client.search(321 collection_name="documents",322 query_vector=query_vector,323 limit=top_k * 2 # Get more candidates324 )325326 # Sparse search (BM25 via metadata)327 sparse_results = client.search(328 collection_name="documents",329 query_filter=Filter(330 must=[331 FieldCondition(332 key="text",333 match=MatchValue(value=query)334 )335 ]336 ),337 limit=top_k * 2338 )339340 # Merge and re-rank341 combined = merge_results(dense_results, sparse_results, weights=(0.7, 0.3))342 return combined[:top_k]343```344345### Technique 2: Query Expansion346347```python348def expand_query(query: str) -> list[str]:349 """350 Generate query variations for better recall.351352 Techniques:353 - Synonym expansion354 - Question reformulation355 - Entity extraction356 """357 from openai import OpenAI358 client = OpenAI()359360 response = client.chat.completions.create(361 model="gpt-4",362 messages=[{363 "role": "system",364 "content": "Generate 3 alternative phrasings of the user's query."365 }, {366 "role": "user",367 "content": query368 }]369 )370371 expanded = [query] + response.choices[0].message.content.split('\n')372 return expanded373374# Usage375queries = expand_query("How to train neural networks?")376# → ["How to train neural networks?",377# "What are neural network training techniques?",378# "Neural network optimization methods",379# "Deep learning model training"]380```381382### Technique 3: Reranking383384```python385from sentence_transformers import CrossEncoder386387# Load cross-encoder (better than bi-encoder for reranking)388reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')389390def rerank_results(query: str, results: list[dict], top_k: int = 5) -> list[dict]:391 """392 Rerank initial results using cross-encoder.393394 More accurate but slower than initial retrieval395 Use on top 20-50 candidates only396 """397 # Score each query-document pair398 pairs = [(query, result['text']) for result in results]399 scores = reranker.predict(pairs)400401 # Combine scores with results402 for result, score in zip(results, scores):403 result['rerank_score'] = float(score)404405 # Sort and return top_k406 reranked = sorted(results, key=lambda x: x['rerank_score'], reverse=True)407 return reranked[:top_k]408```409410### Technique 4: Metadata Filtering411412```python413def filtered_search(414 query_vector: list[float],415 filters: dict,416 top_k: int = 5417):418 """419 Filter search by metadata (date, category, author, etc.)420421 Pre-filter: Faster but may miss results422 Post-filter: More results but slower423 """424 from qdrant_client.models import Filter, FieldCondition, Range425426 # Build filter conditions427 conditions = []428429 if 'date_range' in filters:430 conditions.append(431 FieldCondition(432 key="date",433 range=Range(434 gte=filters['date_range']['start'],435 lte=filters['date_range']['end']436 )437 )438 )439440 if 'category' in filters:441 conditions.append(442 FieldCondition(443 key="category",444 match=MatchValue(value=filters['category'])445 )446 )447448 # Search with filters449 results = client.search(450 collection_name="documents",451 query_vector=query_vector,452 query_filter=Filter(must=conditions) if conditions else None,453 limit=top_k454 )455456 return results457```458459---460461## Part 5: Context Management462463### Pattern 1: Retrieved Context Optimization464465```python466def optimize_context(query: str, retrieved_docs: list[str], max_tokens: int = 4000) -> str:467 """468 Optimize retrieved context to fit within LLM context window.469470 Strategies:471 1. Relevance-based truncation472 2. Extractive summarization473 3. Overlap removal474 """475 # Sort by relevance476 sorted_docs = sorted(retrieved_docs, key=lambda d: d['score'], reverse=True)477478 # Build context within token budget479 context_parts = []480 total_tokens = 0481482 for doc in sorted_docs:483 doc_tokens = estimate_tokens(doc['text'])484485 if total_tokens + doc_tokens <= max_tokens:486 context_parts.append(f"[Source: {doc['source']}]\n{doc['text']}")487 total_tokens += doc_tokens488 else:489 # Truncate last document to fit490 remaining = max_tokens - total_tokens491 truncated = truncate_to_tokens(doc['text'], remaining)492 context_parts.append(f"[Source: {doc['source']}]\n{truncated}")493 break494495 return "\n\n---\n\n".join(context_parts)496```497498### Pattern 2: Citation Tracking499500```python501def generate_with_citations(query: str, context: str, sources: list[dict]) -> dict:502 """503 Generate answer with citation tracking.504505 Returns:506 - answer: Generated text507 - citations: List of source documents used508 """509 from openai import OpenAI510 client = OpenAI()511512 # Create source map513 source_map = {i+1: source for i, source in enumerate(sources)}514 numbered_context = "\n\n".join([515 f"[{i+1}] {source['text']}"516 for i, source in enumerate(sources)517 ])518519 response = client.chat.completions.create(520 model="gpt-4",521 messages=[{522 "role": "system",523 "content": "Answer using the provided sources. Cite sources as [1], [2], etc."524 }, {525 "role": "user",526 "content": f"Context:\n{numbered_context}\n\nQuestion: {query}"527 }]528 )529530 answer = response.choices[0].message.content531532 # Extract citations from answer533 import re534 cited_nums = set(map(int, re.findall(r'\[(\d+)\]', answer)))535 cited_sources = [source_map[num] for num in cited_nums if num in source_map]536537 return {538 'answer': answer,539 'citations': cited_sources,540 'num_sources_used': len(cited_sources)541 }542```543544---545546## Part 6: Production Best Practices547548### Caching Strategy549550```python551from functools import lru_cache552import hashlib553554class EmbeddingCache:555 """Cache embeddings to avoid recomputation."""556557 def __init__(self, cache_size: int = 10000):558 self.cache = {}559 self.max_size = cache_size560561 def get_or_compute(self, text: str, embed_fn) -> list[float]:562 # Create cache key563 key = hashlib.sha256(text.encode()).hexdigest()564565 if key in self.cache:566 return self.cache[key]567568 # Compute and cache569 embedding = embed_fn(text)570571 if len(self.cache) >= self.max_size:572 # Evict oldest (FIFO)573 self.cache.pop(next(iter(self.cache)))574575 self.cache[key] = embedding576 return embedding577578# Usage579cache = EmbeddingCache()580embedding = cache.get_or_compute(text, lambda t: get_embeddings([t])[0])581```582583### Async Processing584585```python586import asyncio587from typing import List588589async def process_documents_async(documents: List[str], batch_size: int = 100):590 """591 Process large document sets asynchronously.592593 Benefits:594 - 10-50x faster for I/O-bound operations595 - Better resource utilization596 - Scalable to millions of documents597 """598 async def process_batch(batch):599 embeddings = await get_embeddings_async(batch)600 await upsert_to_db_async(batch, embeddings)601602 # Split into batches603 batches = [documents[i:i+batch_size] for i in range(0, len(documents), batch_size)]604605 # Process batches concurrently606 await asyncio.gather(*[process_batch(batch) for batch in batches])607608# Usage609asyncio.run(process_documents_async(documents))610```611612### Monitoring & Observability613614```python615import time616from dataclasses import dataclass617from datetime import datetime618619@dataclass620class RAGMetrics:621 """Track RAG system performance."""622 query_count: int = 0623 avg_retrieval_time: float = 0.0624 avg_generation_time: float = 0.0625 cache_hit_rate: float = 0.0626 avg_num_results: float = 0.0627628class RAGMonitor:629 def __init__(self):630 self.metrics = RAGMetrics()631 self.query_times = []632633 def log_query(self, retrieval_time: float, generation_time: float, num_results: int):634 self.metrics.query_count += 1635 self.query_times.append({636 'timestamp': datetime.now(),637 'retrieval_time': retrieval_time,638 'generation_time': generation_time,639 'num_results': num_results640 })641642 # Update averages643 self.metrics.avg_retrieval_time = sum(644 q['retrieval_time'] for q in self.query_times645 ) / len(self.query_times)646647 self.metrics.avg_generation_time = sum(648 q['generation_time'] for q in self.query_times649 ) / len(self.query_times)650651 def get_metrics(self) -> dict:652 return {653 'total_queries': self.metrics.query_count,654 'avg_retrieval_ms': self.metrics.avg_retrieval_time * 1000,655 'avg_generation_ms': self.metrics.avg_generation_time * 1000,656 'p95_retrieval_ms': self._percentile([q['retrieval_time'] for q in self.query_times], 95) * 1000657 }658```659660---661662## Part 7: Common Pitfalls & Solutions663664### Pitfall 1: Chunk Size Too Small/Large665666**Problem:** Small chunks lack context, large chunks reduce retrieval precision667668**Solution:**669```python670# Experiment with chunk sizes671chunk_sizes = [256, 512, 1024, 2048]672for size in chunk_sizes:673 chunks = semantic_chunking(document, chunk_size=size)674 # Evaluate retrieval quality675 recall = evaluate_retrieval(chunks, test_queries)676 print(f"Size {size}: Recall {recall:.2f}")677678# Typical sweet spot: 512-1024 tokens679```680681### Pitfall 2: Poor Embedding Model Choice682683**Problem:** Model not suited for domain (e.g., code search with general model)684685**Solution:**686```python687# Use domain-specific models688domain_models = {689 'code': 'microsoft/codebert-base',690 'medical': 'dmis-lab/biobert-v1.1',691 'legal': 'nlpaueb/legal-bert-base-uncased',692 'general': 'text-embedding-3-small'693}694695model = domain_models.get(your_domain, 'text-embedding-3-small')696```697698### Pitfall 3: No Query Optimization699700**Problem:** User queries don't match document phrasing701702**Solution:** Implement query expansion + rewriting703```python704def optimize_query(raw_query: str) -> str:705 """Transform user query to better match documents."""706 # Example: "how 2 train NN" → "neural network training methods"707 # Use LLM to rewrite poorly-formed queries708 pass709```710711### Pitfall 4: Ignoring Metadata712713**Problem:** Returning irrelevant results due to lack of filtering714715**Solution:** Always store rich metadata716```python717payload = {718 'text': chunk,719 'source': 'doc.pdf',720 'page': 5,721 'date': '2024-01-15',722 'category': 'engineering',723 'author': 'John Doe',724 'confidence': 0.95 # Document quality score725}726```727728---729730## Quick Decision Trees731732### "Which vector DB should I use?"733734```735Need managed service?736 YES → Pinecone (easy) or Weaviate Cloud737 NO → Continue738739Need distributed/high-scale?740 YES → Milvus or Weaviate741 NO → Continue742743Self-hosting on Docker?744 YES → Qdrant (best performance/features)745 NO → Chroma (embedded, simple)746```747748### "Which chunking strategy?"749750```751Document type?752 Structured (docs, books) → Hierarchical chunking753 Unstructured (chat, logs) → Fixed-size chunking754 Mixed → Semantic chunking755756Quality requirement?757 Critical → Sliding window (overlap 50%)758 Standard → Semantic (overlap 20%)759 Fast/cheap → Fixed-size (overlap 10%)760```761762### "Which embedding model?"763764```765Budget?766 No limits → text-embedding-3-large767 Cost-sensitive → all-mpnet-base-v2 (self-hosted)768769Quality requirement?770 Best → text-embedding-3-large771 Good → text-embedding-3-small or Cohere772 Acceptable → all-MiniLM-L6-v2773```774775---776777## Example: Complete RAG Pipeline778779```python780from qdrant_client import QdrantClient781from openai import OpenAI782from langchain.text_splitter import RecursiveCharacterTextSplitter783784class RAGPipeline:785 def __init__(self):786 self.qdrant = QdrantClient(url="http://localhost:6333")787 self.openai = OpenAI()788 self.splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)789790 def ingest_document(self, text: str, metadata: dict):791 """Ingest and index a document."""792 # 1. Chunk793 chunks = self.splitter.split_text(text)794795 # 2. Embed796 embeddings = self.openai.embeddings.create(797 model="text-embedding-3-small",798 input=chunks799 ).data800801 # 3. Store802 points = [803 PointStruct(804 id=i,805 vector=emb.embedding,806 payload={'text': chunk, **metadata}807 )808 for i, (chunk, emb) in enumerate(zip(chunks, embeddings))809 ]810811 self.qdrant.upsert(collection_name="docs", points=points)812813 def query(self, question: str, top_k: int = 5) -> str:814 """Query with RAG."""815 # 1. Embed query816 query_emb = self.openai.embeddings.create(817 model="text-embedding-3-small",818 input=[question]819 ).data[0].embedding820821 # 2. Retrieve822 results = self.qdrant.search(823 collection_name="docs",824 query_vector=query_emb,825 limit=top_k826 )827828 # 3. Build context829 context = "\n\n".join([r.payload['text'] for r in results])830831 # 4. Generate832 response = self.openai.chat.completions.create(833 model="gpt-4",834 messages=[{835 "role": "system",836 "content": f"Answer based on this context:\n{context}"837 }, {838 "role": "user",839 "content": question840 }]841 )842843 return response.choices[0].message.content844845# Usage846rag = RAGPipeline()847rag.ingest_document(document_text, {'source': 'manual.pdf'})848answer = rag.query("How do I configure the system?")849```850851---852853## Resources854855- **Qdrant Docs:** https://qdrant.tech/documentation/856- **Pinecone Docs:** https://docs.pinecone.io/857- **OpenAI Embeddings:** https://platform.openai.com/docs/guides/embeddings858- **LangChain RAG:** https://python.langchain.com/docs/use_cases/question_answering/859- **Sentence Transformers:** https://www.sbert.net/860861---862863**Skill version:** 1.0.0864**Last updated:** 2025-10-25865**Maintained by:** Applied Artificial Intelligence866
Full transparency — inspect the skill content before installing.