How do I install RAG Implementation?

Install RAG Implementation with a single command: npx mdskills install applied-artificial-intelligence/rag-implementation. This downloads the skill files into your project and your AI agent picks them up automatically.
What platforms support RAG Implementation?

RAG Implementation works with Claude Code, Claude Desktop, Cursor, Vscode Copilot, Windsurf, Continue Dev, Codex, Gemini Cli, Amp, Roo Code, Goose, Opencode, Trae, Qodo, Command Code. Skills use the open SKILL.md format which is compatible with any AI coding agent that reads markdown instructions.
← Back to skills
RAG Implementation

Name: RAG Implementation: AI Agent Skill
Rating: 7 (1 reviews)
Author: applied-artificial-intelligence
Verified
SKILL + PLUGINAI & Machine LearningIntermediate
Comprehensive guide to implementing RAG systems including vector database selection, chunking strategies, embedding models, and retrieval optimization. Use when building RAG systems, implementing semantic search, optimizing retrieval quality, or debugging RAG performance issues.
by @applied-artificial-intelligence0Updated 2/20/2026
Add this skill
npx mdskills install applied-artificial-intelligence/rag-implementation
Fork & Edit
Skill Advisor7.0
Comprehensive RAG tutorial with examples, patterns, and trade-offs across all major tools
+Provides complete code examples for multiple RAG patterns and vector stores
+Covers advanced techniques like hybrid search, reranking, and contextual compression
+Includes evaluation metrics, best practices, and troubleshooting guidance
-Reads more like documentation than actionable agent instructions
-Lacks clear trigger conditions for when to apply specific patterns
SKILL.md
Edit in Browser
1---
2name: rag-implementation
3description: Comprehensive guide to implementing RAG systems including vector database selection, chunking strategies, embedding models, and retrieval optimization. Use when building RAG systems, implementing semantic search, optimizing retrieval quality, or debugging RAG performance issues.
4---
5 
6# RAG Implementation Patterns
7 
8Comprehensive guide to implementing Retrieval-Augmented Generation (RAG) systems including vector database selection, chunking strategies, embedding models, retrieval optimization, and production deployment patterns.
9 
10---
11 
12## Quick Reference
13 
14**When to use this skill:**
15- Building RAG/semantic search systems
16- Implementing document retrieval pipelines
17- Optimizing vector database performance
18- Debugging retrieval quality issues
19- Choosing between vector database options
20- Designing chunking strategies
21- Implementing hybrid search
22 
23**Technologies covered:**
24- Vector DBs: Qdrant, Pinecone, Chroma, Weaviate, Milvus
25- Embeddings: OpenAI, Sentence Transformers, Cohere
26- Frameworks: LangChain, LlamaIndex, Haystack
27 
28---
29 
30## Part 1: Vector Database Selection
31 
32### Database Comparison Matrix
33 
34| Database | Best For | Deployment | Performance | Cost |
35|----------|----------|------------|-------------|------|
36| **Qdrant** | Self-hosted, production | Docker/K8s | Excellent (Rust) | Free (self-host) |
37| **Pinecone** | Managed, rapid prototyping | Cloud | Excellent | Pay-per-use |
38| **Chroma** | Local development, embedded | In-process | Good (Python) | Free |
39| **Weaviate** | Complex schemas, GraphQL | Docker/Cloud | Excellent (Go) | Free + Cloud |
40| **Milvus** | Large-scale, distributed | K8s | Excellent (C++) | Free (self-host) |
41 
42### Qdrant Setup (Recommended for Production)
43 
44```python
45from qdrant_client import QdrantClient
46from qdrant_client.models import Distance, VectorParams, PointStruct
47 
48# Initialize client (local or cloud)
49client = QdrantClient(url="http://localhost:6333")  # or cloud URL
50 
51# Create collection
52client.create_collection(
53    collection_name="documents",
54    vectors_config=VectorParams(
55        size=1536,  # OpenAI text-embedding-3-small dimension
56        distance=Distance.COSINE  # or DOT, EUCLID
57    )
58)
59 
60# Insert vectors with payload
61client.upsert(
62    collection_name="documents",
63    points=[
64        PointStruct(
65            id=1,
66            vector=[0.1, 0.2, ...],  # 1536 dimensions
67            payload={
68                "text": "Document content",
69                "source": "doc.pdf",
70                "page": 1,
71                "metadata": {...}
72            }
73        )
74    ]
75)
76 
77# Search
78results = client.search(
79    collection_name="documents",
80    query_vector=[0.1, 0.2, ...],
81    limit=5,
82    score_threshold=0.7  # Minimum similarity
83)
84```
85 
86### Pinecone Setup (Managed Service)
87 
88```python
89from pinecone import Pinecone, ServerlessSpec
90 
91# Initialize
92pc = Pinecone(api_key="your-key")
93 
94# Create index
95pc.create_index(
96    name="documents",
97    dimension=1536,
98    metric="cosine",
99    spec=ServerlessSpec(cloud="aws", region="us-east-1")
100)
101 
102# Get index
103index = pc.Index("documents")
104 
105# Upsert vectors
106index.upsert(vectors=[
107    ("doc1", [0.1, 0.2, ...], {"text": "...", "source": "..."})
108])
109 
110# Query
111results = index.query(
112    vector=[0.1, 0.2, ...],
113    top_k=5,
114    include_metadata=True
115)
116```
117 
118---
119 
120## Part 2: Chunking Strategies
121 
122### Strategy 1: Fixed-Size Chunking (Simple, Fast)
123 
124```python
125def fixed_size_chunking(text: str, chunk_size: int = 512, overlap: int = 50) -> list[str]:
126    """
127    Split text into fixed-size chunks with overlap.
128 
129    Pros: Simple, predictable chunk sizes
130    Cons: May break mid-sentence, poor semantic boundaries
131    """
132    words = text.split()
133    chunks = []
134 
135    for i in range(0, len(words), chunk_size - overlap):
136        chunk = ' '.join(words[i:i + chunk_size])
137        chunks.append(chunk)
138 
139    return chunks
140 
141# Usage
142chunks = fixed_size_chunking(document, chunk_size=512, overlap=50)
143```
144 
145**When to use:**
146- Simple documents (logs, transcripts)
147- Prototyping/MVP
148- Consistent token budgets needed
149 
150### Strategy 2: Semantic Chunking (Better Quality)
151 
152```python
153from langchain.text_splitter import RecursiveCharacterTextSplitter
154 
155def semantic_chunking(text: str, chunk_size: int = 1000, overlap: int = 200) -> list[str]:
156    """
157    Split on semantic boundaries (paragraphs, sentences).
158 
159    Pros: Preserves meaning, better retrieval quality
160    Cons: Variable chunk sizes, slower processing
161    """
162    splitter = RecursiveCharacterTextSplitter(
163        chunk_size=chunk_size,
164        chunk_overlap=overlap,
165        separators=["\n\n", "\n", ". ", " ", ""],  # Priority order
166        length_function=len
167    )
168 
169    return splitter.split_text(text)
170 
171# Usage
172chunks = semantic_chunking(document, chunk_size=1000, overlap=200)
173```
174 
175**When to use:**
176- Long-form documents (articles, books, reports)
177- Quality > speed
178- Natural language content
179 
180### Strategy 3: Hierarchical Chunking (Best for Structured Docs)
181 
182```python
183def hierarchical_chunking(document: dict) -> list[dict]:
184    """
185    Chunk based on document structure (sections, subsections).
186 
187    Pros: Preserves hierarchy, enables parent-child retrieval
188    Cons: Requires structured input, more complex
189    """
190    chunks = []
191 
192    for section in document['sections']:
193        # Parent chunk (section summary)
194        chunks.append({
195            'text': section['title'] + '\n' + section['summary'],
196            'type': 'parent',
197            'section_id': section['id']
198        })
199 
200        # Child chunks (paragraphs)
201        for para in section['paragraphs']:
202            chunks.append({
203                'text': para,
204                'type': 'child',
205                'parent_id': section['id']
206            })
207 
208    return chunks
209```
210 
211**When to use:**
212- Technical documentation
213- Books with TOC
214- Legal documents
215- Need to preserve context hierarchy
216 
217### Strategy 4: Sliding Window (Maximum Context Preservation)
218 
219```python
220def sliding_window_chunking(text: str, window_size: int = 512, stride: int = 256) -> list[str]:
221    """
222    Overlapping windows for maximum context.
223 
224    Pros: No information loss at boundaries
225    Cons: Storage overhead (duplicate content)
226    """
227    words = text.split()
228    chunks = []
229 
230    for i in range(0, len(words) - window_size + 1, stride):
231        chunk = ' '.join(words[i:i + window_size])
232        chunks.append(chunk)
233 
234    return chunks
235```
236 
237**When to use:**
238- Critical retrieval accuracy needed
239- Short queries need broader context
240- Storage cost not a concern
241 
242---
243 
244## Part 3: Embedding Models
245 
246### Model Selection Guide
247 
248| Model | Dimensions | Speed | Quality | Cost | Use Case |
249|-------|-----------|-------|---------|------|----------|
250| **OpenAI text-embedding-3-small** | 1536 | Fast | Excellent | $0.02/1M tokens | Production, general purpose |
251| **OpenAI text-embedding-3-large** | 3072 | Medium | Best | $0.13/1M tokens | High-quality retrieval |
252| **all-MiniLM-L6-v2** | 384 | Very fast | Good | Free | Self-hosted, prototyping |
253| **all-mpnet-base-v2** | 768 | Fast | Very good | Free | Self-hosted, quality |
254| **Cohere embed-english-v3.0** | 1024 | Fast | Excellent | $0.10/1M tokens | Semantic search focus |
255 
256### OpenAI Embeddings (Recommended)
257 
258```python
259from openai import OpenAI
260 
261client = OpenAI(api_key="your-key")
262 
263def get_embeddings(texts: list[str], model: str = "text-embedding-3-small") -> list[list[float]]:
264    """
265    Generate embeddings using OpenAI.
266 
267    Batch size: Up to 2048 inputs per request
268    Rate limits: Check tier limits
269    """
270    response = client.embeddings.create(
271        model=model,
272        input=texts
273    )
274 
275    return [item.embedding for item in response.data]
276 
277# Usage
278chunks = ["chunk 1", "chunk 2", ...]
279embeddings = get_embeddings(chunks)
280```
281 
282### Sentence Transformers (Self-Hosted)
283 
284```python
285from sentence_transformers import SentenceTransformer
286 
287# Load model (cached after first download)
288model = SentenceTransformer('all-MiniLM-L6-v2')
289 
290def get_embeddings_local(texts: list[str]) -> list[list[float]]:
291    """
292    Generate embeddings locally (no API costs).
293 
294    GPU recommended for batches > 100
295    CPU acceptable for small batches
296    """
297    return model.encode(texts, show_progress_bar=True).tolist()
298 
299# Usage
300embeddings = get_embeddings_local(chunks)
301```
302 
303---
304 
305## Part 4: Retrieval Optimization
306 
307### Technique 1: Hybrid Search (Dense + Sparse)
308 
309```python
310from qdrant_client.models import Filter, FieldCondition, MatchValue
311 
312def hybrid_search(query: str, query_vector: list[float], top_k: int = 10):
313    """
314    Combine dense (vector) and sparse (keyword) search.
315 
316    Dense: Semantic similarity
317    Sparse: Exact keyword matches
318    """
319    # Dense search
320    dense_results = client.search(
321        collection_name="documents",
322        query_vector=query_vector,
323        limit=top_k * 2  # Get more candidates
324    )
325 
326    # Sparse search (BM25 via metadata)
327    sparse_results = client.search(
328        collection_name="documents",
329        query_filter=Filter(
330            must=[
331                FieldCondition(
332                    key="text",
333                    match=MatchValue(value=query)
334                )
335            ]
336        ),
337        limit=top_k * 2
338    )
339 
340    # Merge and re-rank
341    combined = merge_results(dense_results, sparse_results, weights=(0.7, 0.3))
342    return combined[:top_k]
343```
344 
345### Technique 2: Query Expansion
346 
347```python
348def expand_query(query: str) -> list[str]:
349    """
350    Generate query variations for better recall.
351 
352    Techniques:
353    - Synonym expansion
354    - Question reformulation
355    - Entity extraction
356    """
357    from openai import OpenAI
358    client = OpenAI()
359 
360    response = client.chat.completions.create(
361        model="gpt-4",
362        messages=[{
363            "role": "system",
364            "content": "Generate 3 alternative phrasings of the user's query."
365        }, {
366            "role": "user",
367            "content": query
368        }]
369    )
370 
371    expanded = [query] + response.choices[0].message.content.split('\n')
372    return expanded
373 
374# Usage
375queries = expand_query("How to train neural networks?")
376# → ["How to train neural networks?",
377#    "What are neural network training techniques?",
378#    "Neural network optimization methods",
379#    "Deep learning model training"]
380```
381 
382### Technique 3: Reranking
383 
384```python
385from sentence_transformers import CrossEncoder
386 
387# Load cross-encoder (better than bi-encoder for reranking)
388reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
389 
390def rerank_results(query: str, results: list[dict], top_k: int = 5) -> list[dict]:
391    """
392    Rerank initial results using cross-encoder.
393 
394    More accurate but slower than initial retrieval
395    Use on top 20-50 candidates only
396    """
397    # Score each query-document pair
398    pairs = [(query, result['text']) for result in results]
399    scores = reranker.predict(pairs)
400 
401    # Combine scores with results
402    for result, score in zip(results, scores):
403        result['rerank_score'] = float(score)
404 
405    # Sort and return top_k
406    reranked = sorted(results, key=lambda x: x['rerank_score'], reverse=True)
407    return reranked[:top_k]
408```
409 
410### Technique 4: Metadata Filtering
411 
412```python
413def filtered_search(
414    query_vector: list[float],
415    filters: dict,
416    top_k: int = 5
417):
418    """
419    Filter search by metadata (date, category, author, etc.)
420 
421    Pre-filter: Faster but may miss results
422    Post-filter: More results but slower
423    """
424    from qdrant_client.models import Filter, FieldCondition, Range
425 
426    # Build filter conditions
427    conditions = []
428 
429    if 'date_range' in filters:
430        conditions.append(
431            FieldCondition(
432                key="date",
433                range=Range(
434                    gte=filters['date_range']['start'],
435                    lte=filters['date_range']['end']
436                )
437            )
438        )
439 
440    if 'category' in filters:
441        conditions.append(
442            FieldCondition(
443                key="category",
444                match=MatchValue(value=filters['category'])
445            )
446        )
447 
448    # Search with filters
449    results = client.search(
450        collection_name="documents",
451        query_vector=query_vector,
452        query_filter=Filter(must=conditions) if conditions else None,
453        limit=top_k
454    )
455 
456    return results
457```
458 
459---
460 
461## Part 5: Context Management
462 
463### Pattern 1: Retrieved Context Optimization
464 
465```python
466def optimize_context(query: str, retrieved_docs: list[str], max_tokens: int = 4000) -> str:
467    """
468    Optimize retrieved context to fit within LLM context window.
469 
470    Strategies:
471    1. Relevance-based truncation
472    2. Extractive summarization
473    3. Overlap removal
474    """
475    # Sort by relevance
476    sorted_docs = sorted(retrieved_docs, key=lambda d: d['score'], reverse=True)
477 
478    # Build context within token budget
479    context_parts = []
480    total_tokens = 0
481 
482    for doc in sorted_docs:
483        doc_tokens = estimate_tokens(doc['text'])
484 
485        if total_tokens + doc_tokens <= max_tokens:
486            context_parts.append(f"[Source: {doc['source']}]\n{doc['text']}")
487            total_tokens += doc_tokens
488        else:
489            # Truncate last document to fit
490            remaining = max_tokens - total_tokens
491            truncated = truncate_to_tokens(doc['text'], remaining)
492            context_parts.append(f"[Source: {doc['source']}]\n{truncated}")
493            break
494 
495    return "\n\n---\n\n".join(context_parts)
496```
497 
498### Pattern 2: Citation Tracking
499 
500```python
501def generate_with_citations(query: str, context: str, sources: list[dict]) -> dict:
502    """
503    Generate answer with citation tracking.
504 
505    Returns:
506    - answer: Generated text
507    - citations: List of source documents used
508    """
509    from openai import OpenAI
510    client = OpenAI()
511 
512    # Create source map
513    source_map = {i+1: source for i, source in enumerate(sources)}
514    numbered_context = "\n\n".join([
515        f"[{i+1}] {source['text']}"
516        for i, source in enumerate(sources)
517    ])
518 
519    response = client.chat.completions.create(
520        model="gpt-4",
521        messages=[{
522            "role": "system",
523            "content": "Answer using the provided sources. Cite sources as [1], [2], etc."
524        }, {
525            "role": "user",
526            "content": f"Context:\n{numbered_context}\n\nQuestion: {query}"
527        }]
528    )
529 
530    answer = response.choices[0].message.content
531 
532    # Extract citations from answer
533    import re
534    cited_nums = set(map(int, re.findall(r'\[(\d+)\]', answer)))
535    cited_sources = [source_map[num] for num in cited_nums if num in source_map]
536 
537    return {
538        'answer': answer,
539        'citations': cited_sources,
540        'num_sources_used': len(cited_sources)
541    }
542```
543 
544---
545 
546## Part 6: Production Best Practices
547 
548### Caching Strategy
549 
550```python
551from functools import lru_cache
552import hashlib
553 
554class EmbeddingCache:
555    """Cache embeddings to avoid recomputation."""
556 
557    def __init__(self, cache_size: int = 10000):
558        self.cache = {}
559        self.max_size = cache_size
560 
561    def get_or_compute(self, text: str, embed_fn) -> list[float]:
562        # Create cache key
563        key = hashlib.sha256(text.encode()).hexdigest()
564 
565        if key in self.cache:
566            return self.cache[key]
567 
568        # Compute and cache
569        embedding = embed_fn(text)
570 
571        if len(self.cache) >= self.max_size:
572            # Evict oldest (FIFO)
573            self.cache.pop(next(iter(self.cache)))
574 
575        self.cache[key] = embedding
576        return embedding
577 
578# Usage
579cache = EmbeddingCache()
580embedding = cache.get_or_compute(text, lambda t: get_embeddings([t])[0])
581```
582 
583### Async Processing
584 
585```python
586import asyncio
587from typing import List
588 
589async def process_documents_async(documents: List[str], batch_size: int = 100):
590    """
591    Process large document sets asynchronously.
592 
593    Benefits:
594    - 10-50x faster for I/O-bound operations
595    - Better resource utilization
596    - Scalable to millions of documents
597    """
598    async def process_batch(batch):
599        embeddings = await get_embeddings_async(batch)
600        await upsert_to_db_async(batch, embeddings)
601 
602    # Split into batches
603    batches = [documents[i:i+batch_size] for i in range(0, len(documents), batch_size)]
604 
605    # Process batches concurrently
606    await asyncio.gather(*[process_batch(batch) for batch in batches])
607 
608# Usage
609asyncio.run(process_documents_async(documents))
610```
611 
612### Monitoring & Observability
613 
614```python
615import time
616from dataclasses import dataclass
617from datetime import datetime
618 
619@dataclass
620class RAGMetrics:
621    """Track RAG system performance."""
622    query_count: int = 0
623    avg_retrieval_time: float = 0.0
624    avg_generation_time: float = 0.0
625    cache_hit_rate: float = 0.0
626    avg_num_results: float = 0.0
627 
628class RAGMonitor:
629    def __init__(self):
630        self.metrics = RAGMetrics()
631        self.query_times = []
632 
633    def log_query(self, retrieval_time: float, generation_time: float, num_results: int):
634        self.metrics.query_count += 1
635        self.query_times.append({
636            'timestamp': datetime.now(),
637            'retrieval_time': retrieval_time,
638            'generation_time': generation_time,
639            'num_results': num_results
640        })
641 
642        # Update averages
643        self.metrics.avg_retrieval_time = sum(
644            q['retrieval_time'] for q in self.query_times
645        ) / len(self.query_times)
646 
647        self.metrics.avg_generation_time = sum(
648            q['generation_time'] for q in self.query_times
649        ) / len(self.query_times)
650 
651    def get_metrics(self) -> dict:
652        return {
653            'total_queries': self.metrics.query_count,
654            'avg_retrieval_ms': self.metrics.avg_retrieval_time * 1000,
655            'avg_generation_ms': self.metrics.avg_generation_time * 1000,
656            'p95_retrieval_ms': self._percentile([q['retrieval_time'] for q in self.query_times], 95) * 1000
657        }
658```
659 
660---
661 
662## Part 7: Common Pitfalls & Solutions
663 
664### Pitfall 1: Chunk Size Too Small/Large
665 
666**Problem:** Small chunks lack context, large chunks reduce retrieval precision
667 
668**Solution:**
669```python
670# Experiment with chunk sizes
671chunk_sizes = [256, 512, 1024, 2048]
672for size in chunk_sizes:
673    chunks = semantic_chunking(document, chunk_size=size)
674    # Evaluate retrieval quality
675    recall = evaluate_retrieval(chunks, test_queries)
676    print(f"Size {size}: Recall {recall:.2f}")
677 
678# Typical sweet spot: 512-1024 tokens
679```
680 
681### Pitfall 2: Poor Embedding Model Choice
682 
683**Problem:** Model not suited for domain (e.g., code search with general model)
684 
685**Solution:**
686```python
687# Use domain-specific models
688domain_models = {
689    'code': 'microsoft/codebert-base',
690    'medical': 'dmis-lab/biobert-v1.1',
691    'legal': 'nlpaueb/legal-bert-base-uncased',
692    'general': 'text-embedding-3-small'
693}
694 
695model = domain_models.get(your_domain, 'text-embedding-3-small')
696```
697 
698### Pitfall 3: No Query Optimization
699 
700**Problem:** User queries don't match document phrasing
701 
702**Solution:** Implement query expansion + rewriting
703```python
704def optimize_query(raw_query: str) -> str:
705    """Transform user query to better match documents."""
706    # Example: "how 2 train NN" → "neural network training methods"
707    # Use LLM to rewrite poorly-formed queries
708    pass
709```
710 
711### Pitfall 4: Ignoring Metadata
712 
713**Problem:** Returning irrelevant results due to lack of filtering
714 
715**Solution:** Always store rich metadata
716```python
717payload = {
718    'text': chunk,
719    'source': 'doc.pdf',
720    'page': 5,
721    'date': '2024-01-15',
722    'category': 'engineering',
723    'author': 'John Doe',
724    'confidence': 0.95  # Document quality score
725}
726```
727 
728---
729 
730## Quick Decision Trees
731 
732### "Which vector DB should I use?"
733 
734```
735Need managed service?
736  YES → Pinecone (easy) or Weaviate Cloud
737  NO  → Continue
738 
739Need distributed/high-scale?
740  YES → Milvus or Weaviate
741  NO  → Continue
742 
743Self-hosting on Docker?
744  YES → Qdrant (best performance/features)
745  NO  → Chroma (embedded, simple)
746```
747 
748### "Which chunking strategy?"
749 
750```
751Document type?
752  Structured (docs, books) → Hierarchical chunking
753  Unstructured (chat, logs) → Fixed-size chunking
754  Mixed → Semantic chunking
755 
756Quality requirement?
757  Critical → Sliding window (overlap 50%)
758  Standard → Semantic (overlap 20%)
759  Fast/cheap → Fixed-size (overlap 10%)
760```
761 
762### "Which embedding model?"
763 
764```
765Budget?
766  No limits → text-embedding-3-large
767  Cost-sensitive → all-mpnet-base-v2 (self-hosted)
768 
769Quality requirement?
770  Best → text-embedding-3-large
771  Good → text-embedding-3-small or Cohere
772  Acceptable → all-MiniLM-L6-v2
773```
774 
775---
776 
777## Example: Complete RAG Pipeline
778 
779```python
780from qdrant_client import QdrantClient
781from openai import OpenAI
782from langchain.text_splitter import RecursiveCharacterTextSplitter
783 
784class RAGPipeline:
785    def __init__(self):
786        self.qdrant = QdrantClient(url="http://localhost:6333")
787        self.openai = OpenAI()
788        self.splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
789 
790    def ingest_document(self, text: str, metadata: dict):
791        """Ingest and index a document."""
792        # 1. Chunk
793        chunks = self.splitter.split_text(text)
794 
795        # 2. Embed
796        embeddings = self.openai.embeddings.create(
797            model="text-embedding-3-small",
798            input=chunks
799        ).data
800 
801        # 3. Store
802        points = [
803            PointStruct(
804                id=i,
805                vector=emb.embedding,
806                payload={'text': chunk, **metadata}
807            )
808            for i, (chunk, emb) in enumerate(zip(chunks, embeddings))
809        ]
810 
811        self.qdrant.upsert(collection_name="docs", points=points)
812 
813    def query(self, question: str, top_k: int = 5) -> str:
814        """Query with RAG."""
815        # 1. Embed query
816        query_emb = self.openai.embeddings.create(
817            model="text-embedding-3-small",
818            input=[question]
819        ).data[0].embedding
820 
821        # 2. Retrieve
822        results = self.qdrant.search(
823            collection_name="docs",
824            query_vector=query_emb,
825            limit=top_k
826        )
827 
828        # 3. Build context
829        context = "\n\n".join([r.payload['text'] for r in results])
830 
831        # 4. Generate
832        response = self.openai.chat.completions.create(
833            model="gpt-4",
834            messages=[{
835                "role": "system",
836                "content": f"Answer based on this context:\n{context}"
837            }, {
838                "role": "user",
839                "content": question
840            }]
841        )
842 
843        return response.choices[0].message.content
844 
845# Usage
846rag = RAGPipeline()
847rag.ingest_document(document_text, {'source': 'manual.pdf'})
848answer = rag.query("How do I configure the system?")
849```
850 
851---
852 
853## Resources
854 
855- **Qdrant Docs:** https://qdrant.tech/documentation/
856- **Pinecone Docs:** https://docs.pinecone.io/
857- **OpenAI Embeddings:** https://platform.openai.com/docs/guides/embeddings
858- **LangChain RAG:** https://python.langchain.com/docs/use_cases/question_answering/
859- **Sentence Transformers:** https://www.sbert.net/
860 
861---
862 
863**Skill version:** 1.0.0
864**Last updated:** 2025-10-25
865**Maintained by:** Applied Artificial Intelligence
866
Full transparency — inspect the skill content before installing.