Select and optimize embedding models for semantic search and RAG applications. Use when choosing embedding models, implementing chunking strategies, or optimizing embedding quality for specific domains.
Add this skill
npx mdskills install sickn33/embedding-strategiesComprehensive embedding guide with practical templates for multiple models and chunking strategies
1---2name: embedding-strategies3description: Select and optimize embedding models for semantic search and RAG applications. Use when choosing embedding models, implementing chunking strategies, or optimizing embedding quality for specific domains.4---56# Embedding Strategies78Guide to selecting and optimizing embedding models for vector search applications.910## Do not use this skill when1112- The task is unrelated to embedding strategies13- You need a different domain or tool outside this scope1415## Instructions1617- Clarify goals, constraints, and required inputs.18- Apply relevant best practices and validate outcomes.19- Provide actionable steps and verification.20- If detailed examples are required, open `resources/implementation-playbook.md`.2122## Use this skill when2324- Choosing embedding models for RAG25- Optimizing chunking strategies26- Fine-tuning embeddings for domains27- Comparing embedding model performance28- Reducing embedding dimensions29- Handling multilingual content3031## Core Concepts3233### 1. Embedding Model Comparison3435| Model | Dimensions | Max Tokens | Best For |36|-------|------------|------------|----------|37| **text-embedding-3-large** | 3072 | 8191 | High accuracy |38| **text-embedding-3-small** | 1536 | 8191 | Cost-effective |39| **voyage-2** | 1024 | 4000 | Code, legal |40| **bge-large-en-v1.5** | 1024 | 512 | Open source |41| **all-MiniLM-L6-v2** | 384 | 256 | Fast, lightweight |42| **multilingual-e5-large** | 1024 | 512 | Multi-language |4344### 2. Embedding Pipeline4546```47Document → Chunking → Preprocessing → Embedding Model → Vector48 ↓49 [Overlap, Size] [Clean, Normalize] [API/Local]50```5152## Templates5354### Template 1: OpenAI Embeddings5556```python57from openai import OpenAI58from typing import List59import numpy as np6061client = OpenAI()6263def get_embeddings(64 texts: List[str],65 model: str = "text-embedding-3-small",66 dimensions: int = None67) -> List[List[float]]:68 """Get embeddings from OpenAI."""69 # Handle batching for large lists70 batch_size = 10071 all_embeddings = []7273 for i in range(0, len(texts), batch_size):74 batch = texts[i:i + batch_size]7576 kwargs = {"input": batch, "model": model}77 if dimensions:78 kwargs["dimensions"] = dimensions7980 response = client.embeddings.create(**kwargs)81 embeddings = [item.embedding for item in response.data]82 all_embeddings.extend(embeddings)8384 return all_embeddings858687def get_embedding(text: str, **kwargs) -> List[float]:88 """Get single embedding."""89 return get_embeddings([text], **kwargs)[0]909192# Dimension reduction with OpenAI93def get_reduced_embedding(text: str, dimensions: int = 512) -> List[float]:94 """Get embedding with reduced dimensions (Matryoshka)."""95 return get_embedding(96 text,97 model="text-embedding-3-small",98 dimensions=dimensions99 )100```101102### Template 2: Local Embeddings with Sentence Transformers103104```python105from sentence_transformers import SentenceTransformer106from typing import List, Optional107import numpy as np108109class LocalEmbedder:110 """Local embedding with sentence-transformers."""111112 def __init__(113 self,114 model_name: str = "BAAI/bge-large-en-v1.5",115 device: str = "cuda"116 ):117 self.model = SentenceTransformer(model_name, device=device)118119 def embed(120 self,121 texts: List[str],122 normalize: bool = True,123 show_progress: bool = False124 ) -> np.ndarray:125 """Embed texts with optional normalization."""126 embeddings = self.model.encode(127 texts,128 normalize_embeddings=normalize,129 show_progress_bar=show_progress,130 convert_to_numpy=True131 )132 return embeddings133134 def embed_query(self, query: str) -> np.ndarray:135 """Embed a query with BGE-style prefix."""136 # BGE models benefit from query prefix137 if "bge" in self.model.get_sentence_embedding_dimension():138 query = f"Represent this sentence for searching relevant passages: {query}"139 return self.embed([query])[0]140141 def embed_documents(self, documents: List[str]) -> np.ndarray:142 """Embed documents for indexing."""143 return self.embed(documents)144145146# E5 model with instructions147class E5Embedder:148 def __init__(self, model_name: str = "intfloat/multilingual-e5-large"):149 self.model = SentenceTransformer(model_name)150151 def embed_query(self, query: str) -> np.ndarray:152 return self.model.encode(f"query: {query}")153154 def embed_document(self, document: str) -> np.ndarray:155 return self.model.encode(f"passage: {document}")156```157158### Template 3: Chunking Strategies159160```python161from typing import List, Tuple162import re163164def chunk_by_tokens(165 text: str,166 chunk_size: int = 512,167 chunk_overlap: int = 50,168 tokenizer=None169) -> List[str]:170 """Chunk text by token count."""171 import tiktoken172 tokenizer = tokenizer or tiktoken.get_encoding("cl100k_base")173174 tokens = tokenizer.encode(text)175 chunks = []176177 start = 0178 while start < len(tokens):179 end = start + chunk_size180 chunk_tokens = tokens[start:end]181 chunk_text = tokenizer.decode(chunk_tokens)182 chunks.append(chunk_text)183 start = end - chunk_overlap184185 return chunks186187188def chunk_by_sentences(189 text: str,190 max_chunk_size: int = 1000,191 min_chunk_size: int = 100192) -> List[str]:193 """Chunk text by sentences, respecting size limits."""194 import nltk195 sentences = nltk.sent_tokenize(text)196197 chunks = []198 current_chunk = []199 current_size = 0200201 for sentence in sentences:202 sentence_size = len(sentence)203204 if current_size + sentence_size > max_chunk_size and current_chunk:205 chunks.append(" ".join(current_chunk))206 current_chunk = []207 current_size = 0208209 current_chunk.append(sentence)210 current_size += sentence_size211212 if current_chunk:213 chunks.append(" ".join(current_chunk))214215 return chunks216217218def chunk_by_semantic_sections(219 text: str,220 headers_pattern: str = r'^#{1,3}\s+.+$'221) -> List[Tuple[str, str]]:222 """Chunk markdown by headers, preserving hierarchy."""223 lines = text.split('\n')224 chunks = []225 current_header = ""226 current_content = []227228 for line in lines:229 if re.match(headers_pattern, line, re.MULTILINE):230 if current_content:231 chunks.append((current_header, '\n'.join(current_content)))232 current_header = line233 current_content = []234 else:235 current_content.append(line)236237 if current_content:238 chunks.append((current_header, '\n'.join(current_content)))239240 return chunks241242243def recursive_character_splitter(244 text: str,245 chunk_size: int = 1000,246 chunk_overlap: int = 200,247 separators: List[str] = None248) -> List[str]:249 """LangChain-style recursive splitter."""250 separators = separators or ["\n\n", "\n", ". ", " ", ""]251252 def split_text(text: str, separators: List[str]) -> List[str]:253 if not text:254 return []255256 separator = separators[0]257 remaining_separators = separators[1:]258259 if separator == "":260 # Character-level split261 return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size - chunk_overlap)]262263 splits = text.split(separator)264 chunks = []265 current_chunk = []266 current_length = 0267268 for split in splits:269 split_length = len(split) + len(separator)270271 if current_length + split_length > chunk_size and current_chunk:272 chunk_text = separator.join(current_chunk)273274 # Recursively split if still too large275 if len(chunk_text) > chunk_size and remaining_separators:276 chunks.extend(split_text(chunk_text, remaining_separators))277 else:278 chunks.append(chunk_text)279280 # Start new chunk with overlap281 overlap_splits = []282 overlap_length = 0283 for s in reversed(current_chunk):284 if overlap_length + len(s) <= chunk_overlap:285 overlap_splits.insert(0, s)286 overlap_length += len(s)287 else:288 break289 current_chunk = overlap_splits290 current_length = overlap_length291292 current_chunk.append(split)293 current_length += split_length294295 if current_chunk:296 chunks.append(separator.join(current_chunk))297298 return chunks299300 return split_text(text, separators)301```302303### Template 4: Domain-Specific Embedding Pipeline304305```python306class DomainEmbeddingPipeline:307 """Pipeline for domain-specific embeddings."""308309 def __init__(310 self,311 embedding_model: str = "text-embedding-3-small",312 chunk_size: int = 512,313 chunk_overlap: int = 50,314 preprocessing_fn=None315 ):316 self.embedding_model = embedding_model317 self.chunk_size = chunk_size318 self.chunk_overlap = chunk_overlap319 self.preprocess = preprocessing_fn or self._default_preprocess320321 def _default_preprocess(self, text: str) -> str:322 """Default preprocessing."""323 # Remove excessive whitespace324 text = re.sub(r'\s+', ' ', text)325 # Remove special characters326 text = re.sub(r'[^\w\s.,!?-]', '', text)327 return text.strip()328329 async def process_documents(330 self,331 documents: List[dict],332 id_field: str = "id",333 content_field: str = "content",334 metadata_fields: List[str] = None335 ) -> List[dict]:336 """Process documents for vector storage."""337 processed = []338339 for doc in documents:340 content = doc[content_field]341 doc_id = doc[id_field]342343 # Preprocess344 cleaned = self.preprocess(content)345346 # Chunk347 chunks = chunk_by_tokens(348 cleaned,349 self.chunk_size,350 self.chunk_overlap351 )352353 # Create embeddings354 embeddings = get_embeddings(chunks, self.embedding_model)355356 # Create records357 for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):358 record = {359 "id": f"{doc_id}_chunk_{i}",360 "document_id": doc_id,361 "chunk_index": i,362 "text": chunk,363 "embedding": embedding364 }365366 # Add metadata367 if metadata_fields:368 for field in metadata_fields:369 if field in doc:370 record[field] = doc[field]371372 processed.append(record)373374 return processed375376377# Code-specific pipeline378class CodeEmbeddingPipeline:379 """Specialized pipeline for code embeddings."""380381 def __init__(self, model: str = "voyage-code-2"):382 self.model = model383384 def chunk_code(self, code: str, language: str) -> List[dict]:385 """Chunk code by functions/classes."""386 import tree_sitter387388 # Parse with tree-sitter389 # Extract functions, classes, methods390 # Return chunks with context391 pass392393 def embed_with_context(self, chunk: str, context: str) -> List[float]:394 """Embed code with surrounding context."""395 combined = f"Context: {context}\n\nCode:\n{chunk}"396 return get_embedding(combined, model=self.model)397```398399### Template 5: Embedding Quality Evaluation400401```python402import numpy as np403from typing import List, Tuple404405def evaluate_retrieval_quality(406 queries: List[str],407 relevant_docs: List[List[str]], # List of relevant doc IDs per query408 retrieved_docs: List[List[str]], # List of retrieved doc IDs per query409 k: int = 10410) -> dict:411 """Evaluate embedding quality for retrieval."""412413 def precision_at_k(relevant: set, retrieved: List[str], k: int) -> float:414 retrieved_k = retrieved[:k]415 relevant_retrieved = len(set(retrieved_k) & relevant)416 return relevant_retrieved / k417418 def recall_at_k(relevant: set, retrieved: List[str], k: int) -> float:419 retrieved_k = retrieved[:k]420 relevant_retrieved = len(set(retrieved_k) & relevant)421 return relevant_retrieved / len(relevant) if relevant else 0422423 def mrr(relevant: set, retrieved: List[str]) -> float:424 for i, doc in enumerate(retrieved):425 if doc in relevant:426 return 1 / (i + 1)427 return 0428429 def ndcg_at_k(relevant: set, retrieved: List[str], k: int) -> float:430 dcg = sum(431 1 / np.log2(i + 2) if doc in relevant else 0432 for i, doc in enumerate(retrieved[:k])433 )434 ideal_dcg = sum(1 / np.log2(i + 2) for i in range(min(len(relevant), k)))435 return dcg / ideal_dcg if ideal_dcg > 0 else 0436437 metrics = {438 f"precision@{k}": [],439 f"recall@{k}": [],440 "mrr": [],441 f"ndcg@{k}": []442 }443444 for relevant, retrieved in zip(relevant_docs, retrieved_docs):445 relevant_set = set(relevant)446 metrics[f"precision@{k}"].append(precision_at_k(relevant_set, retrieved, k))447 metrics[f"recall@{k}"].append(recall_at_k(relevant_set, retrieved, k))448 metrics["mrr"].append(mrr(relevant_set, retrieved))449 metrics[f"ndcg@{k}"].append(ndcg_at_k(relevant_set, retrieved, k))450451 return {name: np.mean(values) for name, values in metrics.items()}452453454def compute_embedding_similarity(455 embeddings1: np.ndarray,456 embeddings2: np.ndarray,457 metric: str = "cosine"458) -> np.ndarray:459 """Compute similarity matrix between embedding sets."""460 if metric == "cosine":461 # Normalize462 norm1 = embeddings1 / np.linalg.norm(embeddings1, axis=1, keepdims=True)463 norm2 = embeddings2 / np.linalg.norm(embeddings2, axis=1, keepdims=True)464 return norm1 @ norm2.T465 elif metric == "euclidean":466 from scipy.spatial.distance import cdist467 return -cdist(embeddings1, embeddings2, metric='euclidean')468 elif metric == "dot":469 return embeddings1 @ embeddings2.T470```471472## Best Practices473474### Do's475- **Match model to use case** - Code vs prose vs multilingual476- **Chunk thoughtfully** - Preserve semantic boundaries477- **Normalize embeddings** - For cosine similarity478- **Batch requests** - More efficient than one-by-one479- **Cache embeddings** - Avoid recomputing480481### Don'ts482- **Don't ignore token limits** - Truncation loses info483- **Don't mix embedding models** - Incompatible spaces484- **Don't skip preprocessing** - Garbage in, garbage out485- **Don't over-chunk** - Lose context486487## Resources488489- [OpenAI Embeddings](https://platform.openai.com/docs/guides/embeddings)490- [Sentence Transformers](https://www.sbert.net/)491- [MTEB Benchmark](https://huggingface.co/spaces/mteb/leaderboard)492
Full transparency — inspect the skill content before installing.