LLM evaluation and testing patterns including prompt testing, hallucination detection, benchmark creation, and quality metrics. Use when testing LLM applications, validating prompt quality, implementing systematic evaluation, or measuring LLM performance.
Add this skill
npx mdskills install applied-artificial-intelligence/llm-evaluationComprehensive evaluation framework with metrics, testing patterns, and statistical validation
1---2name: llm-evaluation3description: LLM evaluation and testing patterns including prompt testing, hallucination detection, benchmark creation, and quality metrics. Use when testing LLM applications, validating prompt quality, implementing systematic evaluation, or measuring LLM performance.4---56# LLM Evaluation & Testing78Comprehensive guide to evaluating and testing LLM applications including prompt testing, output validation, hallucination detection, benchmark creation, A/B testing, and quality metrics.910---1112## Quick Reference1314**When to use this skill:**15- Testing LLM application outputs16- Validating prompt quality and consistency17- Detecting hallucinations and factual errors18- Creating evaluation benchmarks19- A/B testing prompts or models20- Implementing continuous evaluation (CI/CD)21- Measuring retrieval quality (for RAG)22- Debugging unexpected LLM behavior2324**Metrics covered:**25- Traditional: BLEU, ROUGE, BERTScore, Perplexity26- LLM-as-Judge: GPT-4 evaluation, rubric-based scoring27- Task-specific: Exact match, F1, accuracy, recall28- Quality: Toxicity, bias, coherence, relevance2930---3132## Part 1: Evaluation Fundamentals3334### The LLM Evaluation Challenge3536**Why LLM evaluation is hard:**371. **Subjective quality** - "Good" output varies by use case382. **No single ground truth** - Multiple valid answers393. **Context-dependent** - Same output good/bad in different scenarios404. **Expensive to label** - Human evaluation doesn't scale415. **Adversarial brittleness** - Small prompt changes = large output changes4243**Solution: Multi-layered evaluation**44```45Layer 1: Automated Metrics (fast, scalable)46 ↓47Layer 2: LLM-as-Judge (flexible, nuanced)48 ↓49Layer 3: Human Review (gold standard, expensive)50```5152### Evaluation Dataset Structure5354```python55from dataclasses import dataclass56from typing import List, Optional5758@dataclass59class EvalExample:60 """Single evaluation example."""61 input: str # User input / prompt62 expected_output: Optional[str] # Gold standard (if exists)63 context: Optional[str] # Additional context (for RAG)64 metadata: dict # Category, difficulty, etc.6566@dataclass67class EvalResult:68 """Evaluation result for one example."""69 example_id: str70 actual_output: str71 scores: dict # {'metric_name': score}72 passed: bool73 failure_reason: Optional[str]7475# Example dataset76eval_dataset = [77 EvalExample(78 input="What is the capital of France?",79 expected_output="Paris",80 context=None,81 metadata={'category': 'factual', 'difficulty': 'easy'}82 ),83 EvalExample(84 input="Explain quantum entanglement",85 expected_output=None, # No single answer86 context=None,87 metadata={'category': 'explanation', 'difficulty': 'hard'}88 )89]90```9192---9394## Part 2: Traditional Metrics9596### Metric 1: Exact Match (Simplest)9798```python99def exact_match(predicted: str, expected: str, case_sensitive: bool = False) -> float:100 """101 Binary metric: 1.0 if match, 0.0 otherwise.102103 Use for: Classification, short answers, structured output104 Limitations: Too strict for generation tasks105 """106 if not case_sensitive:107 predicted = predicted.lower().strip()108 expected = expected.lower().strip()109110 return 1.0 if predicted == expected else 0.0111112# Example113score = exact_match("Paris", "paris") # 1.0114score = exact_match("The capital is Paris", "Paris") # 0.0115```116117### Metric 2: ROUGE (Recall-Oriented)118119```python120from rouge_score import rouge_scorer121122def compute_rouge(predicted: str, expected: str) -> dict:123 """124 ROUGE metrics for text overlap.125126 ROUGE-1: Unigram overlap127 ROUGE-2: Bigram overlap128 ROUGE-L: Longest common subsequence129130 Use for: Summarization, translation131 Limitations: Doesn't capture semantics132 """133 scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)134 scores = scorer.score(expected, predicted)135136 return {137 'rouge1': scores['rouge1'].fmeasure,138 'rouge2': scores['rouge2'].fmeasure,139 'rougeL': scores['rougeL'].fmeasure140 }141142# Example143scores = compute_rouge(144 predicted="Paris is the capital of France",145 expected="The capital of France is Paris"146)147# {'rouge1': 0.82, 'rouge2': 0.67, 'rougeL': 0.82}148```149150### Metric 3: BERTScore (Semantic Similarity)151152```python153from bert_score import score as bert_score154155def compute_bertscore(predicted: List[str], expected: List[str]) -> dict:156 """157 Semantic similarity using BERT embeddings.158159 Better than ROUGE for:160 - Paraphrases161 - Semantic equivalence162 - Generation quality163164 Returns: Precision, Recall, F1165 """166 P, R, F1 = bert_score(predicted, expected, lang="en", verbose=False)167168 return {169 'precision': P.mean().item(),170 'recall': R.mean().item(),171 'f1': F1.mean().item()172 }173174# Example175scores = compute_bertscore(176 predicted=["The capital of France is Paris"],177 expected=["Paris is France's capital city"]178)179# {'precision': 0.94, 'recall': 0.91, 'f1': 0.92}180```181182### Metric 4: Perplexity (Model Confidence)183184```python185import torch186from transformers import GPT2LMHeadModel, GPT2Tokenizer187188def compute_perplexity(text: str, model_name: str = "gpt2") -> float:189 """190 Perplexity: How "surprised" is the model by this text?191192 Lower = More likely/fluent193 Use for: Fluency, naturalness194 Limitations: Doesn't measure correctness195 """196 model = GPT2LMHeadModel.from_pretrained(model_name)197 tokenizer = GPT2Tokenizer.from_pretrained(model_name)198199 inputs = tokenizer(text, return_tensors="pt")200201 with torch.no_grad():202 outputs = model(**inputs, labels=inputs["input_ids"])203 loss = outputs.loss204205 perplexity = torch.exp(loss).item()206 return perplexity207208# Example209ppl = compute_perplexity("Paris is the capital of France") # Low (fluent)210ppl2 = compute_perplexity("Capital France the is Paris of") # High (awkward)211```212213---214215## Part 3: LLM-as-Judge Evaluation216217### Pattern 1: Rubric-Based Scoring218219```python220from openai import OpenAI221222client = OpenAI()223224EVALUATION_PROMPT = """225You are an expert evaluator. Score the assistant's response on a scale of 1-5 for each criterion:226227**Criteria:**2281. **Accuracy**: Is the information factually correct?2292. **Completeness**: Does it fully answer the question?2303. **Clarity**: Is it easy to understand?2314. **Conciseness**: Is it appropriately brief?232233**Response to evaluate:**234{response}235236**Expected answer (reference):**237{expected}238239Provide scores in JSON format:240{{241 "accuracy": <1-5>,242 "completeness": <1-5>,243 "clarity": <1-5>,244 "conciseness": <1-5>,245 "reasoning": "Brief explanation"246}}247"""248249def llm_judge_score(response: str, expected: str) -> dict:250 """251 Use GPT-4 as judge with rubric scoring.252253 Pros: Flexible, nuanced, scales well254 Cons: Costs $, potential bias, slower255 """256 prompt = EVALUATION_PROMPT.format(response=response, expected=expected)257258 completion = client.chat.completions.create(259 model="gpt-4",260 messages=[{"role": "user", "content": prompt}],261 response_format={"type": "json_object"}262 )263264 import json265 scores = json.loads(completion.choices[0].message.content)266 return scores267268# Example269scores = llm_judge_score(270 response="Paris is the capital of France, located in the north-central part of the country.",271 expected="Paris"272)273# {'accuracy': 5, 'completeness': 5, 'clarity': 5, 'conciseness': 3, 'reasoning': '...'}274```275276### Pattern 2: Binary Pass/Fail Evaluation277278```python279PASS_FAIL_PROMPT = """280Evaluate if the assistant's response is acceptable.281282**Question:** {question}283**Response:** {response}284**Criteria:** {criteria}285286Return ONLY "PASS" or "FAIL" followed by a one-sentence reason.287"""288289def binary_eval(question: str, response: str, criteria: str) -> tuple[bool, str]:290 """291 Simple pass/fail evaluation.292293 Use for: Unit tests, regression tests, CI/CD294 """295 prompt = PASS_FAIL_PROMPT.format(296 question=question,297 response=response,298 criteria=criteria299 )300301 completion = client.chat.completions.create(302 model="gpt-4",303 messages=[{"role": "user", "content": prompt}],304 temperature=0.0 # Deterministic305 )306307 result = completion.choices[0].message.content308 passed = result.startswith("PASS")309 reason = result.split(":", 1)[1].strip() if ":" in result else result310311 return passed, reason312313# Example314passed, reason = binary_eval(315 question="What is the capital of France?",316 response="The capital is Paris",317 criteria="Response must mention Paris"318)319# (True, "Response correctly identifies Paris as the capital")320```321322### Pattern 3: Pairwise Comparison (A/B Testing)323324```python325PAIRWISE_PROMPT = """326Compare two responses to the same question. Which is better?327328**Question:** {question}329330**Response A:**331{response_a}332333**Response B:**334{response_b}335336**Criteria:** {criteria}337338Return ONLY: "A", "B", or "TIE", followed by a one-sentence explanation.339"""340341def pairwise_comparison(342 question: str,343 response_a: str,344 response_b: str,345 criteria: str = "Overall quality, accuracy, and helpfulness"346) -> tuple[str, str]:347 """348 A/B test two responses.349350 Use for: Prompt engineering, model comparison351 """352 prompt = PAIRWISE_PROMPT.format(353 question=question,354 response_a=response_a,355 response_b=response_b,356 criteria=criteria357 )358359 completion = client.chat.completions.create(360 model="gpt-4",361 messages=[{"role": "user", "content": prompt}],362 temperature=0.0363 )364365 result = completion.choices[0].message.content366 winner = result.split()[0] # "A", "B", or "TIE"367 reason = result.split(":", 1)[1].strip() if ":" in result else result368369 return winner, reason370371# Example372winner, reason = pairwise_comparison(373 question="Explain quantum computing",374 response_a="Quantum computers use qubits instead of bits...",375 response_b="Quantum computing is complex. It uses quantum mechanics."376)377# ("A", "Response A provides more detail and explanation")378```379380---381382## Part 4: Hallucination Detection383384### Method 1: Grounding Check385386```python387def check_grounding(response: str, context: str) -> dict:388 """389 Verify response is grounded in provided context.390391 Critical for RAG systems.392 """393 GROUNDING_PROMPT = """394 Context: {context}395396 Response: {response}397398 Is the response fully supported by the context? Answer with:399 - "GROUNDED": All claims supported400 - "PARTIALLY_GROUNDED": Some claims unsupported401 - "NOT_GROUNDED": Contains unsupported claims402403 List any unsupported claims.404 """405406 prompt = GROUNDING_PROMPT.format(context=context, response=response)407408 completion = client.chat.completions.create(409 model="gpt-4",410 messages=[{"role": "user", "content": prompt}]411 )412413 result = completion.choices[0].message.content414 status = result.split("\n")[0]415 unsupported = [line for line in result.split("\n")[1:] if line.strip()]416417 return {418 'grounding_status': status,419 'unsupported_claims': unsupported,420 'is_hallucination': status != "GROUNDED"421 }422```423424### Method 2: Factuality Check (External Verification)425426```python427def check_factuality(claim: str, use_search: bool = True) -> dict:428 """429 Verify factual claims using external sources.430431 Options:432 1. Web search + verification433 2. Knowledge base lookup434 3. Cross-reference with trusted source435 """436 if use_search:437 # Use web search to verify438 from tavily import TavilyClient439 tavily = TavilyClient(api_key="your-key")440441 # Search for evidence442 results = tavily.search(claim, max_results=3)443444 # Ask LLM to verify based on search results445 VERIFY_PROMPT = """446 Claim: {claim}447448 Search results:449 {results}450451 Is the claim supported by these sources? Answer: TRUE, FALSE, or UNCERTAIN.452 Explanation:453 """454455 prompt = VERIFY_PROMPT.format(456 claim=claim,457 results="\n\n".join([r['content'] for r in results])458 )459460 completion = client.chat.completions.create(461 model="gpt-4",462 messages=[{"role": "user", "content": prompt}]463 )464465 result = completion.choices[0].message.content466 is_factual = result.startswith("TRUE")467468 return {469 'claim': claim,470 'factual': is_factual,471 'evidence': results,472 'explanation': result473 }474```475476### Method 3: Self-Consistency Check477478```python479def self_consistency_check(question: str, num_samples: int = 5) -> dict:480 """481 Generate multiple responses, check for consistency.482483 If model is confident, responses should be consistent.484 Inconsistency suggests hallucination risk.485 """486 responses = []487488 for _ in range(num_samples):489 completion = client.chat.completions.create(490 model="gpt-4",491 messages=[{"role": "user", "content": question}],492 temperature=0.7 # Some randomness493 )494 responses.append(completion.choices[0].message.content)495496 # Compute pairwise similarity497 from sklearn.feature_extraction.text import TfidfVectorizer498 from sklearn.metrics.pairwise import cosine_similarity499500 vectorizer = TfidfVectorizer()501 vectors = vectorizer.fit_transform(responses)502 similarities = cosine_similarity(vectors)503504 # Average pairwise similarity505 avg_similarity = similarities.sum() / (len(responses) * (len(responses) - 1))506507 return {508 'responses': responses,509 'avg_similarity': avg_similarity,510 'is_consistent': avg_similarity > 0.7, # Threshold511 'confidence': 'high' if avg_similarity > 0.85 else 'medium' if avg_similarity > 0.7 else 'low'512 }513```514515---516517## Part 5: RAG-Specific Evaluation518519### Retrieval Quality Metrics520521```python522def evaluate_retrieval(query: str, retrieved_docs: List[dict], relevant_doc_ids: List[str]) -> dict:523 """524 Evaluate retrieval quality using IR metrics.525526 Precision: What % of retrieved docs are relevant?527 Recall: What % of relevant docs were retrieved?528 MRR: Mean Reciprocal Rank529 NDCG: Normalized Discounted Cumulative Gain530 """531 retrieved_ids = [doc['id'] for doc in retrieved_docs]532533 # Precision534 true_positives = len(set(retrieved_ids) & set(relevant_doc_ids))535 precision = true_positives / len(retrieved_ids) if retrieved_ids else 0.0536537 # Recall538 recall = true_positives / len(relevant_doc_ids) if relevant_doc_ids else 0.0539540 # F1541 f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0.0542543 # MRR (Mean Reciprocal Rank)544 mrr = 0.0545 for i, doc_id in enumerate(retrieved_ids, 1):546 if doc_id in relevant_doc_ids:547 mrr = 1.0 / i548 break549550 return {551 'precision': precision,552 'recall': recall,553 'f1': f1,554 'mrr': mrr,555 'num_retrieved': len(retrieved_ids),556 'num_relevant_retrieved': true_positives557 }558```559560### End-to-End RAG Evaluation561562```python563def evaluate_rag_pipeline(564 question: str,565 generated_answer: str,566 retrieved_docs: List[dict],567 ground_truth: str,568 relevant_doc_ids: List[str]569) -> dict:570 """571 Comprehensive RAG evaluation.572573 1. Retrieval quality (precision, recall)574 2. Answer quality (ROUGE, BERTScore)575 3. Answer grounding (hallucination check)576 4. Citation accuracy577 """578 # 1. Retrieval metrics579 retrieval_scores = evaluate_retrieval(question, retrieved_docs, relevant_doc_ids)580581 # 2. Answer quality582 context = "\n\n".join([doc['text'] for doc in retrieved_docs])583584 rouge_scores = compute_rouge(generated_answer, ground_truth)585 bert_scores = compute_bertscore([generated_answer], [ground_truth])586587 # 3. Grounding check588 grounding = check_grounding(generated_answer, context)589590 # 4. LLM-as-judge overall quality591 judge_scores = llm_judge_score(generated_answer, ground_truth)592593 return {594 'retrieval': retrieval_scores,595 'answer_quality': {596 'rouge': rouge_scores,597 'bertscore': bert_scores598 },599 'grounding': grounding,600 'llm_judge': judge_scores,601 'overall_pass': (602 retrieval_scores['f1'] > 0.5 and603 grounding['grounding_status'] == "GROUNDED" and604 judge_scores['accuracy'] >= 4605 )606 }607```608609---610611## Part 6: Prompt Testing Frameworks612613### Framework 1: Regression Test Suite614615```python616class PromptTestSuite:617 """618 Unit tests for prompts (like pytest for LLMs).619 """620621 def __init__(self):622 self.tests = []623 self.results = []624625 def add_test(self, name: str, input: str, criteria: str):626 """Add a test case."""627 self.tests.append({628 'name': name,629 'input': input,630 'criteria': criteria631 })632633 def run(self, generate_fn):634 """Run all tests with given generation function."""635 for test in self.tests:636 response = generate_fn(test['input'])637 passed, reason = binary_eval(638 question=test['input'],639 response=response,640 criteria=test['criteria']641 )642643 self.results.append({644 'test_name': test['name'],645 'passed': passed,646 'reason': reason,647 'response': response648 })649650 return self.results651652 def summary(self) -> dict:653 """Get test summary."""654 total = len(self.results)655 passed = sum(1 for r in self.results if r['passed'])656657 return {658 'total_tests': total,659 'passed': passed,660 'failed': total - passed,661 'pass_rate': passed / total if total > 0 else 0.0662 }663664# Usage665suite = PromptTestSuite()666suite.add_test("capital_france", "What is the capital of France?", "Must mention Paris")667suite.add_test("capital_germany", "What is the capital of Germany?", "Must mention Berlin")668669def my_generate(prompt):670 # Your LLM call671 return client.chat.completions.create(672 model="gpt-4",673 messages=[{"role": "user", "content": prompt}]674 ).choices[0].message.content675676results = suite.run(my_generate)677print(suite.summary())678# {'total_tests': 2, 'passed': 2, 'failed': 0, 'pass_rate': 1.0}679```680681### Framework 2: A/B Testing Framework682683```python684class ABTest:685 """686 A/B test prompts or models.687 """688689 def __init__(self, test_cases: List[dict]):690 self.test_cases = test_cases691 self.results = []692693 def run(self, generate_a, generate_b):694 """Compare two generation functions."""695 for test in self.test_cases:696 response_a = generate_a(test['input'])697 response_b = generate_b(test['input'])698699 winner, reason = pairwise_comparison(700 question=test['input'],701 response_a=response_a,702 response_b=response_b703 )704705 self.results.append({706 'input': test['input'],707 'response_a': response_a,708 'response_b': response_b,709 'winner': winner,710 'reason': reason711 })712713 return self.results714715 def summary(self) -> dict:716 """Aggregate results."""717 total = len(self.results)718 a_wins = sum(1 for r in self.results if r['winner'] == 'A')719 b_wins = sum(1 for r in self.results if r['winner'] == 'B')720 ties = sum(1 for r in self.results if r['winner'] == 'TIE')721722 return {723 'total_comparisons': total,724 'a_wins': a_wins,725 'b_wins': b_wins,726 'ties': ties,727 'a_win_rate': a_wins / total if total > 0 else 0.0,728 'statistical_significance': self._check_significance(a_wins, b_wins, total)729 }730731 def _check_significance(self, a_wins, b_wins, total):732 """Simple binomial test for statistical significance."""733 from scipy.stats import binom_test734 # H0: Both equally good (p=0.5)735 p_value = binom_test(max(a_wins, b_wins), total, 0.5)736 return p_value < 0.05 # Significant at 95% confidence737```738739---740741## Part 7: Production Monitoring742743### Continuous Evaluation Pipeline744745```python746import logging747from datetime import datetime748749class ProductionMonitor:750 """751 Monitor LLM performance in production.752 """753754 def __init__(self, sample_rate: float = 0.1):755 self.sample_rate = sample_rate756 self.metrics = []757 self.logger = logging.getLogger(__name__)758759 def log_interaction(self, user_input: str, model_output: str, metadata: dict):760 """Log interaction for evaluation."""761 import random762763 # Sample traffic for evaluation764 if random.random() < self.sample_rate:765 # Run automated checks766 toxicity = self._check_toxicity(model_output)767 perplexity = compute_perplexity(model_output)768769 metric = {770 'timestamp': datetime.now().isoformat(),771 'user_input': user_input,772 'model_output': model_output,773 'toxicity_score': toxicity,774 'perplexity': perplexity,775 'latency_ms': metadata.get('latency_ms'),776 'model_version': metadata.get('model_version')777 }778779 self.metrics.append(metric)780781 # Alert if anomaly detected782 if toxicity > 0.5:783 self.logger.warning(f"High toxicity detected: {toxicity}")784785 def _check_toxicity(self, text: str) -> float:786 """Check for toxic content."""787 from detoxify import Detoxify788 model = Detoxify('original')789 results = model.predict(text)790 return max(results.values()) # Max toxicity score791792 def get_metrics(self) -> dict:793 """Aggregate metrics."""794 if not self.metrics:795 return {}796797 return {798 'total_interactions': len(self.metrics),799 'avg_toxicity': sum(m['toxicity_score'] for m in self.metrics) / len(self.metrics),800 'avg_perplexity': sum(m['perplexity'] for m in self.metrics) / len(self.metrics),801 'avg_latency_ms': sum(m['latency_ms'] for m in self.metrics if m.get('latency_ms')) / len(self.metrics),802 'high_toxicity_rate': sum(1 for m in self.metrics if m['toxicity_score'] > 0.5) / len(self.metrics)803 }804```805806---807808## Part 8: Best Practices809810### Practice 1: Layered Evaluation Strategy811812```python813# Layer 1: Fast, cheap automated checks814def quick_checks(response: str) -> bool:815 """Run fast automated checks."""816 # Length check817 if len(response) < 10:818 return False819820 # Toxicity check821 if check_toxicity(response) > 0.5:822 return False823824 # Basic coherence (perplexity)825 if compute_perplexity(response) > 100:826 return False827828 return True829830# Layer 2: LLM-as-judge (selective)831def llm_evaluation(response: str, criteria: str) -> float:832 """Run LLM evaluation on subset."""833 scores = llm_judge_score(response, criteria)834 return sum(scores.values()) / len(scores) # Average score835836# Layer 3: Human review (expensive, critical cases)837def flag_for_human_review(response: str, confidence: float) -> bool:838 """Determine if human review needed."""839 return (840 confidence < 0.7 or841 len(response) > 1000 or # Long responses842 "uncertain" in response.lower() # Model uncertainty843 )844845# Combined pipeline846def evaluate_response(question: str, response: str) -> dict:847 # Layer 1: Quick checks848 if not quick_checks(response):849 return {'status': 'failed_quick_checks', 'human_review': False}850851 # Layer 2: LLM judge852 score = llm_evaluation(response, "accuracy and helpfulness")853 confidence = score / 5.0854855 # Layer 3: Human review decision856 needs_human = flag_for_human_review(response, confidence)857858 return {859 'status': 'passed' if score >= 3.5 else 'failed',860 'score': score,861 'confidence': confidence,862 'human_review': needs_human863 }864```865866### Practice 2: Version Your Prompts867868```python869from typing import Dict870import hashlib871872class PromptVersion:873 """Track prompt versions for A/B testing and rollback."""874875 def __init__(self):876 self.versions = {}877 self.active_version = None878879 def register(self, name: str, prompt_template: str, metadata: dict = None):880 """Register a prompt version."""881 version_id = hashlib.md5(prompt_template.encode()).hexdigest()[:8]882883 self.versions[version_id] = {884 'name': name,885 'template': prompt_template,886 'metadata': metadata or {},887 'created_at': datetime.now(),888 'metrics': {'total_uses': 0, 'avg_score': 0.0}889 }890891 return version_id892893 def use(self, version_id: str, **kwargs) -> str:894 """Use a specific prompt version."""895 if version_id not in self.versions:896 raise ValueError(f"Unknown version: {version_id}")897898 version = self.versions[version_id]899 version['metrics']['total_uses'] += 1900901 return version['template'].format(**kwargs)902903 def update_metrics(self, version_id: str, score: float):904 """Update performance metrics for a version."""905 version = self.versions[version_id]906 current_avg = version['metrics']['avg_score']907 total_uses = version['metrics']['total_uses']908909 # Running average910 new_avg = ((current_avg * (total_uses - 1)) + score) / total_uses911 version['metrics']['avg_score'] = new_avg912913# Usage914pm = PromptVersion()915916v1 = pm.register(917 name="question_answering_v1",918 prompt_template="Answer this question: {question}",919 metadata={'author': 'alice', 'date': '2024-01-01'}920)921922v2 = pm.register(923 name="question_answering_v2",924 prompt_template="You are a helpful assistant. Answer: {question}",925 metadata={'author': 'bob', 'date': '2024-01-15'}926)927928# A/B test929prompt = pm.use(v1, question="What is AI?") # 50% traffic930score = llm_evaluation(response, criteria)931pm.update_metrics(v1, score)932```933934---935936## Quick Decision Trees937938### "Which evaluation method should I use?"939940```941Have ground truth labels?942 YES → ROUGE, BERTScore, Exact Match943 NO → LLM-as-judge, Human review944945Evaluating factual correctness?946 YES → Grounding check, Factuality verification947 NO → Subjective quality → LLM-as-judge948949Need fast feedback (CI/CD)?950 YES → Binary pass/fail tests951 NO → Comprehensive multi-metric evaluation952953Budget constraints?954 Tight → Automated metrics only955 Moderate → LLM-as-judge + sampling956 No limit → Human review gold standard957```958959### "How to detect hallucinations?"960961```962Have source documents (RAG)?963 YES → Grounding check against context964 NO → Continue965966Can verify with search?967 YES → Factuality check with web search968 NO → Continue969970Check model confidence?971 YES → Self-consistency check (multiple samples)972 NO → Flag for human review973```974975---976977## Resources978979- **ROUGE:** https://github.com/google-research/google-research/tree/master/rouge980- **BERTScore:** https://github.com/Tiiiger/bert_score981- **OpenAI Evals:** https://github.com/openai/evals982- **LangChain Evaluation:** https://python.langchain.com/docs/guides/evaluation/983- **Ragas (RAG eval):** https://github.com/explodinggradients/ragas984985---986987**Skill version:** 1.0.0988**Last updated:** 2025-10-25989**Maintained by:** Applied Artificial Intelligence990
Full transparency — inspect the skill content before installing.