How do I install LLM Evaluation?

Install LLM Evaluation with a single command: npx mdskills install applied-artificial-intelligence/llm-evaluation. This downloads the skill files into your project and your AI agent picks them up automatically.
What platforms support LLM Evaluation?

LLM Evaluation works with Claude Code, Claude Desktop, Cursor, Vscode Copilot, Windsurf, Continue Dev, Codex, Gemini Cli, Amp, Roo Code, Goose, Opencode, Trae, Qodo, Command Code. Skills use the open SKILL.md format which is compatible with any AI coding agent that reads markdown instructions.
← Back to skills
LLM Evaluation

Name: LLM Evaluation: AI Agent Skill
Brand: applied-artificial-intelligence
Availability: InStock
Rating: 7 (1 reviews)
Author: applied-artificial-intelligence
Verified
SKILL + PLUGINTesting & QAIntermediate
LLM evaluation and testing patterns including prompt testing, hallucination detection, benchmark creation, and quality metrics. Use when testing LLM applications, validating prompt quality, implementing systematic evaluation, or measuring LLM performance.
by @applied-artificial-intelligence1 downloads43Updated 2/20/2026
Add this skill
npx mdskills install applied-artificial-intelligence/llm-evaluation
Fork & Edit
Are you @applied-artificial-intelligence? Sign in with GitHub to claim this listing.
Skill Advisor7.0
Comprehensive evaluation framework with metrics, testing patterns, and statistical validation
+Provides concrete code examples for BLEU, ROUGE, BERTScore, and custom metrics
+Covers automated metrics, LLM-as-judge, human evaluation, and A/B testing comprehensively
+Includes statistical testing with effect size calculation and regression detection
-Requests filesystem write, shell execution, and network without clear justification
-Content truncates mid-implementation without completion of benchmark runner
SKILL.md
Edit in Browser
1---
2name: llm-evaluation
3description: LLM evaluation and testing patterns including prompt testing, hallucination detection, benchmark creation, and quality metrics. Use when testing LLM applications, validating prompt quality, implementing systematic evaluation, or measuring LLM performance.
4---
5 
6# LLM Evaluation & Testing
7 
8Comprehensive guide to evaluating and testing LLM applications including prompt testing, output validation, hallucination detection, benchmark creation, A/B testing, and quality metrics.
9 
10---
11 
12## Quick Reference
13 
14**When to use this skill:**
15- Testing LLM application outputs
16- Validating prompt quality and consistency
17- Detecting hallucinations and factual errors
18- Creating evaluation benchmarks
19- A/B testing prompts or models
20- Implementing continuous evaluation (CI/CD)
21- Measuring retrieval quality (for RAG)
22- Debugging unexpected LLM behavior
23 
24**Metrics covered:**
25- Traditional: BLEU, ROUGE, BERTScore, Perplexity
26- LLM-as-Judge: GPT-4 evaluation, rubric-based scoring
27- Task-specific: Exact match, F1, accuracy, recall
28- Quality: Toxicity, bias, coherence, relevance
29 
30---
31 
32## Part 1: Evaluation Fundamentals
33 
34### The LLM Evaluation Challenge
35 
36**Why LLM evaluation is hard:**
371. **Subjective quality** - "Good" output varies by use case
382. **No single ground truth** - Multiple valid answers
393. **Context-dependent** - Same output good/bad in different scenarios
404. **Expensive to label** - Human evaluation doesn't scale
415. **Adversarial brittleness** - Small prompt changes = large output changes
42 
43**Solution: Multi-layered evaluation**
44```
45Layer 1: Automated Metrics (fast, scalable)
46  ↓
47Layer 2: LLM-as-Judge (flexible, nuanced)
48  ↓
49Layer 3: Human Review (gold standard, expensive)
50```
51 
52### Evaluation Dataset Structure
53 
54```python
55from dataclasses import dataclass
56from typing import List, Optional
57 
58@dataclass
59class EvalExample:
60    """Single evaluation example."""
61    input: str  # User input / prompt
62    expected_output: Optional[str]  # Gold standard (if exists)
63    context: Optional[str]  # Additional context (for RAG)
64    metadata: dict  # Category, difficulty, etc.
65 
66@dataclass
67class EvalResult:
68    """Evaluation result for one example."""
69    example_id: str
70    actual_output: str
71    scores: dict  # {'metric_name': score}
72    passed: bool
73    failure_reason: Optional[str]
74 
75# Example dataset
76eval_dataset = [
77    EvalExample(
78        input="What is the capital of France?",
79        expected_output="Paris",
80        context=None,
81        metadata={'category': 'factual', 'difficulty': 'easy'}
82    ),
83    EvalExample(
84        input="Explain quantum entanglement",
85        expected_output=None,  # No single answer
86        context=None,
87        metadata={'category': 'explanation', 'difficulty': 'hard'}
88    )
89]
90```
91 
92---
93 
94## Part 2: Traditional Metrics
95 
96### Metric 1: Exact Match (Simplest)
97 
98```python
99def exact_match(predicted: str, expected: str, case_sensitive: bool = False) -> float:
100    """
101    Binary metric: 1.0 if match, 0.0 otherwise.
102 
103    Use for: Classification, short answers, structured output
104    Limitations: Too strict for generation tasks
105    """
106    if not case_sensitive:
107        predicted = predicted.lower().strip()
108        expected = expected.lower().strip()
109 
110    return 1.0 if predicted == expected else 0.0
111 
112# Example
113score = exact_match("Paris", "paris")  # 1.0
114score = exact_match("The capital is Paris", "Paris")  # 0.0
115```
116 
117### Metric 2: ROUGE (Recall-Oriented)
118 
119```python
120from rouge_score import rouge_scorer
121 
122def compute_rouge(predicted: str, expected: str) -> dict:
123    """
124    ROUGE metrics for text overlap.
125 
126    ROUGE-1: Unigram overlap
127    ROUGE-2: Bigram overlap
128    ROUGE-L: Longest common subsequence
129 
130    Use for: Summarization, translation
131    Limitations: Doesn't capture semantics
132    """
133    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
134    scores = scorer.score(expected, predicted)
135 
136    return {
137        'rouge1': scores['rouge1'].fmeasure,
138        'rouge2': scores['rouge2'].fmeasure,
139        'rougeL': scores['rougeL'].fmeasure
140    }
141 
142# Example
143scores = compute_rouge(
144    predicted="Paris is the capital of France",
145    expected="The capital of France is Paris"
146)
147# {'rouge1': 0.82, 'rouge2': 0.67, 'rougeL': 0.82}
148```
149 
150### Metric 3: BERTScore (Semantic Similarity)
151 
152```python
153from bert_score import score as bert_score
154 
155def compute_bertscore(predicted: List[str], expected: List[str]) -> dict:
156    """
157    Semantic similarity using BERT embeddings.
158 
159    Better than ROUGE for:
160    - Paraphrases
161    - Semantic equivalence
162    - Generation quality
163 
164    Returns: Precision, Recall, F1
165    """
166    P, R, F1 = bert_score(predicted, expected, lang="en", verbose=False)
167 
168    return {
169        'precision': P.mean().item(),
170        'recall': R.mean().item(),
171        'f1': F1.mean().item()
172    }
173 
174# Example
175scores = compute_bertscore(
176    predicted=["The capital of France is Paris"],
177    expected=["Paris is France's capital city"]
178)
179# {'precision': 0.94, 'recall': 0.91, 'f1': 0.92}
180```
181 
182### Metric 4: Perplexity (Model Confidence)
183 
184```python
185import torch
186from transformers import GPT2LMHeadModel, GPT2Tokenizer
187 
188def compute_perplexity(text: str, model_name: str = "gpt2") -> float:
189    """
190    Perplexity: How "surprised" is the model by this text?
191 
192    Lower = More likely/fluent
193    Use for: Fluency, naturalness
194    Limitations: Doesn't measure correctness
195    """
196    model = GPT2LMHeadModel.from_pretrained(model_name)
197    tokenizer = GPT2Tokenizer.from_pretrained(model_name)
198 
199    inputs = tokenizer(text, return_tensors="pt")
200 
201    with torch.no_grad():
202        outputs = model(**inputs, labels=inputs["input_ids"])
203        loss = outputs.loss
204 
205    perplexity = torch.exp(loss).item()
206    return perplexity
207 
208# Example
209ppl = compute_perplexity("Paris is the capital of France")  # Low (fluent)
210ppl2 = compute_perplexity("Capital France the is Paris of")  # High (awkward)
211```
212 
213---
214 
215## Part 3: LLM-as-Judge Evaluation
216 
217### Pattern 1: Rubric-Based Scoring
218 
219```python
220from openai import OpenAI
221 
222client = OpenAI()
223 
224EVALUATION_PROMPT = """
225You are an expert evaluator. Score the assistant's response on a scale of 1-5 for each criterion:
226 
227**Criteria:**
2281. **Accuracy**: Is the information factually correct?
2292. **Completeness**: Does it fully answer the question?
2303. **Clarity**: Is it easy to understand?
2314. **Conciseness**: Is it appropriately brief?
232 
233**Response to evaluate:**
234{response}
235 
236**Expected answer (reference):**
237{expected}
238 
239Provide scores in JSON format:
240{{
241  "accuracy": <1-5>,
242  "completeness": <1-5>,
243  "clarity": <1-5>,
244  "conciseness": <1-5>,
245  "reasoning": "Brief explanation"
246}}
247"""
248 
249def llm_judge_score(response: str, expected: str) -> dict:
250    """
251    Use GPT-4 as judge with rubric scoring.
252 
253    Pros: Flexible, nuanced, scales well
254    Cons: Costs $, potential bias, slower
255    """
256    prompt = EVALUATION_PROMPT.format(response=response, expected=expected)
257 
258    completion = client.chat.completions.create(
259        model="gpt-4",
260        messages=[{"role": "user", "content": prompt}],
261        response_format={"type": "json_object"}
262    )
263 
264    import json
265    scores = json.loads(completion.choices[0].message.content)
266    return scores
267 
268# Example
269scores = llm_judge_score(
270    response="Paris is the capital of France, located in the north-central part of the country.",
271    expected="Paris"
272)
273# {'accuracy': 5, 'completeness': 5, 'clarity': 5, 'conciseness': 3, 'reasoning': '...'}
274```
275 
276### Pattern 2: Binary Pass/Fail Evaluation
277 
278```python
279PASS_FAIL_PROMPT = """
280Evaluate if the assistant's response is acceptable.
281 
282**Question:** {question}
283**Response:** {response}
284**Criteria:** {criteria}
285 
286Return ONLY "PASS" or "FAIL" followed by a one-sentence reason.
287"""
288 
289def binary_eval(question: str, response: str, criteria: str) -> tuple[bool, str]:
290    """
291    Simple pass/fail evaluation.
292 
293    Use for: Unit tests, regression tests, CI/CD
294    """
295    prompt = PASS_FAIL_PROMPT.format(
296        question=question,
297        response=response,
298        criteria=criteria
299    )
300 
301    completion = client.chat.completions.create(
302        model="gpt-4",
303        messages=[{"role": "user", "content": prompt}],
304        temperature=0.0  # Deterministic
305    )
306 
307    result = completion.choices[0].message.content
308    passed = result.startswith("PASS")
309    reason = result.split(":", 1)[1].strip() if ":" in result else result
310 
311    return passed, reason
312 
313# Example
314passed, reason = binary_eval(
315    question="What is the capital of France?",
316    response="The capital is Paris",
317    criteria="Response must mention Paris"
318)
319# (True, "Response correctly identifies Paris as the capital")
320```
321 
322### Pattern 3: Pairwise Comparison (A/B Testing)
323 
324```python
325PAIRWISE_PROMPT = """
326Compare two responses to the same question. Which is better?
327 
328**Question:** {question}
329 
330**Response A:**
331{response_a}
332 
333**Response B:**
334{response_b}
335 
336**Criteria:** {criteria}
337 
338Return ONLY: "A", "B", or "TIE", followed by a one-sentence explanation.
339"""
340 
341def pairwise_comparison(
342    question: str,
343    response_a: str,
344    response_b: str,
345    criteria: str = "Overall quality, accuracy, and helpfulness"
346) -> tuple[str, str]:
347    """
348    A/B test two responses.
349 
350    Use for: Prompt engineering, model comparison
351    """
352    prompt = PAIRWISE_PROMPT.format(
353        question=question,
354        response_a=response_a,
355        response_b=response_b,
356        criteria=criteria
357    )
358 
359    completion = client.chat.completions.create(
360        model="gpt-4",
361        messages=[{"role": "user", "content": prompt}],
362        temperature=0.0
363    )
364 
365    result = completion.choices[0].message.content
366    winner = result.split()[0]  # "A", "B", or "TIE"
367    reason = result.split(":", 1)[1].strip() if ":" in result else result
368 
369    return winner, reason
370 
371# Example
372winner, reason = pairwise_comparison(
373    question="Explain quantum computing",
374    response_a="Quantum computers use qubits instead of bits...",
375    response_b="Quantum computing is complex. It uses quantum mechanics."
376)
377# ("A", "Response A provides more detail and explanation")
378```
379 
380---
381 
382## Part 4: Hallucination Detection
383 
384### Method 1: Grounding Check
385 
386```python
387def check_grounding(response: str, context: str) -> dict:
388    """
389    Verify response is grounded in provided context.
390 
391    Critical for RAG systems.
392    """
393    GROUNDING_PROMPT = """
394    Context: {context}
395 
396    Response: {response}
397 
398    Is the response fully supported by the context? Answer with:
399    - "GROUNDED": All claims supported
400    - "PARTIALLY_GROUNDED": Some claims unsupported
401    - "NOT_GROUNDED": Contains unsupported claims
402 
403    List any unsupported claims.
404    """
405 
406    prompt = GROUNDING_PROMPT.format(context=context, response=response)
407 
408    completion = client.chat.completions.create(
409        model="gpt-4",
410        messages=[{"role": "user", "content": prompt}]
411    )
412 
413    result = completion.choices[0].message.content
414    status = result.split("\n")[0]
415    unsupported = [line for line in result.split("\n")[1:] if line.strip()]
416 
417    return {
418        'grounding_status': status,
419        'unsupported_claims': unsupported,
420        'is_hallucination': status != "GROUNDED"
421    }
422```
423 
424### Method 2: Factuality Check (External Verification)
425 
426```python
427def check_factuality(claim: str, use_search: bool = True) -> dict:
428    """
429    Verify factual claims using external sources.
430 
431    Options:
432    1. Web search + verification
433    2. Knowledge base lookup
434    3. Cross-reference with trusted source
435    """
436    if use_search:
437        # Use web search to verify
438        from tavily import TavilyClient
439        tavily = TavilyClient(api_key="your-key")
440 
441        # Search for evidence
442        results = tavily.search(claim, max_results=3)
443 
444        # Ask LLM to verify based on search results
445        VERIFY_PROMPT = """
446        Claim: {claim}
447 
448        Search results:
449        {results}
450 
451        Is the claim supported by these sources? Answer: TRUE, FALSE, or UNCERTAIN.
452        Explanation:
453        """
454 
455        prompt = VERIFY_PROMPT.format(
456            claim=claim,
457            results="\n\n".join([r['content'] for r in results])
458        )
459 
460        completion = client.chat.completions.create(
461            model="gpt-4",
462            messages=[{"role": "user", "content": prompt}]
463        )
464 
465        result = completion.choices[0].message.content
466        is_factual = result.startswith("TRUE")
467 
468        return {
469            'claim': claim,
470            'factual': is_factual,
471            'evidence': results,
472            'explanation': result
473        }
474```
475 
476### Method 3: Self-Consistency Check
477 
478```python
479def self_consistency_check(question: str, num_samples: int = 5) -> dict:
480    """
481    Generate multiple responses, check for consistency.
482 
483    If model is confident, responses should be consistent.
484    Inconsistency suggests hallucination risk.
485    """
486    responses = []
487 
488    for _ in range(num_samples):
489        completion = client.chat.completions.create(
490            model="gpt-4",
491            messages=[{"role": "user", "content": question}],
492            temperature=0.7  # Some randomness
493        )
494        responses.append(completion.choices[0].message.content)
495 
496    # Compute pairwise similarity
497    from sklearn.feature_extraction.text import TfidfVectorizer
498    from sklearn.metrics.pairwise import cosine_similarity
499 
500    vectorizer = TfidfVectorizer()
501    vectors = vectorizer.fit_transform(responses)
502    similarities = cosine_similarity(vectors)
503 
504    # Average pairwise similarity
505    avg_similarity = similarities.sum() / (len(responses) * (len(responses) - 1))
506 
507    return {
508        'responses': responses,
509        'avg_similarity': avg_similarity,
510        'is_consistent': avg_similarity > 0.7,  # Threshold
511        'confidence': 'high' if avg_similarity > 0.85 else 'medium' if avg_similarity > 0.7 else 'low'
512    }
513```
514 
515---
516 
517## Part 5: RAG-Specific Evaluation
518 
519### Retrieval Quality Metrics
520 
521```python
522def evaluate_retrieval(query: str, retrieved_docs: List[dict], relevant_doc_ids: List[str]) -> dict:
523    """
524    Evaluate retrieval quality using IR metrics.
525 
526    Precision: What % of retrieved docs are relevant?
527    Recall: What % of relevant docs were retrieved?
528    MRR: Mean Reciprocal Rank
529    NDCG: Normalized Discounted Cumulative Gain
530    """
531    retrieved_ids = [doc['id'] for doc in retrieved_docs]
532 
533    # Precision
534    true_positives = len(set(retrieved_ids) & set(relevant_doc_ids))
535    precision = true_positives / len(retrieved_ids) if retrieved_ids else 0.0
536 
537    # Recall
538    recall = true_positives / len(relevant_doc_ids) if relevant_doc_ids else 0.0
539 
540    # F1
541    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0.0
542 
543    # MRR (Mean Reciprocal Rank)
544    mrr = 0.0
545    for i, doc_id in enumerate(retrieved_ids, 1):
546        if doc_id in relevant_doc_ids:
547            mrr = 1.0 / i
548            break
549 
550    return {
551        'precision': precision,
552        'recall': recall,
553        'f1': f1,
554        'mrr': mrr,
555        'num_retrieved': len(retrieved_ids),
556        'num_relevant_retrieved': true_positives
557    }
558```
559 
560### End-to-End RAG Evaluation
561 
562```python
563def evaluate_rag_pipeline(
564    question: str,
565    generated_answer: str,
566    retrieved_docs: List[dict],
567    ground_truth: str,
568    relevant_doc_ids: List[str]
569) -> dict:
570    """
571    Comprehensive RAG evaluation.
572 
573    1. Retrieval quality (precision, recall)
574    2. Answer quality (ROUGE, BERTScore)
575    3. Answer grounding (hallucination check)
576    4. Citation accuracy
577    """
578    # 1. Retrieval metrics
579    retrieval_scores = evaluate_retrieval(question, retrieved_docs, relevant_doc_ids)
580 
581    # 2. Answer quality
582    context = "\n\n".join([doc['text'] for doc in retrieved_docs])
583 
584    rouge_scores = compute_rouge(generated_answer, ground_truth)
585    bert_scores = compute_bertscore([generated_answer], [ground_truth])
586 
587    # 3. Grounding check
588    grounding = check_grounding(generated_answer, context)
589 
590    # 4. LLM-as-judge overall quality
591    judge_scores = llm_judge_score(generated_answer, ground_truth)
592 
593    return {
594        'retrieval': retrieval_scores,
595        'answer_quality': {
596            'rouge': rouge_scores,
597            'bertscore': bert_scores
598        },
599        'grounding': grounding,
600        'llm_judge': judge_scores,
601        'overall_pass': (
602            retrieval_scores['f1'] > 0.5 and
603            grounding['grounding_status'] == "GROUNDED" and
604            judge_scores['accuracy'] >= 4
605        )
606    }
607```
608 
609---
610 
611## Part 6: Prompt Testing Frameworks
612 
613### Framework 1: Regression Test Suite
614 
615```python
616class PromptTestSuite:
617    """
618    Unit tests for prompts (like pytest for LLMs).
619    """
620 
621    def __init__(self):
622        self.tests = []
623        self.results = []
624 
625    def add_test(self, name: str, input: str, criteria: str):
626        """Add a test case."""
627        self.tests.append({
628            'name': name,
629            'input': input,
630            'criteria': criteria
631        })
632 
633    def run(self, generate_fn):
634        """Run all tests with given generation function."""
635        for test in self.tests:
636            response = generate_fn(test['input'])
637            passed, reason = binary_eval(
638                question=test['input'],
639                response=response,
640                criteria=test['criteria']
641            )
642 
643            self.results.append({
644                'test_name': test['name'],
645                'passed': passed,
646                'reason': reason,
647                'response': response
648            })
649 
650        return self.results
651 
652    def summary(self) -> dict:
653        """Get test summary."""
654        total = len(self.results)
655        passed = sum(1 for r in self.results if r['passed'])
656 
657        return {
658            'total_tests': total,
659            'passed': passed,
660            'failed': total - passed,
661            'pass_rate': passed / total if total > 0 else 0.0
662        }
663 
664# Usage
665suite = PromptTestSuite()
666suite.add_test("capital_france", "What is the capital of France?", "Must mention Paris")
667suite.add_test("capital_germany", "What is the capital of Germany?", "Must mention Berlin")
668 
669def my_generate(prompt):
670    # Your LLM call
671    return client.chat.completions.create(
672        model="gpt-4",
673        messages=[{"role": "user", "content": prompt}]
674    ).choices[0].message.content
675 
676results = suite.run(my_generate)
677print(suite.summary())
678# {'total_tests': 2, 'passed': 2, 'failed': 0, 'pass_rate': 1.0}
679```
680 
681### Framework 2: A/B Testing Framework
682 
683```python
684class ABTest:
685    """
686    A/B test prompts or models.
687    """
688 
689    def __init__(self, test_cases: List[dict]):
690        self.test_cases = test_cases
691        self.results = []
692 
693    def run(self, generate_a, generate_b):
694        """Compare two generation functions."""
695        for test in self.test_cases:
696            response_a = generate_a(test['input'])
697            response_b = generate_b(test['input'])
698 
699            winner, reason = pairwise_comparison(
700                question=test['input'],
701                response_a=response_a,
702                response_b=response_b
703            )
704 
705            self.results.append({
706                'input': test['input'],
707                'response_a': response_a,
708                'response_b': response_b,
709                'winner': winner,
710                'reason': reason
711            })
712 
713        return self.results
714 
715    def summary(self) -> dict:
716        """Aggregate results."""
717        total = len(self.results)
718        a_wins = sum(1 for r in self.results if r['winner'] == 'A')
719        b_wins = sum(1 for r in self.results if r['winner'] == 'B')
720        ties = sum(1 for r in self.results if r['winner'] == 'TIE')
721 
722        return {
723            'total_comparisons': total,
724            'a_wins': a_wins,
725            'b_wins': b_wins,
726            'ties': ties,
727            'a_win_rate': a_wins / total if total > 0 else 0.0,
728            'statistical_significance': self._check_significance(a_wins, b_wins, total)
729        }
730 
731    def _check_significance(self, a_wins, b_wins, total):
732        """Simple binomial test for statistical significance."""
733        from scipy.stats import binom_test
734        # H0: Both equally good (p=0.5)
735        p_value = binom_test(max(a_wins, b_wins), total, 0.5)
736        return p_value < 0.05  # Significant at 95% confidence
737```
738 
739---
740 
741## Part 7: Production Monitoring
742 
743### Continuous Evaluation Pipeline
744 
745```python
746import logging
747from datetime import datetime
748 
749class ProductionMonitor:
750    """
751    Monitor LLM performance in production.
752    """
753 
754    def __init__(self, sample_rate: float = 0.1):
755        self.sample_rate = sample_rate
756        self.metrics = []
757        self.logger = logging.getLogger(__name__)
758 
759    def log_interaction(self, user_input: str, model_output: str, metadata: dict):
760        """Log interaction for evaluation."""
761        import random
762 
763        # Sample traffic for evaluation
764        if random.random() < self.sample_rate:
765            # Run automated checks
766            toxicity = self._check_toxicity(model_output)
767            perplexity = compute_perplexity(model_output)
768 
769            metric = {
770                'timestamp': datetime.now().isoformat(),
771                'user_input': user_input,
772                'model_output': model_output,
773                'toxicity_score': toxicity,
774                'perplexity': perplexity,
775                'latency_ms': metadata.get('latency_ms'),
776                'model_version': metadata.get('model_version')
777            }
778 
779            self.metrics.append(metric)
780 
781            # Alert if anomaly detected
782            if toxicity > 0.5:
783                self.logger.warning(f"High toxicity detected: {toxicity}")
784 
785    def _check_toxicity(self, text: str) -> float:
786        """Check for toxic content."""
787        from detoxify import Detoxify
788        model = Detoxify('original')
789        results = model.predict(text)
790        return max(results.values())  # Max toxicity score
791 
792    def get_metrics(self) -> dict:
793        """Aggregate metrics."""
794        if not self.metrics:
795            return {}
796 
797        return {
798            'total_interactions': len(self.metrics),
799            'avg_toxicity': sum(m['toxicity_score'] for m in self.metrics) / len(self.metrics),
800            'avg_perplexity': sum(m['perplexity'] for m in self.metrics) / len(self.metrics),
801            'avg_latency_ms': sum(m['latency_ms'] for m in self.metrics if m.get('latency_ms')) / len(self.metrics),
802            'high_toxicity_rate': sum(1 for m in self.metrics if m['toxicity_score'] > 0.5) / len(self.metrics)
803        }
804```
805 
806---
807 
808## Part 8: Best Practices
809 
810### Practice 1: Layered Evaluation Strategy
811 
812```python
813# Layer 1: Fast, cheap automated checks
814def quick_checks(response: str) -> bool:
815    """Run fast automated checks."""
816    # Length check
817    if len(response) < 10:
818        return False
819 
820    # Toxicity check
821    if check_toxicity(response) > 0.5:
822        return False
823 
824    # Basic coherence (perplexity)
825    if compute_perplexity(response) > 100:
826        return False
827 
828    return True
829 
830# Layer 2: LLM-as-judge (selective)
831def llm_evaluation(response: str, criteria: str) -> float:
832    """Run LLM evaluation on subset."""
833    scores = llm_judge_score(response, criteria)
834    return sum(scores.values()) / len(scores)  # Average score
835 
836# Layer 3: Human review (expensive, critical cases)
837def flag_for_human_review(response: str, confidence: float) -> bool:
838    """Determine if human review needed."""
839    return (
840        confidence < 0.7 or
841        len(response) > 1000 or  # Long responses
842        "uncertain" in response.lower()  # Model uncertainty
843    )
844 
845# Combined pipeline
846def evaluate_response(question: str, response: str) -> dict:
847    # Layer 1: Quick checks
848    if not quick_checks(response):
849        return {'status': 'failed_quick_checks', 'human_review': False}
850 
851    # Layer 2: LLM judge
852    score = llm_evaluation(response, "accuracy and helpfulness")
853    confidence = score / 5.0
854 
855    # Layer 3: Human review decision
856    needs_human = flag_for_human_review(response, confidence)
857 
858    return {
859        'status': 'passed' if score >= 3.5 else 'failed',
860        'score': score,
861        'confidence': confidence,
862        'human_review': needs_human
863    }
864```
865 
866### Practice 2: Version Your Prompts
867 
868```python
869from typing import Dict
870import hashlib
871 
872class PromptVersion:
873    """Track prompt versions for A/B testing and rollback."""
874 
875    def __init__(self):
876        self.versions = {}
877        self.active_version = None
878 
879    def register(self, name: str, prompt_template: str, metadata: dict = None):
880        """Register a prompt version."""
881        version_id = hashlib.md5(prompt_template.encode()).hexdigest()[:8]
882 
883        self.versions[version_id] = {
884            'name': name,
885            'template': prompt_template,
886            'metadata': metadata or {},
887            'created_at': datetime.now(),
888            'metrics': {'total_uses': 0, 'avg_score': 0.0}
889        }
890 
891        return version_id
892 
893    def use(self, version_id: str, **kwargs) -> str:
894        """Use a specific prompt version."""
895        if version_id not in self.versions:
896            raise ValueError(f"Unknown version: {version_id}")
897 
898        version = self.versions[version_id]
899        version['metrics']['total_uses'] += 1
900 
901        return version['template'].format(**kwargs)
902 
903    def update_metrics(self, version_id: str, score: float):
904        """Update performance metrics for a version."""
905        version = self.versions[version_id]
906        current_avg = version['metrics']['avg_score']
907        total_uses = version['metrics']['total_uses']
908 
909        # Running average
910        new_avg = ((current_avg * (total_uses - 1)) + score) / total_uses
911        version['metrics']['avg_score'] = new_avg
912 
913# Usage
914pm = PromptVersion()
915 
916v1 = pm.register(
917    name="question_answering_v1",
918    prompt_template="Answer this question: {question}",
919    metadata={'author': 'alice', 'date': '2024-01-01'}
920)
921 
922v2 = pm.register(
923    name="question_answering_v2",
924    prompt_template="You are a helpful assistant. Answer: {question}",
925    metadata={'author': 'bob', 'date': '2024-01-15'}
926)
927 
928# A/B test
929prompt = pm.use(v1, question="What is AI?")  # 50% traffic
930score = llm_evaluation(response, criteria)
931pm.update_metrics(v1, score)
932```
933 
934---
935 
936## Quick Decision Trees
937 
938### "Which evaluation method should I use?"
939 
940```
941Have ground truth labels?
942  YES → ROUGE, BERTScore, Exact Match
943  NO  → LLM-as-judge, Human review
944 
945Evaluating factual correctness?
946  YES → Grounding check, Factuality verification
947  NO  → Subjective quality → LLM-as-judge
948 
949Need fast feedback (CI/CD)?
950  YES → Binary pass/fail tests
951  NO  → Comprehensive multi-metric evaluation
952 
953Budget constraints?
954  Tight → Automated metrics only
955  Moderate → LLM-as-judge + sampling
956  No limit → Human review gold standard
957```
958 
959### "How to detect hallucinations?"
960 
961```
962Have source documents (RAG)?
963  YES → Grounding check against context
964  NO  → Continue
965 
966Can verify with search?
967  YES → Factuality check with web search
968  NO  → Continue
969 
970Check model confidence?
971  YES → Self-consistency check (multiple samples)
972  NO  → Flag for human review
973```
974 
975---
976 
977## Resources
978 
979- **ROUGE:** https://github.com/google-research/google-research/tree/master/rouge
980- **BERTScore:** https://github.com/Tiiiger/bert_score
981- **OpenAI Evals:** https://github.com/openai/evals
982- **LangChain Evaluation:** https://python.langchain.com/docs/guides/evaluation/
983- **Ragas (RAG eval):** https://github.com/explodinggradients/ragas
984 
985---
986 
987**Skill version:** 1.0.0
988**Last updated:** 2025-10-25
989**Maintained by:** Applied Artificial Intelligence
990
Full transparency — inspect the skill content before installing.
New to skill.md files?
See what a SKILL.md file is, how to install one, and how it differs from AGENTS.md or cursorrules.
Read the guide →