How do I install Agent Evaluation?

Install Agent Evaluation with a single command: npx mdskills install sickn33/agent-evaluation. This downloads the skill files into your project and your AI agent picks them up automatically.

What platforms support Agent Evaluation?

Agent Evaluation works with Claude Code, Claude Desktop, Cursor, Vscode Copilot, Windsurf, Continue Dev, Codex, Gemini Cli, Amp, Roo Code, Goose, Opencode, Trae, Qodo, Command Code. Skills use the open SKILL.md format which is compatible with any AI coding agent that reads markdown instructions.

← Back to skills

Agent Evaluation

Name: Agent Evaluation: AI Agent Skill
Rating: 4 (1 reviews)
Author: sickn33

Testing & QAIntermediate

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.

by @sickn330Updated 2/20/2026

Add this skill

npx mdskills install sickn33/agent-evaluation

Fork & Edit

Skill Advisor4.0

Identifies critical patterns for LLM agent evaluation but lacks actionable instructions

+Identifies key anti-patterns like single-run testing and output string matching
+Highlights critical sharp edges like benchmark-production gaps and data leakage
-Provides no executable instructions or procedures for conducting evaluations
-Declares filesystem write permissions without apparent justification

SKILL.md

Edit in Browser

1---
2name: agent-evaluation
3description: "Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent."
4source: vibeship-spawner-skills (Apache 2.0)
5---
6 
7# Agent Evaluation
8 
9You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in
10production. You've learned that evaluating LLM agents is fundamentally different from
11testing traditional software—the same input can produce different outputs, and "correct"
12often has no single answer.
13 
14You've built evaluation frameworks that catch issues before production: behavioral regression
15tests, capability assessments, and reliability metrics. You understand that the goal isn't
16100% test pass rate—it
17 
18## Capabilities
19 
20- agent-testing
21- benchmark-design
22- capability-assessment
23- reliability-metrics
24- regression-testing
25 
26## Requirements
27 
28- testing-fundamentals
29- llm-fundamentals
30 
31## Patterns
32 
33### Statistical Test Evaluation
34 
35Run tests multiple times and analyze result distributions
36 
37### Behavioral Contract Testing
38 
39Define and test agent behavioral invariants
40 
41### Adversarial Testing
42 
43Actively try to break agent behavior
44 
45## Anti-Patterns
46 
47### ❌ Single-Run Testing
48 
49### ❌ Only Happy Path Tests
50 
51### ❌ Output String Matching
52 
53## ⚠️ Sharp Edges
54 
55| Issue | Severity | Solution |
56|-------|----------|----------|
57| Agent scores well on benchmarks but fails in production | high | // Bridge benchmark and production evaluation |
58| Same test passes sometimes, fails other times | high | // Handle flaky tests in LLM agent evaluation |
59| Agent optimized for metric, not actual task | medium | // Multi-dimensional evaluation to prevent gaming |
60| Test data accidentally used in training or prompts | critical | // Prevent data leakage in agent evaluation |
61 
62## Related Skills
63 
64Works well with: `multi-agent-orchestration`, `agent-communication`, `autonomous-agents`
65

Full transparency — inspect the skill content before installing.