Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.
Add this skill
npx mdskills install sickn33/agent-evaluationIdentifies critical patterns for LLM agent evaluation but lacks actionable instructions
No forks yet. Be the first to fork and customize this skill.
Visual fork tree and fork list coming soon.