A feedback loop for people building AI skills and MCP servers. You're building a skill, an MCP server, or a custom prompt strategy that's supposed to make an AI coding assistant better at a specific job. But how do you know it actually works? How do you know your latest commit made things better and not worse? Pitlane gives you the answer. Define the tasks your skill should help with, set up a bas
Add this skill
npx mdskills install pitlane-ai/testing-with-pitlaneComprehensive eval design guide with actionable setup, assertion strategy, and clear anti-patterns
1---2name: testing-with-pitlane3description: >4 Design and create pitlane eval benchmarks that measure whether an AI coding5 skill or MCP server actually improves assistant performance. Use when the user6 wants to test a skill, evaluate an MCP server, create a pitlane eval YAML,7 benchmark an AI assistant, or compare baseline vs challenger configurations.8 Covers eval design, assertion strategy, fixture setup, and result interpretation.9category: testing10tags:11 - pitlane12 - evaluation13 - benchmarking14 - skills15 - mcp16---1718# Testing skills and MCP servers with pitlane1920Help users design pitlane eval benchmarks that answer one question: "Does my skill or MCP server actually make the assistant better?"2122## What pitlane is2324Pitlane is an eval harness for AI coding assistants. You define coding tasks in a YAML file, run them against a baseline assistant (without your skill) and a challenger (with your skill), and pitlane tells you which one did better. It tracks pass rates, quality scores, time, tokens, and cost.2526Think of it as A/B testing for skills and MCP servers.2728## Setup2930Install pitlane if the user doesn't have it yet:3132```bash33# installs pitlane cli34uv tool install pitlane --from git+https://github.com/pitlane-ai/pitlane.git35pitlane pitlane_command36# or (no intall with uvx)37uvx --from git+https://github.com/pitlane-ai/pitlane.git pitlane pitlane_command38```3940The user also needs at least one AI coding assistant CLI installed. Pitlane supports four assistants:4142| Type | CLI | Cheap model for iteration |43|------|-----|--------------------------|44| `claude-code` | `claude` | `haiku` |45| `mistral-vibe` | `vibe` | `devstral-small` |46| `opencode` | `opencode` | `minimax-m2.5-free` (free) |47| `bob` | `bob` | N/A |4849Scaffold a new eval project with `pitlane init`. This creates an `eval.yaml` and a `fixtures/empty/` directory to get started.5051## How an eval YAML works5253An eval file has two sections: `assistants` (who runs the tasks) and `tasks` (what they do and how to check the results).5455`pitlane schema generate` generates the schema for the eval file. Use it to discover advanced capabilities.5657```yaml58assistants:59 baseline:60 type: claude-code # or mistral-vibe, opencode61 args:62 model: haiku63 with-skill:64 type: claude-code65 args:66 model: haiku67 skills:68 - source: org/repo # github org/repo for the skill69 skill: skill-name # optional, only if repo has multiple skills7071tasks:72 - name: my-task73 prompt: "What the assistant should do"74 workdir: ./fixtures/my-task # copied fresh for each run75 timeout: 30076 assertions:77 - file_exists: "output.py"78 - command_succeeds: "python output.py"79 - file_contains: { path: "output.py", pattern: "def main" }80```8182For MCP servers instead of skills, pass the config through assistant args:8384```yaml85# Claude Code86with-mcp:87 type: claude-code88 args:89 model: haiku90 mcp_config: ./mcp-config.json9192# Mistral Vibe93with-mcp:94 type: mistral-vibe95 args:96 model: devstral-small97 mcp_servers:98 - name: my-server99 url: http://localhost:3000100```101102### Assertions103104Deterministic (prefer these):105106- `file_exists: "path"` -- does the file exist?107- `file_contains: { path: "file", pattern: "regex" }` -- does it match a regex?108- `command_succeeds: "cmd"` -- does the command exit 0?109- `command_fails: "cmd"` -- does it exit non-zero?110- `custom_script: "python validate.py"` -- run a validation script (also supports advanced form with `script`, `args`, `timeout`, `expected_exit_code`)111112Similarity (requires `pip install pitlane[similarity]`):113114- `rouge: { actual: "file", expected: "./refs/golden.md", metric: "rougeL", min_score: 0.35 }` -- topic coverage, fast, good for docs115- `bleu: { actual: "file", expected: "./refs/golden.md", min_score: 0.15 }` -- phrase matching, fast, good for docs but bad for code116- `bertscore: { actual: "file", expected: "./refs/golden.md", min_score: 0.75 }` -- semantic similarity, slow, works for docs and code117- `cosine_similarity: { actual: "file", expected: "./refs/golden.tf", min_score: 0.7 }` -- overall meaning, slow, best for code and configs118119Any assertion can take a `weight` (default 1.0) to make it count more in the `weighted_score`.120121### Running122123```bash124pitlane run eval.yaml # run everything125pitlane run eval.yaml --task my-task # one task only126pitlane run eval.yaml --only-assistants baseline # run only these assistants (comma-separated)127pitlane run eval.yaml --skip-assistants baseline # skip these assistants (comma-separated)128pitlane run eval.yaml --parallel 4 # run tasks in parallel129pitlane run eval.yaml --repeat 5 # repeat for statistical confidence130pitlane run eval.yaml --verbose # stream debug output131```132133These options can be combined, eg: `pitlane run eval.yaml --task my-task --only-assistants baseline --repeat 5 --parallel 4`134135Results go to `runs/<timestamp>/` with `report.html` (side-by-side comparison), `results.json`, and `debug.log`.136137## Eval design138139This is the hard part. The syntax above is straightforward; designing evals that produce real signal is not.140141Before writing any YAML, work through these questions with the user.142143### What is the knowledge delta?144145The skill or MCP server gives the assistant something it doesn't have by default. The eval has to target that gap directly.146147- What can the assistant do WITH the skill that it can't do WITHOUT? Test that.148- What's the smallest task where the skill makes a difference? Start there, not with a complex integration test.149- Would a strong model pass this task anyway? If so, the task measures the model, not the skill.150151### Isolate the variable152153The only difference between baseline and challenger should be the skill or MCP server. Everything else stays identical: same model, same prompt (word for word), same fixture directory, same timeout.154155If the challenger prompt says "use the MCP tool to...", you're testing prompt engineering, not the skill.156157### Start cheap, validate expensive158159Use a cheaper or weaker model during development. Weaker models amplify the skill's effect because they struggle more without help, which makes the delta easier to see. Switch to a stronger model for final benchmarks to confirm the skill helps capable models too.160161## Assertion strategy162163### Layer your assertions164165Design assertions in rough order of importance:1661671. Did the assistant produce the expected files? (`file_exists`)1682. Does the output actually work? (`command_succeeds`, `file_contains`)1693. How close is it to a known-good solution? (similarity metrics, `custom_script`)170171Weight them accordingly. A passing test suite (`command_succeeds` at weight 3.0) matters more than a README existing (`file_exists` at weight 1.0).172173### Prefer deterministic assertions174175Similarity metrics are tempting but noisy. Reach for deterministic assertions first:176177- Instead of cosine_similarity on `main.tf`, use `file_contains` to check for specific module sources.178- Instead of rouge on a README, use `file_contains` to check that key sections exist.179- Save similarity for genuinely open-ended outputs like free-form documentation.180181When you do use similarity, set `min_score` conservatively. Scores vary between runs. If your threshold is tight, `--repeat 5` will show you the variance.182183### Custom scripts for complex validation184185When `file_contains` and `command_succeeds` aren't expressive enough, write a custom script. Place it in the fixture directory (it gets copied to the workspace) or reference it with a relative path.186187Good candidates: multi-step validation (parse JSON, check field relationships), domain-specific checks (API response format, dependency graphs), or conditional logic (if file A exists, B must contain X).188189## NEVER190191- NEVER write prompts that hint at the skill's existence. The prompt describes the task, not how to solve it. If your prompt says "use module X from registry Y", you're testing instruction-following, not the skill.192- NEVER test with only one task. A single task can pass or fail for unrelated reasons. Use 3+ tasks at different difficulty levels.193- NEVER skip the baseline. Without one, you can't attribute results to the skill. "It passes" means nothing if it also passes without the skill.194- NEVER use tight similarity thresholds without `--repeat`. A min_score of 0.7 that passes once and fails twice is not a passing assertion. Run at least 3 repeats for similarity-based evals.195- NEVER put golden references in the fixture root. They go in `refs/`, which pitlane excludes from the workspace. If the assistant can see the expected output, the eval is meaningless.196- NEVER cram everything into one mega-task. "Create a full application with auth, database, API, and tests" measures general coding ability, not your skill. Split into focused tasks that isolate what the skill improves.197- NEVER assume `command_succeeds` means the code is correct. `python main.py` exiting 0 says nothing about output correctness. Combine with `file_contains` or pipe output through a validation script.198- NEVER set timeouts too tight during development. Start with 300-600s. Tighten after you know the baseline completion time. A timeout failure tells you nothing about skill quality.199200## Fixture design201202### Empty vs pre-populated203204Use an empty fixture (`fixtures/empty/` with a `.gitkeep`) when your skill helps with greenfield creation like scaffolding or boilerplate. Use a pre-populated fixture when the skill helps with modification or enhancement of existing code. Seed it with realistic starter files.205206### The refs/ directory207208Golden references live in a `refs/` subdirectory inside each fixture. Pitlane excludes `refs/` when copying the fixture to the workspace, so the assistant never sees them. They're used as targets for similarity assertions (`expected: "./refs/expected-main.tf"`) and as documentation of what "good" looks like for human reviewers.209210Write reference files by hand or curate them from known-good outputs. Don't generate them with the same model you're testing.211212## Interpreting results213214### Metrics that matter215216- `assertion_pass_rate` compares baseline vs challenger directly. If both hit 100%, the tasks are too easy. If baseline is already at 80%+, the skill has less room to prove itself.217- `weighted_score` is more nuanced than pass_rate when you're using weights and similarity metrics. Compare the delta between baseline and challenger.218- `cost_usd` and `wall_clock_seconds` track the trade-off. A skill that improves quality at 3x the cost may not be worth it.219220### Reading the results221222| Signal | What it means |223|--------|---------------|224| Baseline low, challenger high | The skill is working |225| Both high | Tasks are too easy, add harder ones |226| Both low | Tasks may be too hard, or the skill needs work |227| Baseline high, challenger low | The skill is hurting, investigate |228| High variance across `--repeat` runs | Assertions or prompts need tightening |229230### The iteration loop231232This is TDD applied to skills:2332341. Write tasks and assertions the skill should help with. Run the baseline. It should struggle.2352. Run with the skill. If it doesn't improve, the skill needs work, not the eval.2363. Tighten assertions, add edge cases, increase difficulty. Repeat.237238If the baseline already passes everything, the eval isn't testing the skill's value. Go back to "what is the knowledge delta?"239
Full transparency — inspect the skill content before installing.