How do I install Testing With Pitlane?

Install Testing With Pitlane with a single command: npx mdskills install pitlane-ai/testing-with-pitlane. This downloads the skill files into your project and your AI agent picks them up automatically.

What platforms support Testing With Pitlane?

Testing With Pitlane works with Claude Code, Claude Desktop, Cursor, Vscode Copilot, Windsurf, Continue Dev, Gemini Cli, Amp, Roo Code, Goose. Skills use the open SKILL.md format which is compatible with any AI coding agent that reads markdown instructions.

← Back to MCP servers

Testing With Pitlane

Name: Testing With Pitlane: AI Agent Skill
Brand: pitlane-ai
Availability: InStock
Rating: 8.7 (1 reviews)
Author: pitlane-ai

Verified

MCP ServerGit & WorkflowIntermediate

A feedback loop for people building AI skills and MCP servers. You're building a skill, an MCP server, or a custom prompt strategy that's supposed to make an AI coding assistant better at a specific job. But how do you know it actually works? How do you know your latest commit made things better and not worse? Pitlane gives you the answer. Define the tasks your skill should help with, set up a bas

by @pitlane-ai 5Updated 3/10/2026

Add this skill

npx mdskills install pitlane-ai/testing-with-pitlane

Fork & Edit

Are you @pitlane-ai? Sign in with GitHub to claim this listing.

Skill Advisor8.7

Comprehensive eval design guide with actionable setup, assertion strategy, and clear anti-patterns

+Provides detailed YAML schema, fixture design, and result interpretation patterns
+Includes strong "NEVER" rules that prevent common eval design mistakes
+Covers full workflow from setup to iteration with concrete examples
-Relies heavily on external pitlane CLI without fallback validation approaches

SKILL.md

Edit in Browser

1---
2name: testing-with-pitlane
3description: >
4  Design and create pitlane eval benchmarks that measure whether an AI coding
5  skill or MCP server actually improves assistant performance. Use when the user
6  wants to test a skill, evaluate an MCP server, create a pitlane eval YAML,
7  benchmark an AI assistant, or compare baseline vs challenger configurations.
8  Covers eval design, assertion strategy, fixture setup, and result interpretation.
9category: testing
10tags:
11  - pitlane
12  - evaluation
13  - benchmarking
14  - skills
15  - mcp
16---
17 
18# Testing skills and MCP servers with pitlane
19 
20Help users design pitlane eval benchmarks that answer one question: "Does my skill or MCP server actually make the assistant better?"
21 
22## What pitlane is
23 
24Pitlane is an eval harness for AI coding assistants. You define coding tasks in a YAML file, run them against a baseline assistant (without your skill) and a challenger (with your skill), and pitlane tells you which one did better. It tracks pass rates, quality scores, time, tokens, and cost.
25 
26Think of it as A/B testing for skills and MCP servers.
27 
28## Setup
29 
30Install pitlane if the user doesn't have it yet:
31 
32```bash
33# installs pitlane cli
34uv tool install pitlane --from git+https://github.com/pitlane-ai/pitlane.git
35pitlane pitlane_command
36# or (no intall with uvx)
37uvx --from git+https://github.com/pitlane-ai/pitlane.git pitlane pitlane_command
38```
39 
40The user also needs at least one AI coding assistant CLI installed. Pitlane supports four assistants:
41 
42| Type | CLI | Cheap model for iteration |
43|------|-----|--------------------------|
44| `claude-code` | `claude` | `haiku` |
45| `mistral-vibe` | `vibe` | `devstral-small` |
46| `opencode` | `opencode` | `minimax-m2.5-free` (free) |
47| `bob` | `bob` | N/A |
48 
49Scaffold a new eval project with `pitlane init`. This creates an `eval.yaml` and a `fixtures/empty/` directory to get started.
50 
51## How an eval YAML works
52 
53An eval file has two sections: `assistants` (who runs the tasks) and `tasks` (what they do and how to check the results).
54 
55`pitlane schema generate` generates the schema for the eval file. Use it to discover advanced capabilities.
56 
57```yaml
58assistants:
59  baseline:
60    type: claude-code       # or mistral-vibe, opencode
61    args:
62      model: haiku
63  with-skill:
64    type: claude-code
65    args:
66      model: haiku
67    skills:
68      - source: org/repo       # github org/repo for the skill
69        skill: skill-name      # optional, only if repo has multiple skills
70 
71tasks:
72  - name: my-task
73    prompt: "What the assistant should do"
74    workdir: ./fixtures/my-task   # copied fresh for each run
75    timeout: 300
76    assertions:
77      - file_exists: "output.py"
78      - command_succeeds: "python output.py"
79      - file_contains: { path: "output.py", pattern: "def main" }
80```
81 
82For MCP servers instead of skills, pass the config through assistant args:
83 
84```yaml
85# Claude Code
86with-mcp:
87  type: claude-code
88  args:
89    model: haiku
90    mcp_config: ./mcp-config.json
91 
92# Mistral Vibe
93with-mcp:
94  type: mistral-vibe
95  args:
96    model: devstral-small
97    mcp_servers:
98      - name: my-server
99        url: http://localhost:3000
100```
101 
102### Assertions
103 
104Deterministic (prefer these):
105 
106- `file_exists: "path"` -- does the file exist?
107- `file_contains: { path: "file", pattern: "regex" }` -- does it match a regex?
108- `command_succeeds: "cmd"` -- does the command exit 0?
109- `command_fails: "cmd"` -- does it exit non-zero?
110- `custom_script: "python validate.py"` -- run a validation script (also supports advanced form with `script`, `args`, `timeout`, `expected_exit_code`)
111 
112Similarity (requires `pip install pitlane[similarity]`):
113 
114- `rouge: { actual: "file", expected: "./refs/golden.md", metric: "rougeL", min_score: 0.35 }` -- topic coverage, fast, good for docs
115- `bleu: { actual: "file", expected: "./refs/golden.md", min_score: 0.15 }` -- phrase matching, fast, good for docs but bad for code
116- `bertscore: { actual: "file", expected: "./refs/golden.md", min_score: 0.75 }` -- semantic similarity, slow, works for docs and code
117- `cosine_similarity: { actual: "file", expected: "./refs/golden.tf", min_score: 0.7 }` -- overall meaning, slow, best for code and configs
118 
119Any assertion can take a `weight` (default 1.0) to make it count more in the `weighted_score`.
120 
121### Running
122 
123```bash
124pitlane run eval.yaml                                    # run everything
125pitlane run eval.yaml --task my-task                     # one task only
126pitlane run eval.yaml --only-assistants baseline          # run only these assistants (comma-separated)
127pitlane run eval.yaml --skip-assistants baseline          # skip these assistants (comma-separated)
128pitlane run eval.yaml --parallel 4                       # run tasks in parallel
129pitlane run eval.yaml --repeat 5                         # repeat for statistical confidence
130pitlane run eval.yaml --verbose                          # stream debug output
131```
132 
133These options can be combined, eg: `pitlane run eval.yaml --task my-task --only-assistants baseline --repeat 5 --parallel 4`
134 
135Results go to `runs/<timestamp>/` with `report.html` (side-by-side comparison), `results.json`, and `debug.log`.
136 
137## Eval design
138 
139This is the hard part. The syntax above is straightforward; designing evals that produce real signal is not.
140 
141Before writing any YAML, work through these questions with the user.
142 
143### What is the knowledge delta?
144 
145The skill or MCP server gives the assistant something it doesn't have by default. The eval has to target that gap directly.
146 
147- What can the assistant do WITH the skill that it can't do WITHOUT? Test that.
148- What's the smallest task where the skill makes a difference? Start there, not with a complex integration test.
149- Would a strong model pass this task anyway? If so, the task measures the model, not the skill.
150 
151### Isolate the variable
152 
153The only difference between baseline and challenger should be the skill or MCP server. Everything else stays identical: same model, same prompt (word for word), same fixture directory, same timeout.
154 
155If the challenger prompt says "use the MCP tool to...", you're testing prompt engineering, not the skill.
156 
157### Start cheap, validate expensive
158 
159Use a cheaper or weaker model during development. Weaker models amplify the skill's effect because they struggle more without help, which makes the delta easier to see. Switch to a stronger model for final benchmarks to confirm the skill helps capable models too.
160 
161## Assertion strategy
162 
163### Layer your assertions
164 
165Design assertions in rough order of importance:
166 
1671. Did the assistant produce the expected files? (`file_exists`)
1682. Does the output actually work? (`command_succeeds`, `file_contains`)
1693. How close is it to a known-good solution? (similarity metrics, `custom_script`)
170 
171Weight them accordingly. A passing test suite (`command_succeeds` at weight 3.0) matters more than a README existing (`file_exists` at weight 1.0).
172 
173### Prefer deterministic assertions
174 
175Similarity metrics are tempting but noisy. Reach for deterministic assertions first:
176 
177- Instead of cosine_similarity on `main.tf`, use `file_contains` to check for specific module sources.
178- Instead of rouge on a README, use `file_contains` to check that key sections exist.
179- Save similarity for genuinely open-ended outputs like free-form documentation.
180 
181When you do use similarity, set `min_score` conservatively. Scores vary between runs. If your threshold is tight, `--repeat 5` will show you the variance.
182 
183### Custom scripts for complex validation
184 
185When `file_contains` and `command_succeeds` aren't expressive enough, write a custom script. Place it in the fixture directory (it gets copied to the workspace) or reference it with a relative path.
186 
187Good candidates: multi-step validation (parse JSON, check field relationships), domain-specific checks (API response format, dependency graphs), or conditional logic (if file A exists, B must contain X).
188 
189## NEVER
190 
191- NEVER write prompts that hint at the skill's existence. The prompt describes the task, not how to solve it. If your prompt says "use module X from registry Y", you're testing instruction-following, not the skill.
192- NEVER test with only one task. A single task can pass or fail for unrelated reasons. Use 3+ tasks at different difficulty levels.
193- NEVER skip the baseline. Without one, you can't attribute results to the skill. "It passes" means nothing if it also passes without the skill.
194- NEVER use tight similarity thresholds without `--repeat`. A min_score of 0.7 that passes once and fails twice is not a passing assertion. Run at least 3 repeats for similarity-based evals.
195- NEVER put golden references in the fixture root. They go in `refs/`, which pitlane excludes from the workspace. If the assistant can see the expected output, the eval is meaningless.
196- NEVER cram everything into one mega-task. "Create a full application with auth, database, API, and tests" measures general coding ability, not your skill. Split into focused tasks that isolate what the skill improves.
197- NEVER assume `command_succeeds` means the code is correct. `python main.py` exiting 0 says nothing about output correctness. Combine with `file_contains` or pipe output through a validation script.
198- NEVER set timeouts too tight during development. Start with 300-600s. Tighten after you know the baseline completion time. A timeout failure tells you nothing about skill quality.
199 
200## Fixture design
201 
202### Empty vs pre-populated
203 
204Use an empty fixture (`fixtures/empty/` with a `.gitkeep`) when your skill helps with greenfield creation like scaffolding or boilerplate. Use a pre-populated fixture when the skill helps with modification or enhancement of existing code. Seed it with realistic starter files.
205 
206### The refs/ directory
207 
208Golden references live in a `refs/` subdirectory inside each fixture. Pitlane excludes `refs/` when copying the fixture to the workspace, so the assistant never sees them. They're used as targets for similarity assertions (`expected: "./refs/expected-main.tf"`) and as documentation of what "good" looks like for human reviewers.
209 
210Write reference files by hand or curate them from known-good outputs. Don't generate them with the same model you're testing.
211 
212## Interpreting results
213 
214### Metrics that matter
215 
216- `assertion_pass_rate` compares baseline vs challenger directly. If both hit 100%, the tasks are too easy. If baseline is already at 80%+, the skill has less room to prove itself.
217- `weighted_score` is more nuanced than pass_rate when you're using weights and similarity metrics. Compare the delta between baseline and challenger.
218- `cost_usd` and `wall_clock_seconds` track the trade-off. A skill that improves quality at 3x the cost may not be worth it.
219 
220### Reading the results
221 
222| Signal | What it means |
223|--------|---------------|
224| Baseline low, challenger high | The skill is working |
225| Both high | Tasks are too easy, add harder ones |
226| Both low | Tasks may be too hard, or the skill needs work |
227| Baseline high, challenger low | The skill is hurting, investigate |
228| High variance across `--repeat` runs | Assertions or prompts need tightening |
229 
230### The iteration loop
231 
232This is TDD applied to skills:
233 
2341. Write tasks and assertions the skill should help with. Run the baseline. It should struggle.
2352. Run with the skill. If it doesn't improve, the skill needs work, not the eval.
2363. Tighten assertions, add edge cases, increase difficulty. Repeat.
237 
238If the baseline already passes everything, the eval isn't testing the skill's value. Go back to "what is the knowledge delta?"
239

Full transparency — inspect the skill content before installing.

New to skill.md files?

See what a SKILL.md file is, how to install one, and how it differs from AGENTS.md or cursorrules.

Read the guide →