A feedback loop for people building AI skills and MCP servers. You're building a skill, an MCP server, or a custom prompt strategy that's supposed to make an AI coding assistant better at a specific job. But how do you know it actually works? How do you know your latest commit made things better and not worse? Pitlane gives you the answer. Define the tasks your skill should help with, set up a bas
Add this skill
npx mdskills install pitlane-ai/testing-with-pitlaneComprehensive eval design guide with actionable setup, assertion strategy, and clear anti-patterns
A feedback loop for people building AI skills and MCP servers.
You're building a skill, an MCP server, or a custom prompt strategy that's supposed to make an AI coding assistant better at a specific job. But how do you know it actually works? How do you know your latest commit made things better and not worse?
Pitlane gives you the answer. Define the tasks your skill should help with, set up a baseline (assistant without your skill) and a challenger (assistant with your skill), and race them. The results tell you with numbers, not vibes, whether your work is paying off.
In motorsport, the pit lane is where engineers tune the car between laps. Swap a part, adjust the setup, check the telemetry, see if the next lap is faster.
Building skills and MCP servers works the same way:
Pitlane is the telemetry system. You build the skill, pitlane tells you if it's working.
junit.xml) for native CI test reportingYou'll need uv, a fast Python package installer.
Install on macOS/Linux:
curl -LsSf https://astral.sh/uv/install.sh | sh
Install on Windows:
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
Note: Windows should work but has not been extensively tested. If you encounter issues or can help validate Windows support, contributions are welcome!
Install pitlane:
uv tool install pitlane --from git+https://github.com/pitlane-ai/pitlane.git
Or run without installing:
uvx --from git+https://github.com/pitlane-ai/pitlane.git pitlane run examples/simple-codegen-eval.yaml
Before running your first evaluation, ensure you have an AI coding assistant installed and authenticated on your machine. Choose from supported assistants. Install the assistant's CLI tool following their official documentation, most assistants will prompt you to log in on first use.
The example below uses OpenCode because it's free and requires no API key, but you can use any supported assistant by editing the YAML file.
Initialize a project with example benchmarks:
pitlane init --with-examples
Run the evaluation:
pitlane run examples/simple-codegen-eval.yaml
Results appear in runs/ with an HTML report showing pass rates and metrics.
Want to use a different assistant? Edit examples/simple-codegen-eval.yaml and uncomment your preferred assistant configuration. See Supported Assistants for options.
Need help designing benchmarks? Install the pitlane skill for AI-guided assistance:
npx skills add pitlane-ai/pitlane
Your AI assistant can then help you create effective eval benchmarks. See Writing Benchmarks for details.
| Assistant | Type | Status |
|---|---|---|
| Claude Code | claude-code | ✅ Tested |
| Mistral Vibe | mistral-vibe | ✅ Tested |
| OpenCode | opencode | ✅ Tested |
| Bob | bob | ✅ Tested |
For the latest status and additional assistants in development, see PR #32.
Want to add support for another assistant? Contributions are welcome! See the Contributing Guide for instructions on implementing new assistants.
Run all tasks against all configured assistants:
pitlane run examples/simple-codegen-eval.yaml
Run specific tasks or assistants:
Run a single task:
pitlane run examples/simple-codegen-eval.yaml --task hello-world-python
Run specific assistants (comma-separated):
pitlane run examples/simple-codegen-eval.yaml --only-assistants claude-baseline
Skip assistants:
pitlane run examples/simple-codegen-eval.yaml --skip-assistants claude-baseline
Combine filters:
pitlane run examples/simple-codegen-eval.yaml --task hello-world-python --only-assistants claude-baseline
Speed up multi-task benchmarks:
pitlane run examples/simple-codegen-eval.yaml --parallel 4
Run tasks multiple times to measure consistency and get aggregated statistics:
pitlane run examples/simple-codegen-eval.yaml --repeat 5
This runs each task 5 times and reports avg/min/max/stddev for all metrics in the HTML report.
Every run creates debug.log with detailed execution information. Stream output to terminal in real-time:
pitlane run examples/simple-codegen-eval.yaml --verbose
All assertions include detailed logging to help diagnose failures.
Press Ctrl+C to stop a run. You'll get a partial HTML report with results from completed tasks.
By default, report.html opens in your browser after each run. To disable this:
pitlane run examples/simple-codegen-eval.yaml --no-open
The same flag works when regenerating a report:
pitlane report runs/2024-01-01_12-00-00 --no-open
Initialize new benchmark project:
pitlane init
Initialize with example benchmarks:
pitlane init --with-examples
Generate JSON Schema for YAML validation:
pitlane schema generate
Install VS Code YAML validation (safe, with preview):
pitlane schema install
Regenerate HTML report from existing junit.xml:
pitlane report runs/2024-01-01_12-00-00
Benchmarks are YAML files with two sections: assistants and tasks.
Need help designing effective benchmarks? Install the pitlane skill for AI-guided assistance:
npx skills add pitlane-ai/pitlane
Your AI assistant can help you design eval benchmarks that actually measure whether your skills or MCP servers improve performance.
assistants:
claude-baseline:
type: claude-code
args:
model: haiku
tasks:
- name: hello-world-python
prompt: "Create a Python script called hello.py that prints 'Hello, World!'"
workdir: ./fixtures/empty
timeout: 120
assertions:
- file_exists: "hello.py"
- command_succeeds: "python3 hello.py"
- file_contains: { path: "hello.py", pattern: "Hello, World!" }
Each assistant defines how to run a model:
assistants:
# Baseline configuration
claude-baseline:
type: claude-code
args:
model: haiku
# With skills/MCP
claude-with-skill:
type: claude-code
args:
model: haiku
skills:
# Remote: GitHub reference (installed via npx skills add)
- source: org/repo
skill: my-skill-name
# Local: directory path (for development/testing)
- source: ./path/to/my-skill
Each task specifies:
name: Unique identifierprompt: Instructions for the assistantworkdir: Fixture directory (copied for each run)timeout: Maximum secondsassertions: Checks to verify successassertions:
# File exists
- file_exists: "main.py"
# Command succeeds (exit code 0)
- command_succeeds: "python main.py"
# Command fails (non-zero exit)
- command_fails: "python main.py --invalid"
# File contains pattern (regex)
- file_contains:
path: "main.py"
pattern: "def main\\(\\):"
# Custom script validation
- custom_script:
script: "./validate.sh"
interpreter: "bash"
timeout: 30
expected_exit_code: 0
When you need more complex validation logic than simple commands provide, use custom_script to run a dedicated test script. This is useful for multi-step validation, complex parsing, or reusable test logic.
Simple form (expects exit code 0):
- custom_script: "scripts/validate_output.sh"
- custom_script: "python scripts/validate.py"
- custom_script: "node scripts/check.js"
Advanced form with options:
- custom_script:
script: "python scripts/validate_output.py"
args: ["--strict", "--format=json"]
timeout: 30
expected_exit_code: 0
Options:
script — Shell command to execute (e.g., python script.py, node script.js, ./script.sh)args — List of arguments to pass to the script (optional)timeout — Maximum seconds to wait for completion (default: 60)expected_exit_code — Exit code that indicates success (default: 0)The script field is executed as a shell command in the workdir, so you can use any interpreter:
python validate.py or python3 validate.pynode check.js./validate.sh (must have shebang and be executable)command_succeeds but with more control over timeout and exit codesYour script receives the workdir as its working directory, so it can access generated files directly. The assertion passes if the script exits with the expected code.
Example validation script (scripts/validate_tf.sh):
#!/bin/bash
# Check if Terraform config is valid and contains required resources
terraform validate || exit 1
grep -q "aws_s3_bucket" main.tf || exit 2
exit 0
Use it in your eval:
- custom_script: "scripts/validate_tf.sh"
When exact matching isn't practical, use similarity metrics:
assertions:
# ROUGE: topic coverage (good for docs)
- rouge:
actual: "README.md"
expected: "./refs/golden.md"
metric: "rougeL"
min_score: 0.35
# BLEU: phrase matching (good for docs, not code)
- bleu:
actual: "README.md"
expected: "./refs/golden.md"
min_score: 0.2
# BERTScore: semantic similarity (good for docs/code)
- bertscore:
actual: "README.md"
expected: "./refs/golden.md"
min_score: 0.75
# Cosine similarity: overall meaning (good for code/configs)
- cosine_similarity:
actual: "variables.tf"
expected: "./refs/expected-vars.tf"
min_score: 0.7
Choosing metrics:
| Metric | Question | Speed | Best For |
|---|---|---|---|
rouge | Same topics? | Fast | Documentation coverage |
bleu | Same phrases? | Fast | Documentation phrasing |
bertscore | Same meaning? | Slow | Semantic preservation |
cosine_similarity | Same subject? | Slow | Code/config similarity |
Use deterministic assertions first. Add similarity metrics when you need fuzzy matching.
Make some assertions count more:
assertions:
- file_exists: "main.tf"
- command_succeeds: "terraform validate"
weight: 3.0 # 3x more important
- rouge:
actual: "README.md"
expected: "./refs/golden.md"
metric: "rougeL"
min_score: 0.3
weight: 2.0
Results include both assertion_pass_rate (binary) and weighted_score (continuous).
The examples/ directory contains working benchmarks you can use as starting points:
simple-codegen-eval.yaml — Minimal example with deterministic assertionssimilarity-codegen-eval.yaml — Demonstrates all similarity metricsterraform-module-eval.yaml — Real-world Terraform evaluationweighted-grading-eval.yaml — Weighted assertions and continuous scoringTreat benchmarks like tests:
This lets you iterate on what "good" means without guessing.
Enable YAML validation:
pitlane schema install
This adds JSON Schema validation to .vscode/settings.json with preview and backup.
Manual setup:
{
"yaml.schemas": {
"./schemas/pitlane.schema.json": [
"eval.yaml",
"examples/*.yaml",
"**/*eval*.y*ml"
]
},
"yaml.validate": true
}
Generate schema and docs:
pitlane schema generate
This outputs:
schemas/pitlane.schema.jsondocs/schema.mdExecution is not sandboxed. Pitlane runs assistants directly on your system using their native CLIs. While this provides full functionality and realistic testing conditions, it means assistants have the same file system and network access as any other process you run.
Recommended precautions:
The native CLI approach is intentional—it ensures pitlane tests assistants in real-world conditions. But like any development tool that executes code, reasonable precautions are advisable.
See CONTRIBUTING.md for development setup, testing guidelines, and how to submit changes.
Apache 2.0
Install via CLI
npx mdskills install pitlane-ai/testing-with-pitlaneTesting With Pitlane is a free, open-source AI agent skill. A feedback loop for people building AI skills and MCP servers. You're building a skill, an MCP server, or a custom prompt strategy that's supposed to make an AI coding assistant better at a specific job. But how do you know it actually works? How do you know your latest commit made things better and not worse? Pitlane gives you the answer. Define the tasks your skill should help with, set up a bas
Install Testing With Pitlane with a single command:
npx mdskills install pitlane-ai/testing-with-pitlaneThis downloads the skill files into your project and your AI agent picks them up automatically.
Testing With Pitlane works with Claude Code, Claude Desktop, Cursor, Vscode Copilot, Windsurf, Continue Dev, Gemini Cli, Amp, Roo Code, Goose. Skills use the open SKILL.md format which is compatible with any AI coding agent that reads markdown instructions.