Edit code → commit → run benchmark → measure metric → keep improvement or revert → repeat forever. Works for any optimization target: LLM training loss, test speed, bundle size, build time, Lighthouse scores, and more. Inspired by Karpathy's autoresearch, pi-autoresearch, and litesearch. This plugin provides two skills that work together. Autoresearch is the core engine (works for any metric), and
Add this skill
npx mdskills install proyecto26/autoresearch-mlComprehensive autonomous ML training loop with excellent GPU support and detailed experiment tracking
Autonomous Experiment Loops for Claude Code — Let AI optimize while you sleep
Edit code → commit → run benchmark → measure metric → keep improvement or revert → repeat forever.
Works for any optimization target: LLM training loss, test speed, bundle size, build time, Lighthouse scores, and more.
Inspired by Karpathy's autoresearch, pi-autoresearch, and litesearch.
This plugin provides two skills that work together. Autoresearch is the core engine (works for any metric), and Autoresearch ML extends it with GPU-specific templates for LLM training.
Domain-agnostic autonomous experiment loop.
autoresearch.jsonl — survives context resets and session restarts.Specialized for LLM training with NVIDIA GPUs. Extends the core Autoresearch skill.
val_bpb) enables fair comparison across architectures.Use npx skills to install skills directly:
# Install all skills
npx skills add proyecto26/autoresearch-ai-plugin
# Install specific skills
npx skills add proyecto26/autoresearch-ai-plugin --skill autoresearch autoresearch-ml
# List available skills
npx skills add proyecto26/autoresearch-ai-plugin --list
This automatically installs to your .claude/skills/ directory.
Install via Claude Code's built-in plugin system:
# Add the marketplace
/plugin marketplace add proyecto26/autoresearch-ai-plugin
# Install the plugin
/plugin install autoresearch-ai-plugin
git clone https://github.com/proyecto26/autoresearch-ai-plugin.git
cp -r autoresearch-ai-plugin/skills/* .claude/skills/
Add as a submodule for easy updates:
git submodule add https://github.com/proyecto26/autoresearch-ai-plugin.git .claude/autoresearch-ai-plugin
Then reference skills from .claude/autoresearch-ai-plugin/skills/.
"Run autoresearch to optimize my test suite"
Triggers Autoresearch to set up a benchmark loop, measure test runtime, and iteratively optimize your test configuration.
"Start an experiment loop to reduce bundle size"
Triggers Autoresearch to measure your build output and autonomously try tree-shaking, code splitting, and dependency optimizations.
"Set up ML autoresearch with my RTX 4090"
Triggers Autoresearch ML to copy the training assets, prepare data, and begin autonomous LLM pretraining experiments.
"Optimize val_bpb autonomously overnight"
Triggers Autoresearch ML to run 5-minute training experiments in a loop, keeping architecture and hyperparameter improvements.
"What's the autoresearch status?"
Shows a summary of the current session: total runs, kept improvements, best metric, confidence score.
flowchart TD
A[User triggers autoresearch] --> B[Setup Phase]
B --> B1[Define goal, metric, command, files in scope]
B1 --> B2[Create autoresearch.md + autoresearch.sh]
B2 --> B3[Run baseline → Record in autoresearch.jsonl]
B3 --> C[Experiment Loop]
C --> D[Read past results + ASI annotations]
D --> E[Choose experimental change]
E --> F[Edit files → git commit]
F --> G[Run benchmark: bash autoresearch.sh]
G --> H[Parse METRIC lines from output]
H --> I{autoresearch.checks.sh?}
I -- Yes --> J[Run correctness checks]
I -- No --> K{Metric improved?}
J -- Pass --> K
J -- Fail --> L[Revert commit]
K -- Yes --> M[KEEP commit]
K -- No/Equal --> L
M --> N[Log to autoresearch.jsonl with ASI]
L --> N
N --> O[Update autoresearch.md with learnings]
O --> C
style A fill:#4a9eff,color:#fff
style M fill:#22c55e,color:#fff
style L fill:#ef4444,color:#fff
style C fill:#f59e0b,color:#fff
Context resets? No problem. autoresearch.jsonl + autoresearch.md contain everything needed to resume — including ASI annotations from discarded experiments.
The plugin includes a PreToolUse hook that automatically blocks modification of sensitive files during the experiment loop. This prevents accidental changes that would invalidate experiment comparisons.
Protected files:
| File | Why it's protected |
|---|---|
prepare.py | Fixed evaluation harness, tokenizer, dataloader — modifying it invalidates all comparisons |
autoresearch.sh | Benchmark script — changing it mid-loop breaks metric comparability |
autoresearch.checks.sh | Correctness checks — weakening them mid-loop undermines quality guarantees |
parse-metrics.sh | Plugin utility script |
log-experiment.sh | Plugin utility script |
Allowed files: train.py and all other project files remain fully editable.
Setup-phase aware: Protected files can be created during initial setup (file doesn't exist yet) but cannot be modified once they exist. This allows the normal setup flow where Claude writes autoresearch.sh for the first time.
If Claude attempts to modify an existing protected file, the hook blocks the operation and returns feedback explaining why, so Claude can adjust its approach.
Create .claude/autoresearch-ai-plugin.local.md in your project root for persistent settings:
---
enabled: true
max_iterations: 50
working_dir: "/path/to/project"
benchmark_timeout: 600
checks_timeout: 300
---
| Field | Default | Description |
|---|---|---|
enabled | true | Whether autoresearch is active |
max_iterations | 0 (unlimited) | Stop after N experiments |
working_dir | current directory | Override directory for experiment files |
benchmark_timeout | 600 | Benchmark timeout in seconds |
checks_timeout | 300 | Correctness checks timeout in seconds |
This file is per-project and should not be committed (add .claude/*.local.md to .gitignore).
| File | Purpose |
|---|---|
autoresearch.md | Living session doc — goal, metrics, scope, learnings |
autoresearch.sh | Benchmark script outputting METRIC name=value lines |
autoresearch.checks.sh | Optional correctness checks (tests, lint, types) |
autoresearch.jsonl | Append-only experiment log with ASI (survives restarts) |
autoresearch.ideas.md | Optional backlog of experiment ideas |
.claude/autoresearch-ai-plugin.local.md | Optional persistent configuration |
Each experiment is logged as a single JSON line in autoresearch.jsonl:
{"run":5,"commit":"abc1234","metric":4230,"metrics":{"compile_ms":1200},"status":"keep","description":"parallelized tests","timestamp":1700000000,"segment":0,"confidence":2.3,"asi":{"hypothesis":"parallel tests reduce wall time","next_action_hint":"try worker pool tuning"}}
A config header is written once at setup:
{"type":"config","name":"Optimize tests","metricName":"total_ms","metricUnit":"ms","bestDirection":"lower"}
The autoresearch-ml skill includes a complete LLM pretraining setup in assets/:
| File | Role |
|---|---|
prepare.py | Data download, BPE tokenizer training, dataloader with best-fit packing |
train.py | GPT model with Flash Attention 3, RoPE, sliding window attention, MuonAdamW |
program.md | Self-contained agent instructions for the autonomous ML loop |
pyproject.toml | Python dependencies (PyTorch 2.9.1 + CUDA 12.8) |
| Tier | GPUs | VRAM |
|---|---|---|
| Consumer | GTX 1080 Ti, RTX 2080 Ti | 11GB |
| Consumer+ | RTX 3090, RTX 4090 | 24GB |
| Enthusiast | RTX 5090 | 32GB |
| Datacenter | A100, H100 | 40-80GB |
Consumer GPUs use gradient checkpointing, built-in attention (no Flash Attention dependency), and automatic fp32 fallback for Pascal architectures.
autoresearch-ai-plugin/
├── .claude-plugin/
│ ├── plugin.json # Plugin manifest
│ └── marketplace.json # Marketplace configuration
├── hooks/
│ ├── hooks.json # Hook configuration (PreToolUse file protection)
│ └── protect-files.sh # Blocks modification of sensitive files
├── tests/
│ └── run-tests.sh # E2E test suite (27 tests)
└── skills/
├── autoresearch/ # Generic experiment loop
│ ├── SKILL.md # Core skill — edit/measure/keep/discard cycle
│ ├── scripts/
│ │ ├── parse-metrics.sh # Extract METRIC lines from benchmark output
│ │ └── log-experiment.sh # Append results to autoresearch.jsonl
│ ├── references/
│ │ ├── confidence-scoring.md # MAD-based noise analysis
│ │ └── best-practices.md # Benchmark tips, ASI patterns, experiment strategies
│ └── examples/
│ ├── autoresearch.sh # Example benchmark script (portable)
│ ├── autoresearch.checks.sh # Example correctness checks
│ └── autoresearch.md # Example session document
└── autoresearch-ml/ # ML/GPU specialization (extends autoresearch)
├── SKILL.md # ML skill — GPU setup, training workflow
├── references/
│ └── gpu-training-guide.md # CUDA config, OOM fixes, perf tuning
└── assets/
├── prepare.py # Data prep (download, tokenizer, dataloader)
├── train.py # GPT model + training loop
├── program.md # Agent instructions for ML loop
└── pyproject.toml # Python deps (PyTorch + CUDA)
Run the E2E test suite to verify all scripts and hooks work correctly:
bash tests/run-tests.sh
Covers 27 tests: metric parsing, JSONL logging (including JSON escaping and validation), file protection hooks (blocking, setup allowance, subdirectory bypass), Python asset compilation, and shell syntax checks.
This project is free and open source. Sponsors help keep it maintained and growing.
Become a Sponsor | Sponsorship Program
When contributing to this repository, please first discuss the change you wish to make via issue, email, or any other method with the owners of this repository before making a change.
Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated ❤️.
You can learn more about how you can contribute to this project in the contribution guide.
Made with ❤️ by Proyecto 26 - Changing the world with small contributions.
One hand can accomplish great things, but many can take you into space and beyond! 🌌
Together we do more, together we are more ❤️
Best experience: Claude Code
/plugin marketplace add proyecto26/autoresearch-mlThen /plugin menu → select skill → restart. Use /skill-name:init for first-time setup.
Other platforms
Install via CLI
npx mdskills install proyecto26/autoresearch-mlAutoresearch ML is a free, open-source AI agent skill. Edit code → commit → run benchmark → measure metric → keep improvement or revert → repeat forever. Works for any optimization target: LLM training loss, test speed, bundle size, build time, Lighthouse scores, and more. Inspired by Karpathy's autoresearch, pi-autoresearch, and litesearch. This plugin provides two skills that work together. Autoresearch is the core engine (works for any metric), and
Install Autoresearch ML with a single command:
npx mdskills install proyecto26/autoresearch-mlThis downloads the skill files into your project and your AI agent picks them up automatically.
Autoresearch ML works with Claude Code, Claude Desktop, Cursor, Vscode Copilot, Windsurf, Continue Dev, Codex, Gemini Cli, Amp, Roo Code, Goose, Opencode, Trae, Qodo, Command Code. Skills use the open SKILL.md format which is compatible with any AI coding agent that reads markdown instructions.