How do I install Autoresearch ML?

Install Autoresearch ML with a single command: npx mdskills install proyecto26/autoresearch-ml. This downloads the skill files into your project and your AI agent picks them up automatically.

What platforms support Autoresearch ML?

Autoresearch ML works with Claude Code, Claude Desktop, Cursor, Vscode Copilot, Windsurf, Continue Dev, Codex, Gemini Cli, Amp, Roo Code, Goose, Opencode, Trae, Qodo, Command Code. Skills use the open SKILL.md format which is compatible with any AI coding agent that reads markdown instructions.

← Back to skills

Autoresearch ML

Name: Autoresearch ML: AI Agent Skill
Rating: 8.7 (1 reviews)
Author: proyecto26

Verified

SKILL + PLUGINSearch & KnowledgeIntermediate

Edit code → commit → run benchmark → measure metric → keep improvement or revert → repeat forever. Works for any optimization target: LLM training loss, test speed, bundle size, build time, Lighthouse scores, and more. Inspired by Karpathy's autoresearch, pi-autoresearch, and litesearch. This plugin provides two skills that work together. Autoresearch is the core engine (works for any metric), and

by @proyecto260Updated 3 weeks ago

Add this skill

npx mdskills install proyecto26/autoresearch-ml

Fork & Edit

Skill Advisor8.7

Comprehensive autonomous ML training loop with excellent GPU support and detailed experiment tracking

+Provides complete end-to-end setup with bundled templates and clear installation steps
+Implements robust experiment loop with git-based versioning, ASI annotations, and confidence scoring
+Includes thorough GPU tier guidance with VRAM auto-scaling for consumer hardware
-Network access permission undeclared but required for HuggingFace data downloads

SKILL.md

Edit in Browser

1---
2name: autoresearch-ml
3description: >-
4  Autonomous LLM training optimization with GPU support. Runs 5-minute training
5  experiments, measures val_bpb, keeps improvements or reverts — repeat forever.
6  Use this skill when the user asks to "train a model autonomously",
7  "optimize LLM training", "run ML experiments", "autoresearch with GPU",
8  "optimize val_bpb", "autonomous ML training", "LLM pretraining loop",
9  "setup ML autoresearch", "GPU training experiments", "fine-tune a model",
10  "pretrain from scratch", "speed up training", "lower my loss",
11  "GPU optimization", "CUDA training", or mentions "train.py", "prepare.py",
12  "bits per byte", "val_bpb", "NVIDIA GPU training", "RTX training",
13  "H100 training", "autonomous model training", "consumer GPU training",
14  "low VRAM training". Always use this skill when the user wants to autonomously
15  optimize any ML training metric.
16version: 0.2.0
17---
18 
19# Autoresearch ML: Autonomous LLM Training Optimization
20 
21An autonomous experiment loop for single-GPU LLM pretraining. Edit `train.py` → commit → run 5-minute training → measure `val_bpb` → keep improvement or revert → **repeat forever**.
22 
23This skill is self-contained — it includes everything needed to set up and run the loop.
24 
25## Setup Phase
26 
27### 1. Copy Template Assets
28 
29Copy the bundled training template to the project directory:
30 
31```bash
32cp ${CLAUDE_SKILL_DIR}/assets/prepare.py .
33cp ${CLAUDE_SKILL_DIR}/assets/train.py .
34cp ${CLAUDE_SKILL_DIR}/assets/pyproject.toml .
35cp ${CLAUDE_SKILL_DIR}/assets/program.md .
36```
37 
38### 2. Install and Prepare
39 
40```bash
41uv sync                    # Install dependencies
42uv run prepare.py          # Download data shards, train tokenizer (~2 min)
43```
44 
45### 3. Verify GPU
46 
47```bash
48nvidia-smi
49python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name()}, VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB')"
50```
51 
52### 4. Initialize the Experiment Session
53 
541. Create a branch: `git checkout -b autoresearch/<tag>-<date>`
552. Ensure session files are gitignored (critical — `git revert` will fail if tracked):
56   ```bash
57   echo -e "autoresearch.jsonl\nrun.log" >> .gitignore
58   git add .gitignore && git commit -m "autoresearch: add session files to gitignore"
59   ```
603. Read `prepare.py` and `train.py` thoroughly to understand the codebase
614. Write `autoresearch.md` — a living session document recording goal, metrics, files in scope, constraints, and learnings
625. Write `autoresearch.sh` — the benchmark script (see Benchmark Script section below)
636. Commit session files
647. Run baseline: `bash autoresearch.sh`
658. Parse metrics from output (lines matching `METRIC name=value`)
669. Record baseline in `autoresearch.jsonl`:
67   - First write a config header: `{"type":"config","name":"Optimize val_bpb","metricName":"val_bpb","metricUnit":"bpb","bestDirection":"lower"}`
68   - Then record the baseline result
6910. Begin the experiment loop
70 
71## The Experiment Loop
72 
73**LOOP FOREVER. Never ask "should I continue?" — just keep going.**
74 
75The user might be asleep, away from the computer, or expects you to work indefinitely. Each experiment takes ~5 minutes, so you can run ~12/hour, ~100 overnight. The loop runs until the user interrupts you, period. If you run out of ideas, think harder — re-read `train.py` for new angles, try combining previous near-misses, try more radical architectural changes.
76 
77Each iteration:
78 
79```
801. Read current git state and autoresearch.md
812. Choose an experimental change to train.py (informed by past results and ASI notes)
823. Edit train.py (the ONLY editable file)
834. git add train.py && git commit -m "experiment: <description>"
845. Run: bash autoresearch.sh > run.log 2>&1
856. Parse METRIC lines from output
867. If output is empty (crash): tail -n 50 run.log to read the stack trace
878. Decide: keep or discard
889. Log result to autoresearch.jsonl (include ASI annotations)
8910. If discard/crash: git revert $(git rev-parse HEAD) --no-edit
9011. Update autoresearch.md with learnings (every few experiments)
9112. Repeat
92```
93 
94### Decision Rules
95 
96- **val_bpb improved (lower)** → `keep` (commit stays, branch advances)
97- **val_bpb equal or worse** → `discard` (run `git revert $(git rev-parse HEAD) --no-edit`)
98- **Crash (OOM, CUDA error, NaN loss)** → `discard` (revert). If it's a simple fix (typo, import), fix and re-run. If the idea is fundamentally broken, log as crash and move on.
99- **Simpler code for equal val_bpb** → `keep` (removing complexity is a win)
100- **Catastrophic VRAM increase** → consider `discard` even if val_bpb improved slightly
101 
102### Simplicity Criterion
103 
104All else being equal, simpler is better. A 0.001 val_bpb improvement that adds 20 lines of hacky code? Probably not worth it. A 0.001 improvement from deleting code? Definitely keep. Equal val_bpb with much simpler code? Keep.
105 
106### Constraints
107 
108- **Fixed 5-minute time budget.** All experiments are directly comparable — the wall clock is the equalizer.
109- **Single file modification.** Only `train.py` changes; `prepare.py` is immutable. This ensures fair comparison (same data, same evaluation).
110- **VRAM is a soft constraint.** Using more VRAM is acceptable but note the trade-off (larger model = fewer training steps in 5 minutes).
111- **No new packages.** You can only use what's already in `pyproject.toml`.
112- **Timeout:** If a run exceeds 10 minutes, kill it and treat as a crash.
113 
114### Don't Thrash
115 
116If 3 consecutive experiments fail or get discarded, stop and think about why. Re-read `train.py` for new angles. Try a fundamentally different approach.
117 
118### Handling User Messages
119 
120If the user sends a message while the loop is running: finish the current cycle, address the feedback, then resume immediately — do not wait for permission.
121 
122## Logging to autoresearch.jsonl
123 
124Each experiment appends one JSON line:
125 
126```json
127{"run":2,"commit":"def5678","metric":0.993,"metrics":{"peak_memory_mb":44200,"mfu_percent":39.8},"status":"keep","description":"increase LR to 0.04","timestamp":1700000000,"segment":0,"confidence":null,"asi":{"hypothesis":"higher LR converges faster","arch_change":"MATRIX_LR 0.03→0.04"}}
128```
129 
130Use the shared logging script:
131 
132```bash
133bash ${CLAUDE_PLUGIN_ROOT}/skills/autoresearch/scripts/log-experiment.sh \
134  --run 2 \
135  --commit "$(git rev-parse --short HEAD)" \
136  --metric 0.993 \
137  --status keep \
138  --description "increase LR to 0.04" \
139  --metrics '{"peak_memory_mb":44200,"mfu_percent":39.8}' \
140  --segment 0 \
141  --asi '{"hypothesis":"higher LR converges faster"}'
142```
143 
144Parse metrics from benchmark output:
145 
146```bash
147bash autoresearch.sh 2>&1 | bash ${CLAUDE_PLUGIN_ROOT}/skills/autoresearch/scripts/parse-metrics.sh
148```
149 
150Valid statuses: `keep`, `discard`, `crash`, `checks_failed`
151 
152## ASI (Actionable Side Information)
153 
154ASI is structured annotation per experiment that **survives reverts**. When code changes are discarded, only the description and ASI remain — the only structured memory of what happened.
155 
156Record ASI for every experiment:
157 
158```json
159{
160  "hypothesis": "Deeper model with fewer steps should compress better",
161  "arch_change": "DEPTH 8→12, DEVICE_BATCH_SIZE 128→64",
162  "result": "val_bpb improved 0.998→0.992, but 2x VRAM",
163  "next_action_hint": "Try intermediate DEPTH=10 for better VRAM tradeoff"
164}
165```
166 
167## Resuming After Context Reset
168 
169If `autoresearch.jsonl` and `autoresearch.md` exist in the working directory:
170 
1711. Read `autoresearch.md` for full context (goal, metrics, files, constraints, learnings)
1722. Read `autoresearch.jsonl` to see all past experiments, current best, and ASI annotations
1733. Check git log to verify current branch state matches expected state
1744. If git state is dirty (unclean shutdown), revert uncommitted changes
1755. Resume the loop from where it left off — no re-setup needed
1766. **Resume immediately** — do not ask "should I continue?"
177 
178## Confidence Scoring
179 
180After 3+ experiments, assess whether improvements are real or noise:
181 
182- Compute the **Median Absolute Deviation (MAD)** of all metric values as a noise floor
183- **Confidence = |best improvement| / MAD**
184- ≥2.0× → likely real improvement
185- 1.0–2.0× → marginal, could be noise
186- <1.0× → within noise floor
187 
188ML training with fixed seeds is mostly deterministic, so the noise floor is typically very low.
189 
190## Template Architecture
191 
192### prepare.py (FIXED — never modify)
193 
194- **Data download:** Fetches parquet shards from HuggingFace (climbmix-400b-shuffle)
195- **Tokenizer training:** BPE tokenizer (8192 vocab) using rustbpe/tiktoken
196- **Dataloader:** Best-fit document packing with 100% token utilization, BOS-aligned
197- **Evaluation:** `evaluate_bpb()` computes bits-per-byte (vocab-size-independent metric)
198 
199Key constants: `MAX_SEQ_LEN = 2048`, `TIME_BUDGET = 300`, `EVAL_TOKENS = 40 * 524288`, `VOCAB_SIZE = 8192`
200 
201### train.py (MODIFIED BY AGENT — the only editable file)
202 
203- **Model:** GPT with RoPE, sliding window attention, value embeddings, Flash Attention 3
204- **Optimizer:** Hybrid MuonAdamW (Muon for matrices, AdamW for everything else)
205- **Training:** Gradient accumulation, LR schedules (warmup/flat/warmdown), fixed time budget
206 
207Editable: `ASPECT_RATIO`, `DEPTH`, `WINDOW_PATTERN`, `TOTAL_BATCH_SIZE`, learning rates, LR schedule phases, and the full model architecture.
208 
209## GPU Requirements
210 
211### Supported GPU Tiers
212 
213| Tier | GPUs | VRAM | Notes |
214|------|------|------|-------|
215| **Consumer** | GTX 1080 Ti, RTX 2080 Ti | 11GB | fp32 fallback, gradient checkpointing required |
216| **Consumer+** | RTX 3090, RTX 4090 | 24GB | Great for experiments |
217| **Enthusiast** | RTX 5090 | 32GB | Excellent — larger models possible |
218| **Datacenter** | A100, H100 | 40-80GB | Original development target |
219 
220### Consumer GPU Adaptations
221 
222For GPUs with limited VRAM (< 16GB), apply these changes to `train.py` during the first experiment:
223 
2241. **Remove Flash Attention 3 import and dependency** — the top-level `from kernels import get_kernel` block (lines 20-24) runs unconditionally at startup and will fail on non-Hopper GPUs. Replace the entire block and the `fa3.flash_attn_func()` call in `CausalSelfAttention.forward()` with `torch.nn.functional.scaled_dot_product_attention`. Also remove `kernels` from `pyproject.toml` and run `uv sync` again.
2252. **Enable gradient checkpointing** — use `torch.utils.checkpoint.checkpoint()` with `use_reentrant=False` to trade ~30% compute for ~50% VRAM savings
2263. **Auto-scale model size** — reduce `DEPTH` and `DEVICE_BATCH_SIZE` to fit VRAM budget (see table below)
2274. **Cap evaluation steps** — scale eval batch count by available VRAM (30-100 steps)
2285. **fp32 fallback** — use fp32 instead of bf16 for Pascal GPUs (compute capability < 7.5). Change the autocast dtype and disable bf16-specific optimizations.
229 
230### VRAM Auto-Scaling Guide
231 
232| VRAM Budget | DEPTH | n_embd | Batch Size | Seq Length | ~Params |
233|-------------|-------|--------|------------|------------|---------|
234| 4GB | 2 | 128 | 4 | 512 | ~1M |
235| 8GB | 4 | 256 | 8 | 1024 | ~5M |
236| 12GB | 6 | 384 | 16 | 1024 | ~14M |
237| 16GB | 8 | 512 | 32 | 2048 | ~25M |
238| 24GB | 8 | 512 | 128 | 2048 | ~50M |
239| 32GB | 12 | 768 | 128 | 2048 | ~85M |
240| 80GB | 16 | 1024 | 128 | 2048 | ~200M |
241 
242**Note:** `n_embd` must be a multiple of `HEAD_DIM` (default 128). Config search: start with the largest depth that fits, reduce `DEVICE_BATCH_SIZE` then `MAX_SEQ_LEN` if OOM.
243 
244## Experiment Strategies
245 
2461. **Architecture:** Layer count, attention patterns, embedding dimensions, activation functions
2472. **Optimizer:** Learning rates (per-parameter), schedule phases, momentum, weight decay
2483. **Attention:** Window sizes, sliding window configs, full vs. local attention
2494. **Batch size:** Trade-off between gradient quality and steps-per-budget
2505. **Initialization:** Weight init schemes, residual scaling parameters
2516. **Advanced:** Value embeddings, softcapped logits, GQA
252 
253## Metric: Bits Per Byte (BPB)
254 
255How well the model compresses text, normalized by byte count. Vocabulary-size-independent — all architectures are directly comparable. Lower is better. See `references/gpu-training-guide.md` for the formula and interpretation table.
256 
257## Benchmark Script
258 
259Use this as `autoresearch.sh`:
260 
261```bash
262#!/usr/bin/env bash
263set -euo pipefail
264 
265uv run train.py > run.log 2>&1
266 
267val_bpb=$(grep "^val_bpb:" run.log | tail -1 | awk '{print $2}' || echo "0")
268memory=$(grep "^peak_vram_mb:" run.log | tail -1 | awk '{print $2}' || echo "0")
269mfu=$(grep "^mfu_percent:" run.log | tail -1 | awk '{print $2}' || echo "0")
270 
271echo "METRIC val_bpb=$val_bpb"
272echo "METRIC peak_memory_mb=$memory"
273echo "METRIC mfu_percent=$mfu"
274```
275 
276## Session Files
277 
278| File | Purpose |
279|------|---------|
280| `autoresearch.md` | Living session document — goal, metrics, scope, learnings |
281| `autoresearch.sh` | Benchmark script — outputs `METRIC name=value` lines |
282| `autoresearch.jsonl` | Append-only experiment log with ASI (survives restarts) |
283 
284## Additional Resources
285 
286- **`references/gpu-training-guide.md`** — Detailed GPU setup, CUDA configuration, OOM troubleshooting, BPB formula, and performance tuning
287- **`assets/prepare.py`** — Data preparation (download, tokenizer, dataloader, evaluation)
288- **`assets/train.py`** — Model architecture and training loop
289- **`assets/program.md`** — Self-contained agent instructions for the ML loop
290- **`assets/pyproject.toml`** — Python dependencies (PyTorch, Flash Attention, etc.)
291

Full transparency — inspect the skill content before installing.