Edit code → commit → run benchmark → measure metric → keep improvement or revert → repeat forever. Works for any optimization target: LLM training loss, test speed, bundle size, build time, Lighthouse scores, and more. Inspired by Karpathy's autoresearch, pi-autoresearch, and litesearch. This plugin provides two skills that work together. Autoresearch is the core engine (works for any metric), and
Add this skill
npx mdskills install proyecto26/autoresearch-mlComprehensive autonomous ML training loop with excellent GPU support and detailed experiment tracking
1---2name: autoresearch-ml3description: >-4 Autonomous LLM training optimization with GPU support. Runs 5-minute training5 experiments, measures val_bpb, keeps improvements or reverts — repeat forever.6 Use this skill when the user asks to "train a model autonomously",7 "optimize LLM training", "run ML experiments", "autoresearch with GPU",8 "optimize val_bpb", "autonomous ML training", "LLM pretraining loop",9 "setup ML autoresearch", "GPU training experiments", "fine-tune a model",10 "pretrain from scratch", "speed up training", "lower my loss",11 "GPU optimization", "CUDA training", or mentions "train.py", "prepare.py",12 "bits per byte", "val_bpb", "NVIDIA GPU training", "RTX training",13 "H100 training", "autonomous model training", "consumer GPU training",14 "low VRAM training". Always use this skill when the user wants to autonomously15 optimize any ML training metric.16version: 0.2.017---1819# Autoresearch ML: Autonomous LLM Training Optimization2021An autonomous experiment loop for single-GPU LLM pretraining. Edit `train.py` → commit → run 5-minute training → measure `val_bpb` → keep improvement or revert → **repeat forever**.2223This skill is self-contained — it includes everything needed to set up and run the loop.2425## Setup Phase2627### 1. Copy Template Assets2829Copy the bundled training template to the project directory:3031```bash32cp ${CLAUDE_SKILL_DIR}/assets/prepare.py .33cp ${CLAUDE_SKILL_DIR}/assets/train.py .34cp ${CLAUDE_SKILL_DIR}/assets/pyproject.toml .35cp ${CLAUDE_SKILL_DIR}/assets/program.md .36```3738### 2. Install and Prepare3940```bash41uv sync # Install dependencies42uv run prepare.py # Download data shards, train tokenizer (~2 min)43```4445### 3. Verify GPU4647```bash48nvidia-smi49python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name()}, VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB')"50```5152### 4. Initialize the Experiment Session53541. Create a branch: `git checkout -b autoresearch/<tag>-<date>`552. Ensure session files are gitignored (critical — `git revert` will fail if tracked):56 ```bash57 echo -e "autoresearch.jsonl\nrun.log" >> .gitignore58 git add .gitignore && git commit -m "autoresearch: add session files to gitignore"59 ```603. Read `prepare.py` and `train.py` thoroughly to understand the codebase614. Write `autoresearch.md` — a living session document recording goal, metrics, files in scope, constraints, and learnings625. Write `autoresearch.sh` — the benchmark script (see Benchmark Script section below)636. Commit session files647. Run baseline: `bash autoresearch.sh`658. Parse metrics from output (lines matching `METRIC name=value`)669. Record baseline in `autoresearch.jsonl`:67 - First write a config header: `{"type":"config","name":"Optimize val_bpb","metricName":"val_bpb","metricUnit":"bpb","bestDirection":"lower"}`68 - Then record the baseline result6910. Begin the experiment loop7071## The Experiment Loop7273**LOOP FOREVER. Never ask "should I continue?" — just keep going.**7475The user might be asleep, away from the computer, or expects you to work indefinitely. Each experiment takes ~5 minutes, so you can run ~12/hour, ~100 overnight. The loop runs until the user interrupts you, period. If you run out of ideas, think harder — re-read `train.py` for new angles, try combining previous near-misses, try more radical architectural changes.7677Each iteration:7879```801. Read current git state and autoresearch.md812. Choose an experimental change to train.py (informed by past results and ASI notes)823. Edit train.py (the ONLY editable file)834. git add train.py && git commit -m "experiment: <description>"845. Run: bash autoresearch.sh > run.log 2>&1856. Parse METRIC lines from output867. If output is empty (crash): tail -n 50 run.log to read the stack trace878. Decide: keep or discard889. Log result to autoresearch.jsonl (include ASI annotations)8910. If discard/crash: git revert $(git rev-parse HEAD) --no-edit9011. Update autoresearch.md with learnings (every few experiments)9112. Repeat92```9394### Decision Rules9596- **val_bpb improved (lower)** → `keep` (commit stays, branch advances)97- **val_bpb equal or worse** → `discard` (run `git revert $(git rev-parse HEAD) --no-edit`)98- **Crash (OOM, CUDA error, NaN loss)** → `discard` (revert). If it's a simple fix (typo, import), fix and re-run. If the idea is fundamentally broken, log as crash and move on.99- **Simpler code for equal val_bpb** → `keep` (removing complexity is a win)100- **Catastrophic VRAM increase** → consider `discard` even if val_bpb improved slightly101102### Simplicity Criterion103104All else being equal, simpler is better. A 0.001 val_bpb improvement that adds 20 lines of hacky code? Probably not worth it. A 0.001 improvement from deleting code? Definitely keep. Equal val_bpb with much simpler code? Keep.105106### Constraints107108- **Fixed 5-minute time budget.** All experiments are directly comparable — the wall clock is the equalizer.109- **Single file modification.** Only `train.py` changes; `prepare.py` is immutable. This ensures fair comparison (same data, same evaluation).110- **VRAM is a soft constraint.** Using more VRAM is acceptable but note the trade-off (larger model = fewer training steps in 5 minutes).111- **No new packages.** You can only use what's already in `pyproject.toml`.112- **Timeout:** If a run exceeds 10 minutes, kill it and treat as a crash.113114### Don't Thrash115116If 3 consecutive experiments fail or get discarded, stop and think about why. Re-read `train.py` for new angles. Try a fundamentally different approach.117118### Handling User Messages119120If the user sends a message while the loop is running: finish the current cycle, address the feedback, then resume immediately — do not wait for permission.121122## Logging to autoresearch.jsonl123124Each experiment appends one JSON line:125126```json127{"run":2,"commit":"def5678","metric":0.993,"metrics":{"peak_memory_mb":44200,"mfu_percent":39.8},"status":"keep","description":"increase LR to 0.04","timestamp":1700000000,"segment":0,"confidence":null,"asi":{"hypothesis":"higher LR converges faster","arch_change":"MATRIX_LR 0.03→0.04"}}128```129130Use the shared logging script:131132```bash133bash ${CLAUDE_PLUGIN_ROOT}/skills/autoresearch/scripts/log-experiment.sh \134 --run 2 \135 --commit "$(git rev-parse --short HEAD)" \136 --metric 0.993 \137 --status keep \138 --description "increase LR to 0.04" \139 --metrics '{"peak_memory_mb":44200,"mfu_percent":39.8}' \140 --segment 0 \141 --asi '{"hypothesis":"higher LR converges faster"}'142```143144Parse metrics from benchmark output:145146```bash147bash autoresearch.sh 2>&1 | bash ${CLAUDE_PLUGIN_ROOT}/skills/autoresearch/scripts/parse-metrics.sh148```149150Valid statuses: `keep`, `discard`, `crash`, `checks_failed`151152## ASI (Actionable Side Information)153154ASI is structured annotation per experiment that **survives reverts**. When code changes are discarded, only the description and ASI remain — the only structured memory of what happened.155156Record ASI for every experiment:157158```json159{160 "hypothesis": "Deeper model with fewer steps should compress better",161 "arch_change": "DEPTH 8→12, DEVICE_BATCH_SIZE 128→64",162 "result": "val_bpb improved 0.998→0.992, but 2x VRAM",163 "next_action_hint": "Try intermediate DEPTH=10 for better VRAM tradeoff"164}165```166167## Resuming After Context Reset168169If `autoresearch.jsonl` and `autoresearch.md` exist in the working directory:1701711. Read `autoresearch.md` for full context (goal, metrics, files, constraints, learnings)1722. Read `autoresearch.jsonl` to see all past experiments, current best, and ASI annotations1733. Check git log to verify current branch state matches expected state1744. If git state is dirty (unclean shutdown), revert uncommitted changes1755. Resume the loop from where it left off — no re-setup needed1766. **Resume immediately** — do not ask "should I continue?"177178## Confidence Scoring179180After 3+ experiments, assess whether improvements are real or noise:181182- Compute the **Median Absolute Deviation (MAD)** of all metric values as a noise floor183- **Confidence = |best improvement| / MAD**184- ≥2.0× → likely real improvement185- 1.0–2.0× → marginal, could be noise186- <1.0× → within noise floor187188ML training with fixed seeds is mostly deterministic, so the noise floor is typically very low.189190## Template Architecture191192### prepare.py (FIXED — never modify)193194- **Data download:** Fetches parquet shards from HuggingFace (climbmix-400b-shuffle)195- **Tokenizer training:** BPE tokenizer (8192 vocab) using rustbpe/tiktoken196- **Dataloader:** Best-fit document packing with 100% token utilization, BOS-aligned197- **Evaluation:** `evaluate_bpb()` computes bits-per-byte (vocab-size-independent metric)198199Key constants: `MAX_SEQ_LEN = 2048`, `TIME_BUDGET = 300`, `EVAL_TOKENS = 40 * 524288`, `VOCAB_SIZE = 8192`200201### train.py (MODIFIED BY AGENT — the only editable file)202203- **Model:** GPT with RoPE, sliding window attention, value embeddings, Flash Attention 3204- **Optimizer:** Hybrid MuonAdamW (Muon for matrices, AdamW for everything else)205- **Training:** Gradient accumulation, LR schedules (warmup/flat/warmdown), fixed time budget206207Editable: `ASPECT_RATIO`, `DEPTH`, `WINDOW_PATTERN`, `TOTAL_BATCH_SIZE`, learning rates, LR schedule phases, and the full model architecture.208209## GPU Requirements210211### Supported GPU Tiers212213| Tier | GPUs | VRAM | Notes |214|------|------|------|-------|215| **Consumer** | GTX 1080 Ti, RTX 2080 Ti | 11GB | fp32 fallback, gradient checkpointing required |216| **Consumer+** | RTX 3090, RTX 4090 | 24GB | Great for experiments |217| **Enthusiast** | RTX 5090 | 32GB | Excellent — larger models possible |218| **Datacenter** | A100, H100 | 40-80GB | Original development target |219220### Consumer GPU Adaptations221222For GPUs with limited VRAM (< 16GB), apply these changes to `train.py` during the first experiment:2232241. **Remove Flash Attention 3 import and dependency** — the top-level `from kernels import get_kernel` block (lines 20-24) runs unconditionally at startup and will fail on non-Hopper GPUs. Replace the entire block and the `fa3.flash_attn_func()` call in `CausalSelfAttention.forward()` with `torch.nn.functional.scaled_dot_product_attention`. Also remove `kernels` from `pyproject.toml` and run `uv sync` again.2252. **Enable gradient checkpointing** — use `torch.utils.checkpoint.checkpoint()` with `use_reentrant=False` to trade ~30% compute for ~50% VRAM savings2263. **Auto-scale model size** — reduce `DEPTH` and `DEVICE_BATCH_SIZE` to fit VRAM budget (see table below)2274. **Cap evaluation steps** — scale eval batch count by available VRAM (30-100 steps)2285. **fp32 fallback** — use fp32 instead of bf16 for Pascal GPUs (compute capability < 7.5). Change the autocast dtype and disable bf16-specific optimizations.229230### VRAM Auto-Scaling Guide231232| VRAM Budget | DEPTH | n_embd | Batch Size | Seq Length | ~Params |233|-------------|-------|--------|------------|------------|---------|234| 4GB | 2 | 128 | 4 | 512 | ~1M |235| 8GB | 4 | 256 | 8 | 1024 | ~5M |236| 12GB | 6 | 384 | 16 | 1024 | ~14M |237| 16GB | 8 | 512 | 32 | 2048 | ~25M |238| 24GB | 8 | 512 | 128 | 2048 | ~50M |239| 32GB | 12 | 768 | 128 | 2048 | ~85M |240| 80GB | 16 | 1024 | 128 | 2048 | ~200M |241242**Note:** `n_embd` must be a multiple of `HEAD_DIM` (default 128). Config search: start with the largest depth that fits, reduce `DEVICE_BATCH_SIZE` then `MAX_SEQ_LEN` if OOM.243244## Experiment Strategies2452461. **Architecture:** Layer count, attention patterns, embedding dimensions, activation functions2472. **Optimizer:** Learning rates (per-parameter), schedule phases, momentum, weight decay2483. **Attention:** Window sizes, sliding window configs, full vs. local attention2494. **Batch size:** Trade-off between gradient quality and steps-per-budget2505. **Initialization:** Weight init schemes, residual scaling parameters2516. **Advanced:** Value embeddings, softcapped logits, GQA252253## Metric: Bits Per Byte (BPB)254255How well the model compresses text, normalized by byte count. Vocabulary-size-independent — all architectures are directly comparable. Lower is better. See `references/gpu-training-guide.md` for the formula and interpretation table.256257## Benchmark Script258259Use this as `autoresearch.sh`:260261```bash262#!/usr/bin/env bash263set -euo pipefail264265uv run train.py > run.log 2>&1266267val_bpb=$(grep "^val_bpb:" run.log | tail -1 | awk '{print $2}' || echo "0")268memory=$(grep "^peak_vram_mb:" run.log | tail -1 | awk '{print $2}' || echo "0")269mfu=$(grep "^mfu_percent:" run.log | tail -1 | awk '{print $2}' || echo "0")270271echo "METRIC val_bpb=$val_bpb"272echo "METRIC peak_memory_mb=$memory"273echo "METRIC mfu_percent=$mfu"274```275276## Session Files277278| File | Purpose |279|------|---------|280| `autoresearch.md` | Living session document — goal, metrics, scope, learnings |281| `autoresearch.sh` | Benchmark script — outputs `METRIC name=value` lines |282| `autoresearch.jsonl` | Append-only experiment log with ASI (survives restarts) |283284## Additional Resources285286- **`references/gpu-training-guide.md`** — Detailed GPU setup, CUDA configuration, OOM troubleshooting, BPB formula, and performance tuning287- **`assets/prepare.py`** — Data preparation (download, tokenizer, dataloader, evaluation)288- **`assets/train.py`** — Model architecture and training loop289- **`assets/program.md`** — Self-contained agent instructions for the ML loop290- **`assets/pyproject.toml`** — Python dependencies (PyTorch, Flash Attention, etc.)291
Full transparency — inspect the skill content before installing.