Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.
Add this skill
npx mdskills install huggingface/hugging-face-evaluationComprehensive evaluation tooling with multiple workflows, strong documentation, and clear usage examples
1---2name: hugging-face-evaluation3description: Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.4---56# Overview7This skill provides tools to add structured evaluation results to Hugging Face model cards. It supports multiple methods for adding evaluation data:8- Extracting existing evaluation tables from README content9- Importing benchmark scores from Artificial Analysis10- Running custom model evaluations with vLLM or accelerate backends (lighteval/inspect-ai)1112## Integration with HF Ecosystem13- **Model Cards**: Updates model-index metadata for leaderboard integration14- **Artificial Analysis**: Direct API integration for benchmark imports15- **Papers with Code**: Compatible with their model-index specification16- **Jobs**: Run evaluations directly on Hugging Face Jobs with `uv` integration17- **vLLM**: Efficient GPU inference for custom model evaluation18- **lighteval**: HuggingFace's evaluation library with vLLM/accelerate backends19- **inspect-ai**: UK AI Safety Institute's evaluation framework2021# Version221.3.02324# Dependencies2526## Core Dependencies27- huggingface_hub>=0.26.028- markdown-it-py>=3.0.029- python-dotenv>=1.2.130- pyyaml>=6.0.331- requests>=2.32.532- re (built-in)3334## Inference Provider Evaluation35- inspect-ai>=0.3.036- inspect-evals37- openai3839## vLLM Custom Model Evaluation (GPU required)40- lighteval[accelerate,vllm]>=0.6.041- vllm>=0.4.042- torch>=2.0.043- transformers>=4.40.044- accelerate>=0.30.04546Note: vLLM dependencies are installed automatically via PEP 723 script headers when using `uv run`.4748# IMPORTANT: Using This Skill4950## ⚠️ CRITICAL: Check for Existing PRs Before Creating New Ones5152**Before creating ANY pull request with `--create-pr`, you MUST check for existing open PRs:**5354```bash55uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"56```5758**If open PRs exist:**591. **DO NOT create a new PR** - this creates duplicate work for maintainers602. **Warn the user** that open PRs already exist613. **Show the user** the existing PR URLs so they can review them624. Only proceed if the user explicitly confirms they want to create another PR6364This prevents spamming model repositories with duplicate evaluation PRs.6566---6768> **All paths are relative to the directory containing this SKILL.md69file.**70> Before running any script, first `cd` to that directory or use the full71path.727374**Use `--help` for the latest workflow guidance.** Works with plain Python or `uv run`:75```bash76uv run scripts/evaluation_manager.py --help77uv run scripts/evaluation_manager.py inspect-tables --help78uv run scripts/evaluation_manager.py extract-readme --help79```80Key workflow (matches CLI help):81821) `get-prs` → check for existing open PRs first832) `inspect-tables` → find table numbers/columns843) `extract-readme --table N` → prints YAML by default854) add `--apply` (push) or `--create-pr` to write changes8687# Core Capabilities8889## 1. Inspect and Extract Evaluation Tables from README90- **Inspect Tables**: Use `inspect-tables` to see all tables in a README with structure, columns, and sample rows91- **Parse Markdown Tables**: Accurate parsing using markdown-it-py (ignores code blocks and examples)92- **Table Selection**: Use `--table N` to extract from a specific table (required when multiple tables exist)93- **Format Detection**: Recognize common formats (benchmarks as rows, columns, or comparison tables with multiple models)94- **Column Matching**: Automatically identify model columns/rows; prefer `--model-column-index` (index from inspect output). Use `--model-name-override` only with exact column header text.95- **YAML Generation**: Convert selected table to model-index YAML format96- **Task Typing**: `--task-type` sets the `task.type` field in model-index output (e.g., `text-generation`, `summarization`)9798## 2. Import from Artificial Analysis99- **API Integration**: Fetch benchmark scores directly from Artificial Analysis100- **Automatic Formatting**: Convert API responses to model-index format101- **Metadata Preservation**: Maintain source attribution and URLs102- **PR Creation**: Automatically create pull requests with evaluation updates103104## 3. Model-Index Management105- **YAML Generation**: Create properly formatted model-index entries106- **Merge Support**: Add evaluations to existing model cards without overwriting107- **Validation**: Ensure compliance with Papers with Code specification108- **Batch Operations**: Process multiple models efficiently109110## 4. Run Evaluations on HF Jobs (Inference Providers)111- **Inspect-AI Integration**: Run standard evaluations using the `inspect-ai` library112- **UV Integration**: Seamlessly run Python scripts with ephemeral dependencies on HF infrastructure113- **Zero-Config**: No Dockerfiles or Space management required114- **Hardware Selection**: Configure CPU or GPU hardware for the evaluation job115- **Secure Execution**: Handles API tokens safely via secrets passed through the CLI116117## 5. Run Custom Model Evaluations with vLLM (NEW)118119⚠️ **Important:** This approach is only possible on devices with `uv` installed and sufficient GPU memory.120**Benefits:** No need to use `hf_jobs()` MCP tool, can run scripts directly in terminal121**When to use:** User working in local device directly when GPU is available122123### Before running the script124125- check the script path126- check uv is installed127- check gpu is available with `nvidia-smi`128129### Running the script130131```bash132uv run scripts/train_sft_example.py133```134### Features135136- **vLLM Backend**: High-performance GPU inference (5-10x faster than standard HF methods)137- **lighteval Framework**: HuggingFace's evaluation library with Open LLM Leaderboard tasks138- **inspect-ai Framework**: UK AI Safety Institute's evaluation library139- **Standalone or Jobs**: Run locally or submit to HF Jobs infrastructure140141# Usage Instructions142143The skill includes Python scripts in `scripts/` to perform operations.144145### Prerequisites146- Preferred: use `uv run` (PEP 723 header auto-installs deps)147- Or install manually: `pip install huggingface-hub markdown-it-py python-dotenv pyyaml requests`148- Set `HF_TOKEN` environment variable with Write-access token149- For Artificial Analysis: Set `AA_API_KEY` environment variable150- `.env` is loaded automatically if `python-dotenv` is installed151152### Method 1: Extract from README (CLI workflow)153154Recommended flow (matches `--help`):155```bash156# 1) Inspect tables to get table numbers and column hints157uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model"158159# 2) Extract a specific table (prints YAML by default)160uv run scripts/evaluation_manager.py extract-readme \161 --repo-id "username/model" \162 --table 1 \163 [--model-column-index <column index shown by inspect-tables>] \164 [--model-name-override "<column header/model name>"] # use exact header text if you can't use the index165166# 3) Apply changes (push or PR)167uv run scripts/evaluation_manager.py extract-readme \168 --repo-id "username/model" \169 --table 1 \170 --apply # push directly171# or172uv run scripts/evaluation_manager.py extract-readme \173 --repo-id "username/model" \174 --table 1 \175 --create-pr # open a PR176```177178Validation checklist:179- YAML is printed by default; compare against the README table before applying.180- Prefer `--model-column-index`; if using `--model-name-override`, the column header text must be exact.181- For transposed tables (models as rows), ensure only one row is extracted.182183### Method 2: Import from Artificial Analysis184185Fetch benchmark scores from Artificial Analysis API and add them to a model card.186187**Basic Usage:**188```bash189AA_API_KEY="your-api-key" uv run scripts/evaluation_manager.py import-aa \190 --creator-slug "anthropic" \191 --model-name "claude-sonnet-4" \192 --repo-id "username/model-name"193```194195**With Environment File:**196```bash197# Create .env file198echo "AA_API_KEY=your-api-key" >> .env199echo "HF_TOKEN=your-hf-token" >> .env200201# Run import202uv run scripts/evaluation_manager.py import-aa \203 --creator-slug "anthropic" \204 --model-name "claude-sonnet-4" \205 --repo-id "username/model-name"206```207208**Create Pull Request:**209```bash210uv run scripts/evaluation_manager.py import-aa \211 --creator-slug "anthropic" \212 --model-name "claude-sonnet-4" \213 --repo-id "username/model-name" \214 --create-pr215```216217### Method 3: Run Evaluation Job218219Submit an evaluation job on Hugging Face infrastructure using the `hf jobs uv run` CLI.220221**Direct CLI Usage:**222```bash223HF_TOKEN=$HF_TOKEN \224hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \225 --flavor cpu-basic \226 --secret HF_TOKEN=$HF_TOKEN \227 -- --model "meta-llama/Llama-2-7b-hf" \228 --task "mmlu"229```230231**GPU Example (A10G):**232```bash233HF_TOKEN=$HF_TOKEN \234hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \235 --flavor a10g-small \236 --secret HF_TOKEN=$HF_TOKEN \237 -- --model "meta-llama/Llama-2-7b-hf" \238 --task "gsm8k"239```240241**Python Helper (optional):**242```bash243uv run scripts/run_eval_job.py \244 --model "meta-llama/Llama-2-7b-hf" \245 --task "mmlu" \246 --hardware "t4-small"247```248249### Method 4: Run Custom Model Evaluation with vLLM250251Evaluate custom HuggingFace models directly on GPU using vLLM or accelerate backends. These scripts are **separate from inference provider scripts** and run models locally on the job's hardware.252253#### When to Use vLLM Evaluation (vs Inference Providers)254255| Feature | vLLM Scripts | Inference Provider Scripts |256|---------|-------------|---------------------------|257| Model access | Any HF model | Models with API endpoints |258| Hardware | Your GPU (or HF Jobs GPU) | Provider's infrastructure |259| Cost | HF Jobs compute cost | API usage fees |260| Speed | vLLM optimized | Depends on provider |261| Offline | Yes (after download) | No |262263#### Option A: lighteval with vLLM Backend264265lighteval is HuggingFace's evaluation library, supporting Open LLM Leaderboard tasks.266267**Standalone (local GPU):**268```bash269# Run MMLU 5-shot with vLLM270uv run scripts/lighteval_vllm_uv.py \271 --model meta-llama/Llama-3.2-1B \272 --tasks "leaderboard|mmlu|5"273274# Run multiple tasks275uv run scripts/lighteval_vllm_uv.py \276 --model meta-llama/Llama-3.2-1B \277 --tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"278279# Use accelerate backend instead of vLLM280uv run scripts/lighteval_vllm_uv.py \281 --model meta-llama/Llama-3.2-1B \282 --tasks "leaderboard|mmlu|5" \283 --backend accelerate284285# Chat/instruction-tuned models286uv run scripts/lighteval_vllm_uv.py \287 --model meta-llama/Llama-3.2-1B-Instruct \288 --tasks "leaderboard|mmlu|5" \289 --use-chat-template290```291292**Via HF Jobs:**293```bash294hf jobs uv run scripts/lighteval_vllm_uv.py \295 --flavor a10g-small \296 --secrets HF_TOKEN=$HF_TOKEN \297 -- --model meta-llama/Llama-3.2-1B \298 --tasks "leaderboard|mmlu|5"299```300301**lighteval Task Format:**302Tasks use the format `suite|task|num_fewshot`:303- `leaderboard|mmlu|5` - MMLU with 5-shot304- `leaderboard|gsm8k|5` - GSM8K with 5-shot305- `lighteval|hellaswag|0` - HellaSwag zero-shot306- `leaderboard|arc_challenge|25` - ARC-Challenge with 25-shot307308**Finding Available Tasks:**309The complete list of available lighteval tasks can be found at:310https://github.com/huggingface/lighteval/blob/main/examples/tasks/all_tasks.txt311312This file contains all supported tasks in the format `suite|task|num_fewshot|0` (the trailing `0` is a version flag and can be ignored). Common suites include:313- `leaderboard` - Open LLM Leaderboard tasks (MMLU, GSM8K, ARC, HellaSwag, etc.)314- `lighteval` - Additional lighteval tasks315- `bigbench` - BigBench tasks316- `original` - Original benchmark tasks317318To use a task from the list, extract the `suite|task|num_fewshot` portion (without the trailing `0`) and pass it to the `--tasks` parameter. For example:319- From file: `leaderboard|mmlu|0` → Use: `leaderboard|mmlu|0` (or change to `5` for 5-shot)320- From file: `bigbench|abstract_narrative_understanding|0` → Use: `bigbench|abstract_narrative_understanding|0`321- From file: `lighteval|wmt14:hi-en|0` → Use: `lighteval|wmt14:hi-en|0`322323Multiple tasks can be specified as comma-separated values: `--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"`324325#### Option B: inspect-ai with vLLM Backend326327inspect-ai is the UK AI Safety Institute's evaluation framework.328329**Standalone (local GPU):**330```bash331# Run MMLU with vLLM332uv run scripts/inspect_vllm_uv.py \333 --model meta-llama/Llama-3.2-1B \334 --task mmlu335336# Use HuggingFace Transformers backend337uv run scripts/inspect_vllm_uv.py \338 --model meta-llama/Llama-3.2-1B \339 --task mmlu \340 --backend hf341342# Multi-GPU with tensor parallelism343uv run scripts/inspect_vllm_uv.py \344 --model meta-llama/Llama-3.2-70B \345 --task mmlu \346 --tensor-parallel-size 4347```348349**Via HF Jobs:**350```bash351hf jobs uv run scripts/inspect_vllm_uv.py \352 --flavor a10g-small \353 --secrets HF_TOKEN=$HF_TOKEN \354 -- --model meta-llama/Llama-3.2-1B \355 --task mmlu356```357358**Available inspect-ai Tasks:**359- `mmlu` - Massive Multitask Language Understanding360- `gsm8k` - Grade School Math361- `hellaswag` - Common sense reasoning362- `arc_challenge` - AI2 Reasoning Challenge363- `truthfulqa` - TruthfulQA benchmark364- `winogrande` - Winograd Schema Challenge365- `humaneval` - Code generation366367#### Option C: Python Helper Script368369The helper script auto-selects hardware and simplifies job submission:370371```bash372# Auto-detect hardware based on model size373uv run scripts/run_vllm_eval_job.py \374 --model meta-llama/Llama-3.2-1B \375 --task "leaderboard|mmlu|5" \376 --framework lighteval377378# Explicit hardware selection379uv run scripts/run_vllm_eval_job.py \380 --model meta-llama/Llama-3.2-70B \381 --task mmlu \382 --framework inspect \383 --hardware a100-large \384 --tensor-parallel-size 4385386# Use HF Transformers backend387uv run scripts/run_vllm_eval_job.py \388 --model microsoft/phi-2 \389 --task mmlu \390 --framework inspect \391 --backend hf392```393394**Hardware Recommendations:**395| Model Size | Recommended Hardware |396|------------|---------------------|397| < 3B params | `t4-small` |398| 3B - 13B | `a10g-small` |399| 13B - 34B | `a10g-large` |400| 34B+ | `a100-large` |401402### Commands Reference403404**Top-level help and version:**405```bash406uv run scripts/evaluation_manager.py --help407uv run scripts/evaluation_manager.py --version408```409410**Inspect Tables (start here):**411```bash412uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model-name"413```414415**Extract from README:**416```bash417uv run scripts/evaluation_manager.py extract-readme \418 --repo-id "username/model-name" \419 --table N \420 [--model-column-index N] \421 [--model-name-override "Exact Column Header or Model Name"] \422 [--task-type "text-generation"] \423 [--dataset-name "Custom Benchmarks"] \424 [--apply | --create-pr]425```426427**Import from Artificial Analysis:**428```bash429AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \430 --creator-slug "creator-name" \431 --model-name "model-slug" \432 --repo-id "username/model-name" \433 [--create-pr]434```435436**View / Validate:**437```bash438uv run scripts/evaluation_manager.py show --repo-id "username/model-name"439uv run scripts/evaluation_manager.py validate --repo-id "username/model-name"440```441442**Check Open PRs (ALWAYS run before --create-pr):**443```bash444uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"445```446Lists all open pull requests for the model repository. Shows PR number, title, author, date, and URL.447448**Run Evaluation Job (Inference Providers):**449```bash450hf jobs uv run scripts/inspect_eval_uv.py \451 --flavor "cpu-basic|t4-small|..." \452 --secret HF_TOKEN=$HF_TOKEN \453 -- --model "model-id" \454 --task "task-name"455```456457or use the Python helper:458459```bash460uv run scripts/run_eval_job.py \461 --model "model-id" \462 --task "task-name" \463 --hardware "cpu-basic|t4-small|..."464```465466**Run vLLM Evaluation (Custom Models):**467```bash468# lighteval with vLLM469hf jobs uv run scripts/lighteval_vllm_uv.py \470 --flavor "a10g-small" \471 --secrets HF_TOKEN=$HF_TOKEN \472 -- --model "model-id" \473 --tasks "leaderboard|mmlu|5"474475# inspect-ai with vLLM476hf jobs uv run scripts/inspect_vllm_uv.py \477 --flavor "a10g-small" \478 --secrets HF_TOKEN=$HF_TOKEN \479 -- --model "model-id" \480 --task "mmlu"481482# Helper script (auto hardware selection)483uv run scripts/run_vllm_eval_job.py \484 --model "model-id" \485 --task "leaderboard|mmlu|5" \486 --framework lighteval487```488489### Model-Index Format490491The generated model-index follows this structure:492493```yaml494model-index:495 - name: Model Name496 results:497 - task:498 type: text-generation499 dataset:500 name: Benchmark Dataset501 type: benchmark_type502 metrics:503 - name: MMLU504 type: mmlu505 value: 85.2506 - name: HumanEval507 type: humaneval508 value: 72.5509 source:510 name: Source Name511 url: https://source-url.com512```513514WARNING: Do not use markdown formatting in the model name. Use the exact name from the table. Only use urls in the source.url field.515516### Error Handling517- **Table Not Found**: Script will report if no evaluation tables are detected518- **Invalid Format**: Clear error messages for malformed tables519- **API Errors**: Retry logic for transient Artificial Analysis API failures520- **Token Issues**: Validation before attempting updates521- **Merge Conflicts**: Preserves existing model-index entries when adding new ones522- **Space Creation**: Handles naming conflicts and hardware request failures gracefully523524### Best Practices5255261. **Check for existing PRs first**: Run `get-prs` before creating any new PR to avoid duplicates5272. **Always start with `inspect-tables`**: See table structure and get the correct extraction command5283. **Use `--help` for guidance**: Run `inspect-tables --help` to see the complete workflow5294. **Preview first**: Default behavior prints YAML; review it before using `--apply` or `--create-pr`5305. **Verify extracted values**: Compare YAML output against the README table manually5316. **Use `--table N` for multi-table READMEs**: Required when multiple evaluation tables exist5327. **Use `--model-name-override` for comparison tables**: Copy the exact column header from `inspect-tables` output5338. **Create PRs for Others**: Use `--create-pr` when updating models you don't own5349. **One model per repo**: Only add the main model's results to model-index53510. **No markdown in YAML names**: The model name field in YAML should be plain text536537### Model Name Matching538539When extracting evaluation tables with multiple models (either as columns or rows), the script uses **exact normalized token matching**:540541- Removes markdown formatting (bold `**`, links `[]()` )542- Normalizes names (lowercase, replace `-` and `_` with spaces)543- Compares token sets: `"OLMo-3-32B"` → `{"olmo", "3", "32b"}` matches `"**Olmo 3 32B**"` or `"[Olmo-3-32B](...)`544- Only extracts if tokens match exactly (handles different word orders and separators)545- Fails if no exact match found (rather than guessing from similar names)546547**For column-based tables** (benchmarks as rows, models as columns):548- Finds the column header matching the model name549- Extracts scores from that column only550551**For transposed tables** (models as rows, benchmarks as columns):552- Finds the row in the first column matching the model name553- Extracts all benchmark scores from that row only554555This ensures only the correct model's scores are extracted, never unrelated models or training checkpoints.556557### Common Patterns558559**Update Your Own Model:**560```bash561# Extract from README and push directly562uv run scripts/evaluation_manager.py extract-readme \563 --repo-id "your-username/your-model" \564 --task-type "text-generation"565```566567**Update Someone Else's Model (Full Workflow):**568```bash569# Step 1: ALWAYS check for existing PRs first570uv run scripts/evaluation_manager.py get-prs \571 --repo-id "other-username/their-model"572573# Step 2: If NO open PRs exist, proceed with creating one574uv run scripts/evaluation_manager.py extract-readme \575 --repo-id "other-username/their-model" \576 --create-pr577578# If open PRs DO exist:579# - Warn the user about existing PRs580# - Show them the PR URLs581# - Do NOT create a new PR unless user explicitly confirms582```583584**Import Fresh Benchmarks:**585```bash586# Step 1: Check for existing PRs587uv run scripts/evaluation_manager.py get-prs \588 --repo-id "anthropic/claude-sonnet-4"589590# Step 2: If no PRs, import from Artificial Analysis591AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \592 --creator-slug "anthropic" \593 --model-name "claude-sonnet-4" \594 --repo-id "anthropic/claude-sonnet-4" \595 --create-pr596```597598### Troubleshooting599600**Issue**: "No evaluation tables found in README"601- **Solution**: Check if README contains markdown tables with numeric scores602603**Issue**: "Could not find model 'X' in transposed table"604- **Solution**: The script will display available models. Use `--model-name-override` with the exact name from the list605- **Example**: `--model-name-override "**Olmo 3-32B**"`606607**Issue**: "AA_API_KEY not set"608- **Solution**: Set environment variable or add to .env file609610**Issue**: "Token does not have write access"611- **Solution**: Ensure HF_TOKEN has write permissions for the repository612613**Issue**: "Model not found in Artificial Analysis"614- **Solution**: Verify creator-slug and model-name match API values615616**Issue**: "Payment required for hardware"617- **Solution**: Add a payment method to your Hugging Face account to use non-CPU hardware618619**Issue**: "vLLM out of memory" or CUDA OOM620- **Solution**: Use a larger hardware flavor, reduce `--gpu-memory-utilization`, or use `--tensor-parallel-size` for multi-GPU621622**Issue**: "Model architecture not supported by vLLM"623- **Solution**: Use `--backend hf` (inspect-ai) or `--backend accelerate` (lighteval) for HuggingFace Transformers624625**Issue**: "Trust remote code required"626- **Solution**: Add `--trust-remote-code` flag for models with custom code (e.g., Phi-2, Qwen)627628**Issue**: "Chat template not found"629- **Solution**: Only use `--use-chat-template` for instruction-tuned models that include a chat template630631### Integration Examples632633**Python Script Integration:**634```python635import subprocess636import os637638def update_model_evaluations(repo_id, readme_content):639 """Update model card with evaluations from README."""640 result = subprocess.run([641 "python", "scripts/evaluation_manager.py",642 "extract-readme",643 "--repo-id", repo_id,644 "--create-pr"645 ], capture_output=True, text=True)646647 if result.returncode == 0:648 print(f"Successfully updated {repo_id}")649 else:650 print(f"Error: {result.stderr}")651```652
Full transparency — inspect the skill content before installing.