How do I install Hugging Face Evaluation?

Install Hugging Face Evaluation with a single command: npx mdskills install huggingface/hugging-face-evaluation. This downloads the skill files into your project and your AI agent picks them up automatically.
What platforms support Hugging Face Evaluation?

Hugging Face Evaluation works with Claude Code, Claude Desktop, Cursor, Vscode Copilot, Windsurf, Continue Dev, Codex, Gemini Cli, Amp, Roo Code, Goose, Opencode, Trae, Qodo, Command Code. Skills use the open SKILL.md format which is compatible with any AI coding agent that reads markdown instructions.
← Back to skills
Hugging Face Evaluation

Name: Hugging Face Evaluation: AI Agent Skill
Brand: huggingface
Availability: InStock
Rating: 9 (1 reviews)
Author: huggingface
Verified
ProductivityIntermediate
Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.
by @huggingface 4,202Updated 2/24/2026
Add this skill
npx mdskills install huggingface/hugging-face-evaluation
Fork & Edit
Are you @huggingface? Sign in with GitHub to claim this listing.
Skill Advisor9.0
Comprehensive evaluation tooling with multiple workflows, strong documentation, and clear usage examples
+Provides multiple evaluation methods (README extraction, API imports, custom vLLM/lighteval runs)
+Includes critical workflow guidance (checking existing PRs before creating new ones)
+Excellent documentation with detailed examples, parameter explanations, and task format guides
-Heavy dependency footprint with GPU requirements may limit accessibility
-Shell execution permission scope could be more granular for security-sensitive workflows
SKILL.md
Edit in Browser
1---
2name: hugging-face-evaluation
3description: Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.
4---
5 
6# Overview
7This skill provides tools to add structured evaluation results to Hugging Face model cards. It supports multiple methods for adding evaluation data:
8- Extracting existing evaluation tables from README content
9- Importing benchmark scores from Artificial Analysis
10- Running custom model evaluations with vLLM or accelerate backends (lighteval/inspect-ai)
11 
12## Integration with HF Ecosystem
13- **Model Cards**: Updates model-index metadata for leaderboard integration
14- **Artificial Analysis**: Direct API integration for benchmark imports
15- **Papers with Code**: Compatible with their model-index specification
16- **Jobs**: Run evaluations directly on Hugging Face Jobs with `uv` integration
17- **vLLM**: Efficient GPU inference for custom model evaluation
18- **lighteval**: HuggingFace's evaluation library with vLLM/accelerate backends
19- **inspect-ai**: UK AI Safety Institute's evaluation framework
20 
21# Version
221.3.0
23 
24# Dependencies
25 
26## Core Dependencies
27- huggingface_hub>=0.26.0
28- markdown-it-py>=3.0.0
29- python-dotenv>=1.2.1
30- pyyaml>=6.0.3
31- requests>=2.32.5
32- re (built-in)
33 
34## Inference Provider Evaluation
35- inspect-ai>=0.3.0
36- inspect-evals
37- openai
38 
39## vLLM Custom Model Evaluation (GPU required)
40- lighteval[accelerate,vllm]>=0.6.0
41- vllm>=0.4.0
42- torch>=2.0.0
43- transformers>=4.40.0
44- accelerate>=0.30.0
45 
46Note: vLLM dependencies are installed automatically via PEP 723 script headers when using `uv run`.
47 
48# IMPORTANT: Using This Skill
49 
50## ⚠️ CRITICAL: Check for Existing PRs Before Creating New Ones
51 
52**Before creating ANY pull request with `--create-pr`, you MUST check for existing open PRs:**
53 
54```bash
55uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"
56```
57 
58**If open PRs exist:**
591. **DO NOT create a new PR** - this creates duplicate work for maintainers
602. **Warn the user** that open PRs already exist
613. **Show the user** the existing PR URLs so they can review them
624. Only proceed if the user explicitly confirms they want to create another PR
63 
64This prevents spamming model repositories with duplicate evaluation PRs.
65 
66---
67 
68> **All paths are relative to the directory containing this SKILL.md
69file.**
70> Before running any script, first `cd` to that directory or use the full
71path.
72 
73 
74**Use `--help` for the latest workflow guidance.** Works with plain Python or `uv run`:
75```bash
76uv run scripts/evaluation_manager.py --help
77uv run scripts/evaluation_manager.py inspect-tables --help
78uv run scripts/evaluation_manager.py extract-readme --help
79```
80Key workflow (matches CLI help):
81 
821) `get-prs` → check for existing open PRs first
832) `inspect-tables` → find table numbers/columns  
843) `extract-readme --table N` → prints YAML by default  
854) add `--apply` (push) or `--create-pr` to write changes
86 
87# Core Capabilities
88 
89## 1. Inspect and Extract Evaluation Tables from README
90- **Inspect Tables**: Use `inspect-tables` to see all tables in a README with structure, columns, and sample rows
91- **Parse Markdown Tables**: Accurate parsing using markdown-it-py (ignores code blocks and examples)
92- **Table Selection**: Use `--table N` to extract from a specific table (required when multiple tables exist)
93- **Format Detection**: Recognize common formats (benchmarks as rows, columns, or comparison tables with multiple models)
94- **Column Matching**: Automatically identify model columns/rows; prefer `--model-column-index` (index from inspect output). Use `--model-name-override` only with exact column header text.
95- **YAML Generation**: Convert selected table to model-index YAML format
96- **Task Typing**: `--task-type` sets the `task.type` field in model-index output (e.g., `text-generation`, `summarization`)
97 
98## 2. Import from Artificial Analysis
99- **API Integration**: Fetch benchmark scores directly from Artificial Analysis
100- **Automatic Formatting**: Convert API responses to model-index format
101- **Metadata Preservation**: Maintain source attribution and URLs
102- **PR Creation**: Automatically create pull requests with evaluation updates
103 
104## 3. Model-Index Management
105- **YAML Generation**: Create properly formatted model-index entries
106- **Merge Support**: Add evaluations to existing model cards without overwriting
107- **Validation**: Ensure compliance with Papers with Code specification
108- **Batch Operations**: Process multiple models efficiently
109 
110## 4. Run Evaluations on HF Jobs (Inference Providers)
111- **Inspect-AI Integration**: Run standard evaluations using the `inspect-ai` library
112- **UV Integration**: Seamlessly run Python scripts with ephemeral dependencies on HF infrastructure
113- **Zero-Config**: No Dockerfiles or Space management required
114- **Hardware Selection**: Configure CPU or GPU hardware for the evaluation job
115- **Secure Execution**: Handles API tokens safely via secrets passed through the CLI
116 
117## 5. Run Custom Model Evaluations with vLLM (NEW)
118 
119⚠️ **Important:** This approach is only possible on devices with `uv` installed and sufficient GPU memory.
120**Benefits:** No need to use `hf_jobs()` MCP tool, can run scripts directly in terminal
121**When to use:** User working in local device directly  when GPU is available
122 
123### Before running the script
124 
125- check the script path
126- check uv is installed
127- check gpu is available with `nvidia-smi`
128 
129### Running the script
130 
131```bash
132uv run scripts/train_sft_example.py
133```
134### Features
135 
136- **vLLM Backend**: High-performance GPU inference (5-10x faster than standard HF methods)
137- **lighteval Framework**: HuggingFace's evaluation library with Open LLM Leaderboard tasks
138- **inspect-ai Framework**: UK AI Safety Institute's evaluation library
139- **Standalone or Jobs**: Run locally or submit to HF Jobs infrastructure
140 
141# Usage Instructions
142 
143The skill includes Python scripts in `scripts/` to perform operations.
144 
145### Prerequisites
146- Preferred: use `uv run` (PEP 723 header auto-installs deps)
147- Or install manually: `pip install huggingface-hub markdown-it-py python-dotenv pyyaml requests`
148- Set `HF_TOKEN` environment variable with Write-access token
149- For Artificial Analysis: Set `AA_API_KEY` environment variable
150- `.env` is loaded automatically if `python-dotenv` is installed
151 
152### Method 1: Extract from README (CLI workflow)
153 
154Recommended flow (matches `--help`):
155```bash
156# 1) Inspect tables to get table numbers and column hints
157uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model"
158 
159# 2) Extract a specific table (prints YAML by default)
160uv run scripts/evaluation_manager.py extract-readme \
161  --repo-id "username/model" \
162  --table 1 \
163  [--model-column-index <column index shown by inspect-tables>] \
164  [--model-name-override "<column header/model name>"]  # use exact header text if you can't use the index
165 
166# 3) Apply changes (push or PR)
167uv run scripts/evaluation_manager.py extract-readme \
168  --repo-id "username/model" \
169  --table 1 \
170  --apply       # push directly
171# or
172uv run scripts/evaluation_manager.py extract-readme \
173  --repo-id "username/model" \
174  --table 1 \
175  --create-pr   # open a PR
176```
177 
178Validation checklist:
179- YAML is printed by default; compare against the README table before applying.
180- Prefer `--model-column-index`; if using `--model-name-override`, the column header text must be exact.
181- For transposed tables (models as rows), ensure only one row is extracted.
182 
183### Method 2: Import from Artificial Analysis
184 
185Fetch benchmark scores from Artificial Analysis API and add them to a model card.
186 
187**Basic Usage:**
188```bash
189AA_API_KEY="your-api-key" uv run scripts/evaluation_manager.py import-aa \
190  --creator-slug "anthropic" \
191  --model-name "claude-sonnet-4" \
192  --repo-id "username/model-name"
193```
194 
195**With Environment File:**
196```bash
197# Create .env file
198echo "AA_API_KEY=your-api-key" >> .env
199echo "HF_TOKEN=your-hf-token" >> .env
200 
201# Run import
202uv run scripts/evaluation_manager.py import-aa \
203  --creator-slug "anthropic" \
204  --model-name "claude-sonnet-4" \
205  --repo-id "username/model-name"
206```
207 
208**Create Pull Request:**
209```bash
210uv run scripts/evaluation_manager.py import-aa \
211  --creator-slug "anthropic" \
212  --model-name "claude-sonnet-4" \
213  --repo-id "username/model-name" \
214  --create-pr
215```
216 
217### Method 3: Run Evaluation Job
218 
219Submit an evaluation job on Hugging Face infrastructure using the `hf jobs uv run` CLI.
220 
221**Direct CLI Usage:**
222```bash
223HF_TOKEN=$HF_TOKEN \
224hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
225  --flavor cpu-basic \
226  --secret HF_TOKEN=$HF_TOKEN \
227  -- --model "meta-llama/Llama-2-7b-hf" \
228     --task "mmlu"
229```
230 
231**GPU Example (A10G):**
232```bash
233HF_TOKEN=$HF_TOKEN \
234hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
235  --flavor a10g-small \
236  --secret HF_TOKEN=$HF_TOKEN \
237  -- --model "meta-llama/Llama-2-7b-hf" \
238     --task "gsm8k"
239```
240 
241**Python Helper (optional):**
242```bash
243uv run scripts/run_eval_job.py \
244  --model "meta-llama/Llama-2-7b-hf" \
245  --task "mmlu" \
246  --hardware "t4-small"
247```
248 
249### Method 4: Run Custom Model Evaluation with vLLM
250 
251Evaluate custom HuggingFace models directly on GPU using vLLM or accelerate backends. These scripts are **separate from inference provider scripts** and run models locally on the job's hardware.
252 
253#### When to Use vLLM Evaluation (vs Inference Providers)
254 
255| Feature | vLLM Scripts | Inference Provider Scripts |
256|---------|-------------|---------------------------|
257| Model access | Any HF model | Models with API endpoints |
258| Hardware | Your GPU (or HF Jobs GPU) | Provider's infrastructure |
259| Cost | HF Jobs compute cost | API usage fees |
260| Speed | vLLM optimized | Depends on provider |
261| Offline | Yes (after download) | No |
262 
263#### Option A: lighteval with vLLM Backend
264 
265lighteval is HuggingFace's evaluation library, supporting Open LLM Leaderboard tasks.
266 
267**Standalone (local GPU):**
268```bash
269# Run MMLU 5-shot with vLLM
270uv run scripts/lighteval_vllm_uv.py \
271  --model meta-llama/Llama-3.2-1B \
272  --tasks "leaderboard|mmlu|5"
273 
274# Run multiple tasks
275uv run scripts/lighteval_vllm_uv.py \
276  --model meta-llama/Llama-3.2-1B \
277  --tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"
278 
279# Use accelerate backend instead of vLLM
280uv run scripts/lighteval_vllm_uv.py \
281  --model meta-llama/Llama-3.2-1B \
282  --tasks "leaderboard|mmlu|5" \
283  --backend accelerate
284 
285# Chat/instruction-tuned models
286uv run scripts/lighteval_vllm_uv.py \
287  --model meta-llama/Llama-3.2-1B-Instruct \
288  --tasks "leaderboard|mmlu|5" \
289  --use-chat-template
290```
291 
292**Via HF Jobs:**
293```bash
294hf jobs uv run scripts/lighteval_vllm_uv.py \
295  --flavor a10g-small \
296  --secrets HF_TOKEN=$HF_TOKEN \
297  -- --model meta-llama/Llama-3.2-1B \
298     --tasks "leaderboard|mmlu|5"
299```
300 
301**lighteval Task Format:**
302Tasks use the format `suite|task|num_fewshot`:
303- `leaderboard|mmlu|5` - MMLU with 5-shot
304- `leaderboard|gsm8k|5` - GSM8K with 5-shot
305- `lighteval|hellaswag|0` - HellaSwag zero-shot
306- `leaderboard|arc_challenge|25` - ARC-Challenge with 25-shot
307 
308**Finding Available Tasks:**
309The complete list of available lighteval tasks can be found at:
310https://github.com/huggingface/lighteval/blob/main/examples/tasks/all_tasks.txt
311 
312This file contains all supported tasks in the format `suite|task|num_fewshot|0` (the trailing `0` is a version flag and can be ignored). Common suites include:
313- `leaderboard` - Open LLM Leaderboard tasks (MMLU, GSM8K, ARC, HellaSwag, etc.)
314- `lighteval` - Additional lighteval tasks
315- `bigbench` - BigBench tasks
316- `original` - Original benchmark tasks
317 
318To use a task from the list, extract the `suite|task|num_fewshot` portion (without the trailing `0`) and pass it to the `--tasks` parameter. For example:
319- From file: `leaderboard|mmlu|0` → Use: `leaderboard|mmlu|0` (or change to `5` for 5-shot)
320- From file: `bigbench|abstract_narrative_understanding|0` → Use: `bigbench|abstract_narrative_understanding|0`
321- From file: `lighteval|wmt14:hi-en|0` → Use: `lighteval|wmt14:hi-en|0`
322 
323Multiple tasks can be specified as comma-separated values: `--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"`
324 
325#### Option B: inspect-ai with vLLM Backend
326 
327inspect-ai is the UK AI Safety Institute's evaluation framework.
328 
329**Standalone (local GPU):**
330```bash
331# Run MMLU with vLLM
332uv run scripts/inspect_vllm_uv.py \
333  --model meta-llama/Llama-3.2-1B \
334  --task mmlu
335 
336# Use HuggingFace Transformers backend
337uv run scripts/inspect_vllm_uv.py \
338  --model meta-llama/Llama-3.2-1B \
339  --task mmlu \
340  --backend hf
341 
342# Multi-GPU with tensor parallelism
343uv run scripts/inspect_vllm_uv.py \
344  --model meta-llama/Llama-3.2-70B \
345  --task mmlu \
346  --tensor-parallel-size 4
347```
348 
349**Via HF Jobs:**
350```bash
351hf jobs uv run scripts/inspect_vllm_uv.py \
352  --flavor a10g-small \
353  --secrets HF_TOKEN=$HF_TOKEN \
354  -- --model meta-llama/Llama-3.2-1B \
355     --task mmlu
356```
357 
358**Available inspect-ai Tasks:**
359- `mmlu` - Massive Multitask Language Understanding
360- `gsm8k` - Grade School Math
361- `hellaswag` - Common sense reasoning
362- `arc_challenge` - AI2 Reasoning Challenge
363- `truthfulqa` - TruthfulQA benchmark
364- `winogrande` - Winograd Schema Challenge
365- `humaneval` - Code generation
366 
367#### Option C: Python Helper Script
368 
369The helper script auto-selects hardware and simplifies job submission:
370 
371```bash
372# Auto-detect hardware based on model size
373uv run scripts/run_vllm_eval_job.py \
374  --model meta-llama/Llama-3.2-1B \
375  --task "leaderboard|mmlu|5" \
376  --framework lighteval
377 
378# Explicit hardware selection
379uv run scripts/run_vllm_eval_job.py \
380  --model meta-llama/Llama-3.2-70B \
381  --task mmlu \
382  --framework inspect \
383  --hardware a100-large \
384  --tensor-parallel-size 4
385 
386# Use HF Transformers backend
387uv run scripts/run_vllm_eval_job.py \
388  --model microsoft/phi-2 \
389  --task mmlu \
390  --framework inspect \
391  --backend hf
392```
393 
394**Hardware Recommendations:**
395| Model Size | Recommended Hardware |
396|------------|---------------------|
397| < 3B params | `t4-small` |
398| 3B - 13B | `a10g-small` |
399| 13B - 34B | `a10g-large` |
400| 34B+ | `a100-large` |
401 
402### Commands Reference
403 
404**Top-level help and version:**
405```bash
406uv run scripts/evaluation_manager.py --help
407uv run scripts/evaluation_manager.py --version
408```
409 
410**Inspect Tables (start here):**
411```bash
412uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model-name"
413```
414 
415**Extract from README:**
416```bash
417uv run scripts/evaluation_manager.py extract-readme \
418  --repo-id "username/model-name" \
419  --table N \
420  [--model-column-index N] \
421  [--model-name-override "Exact Column Header or Model Name"] \
422  [--task-type "text-generation"] \
423  [--dataset-name "Custom Benchmarks"] \
424  [--apply | --create-pr]
425```
426 
427**Import from Artificial Analysis:**
428```bash
429AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
430  --creator-slug "creator-name" \
431  --model-name "model-slug" \
432  --repo-id "username/model-name" \
433  [--create-pr]
434```
435 
436**View / Validate:**
437```bash
438uv run scripts/evaluation_manager.py show --repo-id "username/model-name"
439uv run scripts/evaluation_manager.py validate --repo-id "username/model-name"
440```
441 
442**Check Open PRs (ALWAYS run before --create-pr):**
443```bash
444uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"
445```
446Lists all open pull requests for the model repository. Shows PR number, title, author, date, and URL.
447 
448**Run Evaluation Job (Inference Providers):**
449```bash
450hf jobs uv run scripts/inspect_eval_uv.py \
451  --flavor "cpu-basic|t4-small|..." \
452  --secret HF_TOKEN=$HF_TOKEN \
453  -- --model "model-id" \
454     --task "task-name"
455```
456 
457or use the Python helper:
458 
459```bash
460uv run scripts/run_eval_job.py \
461  --model "model-id" \
462  --task "task-name" \
463  --hardware "cpu-basic|t4-small|..."
464```
465 
466**Run vLLM Evaluation (Custom Models):**
467```bash
468# lighteval with vLLM
469hf jobs uv run scripts/lighteval_vllm_uv.py \
470  --flavor "a10g-small" \
471  --secrets HF_TOKEN=$HF_TOKEN \
472  -- --model "model-id" \
473     --tasks "leaderboard|mmlu|5"
474 
475# inspect-ai with vLLM
476hf jobs uv run scripts/inspect_vllm_uv.py \
477  --flavor "a10g-small" \
478  --secrets HF_TOKEN=$HF_TOKEN \
479  -- --model "model-id" \
480     --task "mmlu"
481 
482# Helper script (auto hardware selection)
483uv run scripts/run_vllm_eval_job.py \
484  --model "model-id" \
485  --task "leaderboard|mmlu|5" \
486  --framework lighteval
487```
488 
489### Model-Index Format
490 
491The generated model-index follows this structure:
492 
493```yaml
494model-index:
495  - name: Model Name
496    results:
497      - task:
498          type: text-generation
499        dataset:
500          name: Benchmark Dataset
501          type: benchmark_type
502        metrics:
503          - name: MMLU
504            type: mmlu
505            value: 85.2
506          - name: HumanEval
507            type: humaneval
508            value: 72.5
509        source:
510          name: Source Name
511          url: https://source-url.com
512```
513 
514WARNING: Do not use markdown formatting in the model name. Use the exact name from the table. Only use urls in the source.url field.
515 
516### Error Handling
517- **Table Not Found**: Script will report if no evaluation tables are detected
518- **Invalid Format**: Clear error messages for malformed tables
519- **API Errors**: Retry logic for transient Artificial Analysis API failures
520- **Token Issues**: Validation before attempting updates
521- **Merge Conflicts**: Preserves existing model-index entries when adding new ones
522- **Space Creation**: Handles naming conflicts and hardware request failures gracefully
523 
524### Best Practices
525 
5261. **Check for existing PRs first**: Run `get-prs` before creating any new PR to avoid duplicates
5272. **Always start with `inspect-tables`**: See table structure and get the correct extraction command
5283. **Use `--help` for guidance**: Run `inspect-tables --help` to see the complete workflow
5294. **Preview first**: Default behavior prints YAML; review it before using `--apply` or `--create-pr`
5305. **Verify extracted values**: Compare YAML output against the README table manually
5316. **Use `--table N` for multi-table READMEs**: Required when multiple evaluation tables exist
5327. **Use `--model-name-override` for comparison tables**: Copy the exact column header from `inspect-tables` output
5338. **Create PRs for Others**: Use `--create-pr` when updating models you don't own
5349. **One model per repo**: Only add the main model's results to model-index
53510. **No markdown in YAML names**: The model name field in YAML should be plain text
536 
537### Model Name Matching
538 
539When extracting evaluation tables with multiple models (either as columns or rows), the script uses **exact normalized token matching**:
540 
541- Removes markdown formatting (bold `**`, links `[]()`  )
542- Normalizes names (lowercase, replace `-` and `_` with spaces)
543- Compares token sets: `"OLMo-3-32B"` → `{"olmo", "3", "32b"}` matches `"**Olmo 3 32B**"` or `"[Olmo-3-32B](...)`
544- Only extracts if tokens match exactly (handles different word orders and separators)
545- Fails if no exact match found (rather than guessing from similar names)
546 
547**For column-based tables** (benchmarks as rows, models as columns):
548- Finds the column header matching the model name
549- Extracts scores from that column only
550 
551**For transposed tables** (models as rows, benchmarks as columns):
552- Finds the row in the first column matching the model name
553- Extracts all benchmark scores from that row only
554 
555This ensures only the correct model's scores are extracted, never unrelated models or training checkpoints. 
556 
557### Common Patterns
558 
559**Update Your Own Model:**
560```bash
561# Extract from README and push directly
562uv run scripts/evaluation_manager.py extract-readme \
563  --repo-id "your-username/your-model" \
564  --task-type "text-generation"
565```
566 
567**Update Someone Else's Model (Full Workflow):**
568```bash
569# Step 1: ALWAYS check for existing PRs first
570uv run scripts/evaluation_manager.py get-prs \
571  --repo-id "other-username/their-model"
572 
573# Step 2: If NO open PRs exist, proceed with creating one
574uv run scripts/evaluation_manager.py extract-readme \
575  --repo-id "other-username/their-model" \
576  --create-pr
577 
578# If open PRs DO exist:
579# - Warn the user about existing PRs
580# - Show them the PR URLs
581# - Do NOT create a new PR unless user explicitly confirms
582```
583 
584**Import Fresh Benchmarks:**
585```bash
586# Step 1: Check for existing PRs
587uv run scripts/evaluation_manager.py get-prs \
588  --repo-id "anthropic/claude-sonnet-4"
589 
590# Step 2: If no PRs, import from Artificial Analysis
591AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
592  --creator-slug "anthropic" \
593  --model-name "claude-sonnet-4" \
594  --repo-id "anthropic/claude-sonnet-4" \
595  --create-pr
596```
597 
598### Troubleshooting
599 
600**Issue**: "No evaluation tables found in README"
601- **Solution**: Check if README contains markdown tables with numeric scores
602 
603**Issue**: "Could not find model 'X' in transposed table"
604- **Solution**: The script will display available models. Use `--model-name-override` with the exact name from the list
605- **Example**: `--model-name-override "**Olmo 3-32B**"`
606 
607**Issue**: "AA_API_KEY not set"
608- **Solution**: Set environment variable or add to .env file
609 
610**Issue**: "Token does not have write access"
611- **Solution**: Ensure HF_TOKEN has write permissions for the repository
612 
613**Issue**: "Model not found in Artificial Analysis"
614- **Solution**: Verify creator-slug and model-name match API values
615 
616**Issue**: "Payment required for hardware"
617- **Solution**: Add a payment method to your Hugging Face account to use non-CPU hardware
618 
619**Issue**: "vLLM out of memory" or CUDA OOM
620- **Solution**: Use a larger hardware flavor, reduce `--gpu-memory-utilization`, or use `--tensor-parallel-size` for multi-GPU
621 
622**Issue**: "Model architecture not supported by vLLM"
623- **Solution**: Use `--backend hf` (inspect-ai) or `--backend accelerate` (lighteval) for HuggingFace Transformers
624 
625**Issue**: "Trust remote code required"
626- **Solution**: Add `--trust-remote-code` flag for models with custom code (e.g., Phi-2, Qwen)
627 
628**Issue**: "Chat template not found"
629- **Solution**: Only use `--use-chat-template` for instruction-tuned models that include a chat template
630 
631### Integration Examples
632 
633**Python Script Integration:**
634```python
635import subprocess
636import os
637 
638def update_model_evaluations(repo_id, readme_content):
639    """Update model card with evaluations from README."""
640    result = subprocess.run([
641        "python", "scripts/evaluation_manager.py",
642        "extract-readme",
643        "--repo-id", repo_id,
644        "--create-pr"
645    ], capture_output=True, text=True)
646 
647    if result.returncode == 0:
648        print(f"Successfully updated {repo_id}")
649    else:
650        print(f"Error: {result.stderr}")
651```
652
Full transparency — inspect the skill content before installing.
New to skill.md files?
See what a SKILL.md file is, how to install one, and how it differs from AGENTS.md or cursorrules.
Read the guide →