How do I install Hugging Face Datasets?

Install Hugging Face Datasets with a single command: npx mdskills install huggingface/hugging-face-datasets. This downloads the skill files into your project and your AI agent picks them up automatically.

What platforms support Hugging Face Datasets?

Hugging Face Datasets works with Claude Code, Claude Desktop, Cursor, Vscode Copilot, Windsurf, Continue Dev, Codex, Gemini Cli, Amp, Roo Code, Goose, Opencode, Trae, Qodo, Command Code. Skills use the open SKILL.md format which is compatible with any AI coding agent that reads markdown instructions.

← Back to skills

Hugging Face Datasets

Name: Hugging Face Datasets: AI Agent Skill
Rating: 9 (1 reviews)
Author: huggingface

Verified

AI & Machine LearningIntermediate

Create and manage datasets on Hugging Face Hub. Supports initializing repos, defining configs/system prompts, streaming row updates, and SQL-based dataset querying/transformation. Designed to work alongside HF MCP server for comprehensive dataset workflows.

by @huggingface0Updated 2/24/2026

Add this skill

npx mdskills install huggingface/hugging-face-datasets

Fork & Edit

Skill Advisor9.0

Comprehensive dataset management with SQL querying, multi-format support, and robust templates

+Provides powerful SQL-based dataset querying and transformation via DuckDB
+Includes well-documented templates for diverse dataset types (chat, QA, classification)
+Demonstrates clear usage examples with both CLI and Python API patterns
-Lacks explicit validation examples for template schemas and error handling patterns

SKILL.md

Edit in Browser

1---
2name: hugging-face-datasets
3description: Create and manage datasets on Hugging Face Hub. Supports initializing repos, defining configs/system prompts, streaming row updates, and SQL-based dataset querying/transformation. Designed to work alongside HF MCP server for comprehensive dataset workflows.
4---
5 
6# Overview
7This skill provides tools to manage datasets on the Hugging Face Hub with a focus on creation, configuration, content management, and SQL-based data manipulation. It is designed to complement the existing Hugging Face MCP server by providing dataset editing and querying capabilities.
8 
9## Integration with HF MCP Server
10- **Use HF MCP Server for**: Dataset discovery, search, and metadata retrieval
11- **Use This Skill for**: Dataset creation, content editing, SQL queries, data transformation, and structured data formatting
12 
13# Version
142.1.0
15 
16# Dependencies
17# This skill uses PEP 723 scripts with inline dependency management
18# Scripts auto-install requirements when run with: uv run scripts/script_name.py
19 
20- uv (Python package manager)
21- Getting Started: See "Usage Instructions" below for PEP 723 usage
22 
23# Core Capabilities
24 
25## 1. Dataset Lifecycle Management
26- **Initialize**: Create new dataset repositories with proper structure
27- **Configure**: Store detailed configuration including system prompts and metadata
28- **Stream Updates**: Add rows efficiently without downloading entire datasets
29 
30## 2. SQL-Based Dataset Querying (NEW)
31Query any Hugging Face dataset using DuckDB SQL via `scripts/sql_manager.py`:
32- **Direct Queries**: Run SQL on datasets using the `hf://` protocol
33- **Schema Discovery**: Describe dataset structure and column types
34- **Data Sampling**: Get random samples for exploration
35- **Aggregations**: Count, histogram, unique values analysis
36- **Transformations**: Filter, join, reshape data with SQL
37- **Export & Push**: Save results locally or push to new Hub repos
38 
39## 3. Multi-Format Dataset Support
40Supports diverse dataset types through template system:
41- **Chat/Conversational**: Chat templating, multi-turn dialogues, tool usage examples
42- **Text Classification**: Sentiment analysis, intent detection, topic classification
43- **Question-Answering**: Reading comprehension, factual QA, knowledge bases
44- **Text Completion**: Language modeling, code completion, creative writing
45- **Tabular Data**: Structured data for regression/classification tasks
46- **Custom Formats**: Flexible schema definition for specialized needs
47 
48## 4. Quality Assurance Features
49- **JSON Validation**: Ensures data integrity during uploads
50- **Batch Processing**: Efficient handling of large datasets
51- **Error Recovery**: Graceful handling of upload failures and conflicts
52 
53# Usage Instructions
54 
55The skill includes two Python scripts that use PEP 723 inline dependency management:
56 
57> **All paths are relative to the directory containing this SKILL.md
58file.**
59> Scripts are run with: `uv run scripts/script_name.py [arguments]`
60 
61- `scripts/dataset_manager.py` - Dataset creation and management
62- `scripts/sql_manager.py` - SQL-based dataset querying and transformation
63 
64### Prerequisites
65- `uv` package manager installed
66- `HF_TOKEN` environment variable must be set with a Write-access token
67 
68---
69 
70# SQL Dataset Querying (sql_manager.py)
71 
72Query, transform, and push Hugging Face datasets using DuckDB SQL. The `hf://` protocol provides direct access to any public dataset (or private with token).
73 
74## Quick Start
75 
76```bash
77# Query a dataset
78uv run scripts/sql_manager.py query \
79  --dataset "cais/mmlu" \
80  --sql "SELECT * FROM data WHERE subject='nutrition' LIMIT 10"
81 
82# Get dataset schema
83uv run scripts/sql_manager.py describe --dataset "cais/mmlu"
84 
85# Sample random rows
86uv run scripts/sql_manager.py sample --dataset "cais/mmlu" --n 5
87 
88# Count rows with filter
89uv run scripts/sql_manager.py count --dataset "cais/mmlu" --where "subject='nutrition'"
90```
91 
92## SQL Query Syntax
93 
94Use `data` as the table name in your SQL - it gets replaced with the actual `hf://` path:
95 
96```sql
97-- Basic select
98SELECT * FROM data LIMIT 10
99 
100-- Filtering
101SELECT * FROM data WHERE subject='nutrition'
102 
103-- Aggregations
104SELECT subject, COUNT(*) as cnt FROM data GROUP BY subject ORDER BY cnt DESC
105 
106-- Column selection and transformation
107SELECT question, choices[answer] AS correct_answer FROM data
108 
109-- Regex matching
110SELECT * FROM data WHERE regexp_matches(question, 'nutrition|diet')
111 
112-- String functions
113SELECT regexp_replace(question, '\n', '') AS cleaned FROM data
114```
115 
116## Common Operations
117 
118### 1. Explore Dataset Structure
119```bash
120# Get schema
121uv run scripts/sql_manager.py describe --dataset "cais/mmlu"
122 
123# Get unique values in column
124uv run scripts/sql_manager.py unique --dataset "cais/mmlu" --column "subject"
125 
126# Get value distribution
127uv run scripts/sql_manager.py histogram --dataset "cais/mmlu" --column "subject" --bins 20
128```
129 
130### 2. Filter and Transform
131```bash
132# Complex filtering with SQL
133uv run scripts/sql_manager.py query \
134  --dataset "cais/mmlu" \
135  --sql "SELECT subject, COUNT(*) as cnt FROM data GROUP BY subject HAVING cnt > 100"
136 
137# Using transform command
138uv run scripts/sql_manager.py transform \
139  --dataset "cais/mmlu" \
140  --select "subject, COUNT(*) as cnt" \
141  --group-by "subject" \
142  --order-by "cnt DESC" \
143  --limit 10
144```
145 
146### 3. Create Subsets and Push to Hub
147```bash
148# Query and push to new dataset
149uv run scripts/sql_manager.py query \
150  --dataset "cais/mmlu" \
151  --sql "SELECT * FROM data WHERE subject='nutrition'" \
152  --push-to "username/mmlu-nutrition-subset" \
153  --private
154 
155# Transform and push
156uv run scripts/sql_manager.py transform \
157  --dataset "ibm/duorc" \
158  --config "ParaphraseRC" \
159  --select "question, answers" \
160  --where "LENGTH(question) > 50" \
161  --push-to "username/duorc-long-questions"
162```
163 
164### 4. Export to Local Files
165```bash
166# Export to Parquet
167uv run scripts/sql_manager.py export \
168  --dataset "cais/mmlu" \
169  --sql "SELECT * FROM data WHERE subject='nutrition'" \
170  --output "nutrition.parquet" \
171  --format parquet
172 
173# Export to JSONL
174uv run scripts/sql_manager.py export \
175  --dataset "cais/mmlu" \
176  --sql "SELECT * FROM data LIMIT 100" \
177  --output "sample.jsonl" \
178  --format jsonl
179```
180 
181### 5. Working with Dataset Configs/Splits
182```bash
183# Specify config (subset)
184uv run scripts/sql_manager.py query \
185  --dataset "ibm/duorc" \
186  --config "ParaphraseRC" \
187  --sql "SELECT * FROM data LIMIT 5"
188 
189# Specify split
190uv run scripts/sql_manager.py query \
191  --dataset "cais/mmlu" \
192  --split "test" \
193  --sql "SELECT COUNT(*) FROM data"
194 
195# Query all splits
196uv run scripts/sql_manager.py query \
197  --dataset "cais/mmlu" \
198  --split "*" \
199  --sql "SELECT * FROM data LIMIT 10"
200```
201 
202### 6. Raw SQL with Full Paths
203For complex queries or joining datasets:
204```bash
205uv run scripts/sql_manager.py raw --sql "
206  SELECT a.*, b.* 
207  FROM 'hf://datasets/dataset1@~parquet/default/train/*.parquet' a
208  JOIN 'hf://datasets/dataset2@~parquet/default/train/*.parquet' b
209  ON a.id = b.id
210  LIMIT 100
211"
212```
213 
214## Python API Usage
215 
216```python
217from sql_manager import HFDatasetSQL
218 
219sql = HFDatasetSQL()
220 
221# Query
222results = sql.query("cais/mmlu", "SELECT * FROM data WHERE subject='nutrition' LIMIT 10")
223 
224# Get schema
225schema = sql.describe("cais/mmlu")
226 
227# Sample
228samples = sql.sample("cais/mmlu", n=5, seed=42)
229 
230# Count
231count = sql.count("cais/mmlu", where="subject='nutrition'")
232 
233# Histogram
234dist = sql.histogram("cais/mmlu", "subject")
235 
236# Filter and transform
237results = sql.filter_and_transform(
238    "cais/mmlu",
239    select="subject, COUNT(*) as cnt",
240    group_by="subject",
241    order_by="cnt DESC",
242    limit=10
243)
244 
245# Push to Hub
246url = sql.push_to_hub(
247    "cais/mmlu",
248    "username/nutrition-subset",
249    sql="SELECT * FROM data WHERE subject='nutrition'",
250    private=True
251)
252 
253# Export locally
254sql.export_to_parquet("cais/mmlu", "output.parquet", sql="SELECT * FROM data LIMIT 100")
255 
256sql.close()
257```
258 
259## HF Path Format
260 
261DuckDB uses the `hf://` protocol to access datasets:
262```
263hf://datasets/{dataset_id}@{revision}/{config}/{split}/*.parquet
264```
265 
266Examples:
267- `hf://datasets/cais/mmlu@~parquet/default/train/*.parquet`
268- `hf://datasets/ibm/duorc@~parquet/ParaphraseRC/test/*.parquet`
269 
270The `@~parquet` revision provides auto-converted Parquet files for any dataset format.
271 
272## Useful DuckDB SQL Functions
273 
274```sql
275-- String functions
276LENGTH(column)                    -- String length
277regexp_replace(col, '\n', '')     -- Regex replace
278regexp_matches(col, 'pattern')    -- Regex match
279LOWER(col), UPPER(col)           -- Case conversion
280 
281-- Array functions  
282choices[0]                        -- Array indexing (0-based)
283array_length(choices)             -- Array length
284unnest(choices)                   -- Expand array to rows
285 
286-- Aggregations
287COUNT(*), SUM(col), AVG(col)
288GROUP BY col HAVING condition
289 
290-- Sampling
291USING SAMPLE 10                   -- Random sample
292USING SAMPLE 10 (RESERVOIR, 42)   -- Reproducible sample
293 
294-- Window functions
295ROW_NUMBER() OVER (PARTITION BY col ORDER BY col2)
296```
297 
298---
299 
300# Dataset Creation (dataset_manager.py)
301 
302### Recommended Workflow
303 
304**1. Discovery (Use HF MCP Server):**
305```python
306# Use HF MCP tools to find existing datasets
307search_datasets("conversational AI training")
308get_dataset_details("username/dataset-name")
309```
310 
311**2. Creation (Use This Skill):**
312```bash
313# Initialize new dataset
314uv run scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private]
315 
316# Configure with detailed system prompt
317uv run scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "$(cat system_prompt.txt)"
318```
319 
320**3. Content Management (Use This Skill):**
321```bash
322# Quick setup with any template
323uv run scripts/dataset_manager.py quick_setup \
324  --repo_id "your-username/dataset-name" \
325  --template classification
326 
327# Add data with template validation
328uv run scripts/dataset_manager.py add_rows \
329  --repo_id "your-username/dataset-name" \
330  --template qa \
331  --rows_json "$(cat your_qa_data.json)"
332```
333 
334### Template-Based Data Structures
335 
336**1. Chat Template (`--template chat`)**
337```json
338{
339  "messages": [
340    {"role": "user", "content": "Natural user request"},
341    {"role": "assistant", "content": "Response with tool usage"},
342    {"role": "tool", "content": "Tool response", "tool_call_id": "call_123"}
343  ],
344  "scenario": "Description of use case",
345  "complexity": "simple|intermediate|advanced"
346}
347```
348 
349**2. Classification Template (`--template classification`)**
350```json
351{
352  "text": "Input text to be classified",
353  "label": "classification_label",
354  "confidence": 0.95,
355  "metadata": {"domain": "technology", "language": "en"}
356}
357```
358 
359**3. QA Template (`--template qa`)**
360```json
361{
362  "question": "What is the question being asked?",
363  "answer": "The complete answer",
364  "context": "Additional context if needed",
365  "answer_type": "factual|explanatory|opinion",
366  "difficulty": "easy|medium|hard"
367}
368```
369 
370**4. Completion Template (`--template completion`)**
371```json
372{
373  "prompt": "The beginning text or context",
374  "completion": "The expected continuation",
375  "domain": "code|creative|technical|conversational",
376  "style": "description of writing style"
377}
378```
379 
380**5. Tabular Template (`--template tabular`)**
381```json
382{
383  "columns": [
384    {"name": "feature1", "type": "numeric", "description": "First feature"},
385    {"name": "target", "type": "categorical", "description": "Target variable"}
386  ],
387  "data": [
388    {"feature1": 123, "target": "class_a"},
389    {"feature1": 456, "target": "class_b"}
390  ]
391}
392```
393 
394### Advanced System Prompt Template
395 
396For high-quality training data generation:
397```text
398You are an AI assistant expert at using MCP tools effectively.
399 
400## MCP SERVER DEFINITIONS
401[Define available servers and tools]
402 
403## TRAINING EXAMPLE STRUCTURE
404[Specify exact JSON schema for chat templating]
405 
406## QUALITY GUIDELINES
407[Detail requirements for realistic scenarios, progressive complexity, proper tool usage]
408 
409## EXAMPLE CATEGORIES
410[List development workflows, debugging scenarios, data management tasks]
411```
412 
413### Example Categories & Templates
414 
415The skill includes diverse training examples beyond just MCP usage:
416 
417**Available Example Sets:**
418- `training_examples.json` - MCP tool usage examples (debugging, project setup, database analysis)
419- `diverse_training_examples.json` - Broader scenarios including:
420  - **Educational Chat** - Explaining programming concepts, tutorials
421  - **Git Workflows** - Feature branches, version control guidance
422  - **Code Analysis** - Performance optimization, architecture review
423  - **Content Generation** - Professional writing, creative brainstorming
424  - **Codebase Navigation** - Legacy code exploration, systematic analysis
425  - **Conversational Support** - Problem-solving, technical discussions
426 
427**Using Different Example Sets:**
428```bash
429# Add MCP-focused examples
430uv run scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name" \
431  --rows_json "$(cat examples/training_examples.json)"
432 
433# Add diverse conversational examples
434uv run scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name" \
435  --rows_json "$(cat examples/diverse_training_examples.json)"
436 
437# Mix both for comprehensive training data
438uv run scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name" \
439  --rows_json "$(jq -s '.[0] + .[1]' examples/training_examples.json examples/diverse_training_examples.json)"
440```
441 
442### Commands Reference
443 
444**List Available Templates:**
445```bash
446uv run scripts/dataset_manager.py list_templates
447```
448 
449**Quick Setup (Recommended):**
450```bash
451uv run scripts/dataset_manager.py quick_setup --repo_id "your-username/dataset-name" --template classification
452```
453 
454**Manual Setup:**
455```bash
456# Initialize repository
457uv run scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private]
458 
459# Configure with system prompt
460uv run scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "Your prompt here"
461 
462# Add data with validation
463uv run scripts/dataset_manager.py add_rows \
464  --repo_id "your-username/dataset-name" \
465  --template qa \
466  --rows_json '[{"question": "What is AI?", "answer": "Artificial Intelligence..."}]'
467```
468 
469**View Dataset Statistics:**
470```bash
471uv run scripts/dataset_manager.py stats --repo_id "your-username/dataset-name"
472```
473 
474### Error Handling
475- **Repository exists**: Script will notify and continue with configuration
476- **Invalid JSON**: Clear error message with parsing details
477- **Network issues**: Automatic retry for transient failures
478- **Token permissions**: Validation before operations begin
479 
480---
481 
482# Combined Workflow Examples
483 
484## Example 1: Create Training Subset from Existing Dataset
485```bash
486# 1. Explore the source dataset
487uv run scripts/sql_manager.py describe --dataset "cais/mmlu"
488uv run scripts/sql_manager.py histogram --dataset "cais/mmlu" --column "subject"
489 
490# 2. Query and create subset
491uv run scripts/sql_manager.py query \
492  --dataset "cais/mmlu" \
493  --sql "SELECT * FROM data WHERE subject IN ('nutrition', 'anatomy', 'clinical_knowledge')" \
494  --push-to "username/mmlu-medical-subset" \
495  --private
496```
497 
498## Example 2: Transform and Reshape Data
499```bash
500# Transform MMLU to QA format with correct answers extracted
501uv run scripts/sql_manager.py query \
502  --dataset "cais/mmlu" \
503  --sql "SELECT question, choices[answer] as correct_answer, subject FROM data" \
504  --push-to "username/mmlu-qa-format"
505```
506 
507## Example 3: Merge Multiple Dataset Splits
508```bash
509# Export multiple splits and combine
510uv run scripts/sql_manager.py export \
511  --dataset "cais/mmlu" \
512  --split "*" \
513  --output "mmlu_all.parquet"
514```
515 
516## Example 4: Quality Filtering
517```bash
518# Filter for high-quality examples
519uv run scripts/sql_manager.py query \
520  --dataset "squad" \
521  --sql "SELECT * FROM data WHERE LENGTH(context) > 500 AND LENGTH(question) > 20" \
522  --push-to "username/squad-filtered"
523```
524 
525## Example 5: Create Custom Training Dataset
526```bash
527# 1. Query source data
528uv run scripts/sql_manager.py export \
529  --dataset "cais/mmlu" \
530  --sql "SELECT question, subject FROM data WHERE subject='nutrition'" \
531  --output "nutrition_source.jsonl" \
532  --format jsonl
533 
534# 2. Process with your pipeline (add answers, format, etc.)
535 
536# 3. Push processed data
537uv run scripts/dataset_manager.py init --repo_id "username/nutrition-training"
538uv run scripts/dataset_manager.py add_rows \
539  --repo_id "username/nutrition-training" \
540  --template qa \
541  --rows_json "$(cat processed_data.json)"
542```
543

Full transparency — inspect the skill content before installing.