Create and manage datasets on Hugging Face Hub. Supports initializing repos, defining configs/system prompts, streaming row updates, and SQL-based dataset querying/transformation. Designed to work alongside HF MCP server for comprehensive dataset workflows.
Add this skill
npx mdskills install huggingface/hugging-face-datasetsComprehensive dataset management with SQL querying, multi-format support, and robust templates
1---2name: hugging-face-datasets3description: Create and manage datasets on Hugging Face Hub. Supports initializing repos, defining configs/system prompts, streaming row updates, and SQL-based dataset querying/transformation. Designed to work alongside HF MCP server for comprehensive dataset workflows.4---56# Overview7This skill provides tools to manage datasets on the Hugging Face Hub with a focus on creation, configuration, content management, and SQL-based data manipulation. It is designed to complement the existing Hugging Face MCP server by providing dataset editing and querying capabilities.89## Integration with HF MCP Server10- **Use HF MCP Server for**: Dataset discovery, search, and metadata retrieval11- **Use This Skill for**: Dataset creation, content editing, SQL queries, data transformation, and structured data formatting1213# Version142.1.01516# Dependencies17# This skill uses PEP 723 scripts with inline dependency management18# Scripts auto-install requirements when run with: uv run scripts/script_name.py1920- uv (Python package manager)21- Getting Started: See "Usage Instructions" below for PEP 723 usage2223# Core Capabilities2425## 1. Dataset Lifecycle Management26- **Initialize**: Create new dataset repositories with proper structure27- **Configure**: Store detailed configuration including system prompts and metadata28- **Stream Updates**: Add rows efficiently without downloading entire datasets2930## 2. SQL-Based Dataset Querying (NEW)31Query any Hugging Face dataset using DuckDB SQL via `scripts/sql_manager.py`:32- **Direct Queries**: Run SQL on datasets using the `hf://` protocol33- **Schema Discovery**: Describe dataset structure and column types34- **Data Sampling**: Get random samples for exploration35- **Aggregations**: Count, histogram, unique values analysis36- **Transformations**: Filter, join, reshape data with SQL37- **Export & Push**: Save results locally or push to new Hub repos3839## 3. Multi-Format Dataset Support40Supports diverse dataset types through template system:41- **Chat/Conversational**: Chat templating, multi-turn dialogues, tool usage examples42- **Text Classification**: Sentiment analysis, intent detection, topic classification43- **Question-Answering**: Reading comprehension, factual QA, knowledge bases44- **Text Completion**: Language modeling, code completion, creative writing45- **Tabular Data**: Structured data for regression/classification tasks46- **Custom Formats**: Flexible schema definition for specialized needs4748## 4. Quality Assurance Features49- **JSON Validation**: Ensures data integrity during uploads50- **Batch Processing**: Efficient handling of large datasets51- **Error Recovery**: Graceful handling of upload failures and conflicts5253# Usage Instructions5455The skill includes two Python scripts that use PEP 723 inline dependency management:5657> **All paths are relative to the directory containing this SKILL.md58file.**59> Scripts are run with: `uv run scripts/script_name.py [arguments]`6061- `scripts/dataset_manager.py` - Dataset creation and management62- `scripts/sql_manager.py` - SQL-based dataset querying and transformation6364### Prerequisites65- `uv` package manager installed66- `HF_TOKEN` environment variable must be set with a Write-access token6768---6970# SQL Dataset Querying (sql_manager.py)7172Query, transform, and push Hugging Face datasets using DuckDB SQL. The `hf://` protocol provides direct access to any public dataset (or private with token).7374## Quick Start7576```bash77# Query a dataset78uv run scripts/sql_manager.py query \79 --dataset "cais/mmlu" \80 --sql "SELECT * FROM data WHERE subject='nutrition' LIMIT 10"8182# Get dataset schema83uv run scripts/sql_manager.py describe --dataset "cais/mmlu"8485# Sample random rows86uv run scripts/sql_manager.py sample --dataset "cais/mmlu" --n 58788# Count rows with filter89uv run scripts/sql_manager.py count --dataset "cais/mmlu" --where "subject='nutrition'"90```9192## SQL Query Syntax9394Use `data` as the table name in your SQL - it gets replaced with the actual `hf://` path:9596```sql97-- Basic select98SELECT * FROM data LIMIT 1099100-- Filtering101SELECT * FROM data WHERE subject='nutrition'102103-- Aggregations104SELECT subject, COUNT(*) as cnt FROM data GROUP BY subject ORDER BY cnt DESC105106-- Column selection and transformation107SELECT question, choices[answer] AS correct_answer FROM data108109-- Regex matching110SELECT * FROM data WHERE regexp_matches(question, 'nutrition|diet')111112-- String functions113SELECT regexp_replace(question, '\n', '') AS cleaned FROM data114```115116## Common Operations117118### 1. Explore Dataset Structure119```bash120# Get schema121uv run scripts/sql_manager.py describe --dataset "cais/mmlu"122123# Get unique values in column124uv run scripts/sql_manager.py unique --dataset "cais/mmlu" --column "subject"125126# Get value distribution127uv run scripts/sql_manager.py histogram --dataset "cais/mmlu" --column "subject" --bins 20128```129130### 2. Filter and Transform131```bash132# Complex filtering with SQL133uv run scripts/sql_manager.py query \134 --dataset "cais/mmlu" \135 --sql "SELECT subject, COUNT(*) as cnt FROM data GROUP BY subject HAVING cnt > 100"136137# Using transform command138uv run scripts/sql_manager.py transform \139 --dataset "cais/mmlu" \140 --select "subject, COUNT(*) as cnt" \141 --group-by "subject" \142 --order-by "cnt DESC" \143 --limit 10144```145146### 3. Create Subsets and Push to Hub147```bash148# Query and push to new dataset149uv run scripts/sql_manager.py query \150 --dataset "cais/mmlu" \151 --sql "SELECT * FROM data WHERE subject='nutrition'" \152 --push-to "username/mmlu-nutrition-subset" \153 --private154155# Transform and push156uv run scripts/sql_manager.py transform \157 --dataset "ibm/duorc" \158 --config "ParaphraseRC" \159 --select "question, answers" \160 --where "LENGTH(question) > 50" \161 --push-to "username/duorc-long-questions"162```163164### 4. Export to Local Files165```bash166# Export to Parquet167uv run scripts/sql_manager.py export \168 --dataset "cais/mmlu" \169 --sql "SELECT * FROM data WHERE subject='nutrition'" \170 --output "nutrition.parquet" \171 --format parquet172173# Export to JSONL174uv run scripts/sql_manager.py export \175 --dataset "cais/mmlu" \176 --sql "SELECT * FROM data LIMIT 100" \177 --output "sample.jsonl" \178 --format jsonl179```180181### 5. Working with Dataset Configs/Splits182```bash183# Specify config (subset)184uv run scripts/sql_manager.py query \185 --dataset "ibm/duorc" \186 --config "ParaphraseRC" \187 --sql "SELECT * FROM data LIMIT 5"188189# Specify split190uv run scripts/sql_manager.py query \191 --dataset "cais/mmlu" \192 --split "test" \193 --sql "SELECT COUNT(*) FROM data"194195# Query all splits196uv run scripts/sql_manager.py query \197 --dataset "cais/mmlu" \198 --split "*" \199 --sql "SELECT * FROM data LIMIT 10"200```201202### 6. Raw SQL with Full Paths203For complex queries or joining datasets:204```bash205uv run scripts/sql_manager.py raw --sql "206 SELECT a.*, b.*207 FROM 'hf://datasets/dataset1@~parquet/default/train/*.parquet' a208 JOIN 'hf://datasets/dataset2@~parquet/default/train/*.parquet' b209 ON a.id = b.id210 LIMIT 100211"212```213214## Python API Usage215216```python217from sql_manager import HFDatasetSQL218219sql = HFDatasetSQL()220221# Query222results = sql.query("cais/mmlu", "SELECT * FROM data WHERE subject='nutrition' LIMIT 10")223224# Get schema225schema = sql.describe("cais/mmlu")226227# Sample228samples = sql.sample("cais/mmlu", n=5, seed=42)229230# Count231count = sql.count("cais/mmlu", where="subject='nutrition'")232233# Histogram234dist = sql.histogram("cais/mmlu", "subject")235236# Filter and transform237results = sql.filter_and_transform(238 "cais/mmlu",239 select="subject, COUNT(*) as cnt",240 group_by="subject",241 order_by="cnt DESC",242 limit=10243)244245# Push to Hub246url = sql.push_to_hub(247 "cais/mmlu",248 "username/nutrition-subset",249 sql="SELECT * FROM data WHERE subject='nutrition'",250 private=True251)252253# Export locally254sql.export_to_parquet("cais/mmlu", "output.parquet", sql="SELECT * FROM data LIMIT 100")255256sql.close()257```258259## HF Path Format260261DuckDB uses the `hf://` protocol to access datasets:262```263hf://datasets/{dataset_id}@{revision}/{config}/{split}/*.parquet264```265266Examples:267- `hf://datasets/cais/mmlu@~parquet/default/train/*.parquet`268- `hf://datasets/ibm/duorc@~parquet/ParaphraseRC/test/*.parquet`269270The `@~parquet` revision provides auto-converted Parquet files for any dataset format.271272## Useful DuckDB SQL Functions273274```sql275-- String functions276LENGTH(column) -- String length277regexp_replace(col, '\n', '') -- Regex replace278regexp_matches(col, 'pattern') -- Regex match279LOWER(col), UPPER(col) -- Case conversion280281-- Array functions282choices[0] -- Array indexing (0-based)283array_length(choices) -- Array length284unnest(choices) -- Expand array to rows285286-- Aggregations287COUNT(*), SUM(col), AVG(col)288GROUP BY col HAVING condition289290-- Sampling291USING SAMPLE 10 -- Random sample292USING SAMPLE 10 (RESERVOIR, 42) -- Reproducible sample293294-- Window functions295ROW_NUMBER() OVER (PARTITION BY col ORDER BY col2)296```297298---299300# Dataset Creation (dataset_manager.py)301302### Recommended Workflow303304**1. Discovery (Use HF MCP Server):**305```python306# Use HF MCP tools to find existing datasets307search_datasets("conversational AI training")308get_dataset_details("username/dataset-name")309```310311**2. Creation (Use This Skill):**312```bash313# Initialize new dataset314uv run scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private]315316# Configure with detailed system prompt317uv run scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "$(cat system_prompt.txt)"318```319320**3. Content Management (Use This Skill):**321```bash322# Quick setup with any template323uv run scripts/dataset_manager.py quick_setup \324 --repo_id "your-username/dataset-name" \325 --template classification326327# Add data with template validation328uv run scripts/dataset_manager.py add_rows \329 --repo_id "your-username/dataset-name" \330 --template qa \331 --rows_json "$(cat your_qa_data.json)"332```333334### Template-Based Data Structures335336**1. Chat Template (`--template chat`)**337```json338{339 "messages": [340 {"role": "user", "content": "Natural user request"},341 {"role": "assistant", "content": "Response with tool usage"},342 {"role": "tool", "content": "Tool response", "tool_call_id": "call_123"}343 ],344 "scenario": "Description of use case",345 "complexity": "simple|intermediate|advanced"346}347```348349**2. Classification Template (`--template classification`)**350```json351{352 "text": "Input text to be classified",353 "label": "classification_label",354 "confidence": 0.95,355 "metadata": {"domain": "technology", "language": "en"}356}357```358359**3. QA Template (`--template qa`)**360```json361{362 "question": "What is the question being asked?",363 "answer": "The complete answer",364 "context": "Additional context if needed",365 "answer_type": "factual|explanatory|opinion",366 "difficulty": "easy|medium|hard"367}368```369370**4. Completion Template (`--template completion`)**371```json372{373 "prompt": "The beginning text or context",374 "completion": "The expected continuation",375 "domain": "code|creative|technical|conversational",376 "style": "description of writing style"377}378```379380**5. Tabular Template (`--template tabular`)**381```json382{383 "columns": [384 {"name": "feature1", "type": "numeric", "description": "First feature"},385 {"name": "target", "type": "categorical", "description": "Target variable"}386 ],387 "data": [388 {"feature1": 123, "target": "class_a"},389 {"feature1": 456, "target": "class_b"}390 ]391}392```393394### Advanced System Prompt Template395396For high-quality training data generation:397```text398You are an AI assistant expert at using MCP tools effectively.399400## MCP SERVER DEFINITIONS401[Define available servers and tools]402403## TRAINING EXAMPLE STRUCTURE404[Specify exact JSON schema for chat templating]405406## QUALITY GUIDELINES407[Detail requirements for realistic scenarios, progressive complexity, proper tool usage]408409## EXAMPLE CATEGORIES410[List development workflows, debugging scenarios, data management tasks]411```412413### Example Categories & Templates414415The skill includes diverse training examples beyond just MCP usage:416417**Available Example Sets:**418- `training_examples.json` - MCP tool usage examples (debugging, project setup, database analysis)419- `diverse_training_examples.json` - Broader scenarios including:420 - **Educational Chat** - Explaining programming concepts, tutorials421 - **Git Workflows** - Feature branches, version control guidance422 - **Code Analysis** - Performance optimization, architecture review423 - **Content Generation** - Professional writing, creative brainstorming424 - **Codebase Navigation** - Legacy code exploration, systematic analysis425 - **Conversational Support** - Problem-solving, technical discussions426427**Using Different Example Sets:**428```bash429# Add MCP-focused examples430uv run scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name" \431 --rows_json "$(cat examples/training_examples.json)"432433# Add diverse conversational examples434uv run scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name" \435 --rows_json "$(cat examples/diverse_training_examples.json)"436437# Mix both for comprehensive training data438uv run scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name" \439 --rows_json "$(jq -s '.[0] + .[1]' examples/training_examples.json examples/diverse_training_examples.json)"440```441442### Commands Reference443444**List Available Templates:**445```bash446uv run scripts/dataset_manager.py list_templates447```448449**Quick Setup (Recommended):**450```bash451uv run scripts/dataset_manager.py quick_setup --repo_id "your-username/dataset-name" --template classification452```453454**Manual Setup:**455```bash456# Initialize repository457uv run scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private]458459# Configure with system prompt460uv run scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "Your prompt here"461462# Add data with validation463uv run scripts/dataset_manager.py add_rows \464 --repo_id "your-username/dataset-name" \465 --template qa \466 --rows_json '[{"question": "What is AI?", "answer": "Artificial Intelligence..."}]'467```468469**View Dataset Statistics:**470```bash471uv run scripts/dataset_manager.py stats --repo_id "your-username/dataset-name"472```473474### Error Handling475- **Repository exists**: Script will notify and continue with configuration476- **Invalid JSON**: Clear error message with parsing details477- **Network issues**: Automatic retry for transient failures478- **Token permissions**: Validation before operations begin479480---481482# Combined Workflow Examples483484## Example 1: Create Training Subset from Existing Dataset485```bash486# 1. Explore the source dataset487uv run scripts/sql_manager.py describe --dataset "cais/mmlu"488uv run scripts/sql_manager.py histogram --dataset "cais/mmlu" --column "subject"489490# 2. Query and create subset491uv run scripts/sql_manager.py query \492 --dataset "cais/mmlu" \493 --sql "SELECT * FROM data WHERE subject IN ('nutrition', 'anatomy', 'clinical_knowledge')" \494 --push-to "username/mmlu-medical-subset" \495 --private496```497498## Example 2: Transform and Reshape Data499```bash500# Transform MMLU to QA format with correct answers extracted501uv run scripts/sql_manager.py query \502 --dataset "cais/mmlu" \503 --sql "SELECT question, choices[answer] as correct_answer, subject FROM data" \504 --push-to "username/mmlu-qa-format"505```506507## Example 3: Merge Multiple Dataset Splits508```bash509# Export multiple splits and combine510uv run scripts/sql_manager.py export \511 --dataset "cais/mmlu" \512 --split "*" \513 --output "mmlu_all.parquet"514```515516## Example 4: Quality Filtering517```bash518# Filter for high-quality examples519uv run scripts/sql_manager.py query \520 --dataset "squad" \521 --sql "SELECT * FROM data WHERE LENGTH(context) > 500 AND LENGTH(question) > 20" \522 --push-to "username/squad-filtered"523```524525## Example 5: Create Custom Training Dataset526```bash527# 1. Query source data528uv run scripts/sql_manager.py export \529 --dataset "cais/mmlu" \530 --sql "SELECT question, subject FROM data WHERE subject='nutrition'" \531 --output "nutrition_source.jsonl" \532 --format jsonl533534# 2. Process with your pipeline (add answers, format, etc.)535536# 3. Push processed data537uv run scripts/dataset_manager.py init --repo_id "username/nutrition-training"538uv run scripts/dataset_manager.py add_rows \539 --repo_id "username/nutrition-training" \540 --template qa \541 --rows_json "$(cat processed_data.json)"542```543
Full transparency — inspect the skill content before installing.