Hugging Face Transformers best practices including model loading, tokenization, fine-tuning workflows, and inference optimization. Use when working with transformer models, fine-tuning LLMs, implementing NLP tasks, or optimizing transformer inference.
Add this skill
npx mdskills install applied-artificial-intelligence/huggingface-transformersComprehensive reference guide with excellent code examples covering model loading, tokenization, fine-tuning, and optimization patterns
Comprehensive guide to using the Hugging Face Transformers library including model loading, tokenization, fine-tuning workflows, pipeline usage, custom datasets, and deployment optimization.
When to use this skill:
Models covered:
from transformers import AutoModel, AutoTokenizer
# Load model and tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# For specific tasks
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=3 # For 3-class classification
)
from transformers import AutoConfig, AutoModel
# Modify configuration
config = AutoConfig.from_pretrained("bert-base-uncased")
config.hidden_dropout_prob = 0.2 # Custom dropout
config.attention_probs_dropout_prob = 0.2
# Load model with custom config
model = AutoModel.from_pretrained("bert-base-uncased", config=config)
# Or create model from scratch with config
model = AutoModel.from_config(config)
from transformers import AutoModel, BitsAndBytesConfig
import torch
# 8-bit quantization (50% memory reduction)
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModel.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=quantization_config,
device_map="auto" # Automatic device placement
)
# 4-bit quantization (75% memory reduction)
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True
)
model = AutoModel.from_pretrained(
"meta-llama/Llama-2-13b-hf",
quantization_config=quantization_config,
device_map="auto"
)
# Save model locally
model.save_pretrained("./my-model")
tokenizer.save_pretrained("./my-model")
# Load from local path
model = AutoModel.from_pretrained("./my-model")
tokenizer = AutoTokenizer.from_pretrained("./my-model")
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# ✅ CORRECT: All required arguments
tokens = tokenizer(
text,
padding=True, # Pad to longest in batch
truncation=True, # Truncate to max_length
max_length=512, # Maximum sequence length
return_tensors="pt" # Return PyTorch tensors
)
# Access components
input_ids = tokens['input_ids'] # Token IDs
attention_mask = tokens['attention_mask'] # Padding mask
token_type_ids = tokens.get('token_type_ids') # Segment IDs (BERT)
# ❌ WRONG: Missing critical arguments
tokens = tokenizer(text) # No padding, truncation, or tensor format!
# Tokenize multiple texts efficiently
texts = ["First text", "Second text", "Third text"]
tokens = tokenizer(
texts,
padding=True, # Pad all to longest in batch
truncation=True,
max_length=128,
return_tensors="pt"
)
# Result shape: [batch_size, max_length]
print(tokens['input_ids'].shape) # torch.Size([3, max_len_in_batch])
# Add special tokens
tokenizer.add_special_tokens({
'additional_special_tokens': ['[CUSTOM]', '[MARKER]']
})
# Resize model embeddings to match
model.resize_token_embeddings(len(tokenizer))
# Encode with special tokens preserved
text = "Hello [CUSTOM] world"
tokens = tokenizer(text, add_special_tokens=True)
# Decode
decoded = tokenizer.decode(tokens['input_ids'][0], skip_special_tokens=False)
# Text classification (single sequence)
tokens = tokenizer(
"This movie was great!",
padding="max_length",
truncation=True,
max_length=128,
return_tensors="pt"
)
# Question answering (pair of sequences)
question = "What is the capital of France?"
context = "France is a country in Europe. Paris is its capital."
tokens = tokenizer(
question,
context,
padding="max_length",
truncation="only_second", # Only truncate context
max_length=384,
return_tensors="pt"
)
# Text generation (decoder-only models)
prompt = "Once upon a time"
tokens = tokenizer(prompt, return_tensors="pt")
# No padding needed for generation input
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
Trainer,
TrainingArguments
)
from datasets import load_dataset
# 1. Load dataset
dataset = load_dataset("glue", "mrpc")
# 2. Load model
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=2
)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# 3. Tokenize dataset
def tokenize_function(examples):
return tokenizer(
examples["sentence1"],
examples["sentence2"],
padding="max_length",
truncation=True,
max_length=128
)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
# 4. Define training arguments
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
logging_dir="./logs",
logging_steps=100,
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="accuracy",
)
# 5. Define metrics
from datasets import load_metric
import numpy as np
metric = load_metric("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
# 6. Create Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
compute_metrics=compute_metrics,
)
# 7. Train
trainer.train()
# 8. Save
trainer.save_model("./fine-tuned-model")
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
# Load base model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
load_in_8bit=True, # 8-bit for memory efficiency
device_map="auto"
)
# Configure LoRA
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=8, # LoRA rank
lora_alpha=32, # LoRA alpha
lora_dropout=0.1,
target_modules=["q_proj", "v_proj"], # Which layers to adapt
)
# Apply LoRA
model = get_peft_model(model, lora_config)
# Check trainable parameters
model.print_trainable_parameters()
# Output: trainable params: 4.2M || all params: 6.7B || trainable%: 0.062%
# Train with Trainer (same as before)
# Only LoRA parameters are updated!
import torch
from torch.utils.data import DataLoader
from transformers import AdamW, get_scheduler
# Prepare dataloaders
train_dataloader = DataLoader(tokenized_datasets["train"], batch_size=16, shuffle=True)
eval_dataloader = DataLoader(tokenized_datasets["validation"], batch_size=16)
# Optimizer
optimizer = AdamW(model.parameters(), lr=2e-5)
# Learning rate scheduler
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
"linear",
optimizer=optimizer,
num_warmup_steps=500,
num_training_steps=num_training_steps
)
# Training loop
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
for epoch in range(num_epochs):
model.train()
for batch in train_dataloader:
batch = {k: v.to(device) for k, v in batch.items()}
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
# Evaluation
model.eval()
for batch in eval_dataloader:
batch = {k: v.to(device) for k, v in batch.items()}
with torch.no_grad():
outputs = model(**batch)
# Compute metrics
from transformers import pipeline
# Load pipeline
classifier = pipeline(
"text-classification",
model="distilbert-base-uncased-finetuned-sst-2-english"
)
# Single prediction
result = classifier("I love this product!")
# [{'label': 'POSITIVE', 'score': 0.9998}]
# Batch prediction
results = classifier([
"Great service!",
"Terrible experience",
"Average quality"
])
qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")
result = qa_pipeline(
question="What is the capital of France?",
context="France is a country in Europe. Its capital is Paris, a beautiful city."
)
# {'score': 0.98, 'start': 49, 'end': 54, 'answer': 'Paris'}
generator = pipeline("text-generation", model="gpt2")
outputs = generator(
"Once upon a time",
max_length=50,
num_return_sequences=3,
temperature=0.7,
top_k=50,
top_p=0.95,
do_sample=True
)
for output in outputs:
print(output['generated_text'])
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
result = classifier(
"This is a course about Python programming.",
candidate_labels=["education", "technology", "business", "sports"]
)
# {'sequence': '...', 'labels': ['education', 'technology', ...], 'scores': [0.85, 0.12, ...]}
# ❌ SLOW: Process one at a time
for text in texts:
output = model(**tokenizer(text, return_tensors="pt"))
# ✅ FAST: Process in batches
batch_size = 32
for i in range(0, len(texts), batch_size):
batch = texts[i:i+batch_size]
inputs = tokenizer(batch, padding=True, truncation=True, return_tensors="pt")
outputs = model(**inputs)
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for batch in dataloader:
optimizer.zero_grad()
# Forward pass in mixed precision
with autocast():
outputs = model(**batch)
loss = outputs.loss
# Backward pass with scaled gradients
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from optimum.onnxruntime import ORTModelForSequenceClassification
# Export to ONNX
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
model.save_pretrained("./onnx-model", export=True)
# Load ONNX model (faster inference)
ort_model = ORTModelForSequenceClassification.from_pretrained("./onnx-model")
# Inference (2-3x faster)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
inputs = tokenizer("Hello world", return_tensors="pt")
outputs = ort_model(**inputs)
import torch
# Quantize model to int8
quantized_model = torch.quantization.quantize_dynamic(
model,
{torch.nn.Linear}, # Quantize Linear layers
dtype=torch.qint8
)
# 4x smaller model, 2-3x faster inference on CPU
Problem: RuntimeError: CUDA out of memory
Solutions:
# Solution 1: Reduce batch size
training_args = TrainingArguments(
per_device_train_batch_size=8, # Was 32
gradient_accumulation_steps=4, # Effective batch = 8*4 = 32
)
# Solution 2: Use gradient checkpointing
model.gradient_checkpointing_enable()
# Solution 3: Use 8-bit model
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModel.from_pretrained("model-name", quantization_config=quantization_config)
# Solution 4: Clear cache
import torch
torch.cuda.empty_cache()
Problem: Tokenization is bottleneck
Solutions:
# Solution 1: Use fast tokenizers
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
# Solution 2: Tokenize dataset once, cache it
tokenized_dataset = dataset.map(
tokenize_function,
batched=True,
num_proc=4, # Parallel processing
remove_columns=dataset.column_names,
load_from_cache_file=True # Cache results
)
# Solution 3: Use larger batches for tokenization
tokenizer(
texts,
padding=True,
truncation=True,
max_length=512,
return_tensors="pt",
batched=True, # Process multiple texts at once
batch_size=1000
)
Problem: Model outputs different results for same input
Solution:
# Set seeds for reproducibility
import random
import numpy as np
import torch
def set_seed(seed=42):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
set_seed(42)
# Disable dropout during inference
model.eval()
# Use deterministic generation
outputs = model.generate(
inputs,
do_sample=False, # Greedy decoding
# OR
do_sample=True,
temperature=1.0,
top_k=50,
seed=42 # For sampling
)
Problem: IndexError: index out of range in self
Solution:
# ✅ ALWAYS provide attention mask
tokens = tokenizer(
text,
padding=True,
truncation=True,
return_tensors="pt",
return_attention_mask=True # Explicit (usually default)
)
# Use it in model forward
outputs = model(
input_ids=tokens['input_ids'],
attention_mask=tokens['attention_mask'] # Don't forget this!
)
# For custom padding
attention_mask = (input_ids != tokenizer.pad_token_id).long()
from transformers import GPT2LMHeadModel, GPT2Tokenizer
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# Set pad token (GPT doesn't have one by default)
tokenizer.pad_token = tokenizer.eos_token
# Generation
input_text = "The future of AI is"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=50,
num_beams=5, # Beam search
early_stopping=True,
no_repeat_ngram_size=2, # Prevent repetition
temperature=0.8,
top_p=0.9
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
from transformers import T5ForConditionalGeneration, T5Tokenizer
model = T5ForConditionalGeneration.from_pretrained("t5-small")
tokenizer = T5Tokenizer.from_pretrained("t5-small")
# T5 expects task prefix
input_text = "translate English to German: How are you?"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(
**inputs,
max_length=50
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# "Wie geht es dir?"
from transformers import BertForMaskedLM, BertTokenizer
model = BertForMaskedLM.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
# Masked language modeling
text = "Paris is the [MASK] of France."
inputs = tokenizer(text, return_tensors="pt")
# Get predictions for [MASK]
outputs = model(**inputs)
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = outputs.logits[0, mask_token_index, :]
# Top 5 predictions
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
for token in top_5_tokens:
print(tokenizer.decode([token]))
# capital, city, center, heart, ...
from fastapi import FastAPI
from transformers import pipeline
from pydantic import BaseModel
import uvicorn
app = FastAPI()
# Load model once at startup
classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
class TextInput(BaseModel):
text: str
@app.post("/classify")
async def classify_text(input: TextInput):
result = classifier(input.text)[0]
return {
"label": result['label'],
"confidence": result['score']
}
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000)
import asyncio
from typing import List
class BatchPredictor:
def __init__(self, model, tokenizer, max_batch_size=32):
self.model = model
self.tokenizer = tokenizer
self.max_batch_size = max_batch_size
self.queue = []
self.lock = asyncio.Lock()
async def predict(self, text: str):
async with self.lock:
future = asyncio.Future()
self.queue.append((text, future))
if len(self.queue) >= self.max_batch_size:
await self._process_batch()
return await future
async def _process_batch(self):
if not self.queue:
return
texts, futures = zip(*self.queue)
self.queue = []
# Process batch
inputs = self.tokenizer(list(texts), padding=True, truncation=True, return_tensors="pt")
outputs = self.model(**inputs)
results = outputs.logits.argmax(dim=-1).tolist()
# Return results
for future, result in zip(futures, results):
future.set_result(result)
Task type?
Classification → BERT, RoBERTa, DeBERTa
Generation → GPT-2, GPT-Neo, LLaMA
Translation/Summarization → T5, BART, mT5
Question Answering → BERT, DeBERTa, RoBERTa
Performance vs Speed?
Best performance → Large models (355M+ params)
Balanced → Base models (110M params)
Fast inference → Distilled models (66M params)
Have full dataset control?
YES → Full fine-tuning or LoRA
NO → Few-shot prompting
Dataset size?
Large (>10K examples) → Full fine-tuning
Medium (1K-10K) → LoRA or full fine-tuning
Small (<1K) → LoRA or prompt engineering
Compute available?
Limited → LoRA (4-bit quantized)
Moderate → LoRA (8-bit)
High → Full fine-tuning
Skill version: 1.0.0 Last updated: 2025-10-25 Maintained by: Applied Artificial Intelligence
Best experience: Claude Code
/plugin marketplace add applied-artificial-intelligence/huggingface-transformersThen /plugin menu → select skill → restart. Use /skill-name:init for first-time setup.
Other platforms
Install via CLI
npx mdskills install applied-artificial-intelligence/huggingface-transformersHuggingface Transformers is a free, open-source AI agent skill. Hugging Face Transformers best practices including model loading, tokenization, fine-tuning workflows, and inference optimization. Use when working with transformer models, fine-tuning LLMs, implementing NLP tasks, or optimizing transformer inference.
Install Huggingface Transformers with a single command:
npx mdskills install applied-artificial-intelligence/huggingface-transformersThis downloads the skill files into your project and your AI agent picks them up automatically.
Huggingface Transformers works with Claude Code, Claude Desktop, Cursor, Vscode Copilot, Windsurf, Continue Dev, Codex, Gemini Cli, Amp, Roo Code, Goose, Opencode, Trae, Qodo, Command Code. Skills use the open SKILL.md format which is compatible with any AI coding agent that reads markdown instructions.