Hugging Face Transformers best practices including model loading, tokenization, fine-tuning workflows, and inference optimization. Use when working with transformer models, fine-tuning LLMs, implementing NLP tasks, or optimizing transformer inference.
Add this skill
npx mdskills install applied-artificial-intelligence/huggingface-transformersComprehensive reference guide with excellent code examples covering model loading, tokenization, fine-tuning, and optimization patterns
1---2name: huggingface-transformers3description: Hugging Face Transformers best practices including model loading, tokenization, fine-tuning workflows, and inference optimization. Use when working with transformer models, fine-tuning LLMs, implementing NLP tasks, or optimizing transformer inference.4---56# Hugging Face Transformers Best Practices78Comprehensive guide to using the Hugging Face Transformers library including model loading, tokenization, fine-tuning workflows, pipeline usage, custom datasets, and deployment optimization.910---1112## Quick Reference1314**When to use this skill:**15- Loading and using pre-trained transformers (BERT, GPT, T5, LLaMA, etc.)16- Fine-tuning models on custom data17- Implementing NLP tasks (classification, QA, generation, etc.)18- Optimizing inference (quantization, ONNX, etc.)19- Debugging tokenization issues20- Using Hugging Face pipelines21- Deploying transformers to production2223**Models covered:**24- Encoders: BERT, RoBERTa, DeBERTa, ALBERT25- Decoders: GPT-2, GPT-Neo, LLaMA, Mistral26- Encoder-Decoders: T5, BART, Flan-T527- Vision: ViT, CLIP, Stable Diffusion2829---3031## Part 1: Model Loading Patterns3233### Pattern 1: Basic Model Loading3435```python36from transformers import AutoModel, AutoTokenizer3738# Load model and tokenizer39model_name = "bert-base-uncased"40tokenizer = AutoTokenizer.from_pretrained(model_name)41model = AutoModel.from_pretrained(model_name)4243# For specific tasks44from transformers import AutoModelForSequenceClassification45model = AutoModelForSequenceClassification.from_pretrained(46 model_name,47 num_labels=3 # For 3-class classification48)49```5051### Pattern 2: Loading with Specific Configuration5253```python54from transformers import AutoConfig, AutoModel5556# Modify configuration57config = AutoConfig.from_pretrained("bert-base-uncased")58config.hidden_dropout_prob = 0.2 # Custom dropout59config.attention_probs_dropout_prob = 0.26061# Load model with custom config62model = AutoModel.from_pretrained("bert-base-uncased", config=config)6364# Or create model from scratch with config65model = AutoModel.from_config(config)66```6768### Pattern 3: Loading Quantized Models (Memory Efficient)6970```python71from transformers import AutoModel, BitsAndBytesConfig72import torch7374# 8-bit quantization (50% memory reduction)75quantization_config = BitsAndBytesConfig(load_in_8bit=True)7677model = AutoModel.from_pretrained(78 "meta-llama/Llama-2-7b-hf",79 quantization_config=quantization_config,80 device_map="auto" # Automatic device placement81)8283# 4-bit quantization (75% memory reduction)84quantization_config = BitsAndBytesConfig(85 load_in_4bit=True,86 bnb_4bit_compute_dtype=torch.float16,87 bnb_4bit_quant_type="nf4",88 bnb_4bit_use_double_quant=True89)9091model = AutoModel.from_pretrained(92 "meta-llama/Llama-2-13b-hf",93 quantization_config=quantization_config,94 device_map="auto"95)96```9798### Pattern 4: Loading from Local Path99100```python101# Save model locally102model.save_pretrained("./my-model")103tokenizer.save_pretrained("./my-model")104105# Load from local path106model = AutoModel.from_pretrained("./my-model")107tokenizer = AutoTokenizer.from_pretrained("./my-model")108```109110---111112## Part 2: Tokenization Best Practices113114### Critical Tokenization Patterns115116```python117from transformers import AutoTokenizer118119tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")120121# ✅ CORRECT: All required arguments122tokens = tokenizer(123 text,124 padding=True, # Pad to longest in batch125 truncation=True, # Truncate to max_length126 max_length=512, # Maximum sequence length127 return_tensors="pt" # Return PyTorch tensors128)129130# Access components131input_ids = tokens['input_ids'] # Token IDs132attention_mask = tokens['attention_mask'] # Padding mask133token_type_ids = tokens.get('token_type_ids') # Segment IDs (BERT)134135# ❌ WRONG: Missing critical arguments136tokens = tokenizer(text) # No padding, truncation, or tensor format!137```138139### Batch Tokenization140141```python142# Tokenize multiple texts efficiently143texts = ["First text", "Second text", "Third text"]144145tokens = tokenizer(146 texts,147 padding=True, # Pad all to longest in batch148 truncation=True,149 max_length=128,150 return_tensors="pt"151)152153# Result shape: [batch_size, max_length]154print(tokens['input_ids'].shape) # torch.Size([3, max_len_in_batch])155```156157### Special Token Handling158159```python160# Add special tokens161tokenizer.add_special_tokens({162 'additional_special_tokens': ['[CUSTOM]', '[MARKER]']163})164165# Resize model embeddings to match166model.resize_token_embeddings(len(tokenizer))167168# Encode with special tokens preserved169text = "Hello [CUSTOM] world"170tokens = tokenizer(text, add_special_tokens=True)171172# Decode173decoded = tokenizer.decode(tokens['input_ids'][0], skip_special_tokens=False)174```175176### Tokenization for Different Tasks177178```python179# Text classification (single sequence)180tokens = tokenizer(181 "This movie was great!",182 padding="max_length",183 truncation=True,184 max_length=128,185 return_tensors="pt"186)187188# Question answering (pair of sequences)189question = "What is the capital of France?"190context = "France is a country in Europe. Paris is its capital."191192tokens = tokenizer(193 question,194 context,195 padding="max_length",196 truncation="only_second", # Only truncate context197 max_length=384,198 return_tensors="pt"199)200201# Text generation (decoder-only models)202prompt = "Once upon a time"203tokens = tokenizer(prompt, return_tensors="pt")204# No padding needed for generation input205```206207---208209## Part 3: Fine-Tuning Workflows210211### Pattern 1: Simple Fine-Tuning with Trainer212213```python214from transformers import (215 AutoModelForSequenceClassification,216 AutoTokenizer,217 Trainer,218 TrainingArguments219)220from datasets import load_dataset221222# 1. Load dataset223dataset = load_dataset("glue", "mrpc")224225# 2. Load model226model = AutoModelForSequenceClassification.from_pretrained(227 "bert-base-uncased",228 num_labels=2229)230tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")231232# 3. Tokenize dataset233def tokenize_function(examples):234 return tokenizer(235 examples["sentence1"],236 examples["sentence2"],237 padding="max_length",238 truncation=True,239 max_length=128240 )241242tokenized_datasets = dataset.map(tokenize_function, batched=True)243244# 4. Define training arguments245training_args = TrainingArguments(246 output_dir="./results",247 evaluation_strategy="epoch",248 learning_rate=2e-5,249 per_device_train_batch_size=16,250 per_device_eval_batch_size=16,251 num_train_epochs=3,252 weight_decay=0.01,253 logging_dir="./logs",254 logging_steps=100,255 save_strategy="epoch",256 load_best_model_at_end=True,257 metric_for_best_model="accuracy",258)259260# 5. Define metrics261from datasets import load_metric262import numpy as np263264metric = load_metric("accuracy")265266def compute_metrics(eval_pred):267 logits, labels = eval_pred268 predictions = np.argmax(logits, axis=-1)269 return metric.compute(predictions=predictions, references=labels)270271# 6. Create Trainer272trainer = Trainer(273 model=model,274 args=training_args,275 train_dataset=tokenized_datasets["train"],276 eval_dataset=tokenized_datasets["validation"],277 compute_metrics=compute_metrics,278)279280# 7. Train281trainer.train()282283# 8. Save284trainer.save_model("./fine-tuned-model")285```286287### Pattern 2: LoRA Fine-Tuning (Parameter-Efficient)288289```python290from transformers import AutoModelForCausalLM, AutoTokenizer291from peft import LoraConfig, get_peft_model, TaskType292293# Load base model294model = AutoModelForCausalLM.from_pretrained(295 "meta-llama/Llama-2-7b-hf",296 load_in_8bit=True, # 8-bit for memory efficiency297 device_map="auto"298)299300# Configure LoRA301lora_config = LoraConfig(302 task_type=TaskType.CAUSAL_LM,303 r=8, # LoRA rank304 lora_alpha=32, # LoRA alpha305 lora_dropout=0.1,306 target_modules=["q_proj", "v_proj"], # Which layers to adapt307)308309# Apply LoRA310model = get_peft_model(model, lora_config)311312# Check trainable parameters313model.print_trainable_parameters()314# Output: trainable params: 4.2M || all params: 6.7B || trainable%: 0.062%315316# Train with Trainer (same as before)317# Only LoRA parameters are updated!318```319320### Pattern 3: Custom Training Loop321322```python323import torch324from torch.utils.data import DataLoader325from transformers import AdamW, get_scheduler326327# Prepare dataloaders328train_dataloader = DataLoader(tokenized_datasets["train"], batch_size=16, shuffle=True)329eval_dataloader = DataLoader(tokenized_datasets["validation"], batch_size=16)330331# Optimizer332optimizer = AdamW(model.parameters(), lr=2e-5)333334# Learning rate scheduler335num_epochs = 3336num_training_steps = num_epochs * len(train_dataloader)337lr_scheduler = get_scheduler(338 "linear",339 optimizer=optimizer,340 num_warmup_steps=500,341 num_training_steps=num_training_steps342)343344# Training loop345device = torch.device("cuda" if torch.cuda.is_available() else "cpu")346model.to(device)347348for epoch in range(num_epochs):349 model.train()350 for batch in train_dataloader:351 batch = {k: v.to(device) for k, v in batch.items()}352353 outputs = model(**batch)354 loss = outputs.loss355 loss.backward()356357 optimizer.step()358 lr_scheduler.step()359 optimizer.zero_grad()360361 # Evaluation362 model.eval()363 for batch in eval_dataloader:364 batch = {k: v.to(device) for k, v in batch.items()}365 with torch.no_grad():366 outputs = model(**batch)367 # Compute metrics368```369370---371372## Part 4: Pipeline Usage (High-Level API)373374### Text Classification Pipeline375376```python377from transformers import pipeline378379# Load pipeline380classifier = pipeline(381 "text-classification",382 model="distilbert-base-uncased-finetuned-sst-2-english"383)384385# Single prediction386result = classifier("I love this product!")387# [{'label': 'POSITIVE', 'score': 0.9998}]388389# Batch prediction390results = classifier([391 "Great service!",392 "Terrible experience",393 "Average quality"394])395```396397### Question Answering Pipeline398399```python400qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")401402result = qa_pipeline(403 question="What is the capital of France?",404 context="France is a country in Europe. Its capital is Paris, a beautiful city."405)406# {'score': 0.98, 'start': 49, 'end': 54, 'answer': 'Paris'}407```408409### Text Generation Pipeline410411```python412generator = pipeline("text-generation", model="gpt2")413414outputs = generator(415 "Once upon a time",416 max_length=50,417 num_return_sequences=3,418 temperature=0.7,419 top_k=50,420 top_p=0.95,421 do_sample=True422)423424for output in outputs:425 print(output['generated_text'])426```427428### Zero-Shot Classification Pipeline429430```python431classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")432433result = classifier(434 "This is a course about Python programming.",435 candidate_labels=["education", "technology", "business", "sports"]436)437# {'sequence': '...', 'labels': ['education', 'technology', ...], 'scores': [0.85, 0.12, ...]}438```439440---441442## Part 5: Inference Optimization443444### Optimization 1: Batch Processing445446```python447# ❌ SLOW: Process one at a time448for text in texts:449 output = model(**tokenizer(text, return_tensors="pt"))450451# ✅ FAST: Process in batches452batch_size = 32453for i in range(0, len(texts), batch_size):454 batch = texts[i:i+batch_size]455 inputs = tokenizer(batch, padding=True, truncation=True, return_tensors="pt")456 outputs = model(**inputs)457```458459### Optimization 2: Mixed Precision (AMP)460461```python462from torch.cuda.amp import autocast, GradScaler463464scaler = GradScaler()465466for batch in dataloader:467 optimizer.zero_grad()468469 # Forward pass in mixed precision470 with autocast():471 outputs = model(**batch)472 loss = outputs.loss473474 # Backward pass with scaled gradients475 scaler.scale(loss).backward()476 scaler.step(optimizer)477 scaler.update()478```479480### Optimization 3: ONNX Export481482```python483from transformers import AutoModelForSequenceClassification, AutoTokenizer484from optimum.onnxruntime import ORTModelForSequenceClassification485486# Export to ONNX487model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")488model.save_pretrained("./onnx-model", export=True)489490# Load ONNX model (faster inference)491ort_model = ORTModelForSequenceClassification.from_pretrained("./onnx-model")492493# Inference (2-3x faster)494tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")495inputs = tokenizer("Hello world", return_tensors="pt")496outputs = ort_model(**inputs)497```498499### Optimization 4: Dynamic Quantization500501```python502import torch503504# Quantize model to int8505quantized_model = torch.quantization.quantize_dynamic(506 model,507 {torch.nn.Linear}, # Quantize Linear layers508 dtype=torch.qint8509)510511# 4x smaller model, 2-3x faster inference on CPU512```513514---515516## Part 6: Common Issues & Solutions517518### Issue 1: CUDA Out of Memory519520**Problem:** `RuntimeError: CUDA out of memory`521522**Solutions:**523524```python525# Solution 1: Reduce batch size526training_args = TrainingArguments(527 per_device_train_batch_size=8, # Was 32528 gradient_accumulation_steps=4, # Effective batch = 8*4 = 32529)530531# Solution 2: Use gradient checkpointing532model.gradient_checkpointing_enable()533534# Solution 3: Use 8-bit model535from transformers import BitsAndBytesConfig536quantization_config = BitsAndBytesConfig(load_in_8bit=True)537model = AutoModel.from_pretrained("model-name", quantization_config=quantization_config)538539# Solution 4: Clear cache540import torch541torch.cuda.empty_cache()542```543544### Issue 2: Slow Tokenization545546**Problem:** Tokenization is bottleneck547548**Solutions:**549550```python551# Solution 1: Use fast tokenizers552tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)553554# Solution 2: Tokenize dataset once, cache it555tokenized_dataset = dataset.map(556 tokenize_function,557 batched=True,558 num_proc=4, # Parallel processing559 remove_columns=dataset.column_names,560 load_from_cache_file=True # Cache results561)562563# Solution 3: Use larger batches for tokenization564tokenizer(565 texts,566 padding=True,567 truncation=True,568 max_length=512,569 return_tensors="pt",570 batched=True, # Process multiple texts at once571 batch_size=1000572)573```574575### Issue 3: Inconsistent Results576577**Problem:** Model outputs different results for same input578579**Solution:**580581```python582# Set seeds for reproducibility583import random584import numpy as np585import torch586587def set_seed(seed=42):588 random.seed(seed)589 np.random.seed(seed)590 torch.manual_seed(seed)591 torch.cuda.manual_seed_all(seed)592 torch.backends.cudnn.deterministic = True593 torch.backends.cudnn.benchmark = False594595set_seed(42)596597# Disable dropout during inference598model.eval()599600# Use deterministic generation601outputs = model.generate(602 inputs,603 do_sample=False, # Greedy decoding604 # OR605 do_sample=True,606 temperature=1.0,607 top_k=50,608 seed=42 # For sampling609)610```611612### Issue 4: Attention Mask Errors613614**Problem:** `IndexError: index out of range in self`615616**Solution:**617618```python619# ✅ ALWAYS provide attention mask620tokens = tokenizer(621 text,622 padding=True,623 truncation=True,624 return_tensors="pt",625 return_attention_mask=True # Explicit (usually default)626)627628# Use it in model forward629outputs = model(630 input_ids=tokens['input_ids'],631 attention_mask=tokens['attention_mask'] # Don't forget this!632)633634# For custom padding635attention_mask = (input_ids != tokenizer.pad_token_id).long()636```637638---639640## Part 7: Model-Specific Patterns641642### GPT Models (Decoder-Only)643644```python645from transformers import GPT2LMHeadModel, GPT2Tokenizer646647model = GPT2LMHeadModel.from_pretrained("gpt2")648tokenizer = GPT2Tokenizer.from_pretrained("gpt2")649650# Set pad token (GPT doesn't have one by default)651tokenizer.pad_token = tokenizer.eos_token652653# Generation654input_text = "The future of AI is"655inputs = tokenizer(input_text, return_tensors="pt")656657outputs = model.generate(658 **inputs,659 max_new_tokens=50,660 num_beams=5, # Beam search661 early_stopping=True,662 no_repeat_ngram_size=2, # Prevent repetition663 temperature=0.8,664 top_p=0.9665)666667print(tokenizer.decode(outputs[0], skip_special_tokens=True))668```669670### T5 Models (Encoder-Decoder)671672```python673from transformers import T5ForConditionalGeneration, T5Tokenizer674675model = T5ForConditionalGeneration.from_pretrained("t5-small")676tokenizer = T5Tokenizer.from_pretrained("t5-small")677678# T5 expects task prefix679input_text = "translate English to German: How are you?"680inputs = tokenizer(input_text, return_tensors="pt")681682outputs = model.generate(683 **inputs,684 max_length=50685)686687print(tokenizer.decode(outputs[0], skip_special_tokens=True))688# "Wie geht es dir?"689```690691### BERT Models (Encoder-Only)692693```python694from transformers import BertForMaskedLM, BertTokenizer695696model = BertForMaskedLM.from_pretrained("bert-base-uncased")697tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")698699# Masked language modeling700text = "Paris is the [MASK] of France."701inputs = tokenizer(text, return_tensors="pt")702703# Get predictions for [MASK]704outputs = model(**inputs)705mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]706mask_token_logits = outputs.logits[0, mask_token_index, :]707708# Top 5 predictions709top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()710for token in top_5_tokens:711 print(tokenizer.decode([token]))712# capital, city, center, heart, ...713```714715---716717## Part 8: Production Deployment718719### FastAPI Serving Pattern720721```python722from fastapi import FastAPI723from transformers import pipeline724from pydantic import BaseModel725import uvicorn726727app = FastAPI()728729# Load model once at startup730classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")731732class TextInput(BaseModel):733 text: str734735@app.post("/classify")736async def classify_text(input: TextInput):737 result = classifier(input.text)[0]738 return {739 "label": result['label'],740 "confidence": result['score']741 }742743if __name__ == "__main__":744 uvicorn.run(app, host="0.0.0.0", port=8000)745```746747### Batch Inference Optimization748749```python750import asyncio751from typing import List752753class BatchPredictor:754 def __init__(self, model, tokenizer, max_batch_size=32):755 self.model = model756 self.tokenizer = tokenizer757 self.max_batch_size = max_batch_size758 self.queue = []759 self.lock = asyncio.Lock()760761 async def predict(self, text: str):762 async with self.lock:763 future = asyncio.Future()764 self.queue.append((text, future))765766 if len(self.queue) >= self.max_batch_size:767 await self._process_batch()768769 return await future770771 async def _process_batch(self):772 if not self.queue:773 return774775 texts, futures = zip(*self.queue)776 self.queue = []777778 # Process batch779 inputs = self.tokenizer(list(texts), padding=True, truncation=True, return_tensors="pt")780 outputs = self.model(**inputs)781 results = outputs.logits.argmax(dim=-1).tolist()782783 # Return results784 for future, result in zip(futures, results):785 future.set_result(result)786```787788---789790## Quick Decision Trees791792### "Which model should I use?"793794```795Task type?796 Classification → BERT, RoBERTa, DeBERTa797 Generation → GPT-2, GPT-Neo, LLaMA798 Translation/Summarization → T5, BART, mT5799 Question Answering → BERT, DeBERTa, RoBERTa800801Performance vs Speed?802 Best performance → Large models (355M+ params)803 Balanced → Base models (110M params)804 Fast inference → Distilled models (66M params)805```806807### "How should I fine-tune?"808809```810Have full dataset control?811 YES → Full fine-tuning or LoRA812 NO → Few-shot prompting813814Dataset size?815 Large (>10K examples) → Full fine-tuning816 Medium (1K-10K) → LoRA or full fine-tuning817 Small (<1K) → LoRA or prompt engineering818819Compute available?820 Limited → LoRA (4-bit quantized)821 Moderate → LoRA (8-bit)822 High → Full fine-tuning823```824825---826827## Resources828829- **Hugging Face Docs:** https://huggingface.co/docs/transformers/830- **Model Hub:** https://huggingface.co/models831- **PEFT (LoRA):** https://huggingface.co/docs/peft/832- **Optimum:** https://huggingface.co/docs/optimum/833- **Datasets:** https://huggingface.co/docs/datasets/834835---836837**Skill version:** 1.0.0838**Last updated:** 2025-10-25839**Maintained by:** Applied Artificial Intelligence840
Full transparency — inspect the skill content before installing.