Use this skill whenever the user wants to do anything with PDF files. This includes reading or extracting text/tables from PDFs, combining or merging multiple PDFs into one, splitting PDFs apart, rotating pages, adding watermarks, creating new PDFs, filling PDF forms, encrypting/decrypting PDFs, extracting images, and OCR on scanned PDFs to make them searchable. If the user mentions a .pdf file or asks to produce one, use this skill.
Add this skill
npx mdskills install anthropics/pdf-processingComprehensive PDF operations guide with clear code examples and multiple library options
1---2name: pdf3description: Use this skill whenever the user wants to do anything with PDF files. This includes reading or extracting text/tables from PDFs, combining or merging multiple PDFs into one, splitting PDFs apart, rotating pages, adding watermarks, creating new PDFs, filling PDF forms, encrypting/decrypting PDFs, extracting images, and OCR on scanned PDFs to make them searchable. If the user mentions a .pdf file or asks to produce one, use this skill.4license: Proprietary. LICENSE.txt has complete terms5---67# PDF Processing Guide89## Overview1011This guide covers essential PDF processing operations using Python libraries and command-line tools. For advanced features, JavaScript libraries, and detailed examples, see REFERENCE.md. If you need to fill out a PDF form, read FORMS.md and follow its instructions.1213## Quick Start1415```python16from pypdf import PdfReader, PdfWriter1718# Read a PDF19reader = PdfReader("document.pdf")20print(f"Pages: {len(reader.pages)}")2122# Extract text23text = ""24for page in reader.pages:25 text += page.extract_text()26```2728## Python Libraries2930### pypdf - Basic Operations3132#### Merge PDFs33```python34from pypdf import PdfWriter, PdfReader3536writer = PdfWriter()37for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:38 reader = PdfReader(pdf_file)39 for page in reader.pages:40 writer.add_page(page)4142with open("merged.pdf", "wb") as output:43 writer.write(output)44```4546#### Split PDF47```python48reader = PdfReader("input.pdf")49for i, page in enumerate(reader.pages):50 writer = PdfWriter()51 writer.add_page(page)52 with open(f"page_{i+1}.pdf", "wb") as output:53 writer.write(output)54```5556#### Extract Metadata57```python58reader = PdfReader("document.pdf")59meta = reader.metadata60print(f"Title: {meta.title}")61print(f"Author: {meta.author}")62print(f"Subject: {meta.subject}")63print(f"Creator: {meta.creator}")64```6566#### Rotate Pages67```python68reader = PdfReader("input.pdf")69writer = PdfWriter()7071page = reader.pages[0]72page.rotate(90) # Rotate 90 degrees clockwise73writer.add_page(page)7475with open("rotated.pdf", "wb") as output:76 writer.write(output)77```7879### pdfplumber - Text and Table Extraction8081#### Extract Text with Layout82```python83import pdfplumber8485with pdfplumber.open("document.pdf") as pdf:86 for page in pdf.pages:87 text = page.extract_text()88 print(text)89```9091#### Extract Tables92```python93with pdfplumber.open("document.pdf") as pdf:94 for i, page in enumerate(pdf.pages):95 tables = page.extract_tables()96 for j, table in enumerate(tables):97 print(f"Table {j+1} on page {i+1}:")98 for row in table:99 print(row)100```101102#### Advanced Table Extraction103```python104import pandas as pd105106with pdfplumber.open("document.pdf") as pdf:107 all_tables = []108 for page in pdf.pages:109 tables = page.extract_tables()110 for table in tables:111 if table: # Check if table is not empty112 df = pd.DataFrame(table[1:], columns=table[0])113 all_tables.append(df)114115# Combine all tables116if all_tables:117 combined_df = pd.concat(all_tables, ignore_index=True)118 combined_df.to_excel("extracted_tables.xlsx", index=False)119```120121### reportlab - Create PDFs122123#### Basic PDF Creation124```python125from reportlab.lib.pagesizes import letter126from reportlab.pdfgen import canvas127128c = canvas.Canvas("hello.pdf", pagesize=letter)129width, height = letter130131# Add text132c.drawString(100, height - 100, "Hello World!")133c.drawString(100, height - 120, "This is a PDF created with reportlab")134135# Add a line136c.line(100, height - 140, 400, height - 140)137138# Save139c.save()140```141142#### Create PDF with Multiple Pages143```python144from reportlab.lib.pagesizes import letter145from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak146from reportlab.lib.styles import getSampleStyleSheet147148doc = SimpleDocTemplate("report.pdf", pagesize=letter)149styles = getSampleStyleSheet()150story = []151152# Add content153title = Paragraph("Report Title", styles['Title'])154story.append(title)155story.append(Spacer(1, 12))156157body = Paragraph("This is the body of the report. " * 20, styles['Normal'])158story.append(body)159story.append(PageBreak())160161# Page 2162story.append(Paragraph("Page 2", styles['Heading1']))163story.append(Paragraph("Content for page 2", styles['Normal']))164165# Build PDF166doc.build(story)167```168169#### Subscripts and Superscripts170171**IMPORTANT**: Never use Unicode subscript/superscript characters (₀₁₂₃₄₅₆₇₈₉, ⁰¹²³⁴⁵⁶⁷⁸⁹) in ReportLab PDFs. The built-in fonts do not include these glyphs, causing them to render as solid black boxes.172173Instead, use ReportLab's XML markup tags in Paragraph objects:174```python175from reportlab.platypus import Paragraph176from reportlab.lib.styles import getSampleStyleSheet177178styles = getSampleStyleSheet()179180# Subscripts: use <sub> tag181chemical = Paragraph("H<sub>2</sub>O", styles['Normal'])182183# Superscripts: use <super> tag184squared = Paragraph("x<super>2</super> + y<super>2</super>", styles['Normal'])185```186187For canvas-drawn text (not Paragraph objects), manually adjust font the size and position rather than using Unicode subscripts/superscripts.188189## Command-Line Tools190191### pdftotext (poppler-utils)192```bash193# Extract text194pdftotext input.pdf output.txt195196# Extract text preserving layout197pdftotext -layout input.pdf output.txt198199# Extract specific pages200pdftotext -f 1 -l 5 input.pdf output.txt # Pages 1-5201```202203### qpdf204```bash205# Merge PDFs206qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf207208# Split pages209qpdf input.pdf --pages . 1-5 -- pages1-5.pdf210qpdf input.pdf --pages . 6-10 -- pages6-10.pdf211212# Rotate pages213qpdf input.pdf output.pdf --rotate=+90:1 # Rotate page 1 by 90 degrees214215# Remove password216qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf217```218219### pdftk (if available)220```bash221# Merge222pdftk file1.pdf file2.pdf cat output merged.pdf223224# Split225pdftk input.pdf burst226227# Rotate228pdftk input.pdf rotate 1east output rotated.pdf229```230231## Common Tasks232233### Extract Text from Scanned PDFs234```python235# Requires: pip install pytesseract pdf2image236import pytesseract237from pdf2image import convert_from_path238239# Convert PDF to images240images = convert_from_path('scanned.pdf')241242# OCR each page243text = ""244for i, image in enumerate(images):245 text += f"Page {i+1}:\n"246 text += pytesseract.image_to_string(image)247 text += "\n\n"248249print(text)250```251252### Add Watermark253```python254from pypdf import PdfReader, PdfWriter255256# Create watermark (or load existing)257watermark = PdfReader("watermark.pdf").pages[0]258259# Apply to all pages260reader = PdfReader("document.pdf")261writer = PdfWriter()262263for page in reader.pages:264 page.merge_page(watermark)265 writer.add_page(page)266267with open("watermarked.pdf", "wb") as output:268 writer.write(output)269```270271### Extract Images272```bash273# Using pdfimages (poppler-utils)274pdfimages -j input.pdf output_prefix275276# This extracts all images as output_prefix-000.jpg, output_prefix-001.jpg, etc.277```278279### Password Protection280```python281from pypdf import PdfReader, PdfWriter282283reader = PdfReader("input.pdf")284writer = PdfWriter()285286for page in reader.pages:287 writer.add_page(page)288289# Add password290writer.encrypt("userpassword", "ownerpassword")291292with open("encrypted.pdf", "wb") as output:293 writer.write(output)294```295296## Quick Reference297298| Task | Best Tool | Command/Code |299|------|-----------|--------------|300| Merge PDFs | pypdf | `writer.add_page(page)` |301| Split PDFs | pypdf | One page per file |302| Extract text | pdfplumber | `page.extract_text()` |303| Extract tables | pdfplumber | `page.extract_tables()` |304| Create PDFs | reportlab | Canvas or Platypus |305| Command line merge | qpdf | `qpdf --empty --pages ...` |306| OCR scanned PDFs | pytesseract | Convert to image first |307| Fill PDF forms | pdf-lib or pypdf (see FORMS.md) | See FORMS.md |308309## Next Steps310311- For advanced pypdfium2 usage, see REFERENCE.md312- For JavaScript libraries (pdf-lib), see REFERENCE.md313- If you need to fill out a PDF form, follow the instructions in FORMS.md314- For troubleshooting guides, see REFERENCE.md315
Full transparency — inspect the skill content before installing.