Comprehensive PDF manipulation toolkit for extracting text and tables, creating new PDFs, merging/splitting documents, and handling forms. When Claude needs to fill in a PDF form or programmatically process, generate, or analyze PDF documents at scale.
Add this skill
npx mdskills install sickn33/pdf-officialComprehensive reference guide with solid code examples for common PDF operations
1---2name: pdf3description: Comprehensive PDF manipulation toolkit for extracting text and tables, creating new PDFs, merging/splitting documents, and handling forms. When Claude needs to fill in a PDF form or programmatically process, generate, or analyze PDF documents at scale.4license: Proprietary. LICENSE.txt has complete terms5---67# PDF Processing Guide89## Overview1011This guide covers essential PDF processing operations using Python libraries and command-line tools. For advanced features, JavaScript libraries, and detailed examples, see reference.md. If you need to fill out a PDF form, read forms.md and follow its instructions.1213## Quick Start1415```python16from pypdf import PdfReader, PdfWriter1718# Read a PDF19reader = PdfReader("document.pdf")20print(f"Pages: {len(reader.pages)}")2122# Extract text23text = ""24for page in reader.pages:25 text += page.extract_text()26```2728## Python Libraries2930### pypdf - Basic Operations3132#### Merge PDFs33```python34from pypdf import PdfWriter, PdfReader3536writer = PdfWriter()37for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:38 reader = PdfReader(pdf_file)39 for page in reader.pages:40 writer.add_page(page)4142with open("merged.pdf", "wb") as output:43 writer.write(output)44```4546#### Split PDF47```python48reader = PdfReader("input.pdf")49for i, page in enumerate(reader.pages):50 writer = PdfWriter()51 writer.add_page(page)52 with open(f"page_{i+1}.pdf", "wb") as output:53 writer.write(output)54```5556#### Extract Metadata57```python58reader = PdfReader("document.pdf")59meta = reader.metadata60print(f"Title: {meta.title}")61print(f"Author: {meta.author}")62print(f"Subject: {meta.subject}")63print(f"Creator: {meta.creator}")64```6566#### Rotate Pages67```python68reader = PdfReader("input.pdf")69writer = PdfWriter()7071page = reader.pages[0]72page.rotate(90) # Rotate 90 degrees clockwise73writer.add_page(page)7475with open("rotated.pdf", "wb") as output:76 writer.write(output)77```7879### pdfplumber - Text and Table Extraction8081#### Extract Text with Layout82```python83import pdfplumber8485with pdfplumber.open("document.pdf") as pdf:86 for page in pdf.pages:87 text = page.extract_text()88 print(text)89```9091#### Extract Tables92```python93with pdfplumber.open("document.pdf") as pdf:94 for i, page in enumerate(pdf.pages):95 tables = page.extract_tables()96 for j, table in enumerate(tables):97 print(f"Table {j+1} on page {i+1}:")98 for row in table:99 print(row)100```101102#### Advanced Table Extraction103```python104import pandas as pd105106with pdfplumber.open("document.pdf") as pdf:107 all_tables = []108 for page in pdf.pages:109 tables = page.extract_tables()110 for table in tables:111 if table: # Check if table is not empty112 df = pd.DataFrame(table[1:], columns=table[0])113 all_tables.append(df)114115# Combine all tables116if all_tables:117 combined_df = pd.concat(all_tables, ignore_index=True)118 combined_df.to_excel("extracted_tables.xlsx", index=False)119```120121### reportlab - Create PDFs122123#### Basic PDF Creation124```python125from reportlab.lib.pagesizes import letter126from reportlab.pdfgen import canvas127128c = canvas.Canvas("hello.pdf", pagesize=letter)129width, height = letter130131# Add text132c.drawString(100, height - 100, "Hello World!")133c.drawString(100, height - 120, "This is a PDF created with reportlab")134135# Add a line136c.line(100, height - 140, 400, height - 140)137138# Save139c.save()140```141142#### Create PDF with Multiple Pages143```python144from reportlab.lib.pagesizes import letter145from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak146from reportlab.lib.styles import getSampleStyleSheet147148doc = SimpleDocTemplate("report.pdf", pagesize=letter)149styles = getSampleStyleSheet()150story = []151152# Add content153title = Paragraph("Report Title", styles['Title'])154story.append(title)155story.append(Spacer(1, 12))156157body = Paragraph("This is the body of the report. " * 20, styles['Normal'])158story.append(body)159story.append(PageBreak())160161# Page 2162story.append(Paragraph("Page 2", styles['Heading1']))163story.append(Paragraph("Content for page 2", styles['Normal']))164165# Build PDF166doc.build(story)167```168169## Command-Line Tools170171### pdftotext (poppler-utils)172```bash173# Extract text174pdftotext input.pdf output.txt175176# Extract text preserving layout177pdftotext -layout input.pdf output.txt178179# Extract specific pages180pdftotext -f 1 -l 5 input.pdf output.txt # Pages 1-5181```182183### qpdf184```bash185# Merge PDFs186qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf187188# Split pages189qpdf input.pdf --pages . 1-5 -- pages1-5.pdf190qpdf input.pdf --pages . 6-10 -- pages6-10.pdf191192# Rotate pages193qpdf input.pdf output.pdf --rotate=+90:1 # Rotate page 1 by 90 degrees194195# Remove password196qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf197```198199### pdftk (if available)200```bash201# Merge202pdftk file1.pdf file2.pdf cat output merged.pdf203204# Split205pdftk input.pdf burst206207# Rotate208pdftk input.pdf rotate 1east output rotated.pdf209```210211## Common Tasks212213### Extract Text from Scanned PDFs214```python215# Requires: pip install pytesseract pdf2image216import pytesseract217from pdf2image import convert_from_path218219# Convert PDF to images220images = convert_from_path('scanned.pdf')221222# OCR each page223text = ""224for i, image in enumerate(images):225 text += f"Page {i+1}:\n"226 text += pytesseract.image_to_string(image)227 text += "\n\n"228229print(text)230```231232### Add Watermark233```python234from pypdf import PdfReader, PdfWriter235236# Create watermark (or load existing)237watermark = PdfReader("watermark.pdf").pages[0]238239# Apply to all pages240reader = PdfReader("document.pdf")241writer = PdfWriter()242243for page in reader.pages:244 page.merge_page(watermark)245 writer.add_page(page)246247with open("watermarked.pdf", "wb") as output:248 writer.write(output)249```250251### Extract Images252```bash253# Using pdfimages (poppler-utils)254pdfimages -j input.pdf output_prefix255256# This extracts all images as output_prefix-000.jpg, output_prefix-001.jpg, etc.257```258259### Password Protection260```python261from pypdf import PdfReader, PdfWriter262263reader = PdfReader("input.pdf")264writer = PdfWriter()265266for page in reader.pages:267 writer.add_page(page)268269# Add password270writer.encrypt("userpassword", "ownerpassword")271272with open("encrypted.pdf", "wb") as output:273 writer.write(output)274```275276## Quick Reference277278| Task | Best Tool | Command/Code |279|------|-----------|--------------|280| Merge PDFs | pypdf | `writer.add_page(page)` |281| Split PDFs | pypdf | One page per file |282| Extract text | pdfplumber | `page.extract_text()` |283| Extract tables | pdfplumber | `page.extract_tables()` |284| Create PDFs | reportlab | Canvas or Platypus |285| Command line merge | qpdf | `qpdf --empty --pages ...` |286| OCR scanned PDFs | pytesseract | Convert to image first |287| Fill PDF forms | pdf-lib or pypdf (see forms.md) | See forms.md |288289## Next Steps290291- For advanced pypdfium2 usage, see reference.md292- For JavaScript libraries (pdf-lib), see reference.md293- If you need to fill out a PDF form, follow the instructions in forms.md294- For troubleshooting guides, see reference.md295
Full transparency — inspect the skill content before installing.