Batch process books through the complete pipeline - generate cropped images for split pages, OCR all pages, then translate with context. Use when asked to process, OCR, translate, or batch process one or more books.
Add this skill
npx mdskills install Embassy-of-the-Free-Mind/batch-translateComprehensive pipeline for OCR and translation of historical texts with excellent API documentation
1---2name: batch-translate3description: Batch process books through the complete pipeline - generate cropped images for split pages, OCR all pages, then translate with context. Use when asked to process, OCR, translate, or batch process one or more books.4---56# Batch Book Translation Workflow78Process books through the complete pipeline: Crop → OCR → Translate910## Roadmap Reference1112See `.claude/ROADMAP.md` for the translation priority list.1314**Priority 1 = UNTRANSLATED** - These are highest priority for processing:15- Kircher encyclopedias (Oedipus, Musurgia, Ars Magna Lucis)16- Fludd: Utriusque Cosmi Historia17- Theatrum Chemicum, Musaeum Hermeticum18- Cardano: De Subtilitate19- Della Porta: Magia Naturalis20- Lomazzo, Poliziano, Landino2122```bash23# Get roadmap with priorities24curl -s "https://sourcelibrary.org/api/books/roadmap" | jq '.books[] | select(.priority == 1) | {title, notes}'25```2627Roadmap source: `src/app/api/books/roadmap/route.ts`2829## Overview3031This workflow handles the full processing pipeline for historical book scans:321. **Generate Cropped Images** - For split two-page spreads, extract individual pages332. **OCR** - Extract text from page images using Gemini vision343. **Translate** - Translate OCR'd text with prior page context for continuity3536## API Endpoints3738| Endpoint | Purpose |39|----------|---------|40| `GET /api/books` | List all books |41| `GET /api/books/BOOK_ID` | Get book with all pages |42| `POST /api/jobs/queue-books` | Queue pages for Lambda worker processing (primary path) |43| `GET /api/jobs` | List processing jobs |44| `POST /api/jobs/JOB_ID/retry` | Retry failed pages in a job |45| `POST /api/jobs/JOB_ID/cancel` | Cancel a running job |46| `POST /api/books/BOOK_ID/batch-ocr-async` | Submit Gemini Batch API OCR job (50% cheaper, ~24h) |47| `POST /api/books/BOOK_ID/batch-translate-async` | Submit Gemini Batch API translation job |4849## Processing Options5051### Option 1: Lambda Workers via Job System (Primary Path)5253The primary processing path uses AWS Lambda workers via SQS queues. Each page is processed independently with automatic job tracking.5455```bash56# Queue OCR for a book's pages57curl -s -X POST "https://sourcelibrary.org/api/jobs/queue-books" \58 -H "Content-Type: application/json" \59 -d '{"bookIds": ["BOOK_ID"], "action": "ocr"}'6061# Queue translation62curl -s -X POST "https://sourcelibrary.org/api/jobs/queue-books" \63 -H "Content-Type: application/json" \64 -d '{"bookIds": ["BOOK_ID"], "action": "translation"}'6566# Queue image extraction67curl -s -X POST "https://sourcelibrary.org/api/jobs/queue-books" \68 -H "Content-Type: application/json" \69 -d '{"bookIds": ["BOOK_ID"], "action": "image_extraction"}'70```7172**IMPORTANT: Always use `gemini-3-flash-preview` for all OCR and translation tasks. Do NOT use `gemini-2.5-flash`.**7374### Option 2: Gemini Batch API (50% Cheaper, Automated Pipeline)7576The post-import-pipeline cron uses Gemini Batch API for automated processing of newly imported books. Results arrive in ~24 hours at 50% cost.7778| Job Type | API | Model | Cost |79|----------|-----|-------|------|80| Single page | Realtime (Lambda) | gemini-3-flash-preview | Full price |81| batch_ocr | Batch API | gemini-3-flash-preview | **50% off** |82| batch_translate | Batch API | gemini-3-flash-preview | **50% off** |8384## OCR Output Format8586OCR uses **Markdown output** with semantic tags:8788### Markdown Formatting89- `# ## ###` for headings (bigger text = bigger heading)90- `**bold**`, `*italic*` for emphasis91- `->centered text<-` for centered lines (NOT for headings)92- `> blockquotes` for quotes/prayers93- `---` for dividers94- Tables only for actual tabular data9596### Metadata Tags (hidden from readers)97| Tag | Purpose |98|-----|---------|99| `<lang>X</lang>` | Detected language |100| `<page-num>N</page-num>` | Page/folio number |101| `<header>X</header>` | Running headers |102| `<sig>X</sig>` | Printer's marks (A2, B1) |103| `<meta>X</meta>` | Hidden metadata |104| `<warning>X</warning>` | Quality issues |105| `<vocab>X</vocab>` | Key terms for indexing |106107### Inline Annotations (visible to readers)108| Tag | Purpose |109|-----|---------|110| `<margin>X</margin>` | Marginal notes (before paragraph) |111| `<gloss>X</gloss>` | Interlinear annotations |112| `<insert>X</insert>` | Boxed text, additions |113| `<unclear>X</unclear>` | Illegible readings |114| `<note>X</note>` | Interpretive notes |115| `<term>X</term>` | Technical vocabulary |116| `<image-desc>X</image-desc>` | Describe illustrations |117118### Critical OCR Rules1191. Preserve original spelling, capitalization, punctuation1202. Page numbers/headers/signatures go in metadata tags only1213. IGNORE partial text at edges (from facing page in spread)1224. Describe images/diagrams with `<image-desc>`, never tables1235. End with `<vocab>key terms, names, concepts</vocab>`124125## Step 1: Analyze Book Status126127First, check what work is needed for a book:128129```bash130# Get book and analyze page status131curl -s "https://sourcelibrary.org/api/books/BOOK_ID" > /tmp/book.json132133# Count pages by status (IMPORTANT: check length > 0, not just existence - empty strings are truthy!)134jq '{135 title: .title,136 total_pages: (.pages | length),137 split_pages: [.pages[] | select(.crop)] | length,138 needs_crop: [.pages[] | select(.crop) | select(.cropped_photo | not)] | length,139 has_ocr: [.pages[] | select((.ocr.data // "") | length > 0)] | length,140 needs_ocr: [.pages[] | select((.ocr.data // "") | length == 0)] | length,141 has_translation: [.pages[] | select((.translation.data // "") | length > 0)] | length,142 needs_translation: [.pages[] | select((.ocr.data // "") | length > 0) | select((.translation.data // "") | length == 0)] | length143}' /tmp/book.json144```145146### Detecting Bad OCR147148Pages that were OCR'd before cropped images were generated have incorrect OCR (contains both pages of the spread). Detect these:149150```bash151# Find pages with crop data + OCR but missing cropped_photo at OCR time152# These often contain "two-page" or "spread" in the OCR text153jq '[.pages[] | select(.crop) | select(.ocr.data) |154 select(.ocr.data | test("two-page|spread"; "i"))] | length' /tmp/book.json155```156157## Step 2: Generate Cropped Images158159For books with split two-page spreads, generate individual page images:160161```bash162# Get page IDs needing crops163CROP_IDS=$(jq '[.pages[] | select(.crop) | select(.cropped_photo | not) | .id]' /tmp/book.json)164165# Create crop job166curl -s -X POST "https://sourcelibrary.org/api/jobs" \167 -H "Content-Type: application/json" \168 -d "{169 \"type\": \"generate_cropped_images\",170 \"book_id\": \"BOOK_ID\",171 \"book_title\": \"BOOK_TITLE\",172 \"page_ids\": $CROP_IDS173 }"174```175176Process the job:177178```bash179# Trigger processing (40 pages per request, auto-continues)180curl -s -X POST "https://sourcelibrary.org/api/jobs/JOB_ID/process"181```182183## Step 3: OCR Pages184185### Option A: Using Job System (for large batches)186187```bash188# Get page IDs needing OCR (check for empty strings, not just null)189OCR_IDS=$(jq '[.pages[] | select((.ocr.data // "") | length == 0) | .id]' /tmp/book.json)190191# Create OCR job192curl -s -X POST "https://sourcelibrary.org/api/jobs" \193 -H "Content-Type: application/json" \194 -d "{195 \"type\": \"batch_ocr\",196 \"book_id\": \"BOOK_ID\",197 \"book_title\": \"BOOK_TITLE\",198 \"model\": \"gemini-3-flash-preview\",199 \"language\": \"Latin\",200 \"page_ids\": $OCR_IDS201 }"202```203204### Option B: Using Lambda Workers with Page IDs205206```bash207# OCR specific pages (including overwrite)208curl -s -X POST "https://sourcelibrary.org/api/jobs/queue-books" \209 -H "Content-Type: application/json" \210 -d '{211 "bookIds": ["BOOK_ID"],212 "action": "ocr",213 "pageIds": ["PAGE_ID_1", "PAGE_ID_2"],214 "overwrite": true215 }'216```217218Lambda workers automatically use `cropped_photo` when available.219220## Step 4: Translate Pages221222### Option A: Using Job System223224```bash225# Get page IDs needing translation (must have OCR content, check for empty strings)226TRANS_IDS=$(jq '[.pages[] | select((.ocr.data // "") | length > 0) | select((.translation.data // "") | length == 0) | .id]' /tmp/book.json)227228# Create translation job229curl -s -X POST "https://sourcelibrary.org/api/jobs" \230 -H "Content-Type: application/json" \231 -d "{232 \"type\": \"batch_translate\",233 \"book_id\": \"BOOK_ID\",234 \"book_title\": \"BOOK_TITLE\",235 \"model\": \"gemini-3-flash-preview\",236 \"language\": \"Latin\",237 \"page_ids\": $TRANS_IDS238 }"239```240241### Option B: Using Lambda Workers (Recommended)242243Lambda FIFO queue automatically provides previous page context for translation continuity:244245```bash246# Queue translation for pages that have OCR but no translation247curl -s -X POST "https://sourcelibrary.org/api/jobs/queue-books" \248 -H "Content-Type: application/json" \249 -d '{"bookIds": ["BOOK_ID"], "action": "translation"}'250```251252The translation Lambda worker processes pages sequentially via FIFO queue and fetches the previous page's translation for context.253254## Complete Book Processing Script255256Process a single book through the full pipeline using Lambda workers:257258```bash259#!/bin/bash260BOOK_ID="YOUR_BOOK_ID"261BASE_URL="https://sourcelibrary.org"262263# 1. Fetch book data264echo "Fetching book..."265BOOK=$(curl -s "$BASE_URL/api/books/$BOOK_ID")266TITLE=$(echo "$BOOK" | jq -r '.title[0:40]')267echo "Processing: $TITLE"268269# 2. Queue OCR (Lambda workers handle all pages automatically)270NEEDS_OCR=$(echo "$BOOK" | jq '[.pages[] | select((.ocr.data // "") | length == 0)] | length')271if [ "$NEEDS_OCR" != "0" ]; then272 echo "Queueing OCR for $NEEDS_OCR pages..."273 curl -s -X POST "$BASE_URL/api/jobs/queue-books" \274 -H "Content-Type: application/json" \275 -d "{\"bookIds\": [\"$BOOK_ID\"], \"action\": \"ocr\"}"276 echo "OCR job queued!"277fi278279# 3. Queue translation (after OCR completes — check /jobs page)280NEEDS_TRANS=$(echo "$BOOK" | jq '[.pages[] | select((.ocr.data // "") | length > 0) | select((.translation.data // "") | length == 0)] | length')281if [ "$NEEDS_TRANS" != "0" ]; then282 echo "Queueing translation for $NEEDS_TRANS pages..."283 curl -s -X POST "$BASE_URL/api/jobs/queue-books" \284 -H "Content-Type: application/json" \285 -d "{\"bookIds\": [\"$BOOK_ID\"], \"action\": \"translation\"}"286 echo "Translation job queued!"287fi288289echo "Jobs queued! Monitor progress at $BASE_URL/jobs"290```291292## Fixing Bad OCR293294When pages were OCR'd before cropped images existed, they contain text from both pages. Fix with:295296```bash297# 1. Generate cropped images first (Step 2 above)298299# 2. Find pages with bad OCR300BAD_OCR_IDS=$(jq '[.pages[] | select(.crop) | select(.ocr.data) |301 select(.ocr.data | test("two-page|spread"; "i")) | .id]' /tmp/book.json)302303# 3. Re-OCR with overwrite via Lambda workers304curl -s -X POST "https://sourcelibrary.org/api/jobs/queue-books" \305 -H "Content-Type: application/json" \306 -d "{\"bookIds\": [\"BOOK_ID\"], \"action\": \"ocr\", \"pageIds\": $BAD_OCR_IDS, \"overwrite\": true}"307```308309## Processing All Books310311Use the Lambda worker job system for bulk processing:312313```bash314#!/bin/bash315BASE_URL="https://sourcelibrary.org"316317# Get all book IDs318BOOK_IDS=$(curl -s "$BASE_URL/api/books" | jq -r '[.[].id]')319320# Queue OCR for all books (Lambda workers handle parallelism and rate limiting)321curl -s -X POST "$BASE_URL/api/jobs/queue-books" \322 -H "Content-Type: application/json" \323 -d "{\"bookIds\": $BOOK_IDS, \"action\": \"ocr\"}"324325# After OCR completes, queue translation326curl -s -X POST "$BASE_URL/api/jobs/queue-books" \327 -H "Content-Type: application/json" \328 -d "{\"bookIds\": $BOOK_IDS, \"action\": \"translation\"}"329```330331Monitor progress at https://sourcelibrary.org/jobs332333## Monitoring Progress334335Check overall library status:336337```bash338curl -s "https://sourcelibrary.org/api/books" | jq '[.[] | {339 title: .title[0:30],340 pages: .pages_count,341 ocr: .ocr_count,342 translated: .translation_count343}] | sort_by(-.pages)'344```345346## Troubleshooting347348### Empty Strings vs Null (CRITICAL)349In jq, empty strings `""` are truthy! This means:350- `select(.ocr.data)` matches pages with `""` (WRONG)351- `select(.ocr.data | not)` does NOT match pages with `""` (WRONG)352- Use `select((.ocr.data // "") | length == 0)` to find missing/empty OCR353- Use `select((.ocr.data // "") | length > 0)` to find pages WITH OCR content354355### Rate Limits (429 errors)356357#### Gemini API Tiers358| Tier | RPM | How to Qualify |359|------|-----|----------------|360| Free | 15 | Default |361| Tier 1 | 300 | Enable billing + $50 spend |362| Tier 2 | 1000 | $250 spend |363| Tier 3 | 2000 | $1000 spend |364365#### Optimal Sleep Times by Tier366| Tier | Max RPM | Safe Sleep Time | Effective Rate |367|------|---------|-----------------|----------------|368| Free | 15 | 4.0s | ~15/min |369| Tier 1 | 300 | 0.4s | ~150/min |370| Tier 2 | 1000 | 0.12s | ~500/min |371| Tier 3 | 2000 | 0.06s | ~1000/min |372373**Note:** Use ~50% of max rate to leave headroom for bursts.374375#### API Key Rotation376The system supports multiple API keys for higher throughput:377- Set `GEMINI_API_KEY` (primary)378- Set `GEMINI_API_KEY_2`, `GEMINI_API_KEY_3`, ... up to `GEMINI_API_KEY_10`379- Keys rotate automatically with 60s cooldown after rate limit380381With N keys at Tier 1, you get N × 300 RPM = N × 150 safe req/min382383### Function Timeouts384- Jobs have `maxDuration=300s` for Vercel Pro385- If hitting timeouts, reduce `CROP_CHUNK_SIZE` in job processing386387### Missing Cropped Photos388- Check if crop job completed successfully389- Verify page has `crop` data with `xStart` and `xEnd`390- Re-run crop generation for specific pages391392### Bad OCR Detection393Look for these patterns in OCR text indicating wrong image was used:394- "two-page spread"395- "left page" / "right page" descriptions396- Duplicate text blocks397- References to facing pages398
Full transparency — inspect the skill content before installing.