Add this skill
npx mdskills install kreuzberg-dev/kreuzbergComprehensive guide covering installation, API usage, config, and error handling across multiple languages
1---2name: kreuzberg3description: >-4 Extract text, tables, metadata, and images from 75+ document formats5 (PDF, Office, images, HTML, email, archives, academic) using Kreuzberg.6 Use when writing code that calls Kreuzberg APIs in Python, Node.js/TypeScript,7 Rust, or CLI. Covers installation, extraction (sync/async), configuration8 (OCR, chunking, output format), batch processing, error handling, and plugins.9license: MIT10metadata:11 author: kreuzberg-dev12 version: "1.0"13 repository: https://github.com/kreuzberg-dev/kreuzberg14---1516# Kreuzberg Document Extraction1718Kreuzberg is a high-performance document intelligence library with a Rust core and native bindings for Python, Node.js/TypeScript, Ruby, Go, Java, C#, PHP, and Elixir. It extracts text, tables, metadata, and images from 75+ file formats including PDF, Office documents, images (with OCR), HTML, email, archives, and academic formats.1920Use this skill when writing code that:21- Extracts text or metadata from documents22- Performs OCR on scanned documents or images23- Batch-processes multiple files24- Configures extraction options (output format, chunking, OCR, language detection)25- Implements custom plugins (post-processors, validators, OCR backends)2627## Installation2829### Python30```bash31pip install kreuzberg32# Optional OCR backends:33pip install kreuzberg[easyocr] # EasyOCR34pip install kreuzberg[paddleocr] # PaddleOCR35```3637### Node.js38```bash39npm install @kreuzberg/node40```4142### Rust43```toml44# Cargo.toml45[dependencies]46kreuzberg = { version = "4", features = ["tokio-runtime"] }47# features: tokio-runtime (required for sync + batch), pdf, ocr, chunking,48# embeddings, language-detection, keywords-yake, keywords-rake49```5051### CLI52```bash53# Download from GitHub releases, or:54cargo install kreuzberg-cli55```5657## Quick Start5859### Python (Async)60```python61from kreuzberg import extract_file6263result = await extract_file("document.pdf")64print(result.content) # extracted text65print(result.metadata) # document metadata66print(result.tables) # extracted tables67```6869### Python (Sync)70```python71from kreuzberg import extract_file_sync7273result = extract_file_sync("document.pdf")74print(result.content)75```7677### Node.js78```typescript79import { extractFile } from '@kreuzberg/node';8081const result = await extractFile('document.pdf');82console.log(result.content);83console.log(result.metadata);84console.log(result.tables);85```8687### Node.js (Sync)88```typescript89import { extractFileSync } from '@kreuzberg/node';9091const result = extractFileSync('document.pdf');92```9394### Rust (Async)95```rust96use kreuzberg::{extract_file, ExtractionConfig};9798#[tokio::main]99async fn main() -> kreuzberg::Result<()> {100 let config = ExtractionConfig::default();101 let result = extract_file("document.pdf", None, &config).await?;102 println!("{}", result.content);103 Ok(())104}105```106107### Rust (Sync) — requires `tokio-runtime` feature108```rust109use kreuzberg::{extract_file_sync, ExtractionConfig};110111fn main() -> kreuzberg::Result<()> {112 let config = ExtractionConfig::default();113 let result = extract_file_sync("document.pdf", None, &config)?;114 println!("{}", result.content);115 Ok(())116}117```118119### CLI120```bash121kreuzberg extract document.pdf122kreuzberg extract document.pdf --format json123kreuzberg extract document.pdf --output-format markdown124```125126## Configuration127128All languages use the same configuration structure with language-appropriate naming conventions.129130### Python (snake_case)131```python132from kreuzberg import (133 ExtractionConfig, OcrConfig, TesseractConfig,134 PdfConfig, ChunkingConfig,135)136137config = ExtractionConfig(138 ocr=OcrConfig(139 backend="tesseract",140 language="eng",141 tesseract_config=TesseractConfig(psm=6, enable_table_detection=True),142 ),143 pdf_options=PdfConfig(passwords=["secret123"]),144 chunking=ChunkingConfig(max_chars=1000, max_overlap=200),145 output_format="markdown",146)147148result = await extract_file("document.pdf", config=config)149```150151### Node.js (camelCase)152```typescript153import { extractFile, type ExtractionConfig } from '@kreuzberg/node';154155const config: ExtractionConfig = {156 ocr: { backend: 'tesseract', language: 'eng' },157 pdfOptions: { passwords: ['secret123'] },158 chunking: { maxChars: 1000, maxOverlap: 200 },159 outputFormat: 'markdown',160};161162const result = await extractFile('document.pdf', null, config);163```164165### Rust (snake_case)166```rust167use kreuzberg::{ExtractionConfig, OcrConfig, ChunkingConfig, OutputFormat};168169let config = ExtractionConfig {170 ocr: Some(OcrConfig {171 backend: "tesseract".into(),172 language: "eng".into(),173 ..Default::default()174 }),175 chunking: Some(ChunkingConfig {176 max_characters: 1000,177 overlap: 200,178 ..Default::default()179 }),180 output_format: OutputFormat::Markdown,181 ..Default::default()182};183184let result = extract_file("document.pdf", None, &config).await?;185```186187### Config File (TOML)188```toml189output_format = "markdown"190191[ocr]192backend = "tesseract"193language = "eng"194195[chunking]196max_chars = 1000197max_overlap = 200198199[pdf_options]200passwords = ["secret123"]201```202203```bash204# CLI: auto-discovers kreuzberg.toml in current/parent directories205kreuzberg extract doc.pdf206# or explicit:207kreuzberg extract doc.pdf --config kreuzberg.toml208kreuzberg extract doc.pdf --config-json '{"ocr":{"backend":"tesseract","language":"deu"}}'209```210211## Batch Processing212213### Python214```python215from kreuzberg import batch_extract_files, batch_extract_files_sync216217# Async218results = await batch_extract_files(["doc1.pdf", "doc2.docx", "doc3.xlsx"])219220# Sync221results = batch_extract_files_sync(["doc1.pdf", "doc2.docx"])222223for result in results:224 print(f"{len(result.content)} chars extracted")225```226227### Node.js228```typescript229import { batchExtractFiles } from '@kreuzberg/node';230231const results = await batchExtractFiles(['doc1.pdf', 'doc2.docx']);232```233234### Rust — requires `tokio-runtime` feature235```rust236use kreuzberg::{batch_extract_file, ExtractionConfig};237238let config = ExtractionConfig::default();239let paths = vec!["doc1.pdf", "doc2.docx"];240let results = batch_extract_file(paths, &config).await?;241```242243### CLI244```bash245kreuzberg batch *.pdf --format json246kreuzberg batch docs/*.docx --output-format markdown247```248249## OCR250251OCR runs automatically for images and scanned PDFs. Tesseract is the default backend (native binding, no external install required).252253### Backends254- **Tesseract** (default): Built-in native binding. All Tesseract languages supported.255- **EasyOCR** (Python only): `pip install kreuzberg[easyocr]`. Pass `easyocr_kwargs={"gpu": True}`.256- **PaddleOCR** (Python only): `pip install kreuzberg[paddleocr]`. Pass `paddleocr_kwargs={"use_angle_cls": True}`.257- **Guten** (Node.js only): Built-in OCR backend via `GutenOcrBackend`.258259### Language Codes260```python261config = ExtractionConfig(ocr=OcrConfig(language="eng")) # English262config = ExtractionConfig(ocr=OcrConfig(language="eng+deu")) # Multiple263config = ExtractionConfig(ocr=OcrConfig(language="all")) # All installed264```265266### Force OCR267```python268config = ExtractionConfig(force_ocr=True) # OCR even if text is extractable269```270271## ExtractionResult Fields272273| Field | Python | Node.js | Rust | Description |274|-------|--------|---------|------|-------------|275| Text content | `result.content` | `result.content` | `result.content` | Extracted text (str/String) |276| MIME type | `result.mime_type` | `result.mimeType` | `result.mime_type` | Input document MIME type |277| Metadata | `result.metadata` | `result.metadata` | `result.metadata` | Document metadata (dict/object/HashMap) |278| Tables | `result.tables` | `result.tables` | `result.tables` | Extracted tables with cells + markdown |279| Languages | `result.detected_languages` | `result.detectedLanguages` | `result.detected_languages` | Detected languages (if enabled) |280| Chunks | `result.chunks` | `result.chunks` | `result.chunks` | Text chunks (if chunking enabled) |281| Images | `result.images` | `result.images` | `result.images` | Extracted images (if enabled) |282| Elements | `result.elements` | `result.elements` | `result.elements` | Semantic elements (if element_based format) |283| Pages | `result.pages` | `result.pages` | `result.pages` | Per-page content (if page extraction enabled) |284| Keywords | `result.keywords` | `result.keywords` | `result.keywords` | Extracted keywords (if enabled) |285286## Error Handling287288### Python289```python290from kreuzberg import (291 extract_file_sync, KreuzbergError, ParsingError,292 OCRError, ValidationError, MissingDependencyError,293)294295try:296 result = extract_file_sync("file.pdf")297except ParsingError as e:298 print(f"Failed to parse: {e}")299except OCRError as e:300 print(f"OCR failed: {e}")301except ValidationError as e:302 print(f"Invalid input: {e}")303except MissingDependencyError as e:304 print(f"Missing dependency: {e}")305except KreuzbergError as e:306 print(f"Extraction failed: {e}")307```308309### Node.js310```typescript311import {312 extractFile, KreuzbergError, ParsingError,313 OcrError, ValidationError, MissingDependencyError,314} from '@kreuzberg/node';315316try {317 const result = await extractFile('file.pdf');318} catch (e) {319 if (e instanceof ParsingError) { /* ... */ }320 else if (e instanceof OcrError) { /* ... */ }321 else if (e instanceof ValidationError) { /* ... */ }322 else if (e instanceof KreuzbergError) { /* ... */ }323}324```325326### Rust327```rust328use kreuzberg::{extract_file, ExtractionConfig, KreuzbergError};329330let config = ExtractionConfig::default();331match extract_file("file.pdf", None, &config).await {332 Ok(result) => println!("{}", result.content),333 Err(KreuzbergError::Parsing(msg)) => eprintln!("Parse error: {msg}"),334 Err(KreuzbergError::Ocr(msg)) => eprintln!("OCR error: {msg}"),335 Err(e) => eprintln!("Error: {e}"),336}337```338339## Common Pitfalls3403411. **Python ChunkingConfig fields**: Use `max_chars` and `max_overlap`, NOT `max_characters` or `overlap`.3422. **Rust extract_file signature**: Third argument is `&ExtractionConfig` (a reference), not `Option`. Use `&ExtractionConfig::default()` for defaults.3433. **Rust feature gates**: `extract_file_sync`, `batch_extract_file`, and `batch_extract_file_sync` all require `features = ["tokio-runtime"]` in Cargo.toml.3444. **Rust async context**: `extract_file` is async. Use `#[tokio::main]` or call from an async context.3455. **CLI --format vs --output-format**: `--format` controls CLI output (text/json). `--output-format` controls content format (plain/markdown/djot/html).3466. **Node.js extractFile signature**: `extractFile(path, mimeType?, config?)` — mimeType is the second arg (pass `null` to skip).3477. **Python detect_mime_type**: The function for detecting from bytes is `detect_mime_type(data)`. For paths use `detect_mime_type_from_path(path)`.3488. **Config file field names**: Use snake_case in TOML/YAML/JSON config files (e.g., `max_chars`, `max_overlap`, `pdf_options`).349350## Supported Formats (Summary)351352| Category | Extensions |353|----------|-----------|354| **PDF** | `.pdf` |355| **Word** | `.docx`, `.odt` |356| **Spreadsheets** | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.ods` |357| **Presentations** | `.pptx`, `.ppt`, `.ppsx` |358| **eBooks** | `.epub`, `.fb2` |359| **Images** | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif`, `.jp2`, `.jpx`, `.jpm`, `.mj2`, `.jbig2`, `.jb2`, `.pnm`, `.pbm`, `.pgm`, `.ppm`, `.svg` |360| **Markup** | `.html`, `.htm`, `.xhtml`, `.xml` |361| **Data** | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv` |362| **Text** | `.txt`, `.md`, `.markdown`, `.djot`, `.rst`, `.org`, `.rtf` |363| **Email** | `.eml`, `.msg` |364| **Archives** | `.zip`, `.tar`, `.tgz`, `.gz`, `.7z` |365| **Academic** | `.bib`, `.biblatex`, `.ris`, `.nbib`, `.enw`, `.csl`, `.tex`, `.latex`, `.typ`, `.jats`, `.ipynb`, `.docbook`, `.opml`, `.pod`, `.mdoc`, `.troff` |366367See [references/supported-formats.md](references/supported-formats.md) for the complete format reference with MIME types.368369## Additional Resources370371Detailed reference files for specific topics:372373- **[Python API Reference](references/python-api.md)** — All functions, config classes, plugin protocols, exact signatures374- **[Node.js API Reference](references/nodejs-api.md)** — All functions, TypeScript interfaces, worker pool APIs375- **[Rust API Reference](references/rust-api.md)** — All functions with feature gates, structs, Cargo.toml examples376- **[CLI Reference](references/cli-reference.md)** — All commands, flags, config precedence, exit codes377- **[Configuration Reference](references/configuration.md)** — TOML/YAML/JSON formats, auto-discovery, env vars, full schema378- **[Supported Formats](references/supported-formats.md)** — All 75+ formats with file extensions and MIME types379- **[Advanced Features](references/advanced-features.md)** — Plugins, embeddings, MCP server, API server, security limits380- **[Other Language Bindings](references/other-bindings.md)** — Go, Ruby, Java, C#, PHP, Elixir, WASM, Docker381382Full documentation: https://docs.kreuzberg.dev383GitHub: https://github.com/kreuzberg-dev/kreuzberg384
Full transparency — inspect the skill content before installing.