Kreuzberg is a free, open-source AI agent skill. >-

How do I install Kreuzberg?

Install Kreuzberg with a single command: npx mdskills install kreuzberg-dev/kreuzberg. This downloads the skill files into your project and your AI agent picks them up automatically.

What platforms support Kreuzberg?

Kreuzberg works with Claude Code, Claude Desktop, Cursor, Vscode Copilot, Windsurf, Continue Dev, Codex, Gemini Cli, Amp, Roo Code, Goose, Opencode, Trae, Qodo, Command Code. Skills use the open SKILL.md format which is compatible with any AI coding agent that reads markdown instructions.

← Back to skills

Kreuzberg

Name: Kreuzberg: AI Agent Skill
Rating: 8 (1 reviews)
Author: kreuzberg-dev

Verified

File ProcessingIntermediate

by @kreuzberg-dev0Updated 2/24/2026

Add this skill

npx mdskills install kreuzberg-dev/kreuzberg

Fork & Edit

Skill Advisor8.0

Comprehensive guide covering installation, API usage, config, and error handling across multiple languages

+Provides clear code examples for Python, Node.js, Rust, and CLI with both sync/async patterns
+Documents common pitfalls section that prevents typical integration mistakes
+Covers configuration extensively with OCR backends, chunking, and batch processing examples
-Does not include trigger conditions or agent decision-making context for when to use this skill
-Lacks examples of handling specific extraction scenarios or output validation patterns

SKILL.md

Edit in Browser

1---
2name: kreuzberg
3description: >-
4  Extract text, tables, metadata, and images from 75+ document formats
5  (PDF, Office, images, HTML, email, archives, academic) using Kreuzberg.
6  Use when writing code that calls Kreuzberg APIs in Python, Node.js/TypeScript,
7  Rust, or CLI. Covers installation, extraction (sync/async), configuration
8  (OCR, chunking, output format), batch processing, error handling, and plugins.
9license: MIT
10metadata:
11  author: kreuzberg-dev
12  version: "1.0"
13  repository: https://github.com/kreuzberg-dev/kreuzberg
14---
15 
16# Kreuzberg Document Extraction
17 
18Kreuzberg is a high-performance document intelligence library with a Rust core and native bindings for Python, Node.js/TypeScript, Ruby, Go, Java, C#, PHP, and Elixir. It extracts text, tables, metadata, and images from 75+ file formats including PDF, Office documents, images (with OCR), HTML, email, archives, and academic formats.
19 
20Use this skill when writing code that:
21- Extracts text or metadata from documents
22- Performs OCR on scanned documents or images
23- Batch-processes multiple files
24- Configures extraction options (output format, chunking, OCR, language detection)
25- Implements custom plugins (post-processors, validators, OCR backends)
26 
27## Installation
28 
29### Python
30```bash
31pip install kreuzberg
32# Optional OCR backends:
33pip install kreuzberg[easyocr]    # EasyOCR
34pip install kreuzberg[paddleocr]  # PaddleOCR
35```
36 
37### Node.js
38```bash
39npm install @kreuzberg/node
40```
41 
42### Rust
43```toml
44# Cargo.toml
45[dependencies]
46kreuzberg = { version = "4", features = ["tokio-runtime"] }
47# features: tokio-runtime (required for sync + batch), pdf, ocr, chunking,
48#           embeddings, language-detection, keywords-yake, keywords-rake
49```
50 
51### CLI
52```bash
53# Download from GitHub releases, or:
54cargo install kreuzberg-cli
55```
56 
57## Quick Start
58 
59### Python (Async)
60```python
61from kreuzberg import extract_file
62 
63result = await extract_file("document.pdf")
64print(result.content)       # extracted text
65print(result.metadata)      # document metadata
66print(result.tables)        # extracted tables
67```
68 
69### Python (Sync)
70```python
71from kreuzberg import extract_file_sync
72 
73result = extract_file_sync("document.pdf")
74print(result.content)
75```
76 
77### Node.js
78```typescript
79import { extractFile } from '@kreuzberg/node';
80 
81const result = await extractFile('document.pdf');
82console.log(result.content);
83console.log(result.metadata);
84console.log(result.tables);
85```
86 
87### Node.js (Sync)
88```typescript
89import { extractFileSync } from '@kreuzberg/node';
90 
91const result = extractFileSync('document.pdf');
92```
93 
94### Rust (Async)
95```rust
96use kreuzberg::{extract_file, ExtractionConfig};
97 
98#[tokio::main]
99async fn main() -> kreuzberg::Result<()> {
100    let config = ExtractionConfig::default();
101    let result = extract_file("document.pdf", None, &config).await?;
102    println!("{}", result.content);
103    Ok(())
104}
105```
106 
107### Rust (Sync) — requires `tokio-runtime` feature
108```rust
109use kreuzberg::{extract_file_sync, ExtractionConfig};
110 
111fn main() -> kreuzberg::Result<()> {
112    let config = ExtractionConfig::default();
113    let result = extract_file_sync("document.pdf", None, &config)?;
114    println!("{}", result.content);
115    Ok(())
116}
117```
118 
119### CLI
120```bash
121kreuzberg extract document.pdf
122kreuzberg extract document.pdf --format json
123kreuzberg extract document.pdf --output-format markdown
124```
125 
126## Configuration
127 
128All languages use the same configuration structure with language-appropriate naming conventions.
129 
130### Python (snake_case)
131```python
132from kreuzberg import (
133    ExtractionConfig, OcrConfig, TesseractConfig,
134    PdfConfig, ChunkingConfig,
135)
136 
137config = ExtractionConfig(
138    ocr=OcrConfig(
139        backend="tesseract",
140        language="eng",
141        tesseract_config=TesseractConfig(psm=6, enable_table_detection=True),
142    ),
143    pdf_options=PdfConfig(passwords=["secret123"]),
144    chunking=ChunkingConfig(max_chars=1000, max_overlap=200),
145    output_format="markdown",
146)
147 
148result = await extract_file("document.pdf", config=config)
149```
150 
151### Node.js (camelCase)
152```typescript
153import { extractFile, type ExtractionConfig } from '@kreuzberg/node';
154 
155const config: ExtractionConfig = {
156    ocr: { backend: 'tesseract', language: 'eng' },
157    pdfOptions: { passwords: ['secret123'] },
158    chunking: { maxChars: 1000, maxOverlap: 200 },
159    outputFormat: 'markdown',
160};
161 
162const result = await extractFile('document.pdf', null, config);
163```
164 
165### Rust (snake_case)
166```rust
167use kreuzberg::{ExtractionConfig, OcrConfig, ChunkingConfig, OutputFormat};
168 
169let config = ExtractionConfig {
170    ocr: Some(OcrConfig {
171        backend: "tesseract".into(),
172        language: "eng".into(),
173        ..Default::default()
174    }),
175    chunking: Some(ChunkingConfig {
176        max_characters: 1000,
177        overlap: 200,
178        ..Default::default()
179    }),
180    output_format: OutputFormat::Markdown,
181    ..Default::default()
182};
183 
184let result = extract_file("document.pdf", None, &config).await?;
185```
186 
187### Config File (TOML)
188```toml
189output_format = "markdown"
190 
191[ocr]
192backend = "tesseract"
193language = "eng"
194 
195[chunking]
196max_chars = 1000
197max_overlap = 200
198 
199[pdf_options]
200passwords = ["secret123"]
201```
202 
203```bash
204# CLI: auto-discovers kreuzberg.toml in current/parent directories
205kreuzberg extract doc.pdf
206# or explicit:
207kreuzberg extract doc.pdf --config kreuzberg.toml
208kreuzberg extract doc.pdf --config-json '{"ocr":{"backend":"tesseract","language":"deu"}}'
209```
210 
211## Batch Processing
212 
213### Python
214```python
215from kreuzberg import batch_extract_files, batch_extract_files_sync
216 
217# Async
218results = await batch_extract_files(["doc1.pdf", "doc2.docx", "doc3.xlsx"])
219 
220# Sync
221results = batch_extract_files_sync(["doc1.pdf", "doc2.docx"])
222 
223for result in results:
224    print(f"{len(result.content)} chars extracted")
225```
226 
227### Node.js
228```typescript
229import { batchExtractFiles } from '@kreuzberg/node';
230 
231const results = await batchExtractFiles(['doc1.pdf', 'doc2.docx']);
232```
233 
234### Rust — requires `tokio-runtime` feature
235```rust
236use kreuzberg::{batch_extract_file, ExtractionConfig};
237 
238let config = ExtractionConfig::default();
239let paths = vec!["doc1.pdf", "doc2.docx"];
240let results = batch_extract_file(paths, &config).await?;
241```
242 
243### CLI
244```bash
245kreuzberg batch *.pdf --format json
246kreuzberg batch docs/*.docx --output-format markdown
247```
248 
249## OCR
250 
251OCR runs automatically for images and scanned PDFs. Tesseract is the default backend (native binding, no external install required).
252 
253### Backends
254- **Tesseract** (default): Built-in native binding. All Tesseract languages supported.
255- **EasyOCR** (Python only): `pip install kreuzberg[easyocr]`. Pass `easyocr_kwargs={"gpu": True}`.
256- **PaddleOCR** (Python only): `pip install kreuzberg[paddleocr]`. Pass `paddleocr_kwargs={"use_angle_cls": True}`.
257- **Guten** (Node.js only): Built-in OCR backend via `GutenOcrBackend`.
258 
259### Language Codes
260```python
261config = ExtractionConfig(ocr=OcrConfig(language="eng"))       # English
262config = ExtractionConfig(ocr=OcrConfig(language="eng+deu"))   # Multiple
263config = ExtractionConfig(ocr=OcrConfig(language="all"))       # All installed
264```
265 
266### Force OCR
267```python
268config = ExtractionConfig(force_ocr=True)  # OCR even if text is extractable
269```
270 
271## ExtractionResult Fields
272 
273| Field | Python | Node.js | Rust | Description |
274|-------|--------|---------|------|-------------|
275| Text content | `result.content` | `result.content` | `result.content` | Extracted text (str/String) |
276| MIME type | `result.mime_type` | `result.mimeType` | `result.mime_type` | Input document MIME type |
277| Metadata | `result.metadata` | `result.metadata` | `result.metadata` | Document metadata (dict/object/HashMap) |
278| Tables | `result.tables` | `result.tables` | `result.tables` | Extracted tables with cells + markdown |
279| Languages | `result.detected_languages` | `result.detectedLanguages` | `result.detected_languages` | Detected languages (if enabled) |
280| Chunks | `result.chunks` | `result.chunks` | `result.chunks` | Text chunks (if chunking enabled) |
281| Images | `result.images` | `result.images` | `result.images` | Extracted images (if enabled) |
282| Elements | `result.elements` | `result.elements` | `result.elements` | Semantic elements (if element_based format) |
283| Pages | `result.pages` | `result.pages` | `result.pages` | Per-page content (if page extraction enabled) |
284| Keywords | `result.keywords` | `result.keywords` | `result.keywords` | Extracted keywords (if enabled) |
285 
286## Error Handling
287 
288### Python
289```python
290from kreuzberg import (
291    extract_file_sync, KreuzbergError, ParsingError,
292    OCRError, ValidationError, MissingDependencyError,
293)
294 
295try:
296    result = extract_file_sync("file.pdf")
297except ParsingError as e:
298    print(f"Failed to parse: {e}")
299except OCRError as e:
300    print(f"OCR failed: {e}")
301except ValidationError as e:
302    print(f"Invalid input: {e}")
303except MissingDependencyError as e:
304    print(f"Missing dependency: {e}")
305except KreuzbergError as e:
306    print(f"Extraction failed: {e}")
307```
308 
309### Node.js
310```typescript
311import {
312    extractFile, KreuzbergError, ParsingError,
313    OcrError, ValidationError, MissingDependencyError,
314} from '@kreuzberg/node';
315 
316try {
317    const result = await extractFile('file.pdf');
318} catch (e) {
319    if (e instanceof ParsingError) { /* ... */ }
320    else if (e instanceof OcrError) { /* ... */ }
321    else if (e instanceof ValidationError) { /* ... */ }
322    else if (e instanceof KreuzbergError) { /* ... */ }
323}
324```
325 
326### Rust
327```rust
328use kreuzberg::{extract_file, ExtractionConfig, KreuzbergError};
329 
330let config = ExtractionConfig::default();
331match extract_file("file.pdf", None, &config).await {
332    Ok(result) => println!("{}", result.content),
333    Err(KreuzbergError::Parsing(msg)) => eprintln!("Parse error: {msg}"),
334    Err(KreuzbergError::Ocr(msg)) => eprintln!("OCR error: {msg}"),
335    Err(e) => eprintln!("Error: {e}"),
336}
337```
338 
339## Common Pitfalls
340 
3411. **Python ChunkingConfig fields**: Use `max_chars` and `max_overlap`, NOT `max_characters` or `overlap`.
3422. **Rust extract_file signature**: Third argument is `&ExtractionConfig` (a reference), not `Option`. Use `&ExtractionConfig::default()` for defaults.
3433. **Rust feature gates**: `extract_file_sync`, `batch_extract_file`, and `batch_extract_file_sync` all require `features = ["tokio-runtime"]` in Cargo.toml.
3444. **Rust async context**: `extract_file` is async. Use `#[tokio::main]` or call from an async context.
3455. **CLI --format vs --output-format**: `--format` controls CLI output (text/json). `--output-format` controls content format (plain/markdown/djot/html).
3466. **Node.js extractFile signature**: `extractFile(path, mimeType?, config?)` — mimeType is the second arg (pass `null` to skip).
3477. **Python detect_mime_type**: The function for detecting from bytes is `detect_mime_type(data)`. For paths use `detect_mime_type_from_path(path)`.
3488. **Config file field names**: Use snake_case in TOML/YAML/JSON config files (e.g., `max_chars`, `max_overlap`, `pdf_options`).
349 
350## Supported Formats (Summary)
351 
352| Category | Extensions |
353|----------|-----------|
354| **PDF** | `.pdf` |
355| **Word** | `.docx`, `.odt` |
356| **Spreadsheets** | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.ods` |
357| **Presentations** | `.pptx`, `.ppt`, `.ppsx` |
358| **eBooks** | `.epub`, `.fb2` |
359| **Images** | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif`, `.jp2`, `.jpx`, `.jpm`, `.mj2`, `.jbig2`, `.jb2`, `.pnm`, `.pbm`, `.pgm`, `.ppm`, `.svg` |
360| **Markup** | `.html`, `.htm`, `.xhtml`, `.xml` |
361| **Data** | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv` |
362| **Text** | `.txt`, `.md`, `.markdown`, `.djot`, `.rst`, `.org`, `.rtf` |
363| **Email** | `.eml`, `.msg` |
364| **Archives** | `.zip`, `.tar`, `.tgz`, `.gz`, `.7z` |
365| **Academic** | `.bib`, `.biblatex`, `.ris`, `.nbib`, `.enw`, `.csl`, `.tex`, `.latex`, `.typ`, `.jats`, `.ipynb`, `.docbook`, `.opml`, `.pod`, `.mdoc`, `.troff` |
366 
367See [references/supported-formats.md](references/supported-formats.md) for the complete format reference with MIME types.
368 
369## Additional Resources
370 
371Detailed reference files for specific topics:
372 
373- **[Python API Reference](references/python-api.md)** — All functions, config classes, plugin protocols, exact signatures
374- **[Node.js API Reference](references/nodejs-api.md)** — All functions, TypeScript interfaces, worker pool APIs
375- **[Rust API Reference](references/rust-api.md)** — All functions with feature gates, structs, Cargo.toml examples
376- **[CLI Reference](references/cli-reference.md)** — All commands, flags, config precedence, exit codes
377- **[Configuration Reference](references/configuration.md)** — TOML/YAML/JSON formats, auto-discovery, env vars, full schema
378- **[Supported Formats](references/supported-formats.md)** — All 75+ formats with file extensions and MIME types
379- **[Advanced Features](references/advanced-features.md)** — Plugins, embeddings, MCP server, API server, security limits
380- **[Other Language Bindings](references/other-bindings.md)** — Go, Ruby, Java, C#, PHP, Elixir, WASM, Docker
381 
382Full documentation: https://docs.kreuzberg.dev
383GitHub: https://github.com/kreuzberg-dev/kreuzberg
384

Full transparency — inspect the skill content before installing.