Content Core is a powerful, AI-powered content extraction and processing platform that transforms any source into clean, structured content. Extract text from websites, transcribe videos, process documents, and generate AI summariesβall through a unified interface with multiple integration options. Extract content from anywhere: - π Documents - PDF, Word, PowerPoint, Excel, Markdown, HTML, EPUB -
Add this skill
npx mdskills install lfnovo/content-coreComprehensive multi-format content extraction tool with excellent documentation and multiple integration paths
1# Content Core23[](https://opensource.org/licenses/MIT)4[](https://badge.fury.io/py/content-core)5[](https://pepy.tech/project/content-core)6[](https://pepy.tech/project/content-core)7[](https://github.com/lfnovo/content-core)8[](https://github.com/lfnovo/content-core)9[](https://github.com/lfnovo/content-core/issues)10[](https://github.com/psf/black)11[](https://github.com/astral-sh/ruff)1213**Content Core** is a powerful, AI-powered content extraction and processing platform that transforms any source into clean, structured content. Extract text from websites, transcribe videos, process documents, and generate AI summariesβall through a unified interface with multiple integration options.1415## π What You Can Do1617**Extract content from anywhere:**18- π **Documents** - PDF, Word, PowerPoint, Excel, Markdown, HTML, EPUB19- π₯ **Media** - Videos (MP4, AVI, MOV) with automatic transcription20- π΅ **Audio** - MP3, WAV, M4A with speech-to-text conversion21- π **Web** - Any URL with intelligent content extraction22- πΌοΈ **Images** - JPG, PNG, TIFF with OCR text recognition23- π¦ **Archives** - ZIP, TAR, GZ with content analysis2425**Process with AI:**26- β¨ **Clean & format** extracted content automatically27- π **Generate summaries** with customizable styles (bullet points, executive summary, etc.)28- π― **Context-aware processing** - explain to a child, technical summary, action items29- π **Smart engine selection** - automatically chooses the best extraction method3031## π οΈ Multiple Ways to Use3233### π₯οΈ Command Line (Zero Install)34```bash35# Extract content from any source36uvx --from "content-core" ccore https://example.com37uvx --from "content-core" ccore document.pdf3839# Generate AI summaries40uvx --from "content-core" csum video.mp4 --context "bullet points"41```4243### π€ Claude Desktop Integration44One-click setup with Model Context Protocol (MCP) - extract content directly in Claude conversations.4546### π Raycast Extension47Smart auto-detection commands:48- **Extract Content** - Full interface with format options49- **Summarize Content** - 9 summary styles available50- **Quick Extract** - Instant clipboard extraction5152### π±οΈ macOS Right-Click Integration53Right-click any file in Finder β Services β Extract or Summarize content instantly.5455### π Python Library56```python57import content_core as cc5859# Extract from any source60result = await cc.extract("https://example.com/article")61summary = await cc.summarize_content(result, context="explain to a child")62```6364## β‘ Key Features6566* **π― Intelligent Auto-Detection:** Automatically selects the best extraction method based on content type and available services67* **π§ Smart Engine Selection:**68 * **URLs:** Firecrawl β Jina β Crawl4AI (optional) β BeautifulSoup fallback chain69 * **Documents:** Docling β Enhanced PyMuPDF β Simple extraction fallback70 * **Media:** OpenAI Whisper transcription71 * **Images:** OCR with multiple engine support72* **π Enhanced PDF Processing:** Advanced PyMuPDF engine with quality flags, table detection, and optional OCR for mathematical formulas73* **π Multiple Integrations:** CLI, Python library, MCP server, Raycast extension, macOS Services74* **β‘ Zero-Install Options:** Use `uvx` for instant access without installation75* **π§ AI-Powered Processing:** LLM integration for content cleaning and summarization76* **π Asynchronous:** Built with `asyncio` for efficient processing77* **π Pure Python Implementation:** No system dependencies required - simplified installation across all platforms7879## Getting Started8081### Installation8283Install Content Core using `pip` - **no system dependencies required!**8485```bash86# Basic installation (PyMuPDF + BeautifulSoup/Jina extraction)87pip install content-core8889# With enhanced document processing (adds Docling)90pip install content-core[docling]9192# With local browser-based URL extraction (adds Crawl4AI)93# Note: Requires Playwright browsers (~300MB). Run:94pip install content-core[crawl4ai]95python -m playwright install --with-deps9697# Full installation (with all optional features)98pip install content-core[docling,crawl4ai]99```100101> **Note:** The core installation uses pure Python implementations and doesn't require system libraries like libmagic, ensuring consistent, hassle-free installation across Windows, macOS, and Linux. Optional features like Crawl4AI (browser automation) may require additional system dependencies.102103Alternatively, if youβre developing locally:104105```bash106# Clone the repository107git clone https://github.com/lfnovo/content-core108cd content-core109110# Install with uv111uv sync112```113114### Command-Line Interface115116Content Core provides three CLI commands for extracting, cleaning, and summarizing content:117ccore, cclean, and csum. These commands support input from text, URLs, files, or piped data (e.g., via cat file | command).118119**Zero-install usage with uvx:**120```bash121# Extract content122uvx --from "content-core" ccore https://example.com123124# Clean content125uvx --from "content-core" cclean "messy content"126127# Summarize content128uvx --from "content-core" csum "long text" --context "bullet points"129```130131#### ccore - Extract Content132133Extracts content from text, URLs, or files, with optional formatting.134Usage:135```bash136ccore [-f|--format xml|json|text] [-d|--debug] [content]137```138Options:139- `-f`, `--format`: Output format (xml, json, or text). Default: text.140- `-d`, `--debug`: Enable debug logging.141- `content`: Input content (text, URL, or file path). If omitted, reads from stdin.142143Examples:144145```bash146# Extract from a URL as text147ccore https://example.com148149# Extract from a file as JSON150ccore -f json document.pdf151152# Extract from piped text as XML153echo "Sample text" | ccore --format xml154```155156#### cclean - Clean Content157Cleans content by removing unnecessary formatting, spaces, or artifacts. Accepts text, JSON, XML input, URLs, or file paths.158Usage:159160```bash161cclean [-d|--debug] [content]162```163164Options:165- `-d`, `--debug`: Enable debug logging.166- `content`: Input content to clean (text, URL, file path, JSON, or XML). If omitted, reads from stdin.167168Examples:169170```bash171# Clean a text string172cclean " messy text "173174# Clean piped JSON175echo '{"content": " messy text "}' | cclean176177# Clean content from a URL178cclean https://example.com179180# Clean a fileβs content181cclean document.txt182```183184### csum - Summarize Content185186Summarizes content with an optional context to guide the summary style. Accepts text, JSON, XML input, URLs, or file paths.187188Usage:189190```bash191csum [--context "context text"] [-d|--debug] [content]192```193194Options:195- `--context`: Context for summarization (e.g., "explain to a child"). Default: none.196- `-d`, `--debug`: Enable debug logging.197- `content`: Input content to summarize (text, URL, file path, JSON, or XML). If omitted, reads from stdin.198199Examples:200201```bash202# Summarize text203csum "AI is transforming industries."204205# Summarize with context206csum --context "in bullet points" "AI is transforming industries."207208# Summarize piped content209cat article.txt | csum --context "one sentence"210211# Summarize content from URL212csum https://example.com213214# Summarize a file's content215csum document.txt216```217218## Quick Start219220You can quickly integrate `content-core` into your Python projects to extract, clean, and summarize content from various sources.221222```python223import content_core as cc224225# Extract content from a URL, file, or text226result = await cc.extract("https://example.com/article")227228# Clean messy content229cleaned_text = await cc.clean("...messy text with [brackets] and extra spaces...")230231# Summarize content with optional context232summary = await cc.summarize_content("long article text", context="explain to a child")233234# Extract audio with custom speech-to-text model235from content_core.common import ProcessSourceInput236result = await cc.extract(ProcessSourceInput(237 file_path="interview.mp3",238 audio_provider="openai",239 audio_model="whisper-1"240))241```242243## Documentation244245For more information on how to use the Content Core library, including details on AI model configuration and customization, refer to our [Usage Documentation](docs/usage.md).246247## MCP Server Integration248249Content Core includes a Model Context Protocol (MCP) server that enables seamless integration with Claude Desktop and other MCP-compatible applications. The MCP server exposes Content Core's powerful extraction capabilities through a standardized protocol.250251<a href="https://glama.ai/mcp/servers/@lfnovo/content-core">252 <img width="380" height="200" src="https://glama.ai/mcp/servers/@lfnovo/content-core/badge" />253</a>254255### Quick Setup with Claude Desktop256257```bash258# Install Content Core (MCP server included)259pip install content-core260261# Or use directly with uvx (no installation required)262uvx --from "content-core" content-core-mcp263```264265Add to your `claude_desktop_config.json`:266```json267{268 "mcpServers": {269 "content-core": {270 "command": "uvx",271 "args": [272 "--from",273 "content-core",274 "content-core-mcp"275 ]276 }277 }278}279```280281For detailed setup instructions, configuration options, and usage examples, see our [MCP Documentation](docs/mcp.md).282283## Enhanced PDF Processing284285Content Core features an optimized PyMuPDF extraction engine with significant improvements for scientific documents and complex PDFs.286287### Key Improvements288289- **π¬ Mathematical Formula Extraction**: Enhanced quality flags eliminate `<!-- formula-not-decoded -->` placeholders290- **π Automatic Table Detection**: Tables converted to markdown format for LLM consumption291- **π§ Quality Text Rendering**: Better ligature, whitespace, and image-text integration292- **β‘ Optional OCR Enhancement**: Selective OCR for formula-heavy pages (requires Tesseract)293294### Configuration for Scientific Documents295296For documents with heavy mathematical content, enable OCR enhancement:297298```yaml299# In cc_config.yaml300extraction:301 pymupdf:302 enable_formula_ocr: true # Enable OCR for formula-heavy pages303 formula_threshold: 3 # Min formulas per page to trigger OCR304 ocr_fallback: true # Graceful fallback if OCR fails305```306307```python308# Runtime configuration309from content_core.config import set_pymupdf_ocr_enabled310set_pymupdf_ocr_enabled(True)311```312313### Requirements for OCR Enhancement314315```bash316# Install Tesseract OCR (optional, for formula enhancement)317# macOS318brew install tesseract319320# Ubuntu/Debian321sudo apt-get install tesseract-ocr322```323324**Note**: OCR is optional - you get improved PDF extraction automatically without any additional setup.325326## macOS Services Integration327328Content Core provides powerful right-click integration with macOS Finder, allowing you to extract and summarize content from any file without installation. Choose between clipboard or TextEdit output for maximum flexibility.329330### Available Services331332Create **4 convenient services** for different workflows:333334- **Extract Content β Clipboard** - Quick copy for immediate pasting335- **Extract Content β TextEdit** - Review before using336- **Summarize Content β Clipboard** - Quick summary copying337- **Summarize Content β TextEdit** - Formatted summary with headers338339### Quick Setup3403411. **Install uv** (if not already installed):342 ```bash343 curl -LsSf https://astral.sh/uv/install.sh | sh344 ```3453462. **Create services manually** using Automator (5 minutes setup)347348### Usage349350**Right-click any supported file** in Finder β **Services** β Choose your option:351352- **PDFs, Word docs** - Instant text extraction353- **Videos, audio files** - Automatic transcription354- **Images** - OCR text recognition355- **Web content** - Clean text extraction356- **Multiple files** - Batch processing support357358### Features359360- **Zero-install processing**: Uses `uvx` for isolated execution361- **Multiple output options**: Clipboard or TextEdit display362- **System notifications**: Visual feedback on completion363- **Wide format support**: 20+ file types supported364- **Batch processing**: Handle multiple files at once365- **Keyboard shortcuts**: Assignable hotkeys for power users366367For complete setup instructions with copy-paste scripts, see [macOS Services Documentation](docs/macos.md).368369## Raycast Extension370371Content Core provides a powerful Raycast extension with smart auto-detection that handles both URLs and file paths seamlessly. Extract and summarize content directly from your Raycast interface without switching applications.372373### Quick Setup374375**From Raycast Store** (coming soon):3761. Open Raycast and search for "Content Core"3772. Install the extension by `luis_novo`3783. Configure API keys in preferences379380**Manual Installation**:3811. Download the extension from the repository3822. Open Raycast β "Import Extension"3833. Select the `raycast-content-core` folder384385### Commands386387**π Extract Content** - Smart URL/file detection with full interface388- Auto-detects URLs vs file paths in real-time389- Multiple output formats (Text, JSON, XML)390- Drag & drop support for files391- Rich results view with metadata392393**π Summarize Content** - AI-powered summaries with customizable styles394- 9 different summary styles (bullet points, executive summary, etc.)395- Auto-detects source type with visual feedback396- One-click snippet creation and quicklinks397398**β‘ Quick Extract** - Instant extraction to clipboard399- Type β Tab β Paste source β Enter400- No UI, works directly from command bar401- Perfect for quick workflows402403### Features404405- **Smart Auto-Detection**: Instantly recognizes URLs vs file paths406- **Zero Installation**: Uses `uvx` for Content Core execution407- **Rich Integration**: Keyboard shortcuts, clipboard actions, Raycast snippets408- **All File Types**: Documents, videos, audio, images, archives409- **Visual Feedback**: Real-time type detection with icons410411For detailed setup, configuration, and usage examples, see [Raycast Extension Documentation](docs/raycast.md).412413## Using with Langchain414415For users integrating with the [Langchain](https://python.langchain.com/) framework, `content-core` exposes a set of compatible tools. These tools, located in the `src/content_core/tools` directory, allow you to leverage `content-core` extraction, cleaning, and summarization capabilities directly within your Langchain agents and chains.416417You can import and use these tools like any other Langchain tool. For example:418419```python420from content_core.tools import extract_content_tool, cleanup_content_tool, summarize_content_tool421from langchain.agents import initialize_agent, AgentType422423tools = [extract_content_tool, cleanup_content_tool, summarize_content_tool]424agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True)425agent.run("Extract the content from https://example.com and then summarize it.")426```427428Refer to the source code in `src/content_core/tools` for specific tool implementations and usage details.429430## Basic Usage431432The core functionality revolves around the extract_content function.433434```python435import asyncio436from content_core.extraction import extract_content437438async def main():439 # Extract from raw text440 text_data = await extract_content({"content": "This is my sample text content."})441 print(text_data)442443 # Extract from a URL (uses 'auto' engine by default)444 url_data = await extract_content({"url": "https://www.example.com"})445 print(url_data)446447 # Extract from a local video file (gets transcript, engine='auto' by default)448 video_data = await extract_content({"file_path": "path/to/your/video.mp4"})449 print(video_data)450451 # Extract from a local markdown file (engine='auto' by default)452 md_data = await extract_content({"file_path": "path/to/your/document.md"})453 print(md_data)454455 # Per-execution override with Docling for documents456 doc_data = await extract_content({457 "file_path": "path/to/your/document.pdf",458 "document_engine": "docling",459 "output_format": "html"460 })461462 # Per-execution override with Firecrawl for URLs463 url_data = await extract_content({464 "url": "https://www.example.com",465 "url_engine": "firecrawl"466 })467 print(doc_data)468469if __name__ == "__main__":470 asyncio.run(main())471```472473(See `src/content_core/notebooks/run.ipynb` for more detailed examples.)474475## Docling Integration476477Content Core supports an optional Docling-based extraction engine for rich document formats (PDF, DOCX, PPTX, XLSX, Markdown, AsciiDoc, HTML, CSV, Images).478479480### Enabling Docling481482Docling is not the default engine when parsing documents. If you don't want to use it, you need to set engine to "simple".483484#### Via configuration file485486In your `cc_config.yaml` or custom config, set:487```yaml488extraction:489 document_engine: docling # 'auto' (default), 'simple', or 'docling'490 url_engine: auto # 'auto' (default), 'simple', 'firecrawl', or 'jina'491 firecrawl:492 api_url: null # Custom API URL for self-hosted Firecrawl493 docling:494 output_format: markdown # markdown | html | json495```496497#### Programmatically in Python498499```python500from content_core.config import set_document_engine, set_url_engine, set_docling_output_format501502# switch document engine to Docling503set_document_engine("docling")504505# switch URL engine to Firecrawl506set_url_engine("firecrawl")507508# choose output format: 'markdown', 'html', or 'json'509set_docling_output_format("html")510511# now use ccore.extract or ccore.ccore512result = await cc.extract("document.pdf")513```514515## Configuration516517Configuration settings (like API keys for external services, logging levels) can be managed through environment variables or `.env` files, loaded automatically via `python-dotenv`.518519Example `.env`:520521```plaintext522OPENAI_API_KEY=your-key-here523GOOGLE_API_KEY=your-key-here524525# Engine Selection (optional)526CCORE_DOCUMENT_ENGINE=auto # auto, simple, docling527CCORE_URL_ENGINE=auto # auto, simple, firecrawl, jina528529# Audio Processing (optional)530CCORE_AUDIO_CONCURRENCY=3 # Number of concurrent audio transcriptions (1-10, default: 3)531532# Esperanto Timeout Configuration (optional)533ESPERANTO_LLM_TIMEOUT=300 # Language model timeout in seconds (default: 300, max: 3600)534ESPERANTO_STT_TIMEOUT=3600 # Speech-to-text timeout in seconds (default: 3600, max: 3600)535```536537### Engine Selection via Environment Variables538539For deployment scenarios like MCP servers or Raycast extensions, you can override the extraction engines using environment variables:540541- **`CCORE_DOCUMENT_ENGINE`**: Force document engine (`auto`, `simple`, `docling`)542- **`CCORE_URL_ENGINE`**: Force URL engine (`auto`, `simple`, `firecrawl`, `jina`, `crawl4ai`)543- **`CCORE_AUDIO_CONCURRENCY`**: Number of concurrent audio transcriptions (1-10, default: 3)544545These variables take precedence over config file settings and provide explicit control for different deployment scenarios.546547### Audio Processing Configuration548549Content Core processes long audio files by splitting them into segments and transcribing them in parallel for improved performance. You can control the concurrency level to balance speed with API rate limits:550551- **Default**: 3 concurrent transcriptions552- **Range**: 1-10 concurrent transcriptions553- **Configuration**: Set via `CCORE_AUDIO_CONCURRENCY` environment variable or `extraction.audio.concurrency` in `cc_config.yaml`554555Higher concurrency values can speed up processing of long audio/video files but may hit API rate limits. Lower values are more conservative and suitable for accounts with lower API quotas.556557### Retry Configuration558559Content Core includes automatic retry logic for transient failures in external operations (network requests, API calls, transcription). Retries use exponential backoff with jitter to handle temporary issues gracefully.560561**Supported operations:**562- `youtube` - YouTube video title and transcript fetching (5 retries, 2-60s backoff)563- `url_api` - URL extraction via Jina/Firecrawl APIs (3 retries, 1-30s backoff)564- `url_network` - Network operations like HEAD requests, BeautifulSoup (3 retries, 0.5-10s backoff)565- `audio` - Audio transcription API calls (3 retries, 2-30s backoff)566- `llm` - LLM API calls for cleanup/summary (3 retries, 1-30s backoff)567- `download` - Remote file downloads (3 retries, 1-15s backoff)568569**Environment variable overrides:**570```bash571# Override retry settings per operation type572CCORE_YOUTUBE_MAX_RETRIES=10 # Max retry attempts (1-20)573CCORE_YOUTUBE_BASE_DELAY=3 # Base delay in seconds (0.1-60)574CCORE_YOUTUBE_MAX_DELAY=120 # Max delay in seconds (1-300)575576# Same pattern for other operations:577CCORE_URL_API_MAX_RETRIES=5578CCORE_AUDIO_MAX_RETRIES=5579CCORE_LLM_MAX_RETRIES=5580CCORE_DOWNLOAD_MAX_RETRIES=5581```582583For detailed configuration, see our [Usage Documentation](docs/usage.md#retry-configuration).584585### Proxy Configuration586587Content Core supports HTTP/HTTPS proxy configuration through standard environment variables, consistent with most HTTP clients.588589**Quick Start:**590591```bash592# Set standard proxy environment variables593export HTTP_PROXY=http://proxy.example.com:8080594export HTTPS_PROXY=http://proxy.example.com:8080595596# With authentication597export HTTP_PROXY=http://user:password@proxy.example.com:8080598599# Bypass proxy for specific hosts600export NO_PROXY=localhost,127.0.0.1,internal.example.com601```602603All Content Core network requests automatically use these environment variables.604605**Supported Services:**606- All aiohttp requests (URL extraction, downloads)607- YouTube transcript/title fetching (pytubefix, youtube-transcript-api)608- Crawl4AI browser automation609- Esperanto AI models (LLM, speech-to-text)610611**Note:** Firecrawl does not support client-side proxy configuration. Configure proxy on the Firecrawl server side instead.612613For detailed configuration, see our [Usage Documentation](docs/usage.md#proxy-configuration).614615### Timeout Configuration616617Content Core uses the Esperanto library for AI model interactions and supports configurable timeouts for different operations. Timeouts prevent requests from hanging indefinitely and ensure reliable processing.618619**Configuration Methods** (in priority order):6206211. **Config Files** (highest priority): Set in `cc_config.yaml` or `models_config.yaml`6222. **Environment Variables**: Provide global defaults via `ESPERANTO_LLM_TIMEOUT` and `ESPERANTO_STT_TIMEOUT` when a timeout isn't specified in configuration files623624**Default Timeouts:**625626- **Speech-to-Text**: 3600 seconds (1 hour) - for very long audio files627- **Language Models**: 300-600 seconds - for content processing operations628- **Cleanup Model**: 600 seconds (10 minutes) - handles large content with 8000 max tokens629- **Summary Model**: 300 seconds (5 minutes) - for content summarization630631**Environment Variable Overrides:**632633```bash634# Override language model timeout globally (used when config files omit a timeout)635export ESPERANTO_LLM_TIMEOUT=300636637# Override speech-to-text timeout globally (used when config files omit a timeout)638export ESPERANTO_STT_TIMEOUT=3600639```640641**Valid Range:** 1 to 3600 seconds (1 hour maximum)642643For more details on Esperanto timeout configuration, see the [Esperanto documentation](https://github.com/lfnovo/esperanto/blob/main/docs/advanced/timeout-configuration.md).644645### Custom Prompt Templates646647Content Core allows you to define custom prompt templates for content processing. By default, the library uses built-in prompts located in the `prompts` directory. However, you can create your own prompt templates and store them in a dedicated directory. To specify the location of your custom prompts, set the `PROMPT_PATH` environment variable in your `.env` file or system environment.648649Example `.env` with custom prompt path:650651```plaintext652OPENAI_API_KEY=your-key-here653GOOGLE_API_KEY=your-key-here654PROMPT_PATH=/path/to/your/custom/prompts655```656657When a prompt template is requested, Content Core will first look in the custom directory specified by `PROMPT_PATH` (if set and exists). If the template is not found there, it will fall back to the default built-in prompts. This allows you to override specific prompts while still using the default ones for others.658659## Development660661To set up a development environment:662663```bash664# Clone the repository665git clone <repository-url>666cd content-core667668# Create virtual environment and install dependencies669uv venv670source .venv/bin/activate671uv sync --group dev672673# Run tests674make test675676# Lint code677make lint678679# See all commands680make help681```682683## License684685This project is licensed under the [MIT License](LICENSE). See the [LICENSE](LICENSE) file for details.686687## Contributing688689Contributions are welcome! Please see our [Contributing Guide](CONTRIBUTING.md) for more details on how to get started.690
Full transparency β inspect the skill content before installing.