An adaptive web scraping framework that handles everything from a single request to a full-scale crawl. Its parser learns from website changes and automatically relocates your elements when pages update. Its fetchers bypass anti-bot systems like Cloudflare Turnstile.
Add this skill
npx mdskills install D4Vinci/scraplingComprehensive web scraping framework with adaptive parsing, anti-bot bypass, and crawling capabilities
1<!-- mcp-name: io.github.D4Vinci/Scrapling -->23<h1 align="center">4 <a href="https://scrapling.readthedocs.io">5 <picture>6 <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/docs/assets/cover_dark.svg?sanitize=true">7 <img alt="Scrapling Poster" src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/docs/assets/cover_light.svg?sanitize=true">8 </picture>9 </a>10 <br>11 <small>Effortless Web Scraping for the Modern Web</small>12</h1>1314<p align="center">15 <a href="https://github.com/D4Vinci/Scrapling/blob/main/docs/README_AR.md">العربيه</a> | <a href="https://github.com/D4Vinci/Scrapling/blob/main/docs/README_ES.md">Español</a> | <a href="https://github.com/D4Vinci/Scrapling/blob/main/docs/README_DE.md">Deutsch</a> | <a href="https://github.com/D4Vinci/Scrapling/blob/main/docs/README_CN.md">简体中文</a> | <a href="https://github.com/D4Vinci/Scrapling/blob/main/docs/README_JP.md">日本語</a> | <a href="https://github.com/D4Vinci/Scrapling/blob/main/docs/README_RU.md">Русский</a>16 <br/>17 <a href="https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml" alt="Tests">18 <img alt="Tests" src="https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml/badge.svg"></a>19 <a href="https://badge.fury.io/py/Scrapling" alt="PyPI version">20 <img alt="PyPI version" src="https://badge.fury.io/py/Scrapling.svg"></a>21 <a href="https://pepy.tech/project/scrapling" alt="PyPI Downloads">22 <img alt="PyPI Downloads" src="https://static.pepy.tech/personalized-badge/scrapling?period=total&units=INTERNATIONAL_SYSTEM&left_color=GREY&right_color=GREEN&left_text=Downloads"></a>23 <br/>24 <a href="https://discord.gg/EMgGbDceNQ" alt="Discord" target="_blank">25 <img alt="Discord" src="https://img.shields.io/discord/1360786381042880532?style=social&logo=discord&link=https%3A%2F%2Fdiscord.gg%2FEMgGbDceNQ">26 </a>27 <a href="https://x.com/Scrapling_dev" alt="X (formerly Twitter)">28 <img alt="X (formerly Twitter) Follow" src="https://img.shields.io/twitter/follow/Scrapling_dev?style=social&logo=x&link=https%3A%2F%2Fx.com%2FScrapling_dev">29 </a>30 <br/>31 <a href="https://pypi.org/project/scrapling/" alt="Supported Python versions">32 <img alt="Supported Python versions" src="https://img.shields.io/pypi/pyversions/scrapling.svg"></a>33</p>3435<p align="center">36 <a href="https://scrapling.readthedocs.io/en/latest/parsing/selection/"><strong>Selection methods</strong></a>37 ·38 <a href="https://scrapling.readthedocs.io/en/latest/fetching/choosing/"><strong>Fetchers</strong></a>39 ·40 <a href="https://scrapling.readthedocs.io/en/latest/spiders/architecture.html"><strong>Spiders</strong></a>41 ·42 <a href="https://scrapling.readthedocs.io/en/latest/spiders/proxy-blocking.html"><strong>Proxy Rotation</strong></a>43 ·44 <a href="https://scrapling.readthedocs.io/en/latest/cli/overview/"><strong>CLI</strong></a>45 ·46 <a href="https://scrapling.readthedocs.io/en/latest/ai/mcp-server/"><strong>MCP</strong></a>47</p>4849Scrapling is an adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl.5051Its parser learns from website changes and automatically relocates your elements when pages update. Its fetchers bypass anti-bot systems like Cloudflare Turnstile out of the box. And its spider framework lets you scale up to concurrent, multi-session crawls with pause/resume and automatic proxy rotation — all in a few lines of Python. One library, zero compromises.5253Blazing fast crawls with real-time stats and streaming. Built by Web Scrapers for Web Scrapers and regular users, there's something for everyone.5455```python56from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher57StealthyFetcher.adaptive = True58p = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True) # Fetch website under the radar!59products = p.css('.product', auto_save=True) # Scrape data that survives website design changes!60products = p.css('.product', adaptive=True) # Later, if the website structure changes, pass `adaptive=True` to find them!61```62Or scale up to full crawls63```python64from scrapling.spiders import Spider, Response6566class MySpider(Spider):67 name = "demo"68 start_urls = ["https://example.com/"]6970 async def parse(self, response: Response):71 for item in response.css('.product'):72 yield {"title": item.css('h2::text').get()}7374MySpider().start()75```7677# Platinum Sponsors7879# Sponsors8081<!-- sponsors -->8283<a href="https://www.scrapeless.com/en?utm_source=official&utm_term=scrapling" target="_blank" title="Effortless Web Scraping Toolkit for Business and Developers"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/scrapeless.jpg"></a>84<a href="https://www.thordata.com/?ls=github&lk=github" target="_blank" title="Unblockable proxies and scraping infrastructure, delivering real-time, reliable web data to power AI models and workflows."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/thordata.jpg"></a>85<a href="https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling" target="_blank" title="Evomi is your Swiss Quality Proxy Provider, starting at $0.49/GB"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/evomi.png"></a>86<a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>87<a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>88<a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"></a>89<a href="https://hasdata.com/?utm_source=github&utm_medium=banner&utm_campaign=D4Vinci" target="_blank" title="The web scraping service that actually beats anti-bot systems!"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/hasdata.png"></a>90<a href="https://proxyempire.io/" target="_blank" title="Collect The Data Your Project Needs with the Best Residential Proxies"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/ProxyEmpire.png"></a>91<a href="https://hypersolutions.co/?utm_source=github&utm_medium=readme&utm_campaign=scrapling" target="_blank" title="Bot Protection Bypass API for Akamai, DataDome, Incapsula & Kasada"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/HyperSolutions.png"></a>929394<a href="https://www.swiftproxy.net/" target="_blank" title="Unlock Reliable Proxy Services with Swiftproxy!"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/swiftproxy.png"></a>95<a href="https://www.rapidproxy.io/?ref=d4v" target="_blank" title="Affordable Access to the Proxy World – bypass CAPTCHAs blocks, and avoid additional costs."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/rapidproxy.jpg"></a>96<a href="https://browser.cash/?utm_source=D4Vinci&utm_medium=referral" target="_blank" title="Browser Automation & AI Browser Agent Platform"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/browserCash.png"></a>9798<!-- /sponsors -->99100<i><sub>Do you want to show your ad here? Click [here](https://github.com/sponsors/D4Vinci) and choose the tier that suites you!</sub></i>101102---103104## Key Features105106### Spiders — A Full Crawling Framework107- 🕷️ **Scrapy-like Spider API**: Define spiders with `start_urls`, async `parse` callbacks, and `Request`/`Response` objects.108- ⚡ **Concurrent Crawling**: Configurable concurrency limits, per-domain throttling, and download delays.109- 🔄 **Multi-Session Support**: Unified interface for HTTP requests, and stealthy headless browsers in a single spider — route requests to different sessions by ID.110- 💾 **Pause & Resume**: Checkpoint-based crawl persistence. Press Ctrl+C for a graceful shutdown; restart to resume from where you left off.111- 📡 **Streaming Mode**: Stream scraped items as they arrive via `async for item in spider.stream()` with real-time stats — ideal for UI, pipelines, and long-running crawls.112- 🛡️ **Blocked Request Detection**: Automatic detection and retry of blocked requests with customizable logic.113- 📦 **Built-in Export**: Export results through hooks and your own pipeline or the built-in JSON/JSONL with `result.items.to_json()` / `result.items.to_jsonl()` respectively.114115### Advanced Websites Fetching with Session Support116- **HTTP Requests**: Fast and stealthy HTTP requests with the `Fetcher` class. Can impersonate browsers' TLS fingerprint, headers, and use HTTP/3.117- **Dynamic Loading**: Fetch dynamic websites with full browser automation through the `DynamicFetcher` class supporting Playwright's Chromium and Google's Chrome.118- **Anti-bot Bypass**: Advanced stealth capabilities with `StealthyFetcher` and fingerprint spoofing. Can easily bypass all types of Cloudflare's Turnstile/Interstitial with automation.119- **Session Management**: Persistent session support with `FetcherSession`, `StealthySession`, and `DynamicSession` classes for cookie and state management across requests.120- **Proxy Rotation**: Built-in `ProxyRotator` with cyclic or custom rotation strategies across all session types, plus per-request proxy overrides.121- **Domain Blocking**: Block requests to specific domains (and their subdomains) in browser-based fetchers.122- **Async Support**: Complete async support across all fetchers and dedicated async session classes.123124### Adaptive Scraping & AI Integration125- 🔄 **Smart Element Tracking**: Relocate elements after website changes using intelligent similarity algorithms.126- 🎯 **Smart Flexible Selection**: CSS selectors, XPath selectors, filter-based search, text search, regex search, and more.127- 🔍 **Find Similar Elements**: Automatically locate elements similar to found elements.128- 🤖 **MCP Server to be used with AI**: Built-in MCP server for AI-assisted Web Scraping and data extraction. The MCP server features powerful, custom capabilities that leverage Scrapling to extract targeted content before passing it to the AI (Claude/Cursor/etc), thereby speeding up operations and reducing costs by minimizing token usage. ([demo video](https://www.youtube.com/watch?v=qyFk3ZNwOxE))129130### High-Performance & battle-tested Architecture131- 🚀 **Lightning Fast**: Optimized performance outperforming most Python scraping libraries.132- 🔋 **Memory Efficient**: Optimized data structures and lazy loading for a minimal memory footprint.133- ⚡ **Fast JSON Serialization**: 10x faster than the standard library.134- 🏗️ **Battle tested**: Not only does Scrapling have 92% test coverage and full type hints coverage, but it has been used daily by hundreds of Web Scrapers over the past year.135136### Developer/Web Scraper Friendly Experience137- 🎯 **Interactive Web Scraping Shell**: Optional built-in IPython shell with Scrapling integration, shortcuts, and new tools to speed up Web Scraping scripts development, like converting curl requests to Scrapling requests and viewing requests results in your browser.138- 🚀 **Use it directly from the Terminal**: Optionally, you can use Scrapling to scrape a URL without writing a single line of code!139- 🛠️ **Rich Navigation API**: Advanced DOM traversal with parent, sibling, and child navigation methods.140- 🧬 **Enhanced Text Processing**: Built-in regex, cleaning methods, and optimized string operations.141- 📝 **Auto Selector Generation**: Generate robust CSS/XPath selectors for any element.142- 🔌 **Familiar API**: Similar to Scrapy/BeautifulSoup with the same pseudo-elements used in Scrapy/Parsel.143- 📘 **Complete Type Coverage**: Full type hints for excellent IDE support and code completion. The entire codebase is automatically scanned with **PyRight** and **MyPy** with each change.144- 🔋 **Ready Docker image**: With each release, a Docker image containing all browsers is automatically built and pushed.145146## Getting Started147148Let's give you a quick glimpse of what Scrapling can do without deep diving.149150### Basic Usage151HTTP requests with session support152```python153from scrapling.fetchers import Fetcher, FetcherSession154155with FetcherSession(impersonate='chrome') as session: # Use latest version of Chrome's TLS fingerprint156 page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)157 quotes = page.css('.quote .text::text').getall()158159# Or use one-off requests160page = Fetcher.get('https://quotes.toscrape.com/')161quotes = page.css('.quote .text::text').getall()162```163Advanced stealth mode164```python165from scrapling.fetchers import StealthyFetcher, StealthySession166167with StealthySession(headless=True, solve_cloudflare=True) as session: # Keep the browser open until you finish168 page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False)169 data = page.css('#padded_content a').getall()170171# Or use one-off request style, it opens the browser for this request, then closes it after finishing172page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare')173data = page.css('#padded_content a').getall()174```175Full browser automation176```python177from scrapling.fetchers import DynamicFetcher, DynamicSession178179with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session: # Keep the browser open until you finish180 page = session.fetch('https://quotes.toscrape.com/', load_dom=False)181 data = page.xpath('//span[@class="text"]/text()').getall() # XPath selector if you prefer it182183# Or use one-off request style, it opens the browser for this request, then closes it after finishing184page = DynamicFetcher.fetch('https://quotes.toscrape.com/')185data = page.css('.quote .text::text').getall()186```187188### Spiders189Build full crawlers with concurrent requests, multiple session types, and pause/resume:190```python191from scrapling.spiders import Spider, Request, Response192193class QuotesSpider(Spider):194 name = "quotes"195 start_urls = ["https://quotes.toscrape.com/"]196 concurrent_requests = 10197198 async def parse(self, response: Response):199 for quote in response.css('.quote'):200 yield {201 "text": quote.css('.text::text').get(),202 "author": quote.css('.author::text').get(),203 }204205 next_page = response.css('.next a')206 if next_page:207 yield response.follow(next_page[0].attrib['href'])208209result = QuotesSpider().start()210print(f"Scraped {len(result.items)} quotes")211result.items.to_json("quotes.json")212```213Use multiple session types in a single spider:214```python215from scrapling.spiders import Spider, Request, Response216from scrapling.fetchers import FetcherSession, AsyncStealthySession217218class MultiSessionSpider(Spider):219 name = "multi"220 start_urls = ["https://example.com/"]221222 def configure_sessions(self, manager):223 manager.add("fast", FetcherSession(impersonate="chrome"))224 manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)225226 async def parse(self, response: Response):227 for link in response.css('a::attr(href)').getall():228 # Route protected pages through the stealth session229 if "protected" in link:230 yield Request(link, sid="stealth")231 else:232 yield Request(link, sid="fast", callback=self.parse) # explicit callback233```234Pause and resume long crawls with checkpoints by running the spider like this:235```python236QuotesSpider(crawldir="./crawl_data").start()237```238Press Ctrl+C to pause gracefully — progress is saved automatically. Later, when you start the spider again, pass the same `crawldir`, and it will resume from where it stopped.239240### Advanced Parsing & Navigation241```python242from scrapling.fetchers import Fetcher243244# Rich element selection and navigation245page = Fetcher.get('https://quotes.toscrape.com/')246247# Get quotes with multiple selection methods248quotes = page.css('.quote') # CSS selector249quotes = page.xpath('//div[@class="quote"]') # XPath250quotes = page.find_all('div', {'class': 'quote'}) # BeautifulSoup-style251# Same as252quotes = page.find_all('div', class_='quote')253quotes = page.find_all(['div'], class_='quote')254quotes = page.find_all(class_='quote') # and so on...255# Find element by text content256quotes = page.find_by_text('quote', tag='div')257258# Advanced navigation259quote_text = page.css('.quote')[0].css('.text::text').get()260quote_text = page.css('.quote').css('.text::text').getall() # Chained selectors261first_quote = page.css('.quote')[0]262author = first_quote.next_sibling.css('.author::text')263parent_container = first_quote.parent264265# Element relationships and similarity266similar_elements = first_quote.find_similar()267below_elements = first_quote.below_elements()268```269You can use the parser right away if you don't want to fetch websites like below:270```python271from scrapling.parser import Selector272273page = Selector("<html>...</html>")274```275And it works precisely the same way!276277### Async Session Management Examples278```python279import asyncio280from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession281282async with FetcherSession(http3=True) as session: # `FetcherSession` is context-aware and can work in both sync/async patterns283 page1 = session.get('https://quotes.toscrape.com/')284 page2 = session.get('https://quotes.toscrape.com/', impersonate='firefox135')285286# Async session usage287async with AsyncStealthySession(max_pages=2) as session:288 tasks = []289 urls = ['https://example.com/page1', 'https://example.com/page2']290291 for url in urls:292 task = session.fetch(url)293 tasks.append(task)294295 print(session.get_pool_stats()) # Optional - The status of the browser tabs pool (busy/free/error)296 results = await asyncio.gather(*tasks)297 print(session.get_pool_stats())298```299300## CLI & Interactive Shell301302Scrapling includes a powerful command-line interface:303304[](https://asciinema.org/a/736339)305306Launch the interactive Web Scraping shell307```bash308scrapling shell309```310Extract pages to a file directly without programming (Extracts the content inside the `body` tag by default). If the output file ends with `.txt`, then the text content of the target will be extracted. If it ends in `.md`, it will be a Markdown representation of the HTML content; if it ends in `.html`, it will be the HTML content itself.311```bash312scrapling extract get 'https://example.com' content.md313scrapling extract get 'https://example.com' content.txt --css-selector '#fromSkipToProducts' --impersonate 'chrome' # All elements matching the CSS selector '#fromSkipToProducts'314scrapling extract fetch 'https://example.com' content.md --css-selector '#fromSkipToProducts' --no-headless315scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.html --css-selector '#padded_content a' --solve-cloudflare316```317318> [!NOTE]319> There are many additional features, but we want to keep this page concise, including the MCP server and the interactive Web Scraping Shell. Check out the full documentation [here](https://scrapling.readthedocs.io/en/latest/)320321## Performance Benchmarks322323Scrapling isn't just powerful—it's also blazing fast. The following benchmarks compare Scrapling's parser with the latest versions of other popular libraries.324325### Text Extraction Speed Test (5000 nested elements)326327| # | Library | Time (ms) | vs Scrapling |328|---|:-----------------:|:---------:|:------------:|329| 1 | Scrapling | 2.02 | 1.0x |330| 2 | Parsel/Scrapy | 2.04 | 1.01 |331| 3 | Raw Lxml | 2.54 | 1.257 |332| 4 | PyQuery | 24.17 | ~12x |333| 5 | Selectolax | 82.63 | ~41x |334| 6 | MechanicalSoup | 1549.71 | ~767.1x |335| 7 | BS4 with Lxml | 1584.31 | ~784.3x |336| 8 | BS4 with html5lib | 3391.91 | ~1679.1x |337338339### Element Similarity & Text Search Performance340341Scrapling's adaptive element finding capabilities significantly outperform alternatives:342343| Library | Time (ms) | vs Scrapling |344|-------------|:---------:|:------------:|345| Scrapling | 2.39 | 1.0x |346| AutoScraper | 12.45 | 5.209x |347348349> All benchmarks represent averages of 100+ runs. See [benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py) for methodology.350351## Installation352353Scrapling requires Python 3.10 or higher:354355```bash356pip install scrapling357```358359This installation only includes the parser engine and its dependencies, without any fetchers or commandline dependencies.360361### Optional Dependencies3623631. If you are going to use any of the extra features below, the fetchers, or their classes, you will need to install fetchers' dependencies and their browser dependencies as follows:364 ```bash365 pip install "scrapling[fetchers]"366367 scrapling install368 ```369370 This downloads all browsers, along with their system dependencies and fingerprint manipulation dependencies.3713722. Extra features:373 - Install the MCP server feature:374 ```bash375 pip install "scrapling[ai]"376 ```377 - Install shell features (Web Scraping shell and the `extract` command):378 ```bash379 pip install "scrapling[shell]"380 ```381 - Install everything:382 ```bash383 pip install "scrapling[all]"384 ```385 Remember that you need to install the browser dependencies with `scrapling install` after any of these extras (if you didn't already)386387### Docker388You can also install a Docker image with all extras and browsers with the following command from DockerHub:389```bash390docker pull pyd4vinci/scrapling391```392Or download it from the GitHub registry:393```bash394docker pull ghcr.io/d4vinci/scrapling:latest395```396This image is automatically built and pushed using GitHub Actions and the repository's main branch.397398## Contributing399400We welcome contributions! Please read our [contributing guidelines](https://github.com/D4Vinci/Scrapling/blob/main/CONTRIBUTING.md) before getting started.401402## Disclaimer403404> [!CAUTION]405> This library is provided for educational and research purposes only. By using this library, you agree to comply with local and international data scraping and privacy laws. The authors and contributors are not responsible for any misuse of this software. Always respect the terms of service of websites and robots.txt files.406407## License408409This work is licensed under the BSD-3-Clause License.410411## Acknowledgments412413This project includes code adapted from:414- Parsel (BSD License)—Used for [translator](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/translator.py) submodule415416---417<div align="center"><small>Designed & crafted with ❤️ by Karim Shoair.</small></div><br>
Full transparency — inspect the skill content before installing.