scrapy-mcp
A headless web-scraping MCP server built on Scrapy, providing tools for polite fetching, CSS/XPath extraction, link/table extraction, sitemap and robots.txt reading, and bounded asynchronous crawls.
README
scrapy-mcp
A headless web-scraping MCP server built on Scrapy. It exposes Scrapy's scraping primitives — polite fetching, CSS/XPath extraction, link and table extraction, sitemap and robots.txt reading, and bounded asynchronous crawls — as MCP tools an agent can call over stdio.
- Headless, no rendering. Pages are fetched and parsed as HTML; no browser, no JavaScript execution. This keeps the footprint tiny — it runs comfortably on weak machines.
- Reactor-safe. Every operation runs in a short-lived Scrapy subprocess, so Twisted's
reactor never lives inside the asyncio MCP server (no
ReactorNotRestartable), and memory is reclaimed after each call. - Polite by default. Obeys
robots.txt, throttles with AutoThrottle, and enforces hard page/depth caps so a crawl can't run away.
Install / run
Run straight from PyPI with uv — no install step:
uvx scrapy-mcp
Or install it:
uv pip install scrapy-mcp
scrapy-mcp
The server speaks MCP over stdio. Point any MCP client at it. For Claude Desktop, add to
claude_desktop_config.json:
{
"mcpServers": {
"scrapy": {
"command": "uvx",
"args": ["scrapy-mcp"]
}
}
}
Tools
| Tool | What it does |
|---|---|
fetch_page(url, format, max_bytes, obey_robots) |
Fetch one page as markdown (default), text, or html. |
extract(url, selectors, obey_robots) |
Pull structured fields with CSS/XPath selectors. |
extract_tables(url, max_tables, obey_robots) |
Extract every HTML <table> as {headers, rows}. |
extract_links(url, same_domain, pattern, limit, obey_robots) |
List de-duplicated links on a page. |
get_sitemap(url, limit, obey_robots) |
Read a sitemap (gzip + sitemap-index aware). |
check_robots(url, user_agent) |
Is a URL crawlable? Returns the crawl-delay and sitemaps. |
start_crawl(start_url, allow_patterns, deny_patterns, max_pages, max_depth, same_domain, selectors, ...) |
Start a bounded BFS crawl; returns a job_id. |
crawl_status(job_id) |
State + pages scraped for a crawl. |
crawl_results(job_id, cursor, limit) |
Page through a crawl's scraped items. |
cancel_crawl(job_id) |
Stop a running crawl; keep results so far. |
Selector format (extract / start_crawl)
selectors maps an output field to a selector. Each value is either a CSS string (first
match) or an object for more control:
{
"title": "h1::text",
"price": "span.price::text",
"all_links": {"css": "a::attr(href)", "all": true},
"first_heading": {"xpath": "//h1/text()"}
}
"all": true returns every match as a list; otherwise the first match is returned.
Crawls are asynchronous
start_crawl returns immediately with a job_id. The crawl runs as a detached worker that
streams results to disk, so it survives a server restart. Poll crawl_status(job_id), then
read items with crawl_results(job_id) (safe to call mid-crawl for partial results). Jobs are
stored under the system temp dir and reclaimed after 7 days (configurable).
Configuration
All settings are optional environment variables (sensible, polite defaults tuned for a weak
host). They're how you tune a uvx scrapy-mcp deployment.
| Variable | Default | Meaning |
|---|---|---|
SCRAPY_MCP_USER_AGENT |
scrapy-mcp/<version> … |
User-Agent header. |
SCRAPY_MCP_OBEY_ROBOTS |
true |
Obey robots.txt. |
SCRAPY_MCP_DOWNLOAD_DELAY |
0.5 |
Seconds between requests to a host. |
SCRAPY_MCP_CONCURRENT_REQUESTS |
8 |
Global concurrency. |
SCRAPY_MCP_CONCURRENT_REQUESTS_PER_DOMAIN |
4 |
Per-host concurrency. |
SCRAPY_MCP_DOWNLOAD_TIMEOUT |
30 |
Per-request timeout (s). |
SCRAPY_MCP_RETRY_TIMES |
2 |
Retries on transient failures. |
SCRAPY_MCP_AUTOTHROTTLE |
true |
Adapt delay to server latency. |
SCRAPY_MCP_MAX_BYTES |
50000 |
Max characters returned per page (then truncated). |
SCRAPY_MCP_REQUEST_TIMEOUT |
60 |
Wall-clock cap for a blocking single fetch (s). |
SCRAPY_MCP_DEFAULT_MAX_PAGES / _MAX_PAGES_CAP |
50 / 1000 |
Crawl page default / hard cap. |
SCRAPY_MCP_DEFAULT_MAX_DEPTH / _MAX_DEPTH_CAP |
2 / 10 |
Crawl depth default / hard cap. |
SCRAPY_MCP_JOB_DIR |
<tmp>/scrapy_mcp_jobs |
Where crawl jobs are stored. |
SCRAPY_MCP_JOB_TTL_DAYS |
7 |
Delete crawl jobs older than this (0 disables). |
SCRAPY_MCP_LOG_LEVEL |
ERROR |
Scrapy log level (to stderr). |
Development
uv venv
uv pip install -e ".[dev]"
uv run pytest # unit tests (no network)
uv build # build wheel + sdist into dist/
License
MIT © Eitan Hadar
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.