webcrawl-mcp
Lightweight MCP server for web scraping, search, and crawling. Uses local trafilatura/DuckDuckGo by default with optional Firecrawl fallback for transport-blocked pages.
README
webcrawl-mcp
A lightweight MCP server that gives Claude Code (or any MCP client) the ability to scrape, search, map, and crawl the web — using free, open-source libraries. Firecrawl is supported as an optional fallback for JS-heavy sites when you have a key.
Why
Most scraping doesn't actually need a headless browser. trafilatura handles the ~80% case (articles, docs, blogs) locally, which is faster and keeps external API usage to a minimum. This server routes the easy stuff through local extraction and only falls back to Firecrawl when content quality is genuinely poor.
Tools
| Tool | Purpose |
|---|---|
webcrawl_scrape |
Fetch a single URL → {content, source} (markdown + provenance) |
webcrawl_search |
DuckDuckGo search (optionally scrape results, each with provenance) |
webcrawl_map |
Discover same-domain URLs from a starting page |
webcrawl_crawl |
BFS crawl multiple pages (each result includes provenance) |
The source field on scraped content is one of static_http, static_http_retry, firecrawl_transport_fallback, or firecrawl_quality_fallback — see Fallback behavior.
Install
pip install webcrawl-mcp
Requires Python 3.12+.
Quick smoke test (should print Webcrawl MCP server running then exit cleanly with Ctrl-C):
webcrawl-mcp
Install from source (for development)
git clone https://github.com/andyliszewski/webcrawl-mcp.git
cd webcrawl-mcp
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -e .
Configure your MCP client
Claude Code
Create .mcp.json in your project root (or merge into ~/.claude/settings.json):
{
"mcpServers": {
"webcrawl": {
"command": "uvx",
"args": ["webcrawl-mcp"]
}
}
}
This uses uvx to run the package in a temporary environment — no manual install needed. If uvx is unavailable, install via pip install webcrawl-mcp and use "command": "webcrawl-mcp" with no args instead.
For a source checkout (development), point command at your venv's Python and use "args": ["-m", "webcrawl_mcp"].
Then in a Claude Code session run /mcp — you should see webcrawl listed with four tools.
Claude Desktop
Same JSON shape, placed in claude_desktop_config.json (see Anthropic's docs for the OS-specific path). Restart Claude Desktop after editing.
Other MCP clients (Cursor, Cline, Continue, Zed, etc.)
The command / args / env shape is standardized. Consult your client's MCP docs for where to put it.
Verify it's working
Ask your agent something like:
Use the
webcrawl_scrapetool to fetchhttps://docs.python.org/3/library/asyncio.htmland summarize the first section.
You should see the tool invocation in the client UI, followed by a summary grounded in the live page content. If nothing happens, see Troubleshooting.
Environment variables
All optional:
| Variable | Default | Purpose |
|---|---|---|
USER_AGENT |
Mozilla/5.0 (compatible; WebcrawlMCP/1.0; …) |
HTTP User-Agent |
REQUEST_TIMEOUT |
30 |
Seconds before request timeout |
FIRECRAWL_API_KEY |
(unset) | If set, enables Firecrawl fallback for low-quality extractions. Leave unset for a fully free setup. |
FIRECRAWL_API_URL |
https://api.firecrawl.dev/v1 |
Firecrawl endpoint |
FALLBACK_ON_TRANSPORT_ERROR |
false |
If true and FIRECRAWL_API_KEY is set, route bot-blocked statuses (403, 429, 503) to Firecrawl instead of raising. Opt-in. |
POLITE_MODE |
true |
On a 429 with a parseable Retry-After, retry the original request once after the indicated wait (capped at REQUEST_TIMEOUT) before falling through. |
Set these inside the env block of your MCP config, not in your shell — MCP servers run under the client, not your terminal.
Fallback behavior
The scraper distinguishes extraction-quality failure from transport failure and reports which path produced the content via the source field.
Extraction-quality path (default):
trafilaturaextracts main content from HTML →source: static_http.- If that fails or returns <200 chars,
markdownifyconverts the raw HTML. - If the result is still low-quality and
FIRECRAWL_API_KEYis set, Firecrawl is used as a last resort →source: firecrawl_quality_fallback.
Transport path (opt-in):
If a request returns 403, 429, or 503 (typical bot-blocking responses):
- With
POLITE_MODE=true(default), a429carrying aRetry-Afterheader gets one bounded retry of the original request →source: static_http_retryon success. - With
FALLBACK_ON_TRANSPORT_ERROR=trueandFIRECRAWL_API_KEYset, the request routes to Firecrawl instead of raising →source: firecrawl_transport_fallback. - Otherwise, the transport error is raised to the caller (current behavior).
Without a Firecrawl key, the tool is fully self-contained and free; FALLBACK_ON_TRANSPORT_ERROR is a no-op without a key.
Troubleshooting
Client doesn't list webcrawl under MCP servers.
The command path is almost always the problem. It must be an absolute path to the Python binary inside your venv, not just python. Test it in a terminal: /path/you/configured -m webcrawl_mcp should print Webcrawl MCP server running.
ModuleNotFoundError: webcrawl_mcp.
Either pip install -e . didn't run in the same venv as command, or PYTHONPATH is missing/wrong. Double-check both point at the same checkout.
Python version mismatch.
Requires 3.12+. python --version inside your venv should report ≥ 3.12. If not, recreate the venv with a newer Python.
Scrapes return very little text.
Some sites render with JavaScript and can't be extracted statically. Either set FIRECRAWL_API_KEY to enable the fallback path, or accept that this tool isn't the right fit for that specific site.
Search is slow or rate-limited.
DuckDuckGo throttles bursty querying. Space out searches, or reduce num_results.
Responsible use
This tool is for fetching public web content for research, coding assistance, and similar legitimate uses. You are responsible for:
- Respecting each target site's Terms of Service and
robots.txt. - Not overloading servers — the built-in per-domain rate limiter helps, but don't circumvent it.
- Complying with applicable laws around automated access and data use in your jurisdiction.
License
MIT — see LICENSE.
Acknowledgements
Built on top of trafilatura, httpx, beautifulsoup4, markdownify, ddgs, and fastmcp. Firecrawl integration uses the public Firecrawl API (not affiliated).
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.