mcp-common-crawl
MCP server for Common Crawl CDX that enables backlink discovery, expired domain finding, and competitor gap analysis without requiring API keys.
README
mcp-common-crawl
Built by Artur Ferreira @ The GEO Lab ยท ๐ @TheGEO_Lab ยท LinkedIn ยท Reddit
MCP server for Common Crawl CDX โ backlink discovery, expired domain finder, competitor gap analysis. Free alternative to Ahrefs/Semrush backlink APIs ($100+/month).
Tools
| Tool | Description |
|---|---|
discover_backlinks |
Find backlinks to any domain across 3 CC indexes |
find_expired |
Search for expired/parked domains in a niche via CC CDX |
check_domain |
Deep single domain check โ live/expired/parked + CC page count |
competitor_gap |
Find domains linking to competitors but not to you |
Features
โ Production-tested โ patterns used in production at TheGEOLab
Install
# Claude Code
claude mcp add common-crawl -- npx mcp-common-crawl
# Or in .mcp.json
{
"mcpServers": {
"common-crawl": {
"command": "npx",
"args": ["mcp-common-crawl"]
}
}
}
No API Keys Required
Common Crawl is a free, open web archive. No API keys, no rate limits, no paid tiers.
Usage
> find backlinks to thegeolab.net using Common Crawl
> search for expired domains in the "seo tools" niche
> check if example.com is expired or parked
> find link gap between my site and competitors
Important Notes
- Uses native
fetch()for CC CDX (axios returns 404 on CC CDX โ known issue) - Queries the 3 most recent CC indexes for best coverage
- Expired domain detection: ECONNREFUSED/ENOTFOUND = expired, parked page pattern matching for parked domains
Attributions & Licence
Built and maintained by Artur Ferreira @ TheGEOLab.
Email: artur@thegeolab.net
Best Practice Attribution
This MCP server was built following the open source Best Practice Approach โ reading community work for inspiration, then writing original content, and crediting every source.
Based on:
- Model Context Protocol specification by Anthropic
- MCP SDK (MIT)
Data source:
- Common Crawl โ free, open web archive (non-profit)
- Common Crawl CDX API โ index search endpoint
Backlink analysis concepts inspired by:
- Ahrefs โ backlink discovery and competitor gap methodology
- Semrush โ backlink analytics and domain comparison
- Majestic โ historic backlink index concepts
Technical decisions:
- Native
fetch()used instead of axios for CC CDX queries (axios returns 404 on CC CDX from inside Express โ persistent debugging issue documented in geolab-backlinks)
All server code is original writing. No files were copied or adapted from any source. MIT licence.
Found this useful? โญ Star the repo and connect: ๐ thegeolab.net ยท ๐ @TheGEO_Lab ยท LinkedIn ยท Reddit
Related Repos
- claude-code-mcps โ All 5 MCP servers in one collection
- mcp-seo-auditor โ On-page SEO audit + JSON-LD validation
- mcp-serp-intel โ SERP weak spots, PAA trees, intent comparison
- mcp-common-crawl โ Free backlink discovery via Common Crawl
- mcp-gsc-advanced โ GSC cannibalization, rank changes
- mcp-wordpress-setup โ WordPress MCP server setup guide
Licence
MIT โ see LICENSE
Built and maintained by Artur Ferreira @ TheGEOLab ยท MIT License
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.