
Crawl4Claude
A comprehensive, domain-agnostic documentation scraping and AI integration toolkit. Scrape any documentation website, create structured databases, and integrate with Claude Desktop via MCP (Model Context Protocol) for seamless AI-powered documentation assistance.
README
Documentation Scraper & MCP Server
A comprehensive, domain-agnostic documentation scraping and AI integration toolkit. Scrape any documentation website, create structured databases, and integrate with Claude Desktop via MCP (Model Context Protocol) for seamless AI-powered documentation assistance.
🚀 Features
Core Functionality
- 🌐 Universal Documentation Scraper: Works with any documentation website
- 📊 Structured Database: SQLite database with full-text search capabilities
- 🤖 MCP Server Integration: Native Claude Desktop integration via Model Context Protocol
- 📝 LLM-Optimized Output: Ready-to-use context files for AI applications
- ⚙️ Configuration-Driven: Single config file controls all settings
Advanced Tools
- 🔍 Query Interface: Command-line tool for searching and analyzing scraped content
- 🛠️ Debug Suite: Comprehensive debugging tools for testing and validation
- 📋 Auto-Configuration: Automatic MCP setup file generation
- 📈 Progress Tracking: Detailed logging and error handling
- 💾 Resumable Crawls: Smart caching for interrupted crawls
📋 Prerequisites
- Python 3.8 or higher
- Internet connection
- ~500MB free disk space per documentation site
🛠️ Quick Start
1. Installation
# Clone the repository
git clone <repository-url>
cd documentation-scraper
# Install dependencies
pip install -r requirements.txt
2. Configure Your Target
Edit config.py
to set your documentation site:
SCRAPER_CONFIG = {
"base_url": "https://docs.example.com/", # Your documentation site
"output_dir": "docs_db",
"max_pages": 200,
# ... other settings
}
3. Run the Scraper
python docs_scraper.py
4. Query Your Documentation
# Search for content
python query_docs.py --search "tutorial"
# Browse by section
python query_docs.py --section "getting-started"
# Get statistics
python query_docs.py --stats
5. Set Up Claude Integration
# Generate MCP configuration files
python utils/gen_mcp.py
# Follow the instructions to add to Claude Desktop
🏗️ Project Structure
📁 documentation-scraper/
├── 📄 config.py # Central configuration file
├── 🕷️ docs_scraper.py # Main scraper script
├── 🔍 query_docs.py # Query and analysis tool
├── 🤖 mcp_docs_server.py # MCP server for Claude integration
├── 📋 requirements.txt # Python dependencies
├── 📁 utils/ # Debug and utility tools
│ ├── 🛠️ gen_mcp.py # Generate MCP config files
│ ├── 🧪 debug_scraper.py # Test scraper functionality
│ ├── 🔧 debug_mcp_server.py # Debug MCP server
│ ├── 🎯 debug_mcp_client.py # Test MCP tools directly
│ ├── 📡 debug_mcp_server_protocol.py # Test MCP via JSON-RPC
│ └── 🌐 debug_site_content.py # Debug content extraction
├── 📁 docs_db/ # Generated documentation database
│ ├── 📊 documentation.db # SQLite database
│ ├── 📄 documentation.json # JSON export
│ ├── 📋 scrape_summary.json # Statistics
│ └── 📁 llm_context/ # LLM-ready context files
└── 📁 mcp/ # Generated MCP configuration
├── 🔧 run_mcp_server.bat # Windows launcher script
└── ⚙️ claude_mcp_config.json # Claude Desktop config
⚙️ Configuration
Main Configuration (config.py
)
The entire system is controlled by a single configuration file:
# Basic scraping settings
SCRAPER_CONFIG = {
"base_url": "https://docs.example.com/",
"output_dir": "docs_db",
"max_depth": 3,
"max_pages": 200,
"delay_between_requests": 0.5,
}
# URL filtering rules
URL_FILTER_CONFIG = {
"skip_patterns": [r'/api/', r'\.pdf$'],
"allowed_domains": ["docs.example.com"],
}
# MCP server settings
MCP_CONFIG = {
"server_name": "docs-server",
"default_search_limit": 10,
"max_search_limit": 50,
}
Environment Overrides
You can override any setting with environment variables:
export DOCS_DB_PATH="/custom/path/documentation.db"
export DOCS_BASE_URL="https://different-docs.com/"
python mcp_docs_server.py
🤖 Claude Desktop Integration
Automatic Setup
-
Generate configuration files:
python utils/gen_mcp.py
-
Copy the generated config to Claude Desktop:
- Windows:
%APPDATA%\Claude\claude_desktop_config.json
- macOS:
~/Library/Application Support/Claude/claude_desktop_config.json
- Windows:
-
Restart Claude Desktop
Manual Setup
If you prefer manual setup, add this to your Claude Desktop config:
{
"mcpServers": {
"docs": {
"command": "python",
"args": ["path/to/mcp_docs_server.py"],
"cwd": "path/to/project",
"env": {
"DOCS_DB_PATH": "path/to/docs_db/documentation.db"
}
}
}
}
Available MCP Tools
Once connected, Claude can use these tools:
- 🔍 search_documentation: Search for content across all documentation
- 📚 get_documentation_sections: List all available sections
- 📄 get_page_content: Get full content of specific pages
- 🗂️ browse_section: Browse pages within a section
- 📊 get_documentation_stats: Get database statistics
🔧 Command Line Tools
Documentation Scraper
# Basic scraping
python docs_scraper.py
# Override config settings
python docs_scraper.py # Settings from config.py
Query Tool
# Search for content
python query_docs.py --search "authentication guide"
# Browse specific sections
python query_docs.py --section "api-reference"
# Get database statistics
python query_docs.py --stats
# List all sections
python query_docs.py --list-sections
# Export section to file
python query_docs.py --export-section "tutorials" --format markdown > tutorials.md
# Use custom database
python query_docs.py --db "custom/path/docs.db" --search "example"
Debug Tools
# Test scraper functionality
python utils/debug_scraper.py
# Test MCP server
python utils/debug_mcp_server.py
# Test MCP tools directly
python utils/debug_mcp_client.py
# Test MCP protocol
python utils/debug_mcp_server_protocol.py
# Debug content extraction
python utils/debug_site_content.py
# Generate MCP config files
python utils/gen_mcp.py
📊 Database Schema
Pages Table
CREATE TABLE pages (
id INTEGER PRIMARY KEY,
url TEXT UNIQUE NOT NULL,
title TEXT,
content TEXT,
markdown TEXT,
word_count INTEGER,
section TEXT,
subsection TEXT,
scraped_at TIMESTAMP,
metadata TEXT
);
Full-Text Search
-- Search using FTS5
SELECT * FROM pages_fts WHERE pages_fts MATCH 'your search term';
-- Or use the query tool
python query_docs.py --search "your search term"
🎯 Example Use Cases
1. Documentation Analysis
# Get overview of documentation
python query_docs.py --stats
# Find all tutorial content
python query_docs.py --search "tutorial guide example"
# Export specific sections
python query_docs.py --export-section "getting-started" > onboarding.md
2. AI Integration with Claude
# Once MCP is set up, ask Claude:
# "Search the documentation for authentication examples"
# "What sections are available in the documentation?"
# "Show me the content for the API reference page"
3. Custom Applications
import sqlite3
# Connect to your scraped documentation
conn = sqlite3.connect('docs_db/documentation.db')
# Query for specific content
results = conn.execute("""
SELECT title, url, markdown
FROM pages
WHERE section = 'tutorials'
AND word_count > 500
ORDER BY word_count DESC
""").fetchall()
# Build your own tools on top of the structured data
🔍 Debugging and Testing
Test Scraper Before Full Run
python utils/debug_scraper.py
Validate Content Extraction
python utils/debug_site_content.py
Test MCP Integration
# Test server functionality
python utils/debug_mcp_server.py
# Test tools directly
python utils/debug_mcp_client.py
# Test JSON-RPC protocol
python utils/debug_mcp_server_protocol.py
📈 Performance and Optimization
Scraping Performance
- Start small: Use
max_pages=50
for testing - Adjust depth:
max_depth=2
covers most content efficiently - Rate limiting: Increase
delay_between_requests
if getting blocked - Caching: Enabled by default for resumable crawls
Database Performance
- Full-text search: Automatic FTS5 index for fast searching
- Indexing: Optimized indexes on URL and section columns
- Word counts: Pre-calculated for quick statistics
MCP Performance
- Configurable limits: Set appropriate search and section limits
- Snippet length: Adjust snippet size for optimal response times
- Connection pooling: Efficient database connections
🌐 Supported Documentation Sites
This scraper works with most documentation websites including:
- Static sites: Hugo, Jekyll, MkDocs, Docusaurus
- Documentation platforms: GitBook, Notion, Confluence
- API docs: Swagger/OpenAPI documentation
- Wiki-style: MediaWiki, TiddlyWiki
- Custom sites: Any site with consistent HTML structure
Site-Specific Configuration
Customize URL filtering and content extraction for your target site:
URL_FILTER_CONFIG = {
"skip_patterns": [
r'/api/', # Skip API endpoint docs
r'/edit/', # Skip edit pages
r'\.pdf$', # Skip PDF files
],
"allowed_domains": ["docs.yoursite.com"],
}
CONTENT_FILTER_CONFIG = {
"remove_patterns": [
r'Edit this page.*?\n', # Remove edit links
r'Was this helpful\?.*?\n', # Remove feedback sections
],
}
🤝 Contributing
We welcome contributions! Here are some areas where you can help:
- New export formats: PDF, EPUB, Word documents
- Enhanced content filtering: Better noise removal
- Additional debug tools: More comprehensive testing
- Documentation: Improve guides and examples
- Performance optimizations: Faster scraping and querying
⚠️ Responsible Usage
- Respect robots.txt: Check the target site's robots.txt file
- Rate limiting: Use appropriate delays between requests
- Terms of service: Respect the documentation site's terms
- Fair use: Use for educational, research, or personal purposes
- Attribution: Credit the original documentation source
📄 License
This project is provided as-is for educational and research purposes. Please respect the terms of service and licensing of the documentation sites you scrape.
🎉 Getting Started Examples
Example 1: Scrape Python Documentation
# config.py
SCRAPER_CONFIG = {
"base_url": "https://docs.python.org/3/",
"max_pages": 500,
"max_depth": 3,
}
Example 2: Scrape API Documentation
# config.py
SCRAPER_CONFIG = {
"base_url": "https://api-docs.example.com/",
"max_pages": 200,
}
URL_FILTER_CONFIG = {
"skip_patterns": [r'/changelog/', r'/releases/'],
}
Example 3: Corporate Documentation
# config.py
SCRAPER_CONFIG = {
"base_url": "https://internal-docs.company.com/",
"output_dir": "company_docs",
}
MCP_CONFIG = {
"server_name": "company-docs-server",
"docs_display_name": "Company Internal Docs",
}
Happy Documenting! 📚✨
For questions, issues, or feature requests, please check the debug logs first, then create an issue with relevant details.
🙏 Attribution
This project is powered by Crawl4AI - an amazing open-source LLM-friendly web crawler and scraper.
<a href="https://github.com/unclecode/crawl4ai"> <img src="https://img.shields.io/badge/Powered%20by-Crawl4AI-blue?style=flat-square" alt="Powered by Crawl4AI"/> </a>
Crawl4AI enables the intelligent web scraping capabilities that make this documentation toolkit possible. A huge thanks to @unclecode and the Crawl4AI community for building such an incredible tool! 🚀
Check out Crawl4AI:
- Repository: https://github.com/unclecode/crawl4ai
- Documentation: https://crawl4ai.com
- Discord Community: https://discord.gg/jP8KfhDhyN
📄 License
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.