Crawl4AI MCP Server
Enables advanced web crawling and content extraction with JavaScript support, AI-powered analysis, PDF/Office document processing, YouTube transcript extraction, Google search integration, and multi-format data export capabilities.
README
Crawl4AI MCP Server
A comprehensive Model Context Protocol (MCP) server that wraps the powerful crawl4ai library. This server provides advanced web crawling, content extraction, and AI-powered analysis capabilities through the standardized MCP interface.
š Key Features
Core Capabilities
š Complete JavaScript Support
This feature set enables comprehensive JavaScript-heavy website handling:
- Full Playwright Integration - React, Vue, Angular SPA sites fully supported
- Dynamic Content Loading - Auto-waits for content to load
- Custom JavaScript Execution - Run custom scripts on pages
- DOM Element Waiting -
wait_for_selectorfor specific elements - Human-like Browsing Simulation - Bypass basic anti-bot measures
JavaScript-Heavy Sites Recommended Settings:
{
"wait_for_js": true,
"simulate_user": true,
"timeout": 30-60,
"generate_markdown": true
}
- Advanced Web Crawling with complete JavaScript execution support
- Deep Crawling with configurable depth and multiple strategies (BFS, DFS, Best-First)
- AI-Powered Content Extraction using LLM-based analysis
- š File Processing with Microsoft MarkItDown integration
- PDF, Office documents, ZIP archives, and more
- Automatic file format detection and conversion
- Batch processing of archive contents
- šŗ YouTube Transcript Extraction (youtube-transcript-api v1.1.0+)
- No authentication required - works out of the box
- Stable and reliable transcript extraction
- Support for both auto-generated and manual captions
- Multi-language support with priority settings
- Timestamped segment information and clean text output
- Batch processing for multiple videos
- Entity Extraction with 9 built-in patterns including emails, phones, URLs, and dates
- Intelligent Content Filtering (BM25, pruning, LLM-based)
- Content Chunking for large document processing
- Screenshot Capture and media extraction
Advanced Features
- š Google Search Integration with genre-based filtering and metadata extraction
- 31 search genres (academic, programming, news, etc.)
- Automatic title and snippet extraction from search results
- Safe search enabled by default for security
- Batch search capabilities with result analysis
- Multiple Extraction Strategies include CSS selectors, XPath, regex patterns, and LLM-based extraction
- Browser Automation supports custom user agents, headers, cookies, and authentication
- Caching System with multiple modes for performance optimization
- Custom JavaScript Execution for dynamic content interaction
- Structured Data Export in multiple formats (JSON, Markdown, HTML)
š¦ Installation
Quick Setup
Linux/macOS:
./setup.sh
Windows:
setup_windows.bat
Manual Installation
- Create and activate virtual environment:
python3 -m venv venv
source venv/bin/activate # Linux/macOS
# or
venv\Scripts\activate.bat # Windows
- Install Python dependencies:
pip install -r requirements.txt
- Install Playwright browser dependencies (Linux/WSL):
sudo apt-get update
sudo apt-get install libnss3 libnspr4 libasound2 libatk-bridge2.0-0 libdrm2 libgtk-3-0 libgbm1
š³ Docker Deployment (Recommended)
For production deployments and easy setup, we provide multiple Docker container variants optimized for different use cases:
Quick Start with Docker
# Create shared network
docker network create shared_net
# CPU-optimized deployment (recommended for most users)
docker-compose -f docker-compose.cpu.yml build
docker-compose -f docker-compose.cpu.yml up -d
# Verify deployment
docker-compose -f docker-compose.cpu.yml ps
Container Variants
| Variant | Use Case | Build Time | Image Size | Memory Limit |
|---|---|---|---|---|
| CPU-Optimized | VPS, local development | 6-9 min | ~1.5-2GB | 1GB |
| Lightweight | Resource-constrained environments | 4-6 min | ~1-1.5GB | 512MB |
| Standard | Full features (may include CUDA) | 8-12 min | ~2-3GB | 2GB |
| RunPod Serverless | Cloud auto-scaling deployment | 6-9 min | ~1.5-2GB | 1GB |
Docker Features
- š Network Isolation: No localhost exposure, only accessible via
shared_net - ā” CPU Optimization: CUDA-free builds for 50-70% smaller containers
- š Auto-scaling: RunPod serverless deployment with automated CI/CD
- š Health Monitoring: Built-in health checks and monitoring
- š”ļø Security: Non-root user, resource limits, safe mode enabled
Quick Commands
# Build all variants for comparison
./scripts/build-cpu-containers.sh
# Deploy to RunPod (automated via GitHub Actions)
# Image: docker.io/gemneye/crawl4ai-runpod-serverless:latest
# Health check
docker exec crawl4ai-mcp-cpu python -c "from crawl4ai_mcp.server import mcp; print('ā
Healthy')"
For complete Docker documentation, see Docker Guide.
š„ļø Usage
Start the MCP Server
STDIO transport (default):
python -m crawl4ai_mcp.server
HTTP transport:
python -m crawl4ai_mcp.server --transport http --host 127.0.0.1 --port 8000
š MCP Command Registration (Claude Code CLI)
You can register this MCP server with Claude Code CLI. The following methods are available:
Using .mcp.json Configuration (Recommended)
- Create or update
.mcp.jsonin your project directory:
{
"mcpServers": {
"crawl4ai": {
"command": "/home/user/prj/crawl/venv/bin/python",
"args": ["-m", "crawl4ai_mcp.server"],
"env": {
"FASTMCP_LOG_LEVEL": "DEBUG"
}
}
}
}
- Run
claude mcpor start Claude Code from the project directory
Alternative: Command Line Registration
# Register the MCP server with claude command
claude mcp add crawl4ai "/path/to/your/venv/bin/python -m crawl4ai_mcp.server" \
--cwd /path/to/your/crawl4ai-mcp-project
# With environment variables
claude mcp add crawl4ai "/path/to/your/venv/bin/python -m crawl4ai_mcp.server" \
--cwd /path/to/your/crawl4ai-mcp-project \
-e FASTMCP_LOG_LEVEL=DEBUG
# With project scope (shared with team)
claude mcp add crawl4ai "/path/to/your/venv/bin/python -m crawl4ai_mcp.server" \
--cwd /path/to/your/crawl4ai-mcp-project \
--scope project
HTTP Transport (For Remote Access)
# First start the HTTP server
python -m crawl4ai_mcp.server --transport http --host 127.0.0.1 --port 8000
# Then register the HTTP endpoint
claude mcp add crawl4ai-http --transport http --url http://127.0.0.1:8000/mcp
# Or with Pure StreamableHTTP (recommended)
./scripts/start_pure_http_server.sh
claude mcp add crawl4ai-pure-http --transport http --url http://127.0.0.1:8000/mcp
Verification
# List registered MCP servers
claude mcp list
# Test the connection
claude mcp test crawl4ai
# Remove if needed
claude mcp remove crawl4ai
Setting API Keys (Optional for LLM Features)
# Add with environment variables for LLM functionality
claude mcp add crawl4ai "python -m crawl4ai_mcp.server" \
--cwd /path/to/your/crawl4ai-mcp-project \
-e OPENAI_API_KEY=your_openai_key \
-e ANTHROPIC_API_KEY=your_anthropic_key
Claude Desktop Integration
šÆ Pure StreamableHTTP Usage (Recommended)
-
Start Server by running the startup script:
./scripts/start_pure_http_server.sh -
Apply Configuration using one of these methods:
- Copy
configs/claude_desktop_config_pure_http.jsonto Claude Desktop's config directory - Or add the following to your existing config:
{ "mcpServers": { "crawl4ai-pure-http": { "url": "http://127.0.0.1:8000/mcp" } } } - Copy
-
Restart Claude Desktop to apply settings
-
Start Using the tools - crawl4ai tools are now available in chat
š Traditional STDIO Usage
-
Copy the configuration:
cp configs/claude_desktop_config.json ~/.config/claude-desktop/claude_desktop_config.json -
Restart Claude Desktop to enable the crawl4ai tools
š Configuration File Locations
Windows:
%APPDATA%\Claude\claude_desktop_config.json
macOS:
~/Library/Application Support/Claude/claude_desktop_config.json
Linux:
~/.config/claude-desktop/claude_desktop_config.json
š HTTP API Access
This MCP server supports multiple HTTP protocols, allowing you to choose the optimal implementation for your use case.
šÆ Pure StreamableHTTP (Recommended)
Pure JSON HTTP protocol without Server-Sent Events (SSE)
Server Startup
# Method 1: Using startup script
./scripts/start_pure_http_server.sh
# Method 2: Direct startup
python examples/simple_pure_http_server.py --host 127.0.0.1 --port 8000
# Method 3: Background startup
nohup python examples/simple_pure_http_server.py --port 8000 > server.log 2>&1 &
Claude Desktop Configuration
{
"mcpServers": {
"crawl4ai-pure-http": {
"url": "http://127.0.0.1:8000/mcp"
}
}
}
Usage Steps
- Start Server:
./scripts/start_pure_http_server.sh - Apply Configuration: Use
configs/claude_desktop_config_pure_http.json - Restart Claude Desktop: Apply settings
Verification
# Health check
curl http://127.0.0.1:8000/health
# Complete test
python examples/pure_http_test.py
š Legacy HTTP (SSE Implementation)
Traditional FastMCP StreamableHTTP protocol (with SSE)
Server Startup
# Method 1: Command line
python -m crawl4ai_mcp.server --transport http --host 127.0.0.1 --port 8001
# Method 2: Environment variables
export MCP_TRANSPORT=http
export MCP_HOST=127.0.0.1
export MCP_PORT=8001
python -m crawl4ai_mcp.server
Claude Desktop Configuration
{
"mcpServers": {
"crawl4ai-legacy-http": {
"url": "http://127.0.0.1:8001/mcp"
}
}
}
š Protocol Comparison
| Feature | Pure StreamableHTTP | Legacy HTTP (SSE) | STDIO |
|---|---|---|---|
| Response Format | Plain JSON | Server-Sent Events | Binary |
| Configuration Complexity | Low (URL only) | Low (URL only) | High (Process management) |
| Debug Ease | High (curl compatible) | Medium (SSE parser needed) | Low |
| Independence | High | High | Low |
| Performance | High | Medium | High |
š HTTP Usage Examples
Pure StreamableHTTP
# Initialize
SESSION_ID=$(curl -s -X POST http://127.0.0.1:8000/mcp/initialize \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":"init","method":"initialize","params":{"protocolVersion":"2024-11-05","capabilities":{},"clientInfo":{"name":"test","version":"1.0.0"}}}' \
-D- | grep -i mcp-session-id | cut -d' ' -f2 | tr -d '\r')
# Execute tool
curl -X POST http://127.0.0.1:8000/mcp \
-H "Content-Type: application/json" \
-H "mcp-session-id: $SESSION_ID" \
-d '{"jsonrpc":"2.0","id":"crawl","method":"tools/call","params":{"name":"crawl_url","arguments":{"url":"https://example.com"}}}'
Legacy HTTP
curl -X POST "http://127.0.0.1:8001/tools/crawl_url" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "generate_markdown": true}'
š Detailed Documentation
- Pure StreamableHTTP: docs/PURE_STREAMABLE_HTTP.md
- HTTP Server Usage: docs/HTTP_SERVER_USAGE.md
- Legacy HTTP API: docs/HTTP_API_GUIDE.md
Starting the HTTP Server
Method 1: Command Line
python -m crawl4ai_mcp.server --transport http --host 127.0.0.1 --port 8000
Method 2: Environment Variables
export MCP_TRANSPORT=http
export MCP_HOST=127.0.0.1
export MCP_PORT=8000
python -m crawl4ai_mcp.server
Method 3: Docker (if available)
docker run -p 8000:8000 crawl4ai-mcp --transport http --port 8000
Basic Endpoint Information
Once running, the HTTP API provides:
- Base URL:
http://127.0.0.1:8000 - OpenAPI Documentation:
http://127.0.0.1:8000/docs - Tool Endpoints:
http://127.0.0.1:8000/tools/{tool_name} - Resource Endpoints:
http://127.0.0.1:8000/resources/{resource_uri}
All MCP tools (crawl_url, intelligent_extract, process_file, etc.) are accessible via HTTP POST requests with JSON payloads matching the tool parameters.
Quick HTTP Example
curl -X POST "http://127.0.0.1:8000/tools/crawl_url" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "generate_markdown": true}'
For detailed HTTP API documentation, examples, and integration guides, see the HTTP API Guide.
š ļø Tool Selection Guide
š Choose the Right Tool for Your Task
| Use Case | Recommended Tool | Key Features |
|---|---|---|
| Single webpage | crawl_url |
Basic crawling, JS support |
| Multiple pages (up to 5) | deep_crawl_site |
Site mapping, link following |
| Search + Crawling | search_and_crawl |
Google search + auto-crawl |
| Difficult sites | crawl_url_with_fallback |
Multiple retry strategies |
| Extract specific data | intelligent_extract |
AI-powered extraction |
| Find patterns | extract_entities |
Emails, phones, URLs, etc. |
| Structured data | extract_structured_data |
CSS/XPath/LLM schemas |
| File processing | process_file |
PDF, Office, ZIP conversion |
| YouTube content | extract_youtube_transcript |
Subtitle extraction |
ā” Performance Guidelines
- Deep Crawling: Limited to 5 pages max (stability focused)
- Batch Processing: Concurrent limits enforced
- Timeout Calculation:
pages Ć base_timeoutrecommended - Large Files: 100MB maximum size limit
- Retry Strategy: Manual retry recommended on first failure
šÆ Best Practices
For JavaScript-Heavy Sites:
- Always use
wait_for_js: true - Set
simulate_user: truefor better compatibility - Increase timeout to 30-60 seconds
- Use
wait_for_selectorfor specific elements
For AI Features:
- Configure LLM settings with
get_llm_config_info - Fallback to non-AI tools if LLM unavailable
- Use
intelligent_extractfor semantic understanding
š ļø MCP Tools
crawl_url
Advanced web crawling with deep crawling support and intelligent filtering.
Key Parameters:
url: Target URL to crawlmax_depth: Maximum crawling depth (None for single page)crawl_strategy: Strategy type ('bfs', 'dfs', 'best_first')content_filter: Filter type ('bm25', 'pruning', 'llm')chunk_content: Enable content chunking for large documentsexecute_js: Custom JavaScript code executionuser_agent: Custom user agent stringheaders: Custom HTTP headerscookies: Authentication cookies
deep_crawl_site
Dedicated tool for comprehensive site mapping and recursive crawling.
Parameters:
url: Starting URLmax_depth: Maximum crawling depth (recommended: 1-3)max_pages: Maximum number of pages to crawlcrawl_strategy: Crawling strategy ('bfs', 'dfs', 'best_first')url_pattern: URL filter pattern (e.g., 'docs', 'blog')score_threshold: Minimum relevance score (0.0-1.0)
intelligent_extract
AI-powered content extraction with advanced filtering and analysis.
Parameters:
url: Target URLextraction_goal: Description of extraction targetcontent_filter: Filter type for content qualityuse_llm: Enable LLM-based intelligent extractionllm_provider: LLM provider (openai, claude, etc.)custom_instructions: Detailed extraction instructions
extract_entities
High-speed entity extraction using regex patterns.
Built-in Entity Types:
emails: Email addressesphones: Phone numbersurls: URLs and linksdates: Date formatsips: IP addressessocial_media: Social media handles (@username, #hashtag)prices: Price informationcredit_cards: Credit card numberscoordinates: Geographic coordinates
extract_structured_data
Traditional structured data extraction using CSS/XPath selectors or LLM schemas.
batch_crawl
Parallel processing of multiple URLs with unified reporting.
crawl_url_with_fallback
Robust crawling with multiple fallback strategies for maximum reliability.
process_file
š File Processing: Convert various file formats to Markdown using Microsoft MarkItDown.
Parameters:
url: File URL (PDF, Office, ZIP, etc.)max_size_mb: Maximum file size limit (default: 100MB)extract_all_from_zip: Extract all files from ZIP archivesinclude_metadata: Include file metadata in response
Supported Formats:
- PDF: .pdf
- Microsoft Office: .docx, .pptx, .xlsx, .xls
- Archives: .zip
- Web/Text: .html, .htm, .txt, .md, .csv, .rtf
- eBooks: .epub
get_supported_file_formats
š Format Information: Get comprehensive list of supported file formats and their capabilities.
extract_youtube_transcript
šŗ YouTube Processing: Extract transcripts from YouTube videos with language preferences and translation using youtube-transcript-api v1.1.0+.
ā Stable and reliable - No authentication required!
Parameters:
url: YouTube video URLlanguages: Preferred languages in order of preference (default: ["ja", "en"])translate_to: Target language for translation (optional)include_timestamps: Include timestamps in transcriptpreserve_formatting: Preserve original formattinginclude_metadata: Include video metadata
batch_extract_youtube_transcripts
šŗ Batch YouTube Processing: Extract transcripts from multiple YouTube videos in parallel.
ā Enhanced performance with controlled concurrency for stable batch processing.
Parameters:
urls: List of YouTube video URLslanguages: Preferred languages listtranslate_to: Target language for translation (optional)include_timestamps: Include timestamps in transcriptmax_concurrent: Maximum concurrent requests (1-5, default: 3)
get_youtube_video_info
š YouTube Info: Get available transcript information for a YouTube video without extracting the full transcript.
Parameters:
video_url: YouTube video URL
Returns:
- Available transcript languages
- Manual/auto-generated distinction
- Translatable language information
search_google
š Google Search: Perform Google search with genre filtering and metadata extraction.
Parameters:
query: Search query stringnum_results: Number of results to return (1-100, default: 10)language: Search language (default: "en")region: Search region (default: "us")search_genre: Content genre filter (optional)safe_search: Safe search enabled (always True for security)
Features:
- Automatic title and snippet extraction from search results
- 31 available search genres for content filtering
- URL classification and domain analysis
- Safe search enforced by default
batch_search_google
š Batch Google Search: Perform multiple Google searches with comprehensive analysis.
Parameters:
queries: List of search queriesnum_results_per_query: Results per query (1-100, default: 10)max_concurrent: Maximum concurrent searches (1-5, default: 3)language: Search language (default: "en")region: Search region (default: "us")search_genre: Content genre filter (optional)
Returns:
- Individual search results for each query
- Cross-query analysis and statistics
- Domain distribution and result type analysis
search_and_crawl
š Integrated Search+Crawl: Perform Google search and automatically crawl top results.
Parameters:
search_query: Google search querynum_search_results: Number of search results (1-20, default: 5)crawl_top_results: Number of top results to crawl (1-10, default: 3)extract_media: Extract media from crawled pagesgenerate_markdown: Generate markdown contentsearch_genre: Content genre filter (optional)
Returns:
- Complete search metadata and crawled content
- Success rates and processing statistics
- Integrated analysis of search and crawl results
get_search_genres
š Search Genres: Get comprehensive list of available search genres and their descriptions.
Returns:
- 31 available search genres with descriptions
- Categorized genre lists (Academic, Technical, News, etc.)
- Usage examples for each genre type
š Resources
uri://crawl4ai/config: Default crawler configuration optionsuri://crawl4ai/examples: Usage examples and sample requests
šÆ Prompts
crawl_website_prompt: Guided website crawling workflowsanalyze_crawl_results_prompt: Crawl result analysisbatch_crawl_setup_prompt: Batch crawling setup
š§ Configuration Examples
š Google Search Examples
Basic Google Search
{
"query": "python machine learning tutorial",
"num_results": 10,
"language": "en",
"region": "us"
}
Genre-Filtered Search
{
"query": "machine learning research",
"num_results": 15,
"search_genre": "academic",
"language": "en"
}
Batch Search with Analysis
{
"queries": [
"python programming tutorial",
"web development guide",
"data science introduction"
],
"num_results_per_query": 5,
"max_concurrent": 3,
"search_genre": "education"
}
Integrated Search and Crawl
{
"search_query": "python official documentation",
"num_search_results": 10,
"crawl_top_results": 5,
"extract_media": false,
"generate_markdown": true,
"search_genre": "documentation"
}
Basic Deep Crawling
{
"url": "https://docs.example.com",
"max_depth": 2,
"max_pages": 20,
"crawl_strategy": "bfs"
}
AI-Driven Content Extraction
{
"url": "https://news.example.com",
"extraction_goal": "article summary and key points",
"content_filter": "llm",
"use_llm": true,
"custom_instructions": "Extract main article content, summarize key points, and identify important quotes"
}
š File Processing Examples
PDF Document Processing
{
"url": "https://example.com/document.pdf",
"max_size_mb": 50,
"include_metadata": true
}
Office Document Processing
{
"url": "https://example.com/report.docx",
"max_size_mb": 25,
"include_metadata": true
}
ZIP Archive Processing
{
"url": "https://example.com/documents.zip",
"max_size_mb": 100,
"extract_all_from_zip": true,
"include_metadata": true
}
Automatic File Detection
The crawl_url tool automatically detects file formats and routes to appropriate processing:
{
"url": "https://example.com/mixed-content.pdf",
"generate_markdown": true
}
šŗ YouTube Video Processing Examples
ā Stable youtube-transcript-api v1.1.0+ integration - No setup required!
Basic Transcript Extraction
{
"url": "https://www.youtube.com/watch?v=VIDEO_ID",
"languages": ["ja", "en"],
"include_timestamps": true,
"include_metadata": true
}
Auto-Translation Feature
{
"url": "https://www.youtube.com/watch?v=VIDEO_ID",
"languages": ["en"],
"translate_to": "ja",
"include_timestamps": false
}
Batch Video Processing
{
"urls": [
"https://www.youtube.com/watch?v=VIDEO_ID1",
"https://www.youtube.com/watch?v=VIDEO_ID2",
"https://youtu.be/VIDEO_ID3"
],
"languages": ["ja", "en"],
"max_concurrent": 3
}
Automatic YouTube Detection
The crawl_url tool automatically detects YouTube URLs and extracts transcripts:
{
"url": "https://www.youtube.com/watch?v=VIDEO_ID",
"generate_markdown": true
}
Video Information Lookup
{
"video_url": "https://www.youtube.com/watch?v=VIDEO_ID"
}
Entity Extraction
{
"url": "https://company.com/contact",
"entity_types": ["emails", "phones", "social_media"],
"include_context": true,
"deduplicate": true
}
Authenticated Crawling
{
"url": "https://private.example.com",
"auth_token": "Bearer your-token",
"cookies": {"session_id": "abc123"},
"headers": {"X-API-Key": "your-key"}
}
šļø Project Structure
crawl4ai_mcp/
āāā __init__.py # Package initialization
āāā server.py # Main MCP server (1,184+ lines)
āāā strategies.py # Additional extraction strategies
āāā suppress_output.py # Output suppression utilities
config/
āāā claude_desktop_config_windows.json # Claude Desktop config (Windows)
āāā claude_desktop_config_script.json # Script-based config
āāā claude_desktop_config.json # Basic config
docs/
āāā README_ja.md # Japanese documentation
āāā setup_instructions_ja.md # Detailed setup guide
āāā troubleshooting_ja.md # Troubleshooting guide
scripts/
āāā setup.sh # Linux/macOS setup
āāā setup_windows.bat # Windows setup
āāā run_server.sh # Server startup script
š Troubleshooting
Common Issues
ModuleNotFoundError:
- Ensure virtual environment is activated
- Verify PYTHONPATH is set correctly
- Install dependencies:
pip install -r requirements.txt
Playwright Browser Errors:
- Install system dependencies:
sudo apt-get install libnss3 libnspr4 libasound2 - For WSL: Ensure X11 forwarding or headless mode
JSON Parsing Errors:
- Resolved: Output suppression implemented in latest version
- All crawl4ai verbose output is now properly suppressed
For detailed troubleshooting, see docs/troubleshooting_ja.md.
š Supported Formats & Capabilities
ā Web Content
- Static Sites: HTML, CSS, JavaScript
- Dynamic Sites: React, Vue, Angular SPAs
- Complex Sites: JavaScript-heavy, async loading
- Protected Sites: Basic auth, cookies, custom headers
ā Media & Files
- Videos: YouTube (transcript auto-extraction)
- Documents: PDF, Word, Excel, PowerPoint, ZIP
- Archives: Automatic extraction and processing
- Text: Markdown, CSV, RTF, plain text
ā Search & Data
- Google Search: 31 genre filters available
- Entity Extraction: Emails, phones, URLs, dates
- Structured Data: CSS/XPath/LLM-based extraction
- Batch Processing: Multiple URLs simultaneously
ā ļø Limitations & Important Notes
š« Known Limitations
- Authentication Sites: Cannot bypass login requirements
- reCAPTCHA Protected: Limited success on heavily protected sites
- Rate Limiting: Manual interval management recommended
- Automatic Retry: Not implemented - manual retry needed
- Deep Crawling: 5 page maximum for stability
š Regional & Language Support
- Multi-language Sites: Full Unicode support
- Regional Search: Configurable region settings
- Character Encoding: Automatic detection
- Japanese Content: Complete support
š Error Handling Strategy
- First Failure ā Immediate manual retry
- Timeout Issues ā Increase timeout settings
- Persistent Problems ā Use
crawl_url_with_fallback - Alternative Approach ā Try different tool selection
š” Common Workflows
š Research & Analysis
1. Competitive Analysis: search_and_crawl ā intelligent_extract
2. Site Auditing: crawl_url ā extract_entities
3. Content Research: search_google ā batch_crawl
4. Deep Analysis: deep_crawl_site ā structured extraction
š Typical Success Patterns
- E-commerce Sites: Use
simulate_user: true - News Sites: Enable
wait_for_jsfor dynamic content - Documentation: Use
deep_crawl_sitewith URL patterns - Social Media: Extract entities for contact information
š Performance Features
- Intelligent Caching: 15-minute self-cleaning cache with multiple modes
- Async Architecture: Built on asyncio for high performance
- Memory Management: Adaptive concurrency based on system resources
- Rate Limiting: Configurable delays and request throttling
- Parallel Processing: Concurrent crawling of multiple URLs
š”ļø Security Features
- Output Suppression provides complete isolation of crawl4ai output from MCP JSON
- Authentication Support includes token-based and cookie authentication
- Secure Headers offer custom header support for API access
- Error Isolation includes comprehensive error handling with helpful suggestions
š Dependencies
crawl4ai>=0.3.0- Advanced web crawling libraryfastmcp>=0.1.0- MCP server frameworkpydantic>=2.0.0- Data validation and serializationmarkitdown>=0.0.1a2- File processing and conversion (Microsoft)googlesearch-python>=1.3.0- Google search functionalityaiohttp>=3.8.0- Asynchronous HTTP client for metadata extractionbeautifulsoup4>=4.12.0- HTML parsing for title/snippet extractionyoutube-transcript-api>=1.1.0- Stable YouTube transcript extractionasyncio- Asynchronous programming supporttyping-extensions- Extended type hints
YouTube Features Status:
The following status information applies to YouTube transcript extraction:
- YouTube transcript extraction is stable and reliable with v1.1.0+
- No authentication or API keys required
- Works out of the box after installation
š License
MIT License
š¤ Contributing
This project implements the Model Context Protocol specification. It is compatible with any MCP-compliant client and built with the FastMCP framework for easy extension and modification.
š¦ DXT Package Available
One-click installation for Claude Desktop users
This MCP server is available as a DXT (Desktop Extensions) package for easy installation. The following resources are available:
- DXT Package can be found at
dxt-packages/crawl4ai-dxt-correct/ - Installation Guide is available at dxt-packages/README_DXT_PACKAGES.md
- Creation Guide is documented at dxt-packages/DXT_CREATION_GUIDE.md
- Troubleshooting information is at dxt-packages/DXT_TROUBLESHOOTING_GUIDE.md
Simply drag and drop the .dxt file into Claude Desktop for instant setup.
š Additional Documentation
Infrastructure & Deployment
- Docker Guide - Complete Docker containerization guide with multiple variants
- Architecture - Technical architecture, design decisions, and container infrastructure
- Build & Deployment - Build processes, CI/CD pipeline, and deployment strategies
- Configuration - Environment variables, Docker settings, and performance tuning
- Deployment Playbook - Production deployment procedures and troubleshooting
Development & Contributing
- Contributing Guide - Docker development workflow and contribution guidelines
- RunPod Deployment - Serverless cloud deployment on RunPod
- GitHub Actions - Automated CI/CD pipeline documentation
API & Integration
- Pure StreamableHTTP - Recommended HTTP transport protocol
- HTTP Server Usage - HTTP API server configuration
- Legacy HTTP API - Detailed HTTP API documentation
Localization & Support
- Japanese Documentation - Complete feature documentation in Japanese
- Japanese Troubleshooting - Troubleshooting guide in Japanese
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
E2B
Using MCP to run code via e2b.