Crawl4AI RAG MCP Server

Crawl4AI RAG MCP Server

Enables AI assistants to crawl websites, extract and store web content with semantic search capabilities using vector embeddings, and retrieve information through natural language queries with tag-based filtering and intelligent content cleaning.

Category
Visit Server

README

Crawl4AI RAG MCP Server

A high-performance Retrieval-Augmented Generation (RAG) system using Crawl4AI for web content extraction, sqlite-vec for vector storage, and MCP integration for AI assistants.

Summary

This system provides a production-ready RAG solution that combines:

  • Crawl4AI for intelligent web content extraction with markdown conversion
  • SQLite with sqlite-vec for vector storage and semantic search
  • RAM Database Mode for 10-50x faster query performance
  • MCP Server for AI assistant integration (LM-Studio, Claude Desktop, etc.)
  • REST API for bidirectional communication and remote access
  • Security Layer with input sanitization and domain blocking

Quick Start

Option 1: Local Development

  1. Clone and setup:
git clone https://github.com/Rob-P-Smith/mcpragcrawl4ai.git
cd mcpragcrawl4ai
python3 -m venv .venv
source .venv/bin/activate  # Linux/Mac
pip install -r requirements.txt
  1. Start Crawl4AI service:
docker run -d --name crawl4ai -p 11235:11235 unclecode/crawl4ai:latest
  1. Configure environment:
# Create .env file
cat > .env << EOF
IS_SERVER=true
USE_MEMORY_DB=true
LOCAL_API_KEY=dev-api-key
CRAWL4AI_URL=http://localhost:11235
EOF
  1. Run MCP server:
python3 core/rag_processor.py

Option 2: Docker Server Deployment

  1. Deploy full server (REST API + MCP):
cd mcpragcrawl4ai
docker compose -f deployments/server/docker-compose.yml up -d
  1. Test deployment:
curl http://localhost:8080/health

See Deployment Guide for complete deployment options.

Architecture

Core Components

  • MCP Server (core/rag_processor.py) - JSON-RPC 2.0 protocol handler
  • RAG Database (core/data/storage.py) - SQLite + sqlite-vec vector storage with RAM mode support
  • Content Cleaner (core/data/content_cleaner.py) - Navigation removal and quality filtering
  • Sync Manager (core/data/sync_manager.py) - RAM database differential sync with virtual table support
  • Crawler (core/operations/crawler.py) - Web crawling with DFS algorithm and content extraction
  • Defense Layer (core/data/dbdefense.py) - Input sanitization and security
  • REST API (api/api.py) - FastAPI server with 15+ endpoints
  • Auth System (api/auth.py) - API key authentication and rate limiting
  • Recrawl Utility (core/utilities/recrawl_utility.py) - Batch URL recrawling via API with concurrent processing

Database Schema

  • crawled_content - Web content with markdown, embeddings, and metadata
  • content_vectors - Vector embeddings (sqlite-vec vec0 virtual table with rowid support)
  • sessions - User session tracking for temporary content
  • blocked_domains - Domain blocklist with wildcard patterns
  • _sync_tracker - Change tracking for RAM database differential sync (memory mode only)

Technology Stack

  • Python 3.11+ with asyncio for concurrent operations
  • SQLite with sqlite-vec extension for vector similarity search
  • SentenceTransformers (all-MiniLM-L6-v2) for embedding generation
  • langdetect for language detection and filtering
  • FastAPI for REST API with automatic OpenAPI documentation
  • Crawl4AI for intelligent web content extraction with fit_markdown
  • Docker for containerized deployment
  • aiohttp for async HTTP requests in utilities

Documentation

For detailed documentation, see:

Key Features

Performance

  • RAM Database Mode: In-memory SQLite with differential sync for 10-50x faster queries
  • Vector Search: 384-dimensional embeddings using all-MiniLM-L6-v2 for semantic search
  • Batch Crawling: High-performance batch processing with retry logic and progress tracking
  • Content Optimization: 70-80% storage reduction through intelligent cleaning and filtering
  • Efficient Storage: fit_markdown conversion and content chunking for optimal retrieval

Functionality

  • Deep Crawling: DFS-based multi-page crawling with depth and page limits
  • Content Cleaning: Automatic removal of navigation, boilerplate, and low-quality content
  • Language Filtering: Automatic detection and filtering of non-English content
  • Semantic Search: Vector similarity search with tag filtering and deduplication
  • Target Search: Intelligent search with automatic tag expansion
  • Content Management: Full CRUD operations with retention policies and session management
  • Batch Recrawling: Concurrent URL recrawling via API with rate limiting and progress tracking

Security

  • Input Sanitization: Comprehensive SQL injection defense and input validation
  • Domain Blocking: Wildcard-based domain blocking with social media and adult content filters
  • API Authentication: API key-based authentication with rate limiting
  • Safe Crawling: Automatic detection and blocking of forbidden content

Integration

  • MCP Server: Full MCP protocol support for AI assistant integration
  • REST API: Complete REST API with 15+ endpoints for all operations
  • Bidirectional Mode: Server mode (host API) and client mode (forward to remote)
  • Docker Deployment: Production-ready containerized deployment

Quick Usage Examples

Via MCP (in LM-Studio/Claude Desktop)

crawl_and_remember("https://docs.python.org/3/tutorial/", tags="python, tutorial")
search_memory("list comprehensions", tags="python", limit=5)
target_search("async programming best practices", initial_limit=5, expanded_limit=20)
get_database_stats()

Via REST API

# Crawl and store content
curl -X POST http://localhost:8080/api/v1/crawl/store \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://docs.python.org/3/tutorial/", "tags": "python, tutorial"}'

# Semantic search
curl -X POST http://localhost:8080/api/v1/search \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"query": "list comprehensions", "tags": "python", "limit": 5}'

# Get database stats
curl http://localhost:8080/api/v1/stats \
  -H "Authorization: Bearer YOUR_API_KEY"

Via Python Client

from api.api import Crawl4AIClient

client = Crawl4AIClient("http://localhost:8080", "YOUR_API_KEY")
result = await client.crawl_and_store("https://example.com", tags="example")
search_results = await client.search("python tutorials", limit=10)
stats = await client.get_database_stats()

Performance Metrics

With RAM database mode enabled:

  • Search queries: 20-50ms (vs 200-500ms disk mode)
  • Batch crawling: 2,000+ URLs successfully processed
  • Database size: 215MB (2,296 pages, 8,196 embeddings)
  • Sync overhead: <100ms for differential sync (idle: 5s, periodic: 5min)
  • Sync reliability: 100% success rate with virtual table support
  • Memory usage: ~500MB for full in-memory database
  • Storage optimization: 70-80% reduction through content cleaning

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
E2B

E2B

Using MCP to run code via e2b.

Official
Featured