Crawl4AI MCP Server

Crawl4AI MCP Server

Crawl4AI MCP Server is an intelligent information retrieval server offering robust search capabilities and LLM-optimized web content understanding, utilizing multi-engine search and intelligent content extraction to efficiently gather and comprehend internet information.

weidwonder

Browser Automation
Search
Local
Python
Visit Server

README

Crawl4AI MCP Server

smithery badge

这是一个基于MCP (Model Context Protocol)的智能信息获取服务器,为AI助手系统提供强大的搜索能力和面向LLM优化的网页内容理解功能。通过多引擎搜索和智能内容提取,帮助AI系统高效获取和理解互联网信息,将网页内容转换为最适合LLM处理的格式。

特性

  • 🔍 强大的多引擎搜索能力,支持DuckDuckGo和Google
  • 📚 面向LLM优化的网页内容提取,智能过滤非核心内容
  • 🎯 专注信息价值,自动识别和保留关键内容
  • 📝 多种输出格式,支持引用溯源
  • 🚀 基于FastMCP的高性能异步设计

安装

方式1: 大部分的安装场景

  1. 确保您的系统满足以下要求:

    • Python >= 3.9
    • 建议使用专门的虚拟环境
  2. 克隆仓库:

git clone https://github.com/yourusername/crawl4ai-mcp-server.git
cd crawl4ai-mcp-server
  1. 创建并激活虚拟环境:
python -m venv crawl4ai_env
source crawl4ai_env/bin/activate  # Linux/Mac
# 或
.\crawl4ai_env\Scripts\activate  # Windows
  1. 安装依赖:
pip install -r requirements.txt
  1. 安装playwright浏览器:
playwright install

方式2: 安装到Claude桌面客户端 via Smithery

通过 Smithery 将 Crawl4AI MCP 的 Claude 桌面端服务安装自动配置至您本地的 Claude 伸展中心:

npx -y @smithery/cli install @weidwonder/crawl4ai-mcp-server --client claude

使用方法

服务器提供以下工具:

search

强大的网络搜索工具,支持多个搜索引擎:

  • DuckDuckGo搜索(默认): 无需API密钥,全面处理AbstractText、Results和RelatedTopics
  • Google搜索: 需要配置API密钥,提供精准搜索结果
  • 支持同时使用多个引擎获取更全面的结果

参数说明:

  • query: 搜索查询字符串
  • num_results: 返回结果数量(默认10)
  • engine: 搜索引擎选择
    • "duckduckgo": DuckDuckGo搜索(默认)
    • "google": Google搜索(需要API密钥)
    • "all": 同时使用所有可用的搜索引擎

示例:

# DuckDuckGo搜索(默认)
{
    "query": "python programming",
    "num_results": 5
}

# 使用所有可用引擎
{
    "query": "python programming",
    "num_results": 5,
    "engine": "all"
}

read_url

面向LLM优化的网页内容理解工具,提供智能内容提取和格式转换:

  • markdown_with_citations: 包含内联引用的Markdown(默认),保持信息溯源
  • fit_markdown: 经过LLM优化的精简内容,去除冗余信息
  • raw_markdown: 基础HTML→Markdown转换
  • references_markdown: 单独的引用/参考文献部分
  • fit_html: 生成fit_markdown的过滤后HTML
  • markdown: 默认Markdown格式

示例:

{
    "url": "https://example.com",
    "format": "markdown_with_citations"
}

示例:

# DuckDuckGo搜索(默认)
{
    "query": "python programming",
    "num_results": 5
}

# Google搜索
{
    "query": "python programming",
    "num_results": 5,
    "engine": "google"
}

如需使用Google搜索,需要在config.json中配置API密钥:

{
    "google": {
        "api_key": "your-api-key",
        "cse_id": "your-cse-id"
    }
}

LLM内容优化

服务器采用了一系列针对LLM的内容优化策略:

  • 智能内容识别: 自动识别并保留文章主体、关键信息段落
  • 噪音过滤: 自动过滤导航栏、广告、页脚等对理解无帮助的内容
  • 信息完整性: 保留URL引用,支持信息溯源
  • 长度优化: 使用最小词数阈值(10)过滤无效片段
  • 格式优化: 默认输出markdown_with_citations格式,便于LLM理解和引用

开发说明

项目结构:

crawl4ai_mcp_server/
├── src/
│   ├── index.py      # 服务器主实现
│   └── search.py     # 搜索功能实现
├── config_demo.json  # 配置文件示例
├── pyproject.toml    # 项目配置
├── requirements.txt  # 依赖列表
└── README.md        # 项目文档

配置说明

  1. 复制配置示例文件:
cp config_demo.json config.json
  1. 如需使用Google搜索,在config.json中配置API密钥:
{
    "google": {
        "api_key": "your-google-api-key",
        "cse_id": "your-google-cse-id"
    }
}

更新日志

  • 2025.02.08: 添加搜索功能,支持DuckDuckGo(默认)和Google搜索
  • 2025.02.07: 重构项目结构,使用FastMCP实现,优化依赖管理
  • 2025.02.07: 优化内容过滤配置,提高token效率并保持URL完整性

许可证

MIT License

贡献

欢迎提交Issue和Pull Request!

作者

  • Owner: weidwonder
  • Coder: Claude Sonnet 3.5
    • 100% Code wrote by Claude. Cost: $9 ($2 for code writing, $7 cost for Debuging😭)
    • 3 hours time cost. 0.5 hours for code writing, 0.5 hours for env preparing, 2 hours for debuging.😭

致谢

感谢所有为项目做出贡献的开发者!

特别感谢:

  • Crawl4ai 项目提供的优秀网页内容提取技术支持

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Excel MCP Server

Excel MCP Server

A Model Context Protocol server that enables AI assistants to read from and write to Microsoft Excel files, supporting formats like xlsx, xlsm, xltx, and xltm.

Featured
Local
Go
Playwright MCP Server

Playwright MCP Server

Provides a server utilizing Model Context Protocol to enable human-like browser automation with Playwright, allowing control over browser actions such as navigation, element interaction, and scrolling.

Featured
Local
TypeScript
@kazuph/mcp-fetch

@kazuph/mcp-fetch

Model Context Protocol server for fetching web content and processing images. This allows Claude Desktop (or any MCP client) to fetch web content and handle images appropriately.

Featured
Local
JavaScript
Claude Code MCP

Claude Code MCP

An implementation of Claude Code as a Model Context Protocol server that enables using Claude's software engineering capabilities (code generation, editing, reviewing, and file operations) through the standardized MCP interface.

Featured
Local
JavaScript
Apple MCP Server

Apple MCP Server

Enables interaction with Apple apps like Messages, Notes, and Contacts through the MCP protocol to send messages, search, and open app content using natural language.

Featured
Local
TypeScript
DuckDuckGo MCP Server

DuckDuckGo MCP Server

A Model Context Protocol (MCP) server that provides web search capabilities through DuckDuckGo, with additional features for content fetching and parsing.

Featured
Python